Evaluating on Tenu Tech Brief

Evaluating on Tenu Tech Brief https://cluster-site.onrender.com/tags/evaluating/ Recent content in Evaluating on Tenu Tech Brief Hugo -- 0.146.0 en-us Wed, 25 Feb 2026 07:59:13 +0000 Implicit Intelligence -- Evaluating Agents on What Users Don't Say https://cluster-site.onrender.com/posts/implicit-intelligence--evaluating-agents-on-what-users-dont-say/ Wed, 25 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/implicit-intelligence--evaluating-agents-on-what-users-dont-say/ • Computer Science > Artificial Intelligence [Submitted on 23 Feb 2026] Title:Implicit Intelligence – Evaluating Agents on What Users Don’t Say View PDF HTML (experimental)Abstrac Reusability Report: Evaluating the performance of a meta-learning foundation model on predicting the antibacterial activity of natural products https://cluster-site.onrender.com/posts/reusability-report-evaluating-the-performance-of-a-meta-learning-foundation-model-on-predicting-the-antibacterial-activity-of-natural-products/ Tue, 24 Feb 2026 00:35:20 +0000 https://cluster-site.onrender.com/posts/reusability-report-evaluating-the-performance-of-a-meta-learning-foundation-model-on-predicting-the-antibacterial-activity-of-natural-products/ • Abstract Deep learning foundation models are becoming increasingly popular for use in bioactivity prediction. • Recently, Feng et al. • developed ActFound, a bioactive foundation Evaluating Text-based Conversational Agents for Mental Health: A Systematic Review of Metrics, Methods and Usage Contexts https://cluster-site.onrender.com/posts/evaluating-text-based-conversational-agents-for-mental-health-a-systematic-review-of-metrics-methods-and-usage-contexts/ Mon, 23 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/evaluating-text-based-conversational-agents-for-mental-health-a-systematic-review-of-metrics-methods-and-usage-contexts/ • Computer Science > Human-Computer Interaction [Submitted on 8 Jan 2026] Title:Evaluating Text-based Conversational Agents for Mental Health: A Systematic Review of Metrics, Metho The Token Games: Evaluating Language Model Reasoning with Puzzle Duels https://cluster-site.onrender.com/posts/the-token-games-evaluating-language-model-reasoning-with-puzzle-duels/ Mon, 23 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/the-token-games-evaluating-language-model-reasoning-with-puzzle-duels/ • Computer Science > Artificial Intelligence [Submitted on 19 Feb 2026] Title:The Token Games: Evaluating Language Model Reasoning with Puzzle Duels View PDF HTML (experimental)Abs WorkflowPerturb: Calibrated Stress Tests for Evaluating Multi-Agent Workflow Metrics https://cluster-site.onrender.com/posts/workflowperturb-calibrated-stress-tests-for-evaluating-multi-agent-workflow-metrics/ Mon, 23 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/workflowperturb-calibrated-stress-tests-for-evaluating-multi-agent-workflow-metrics/ • Computer Science > Artificial Intelligence [Submitted on 20 Feb 2026] Title:WorkflowPerturb: Calibrated Stress Tests for Evaluating Multi-Agent Workflow Metrics View PDF HTML (ex Reusability Report: Evaluating the performance of a meta-learning foundation model on predicting the antibacterial activity of natural products https://cluster-site.onrender.com/posts/reusability-report-evaluating-the-performance-of-a-meta-learning-foundation-model-on-predicting-the-antibacterial-activity-of-natural-products/ Sun, 22 Feb 2026 00:35:32 +0000 https://cluster-site.onrender.com/posts/reusability-report-evaluating-the-performance-of-a-meta-learning-foundation-model-on-predicting-the-antibacterial-activity-of-natural-products/ • Abstract Deep learning foundation models are becoming increasingly popular for use in bioactivity prediction. • Recently, Feng et al. • developed ActFound, a bioactive foundation Evaluating Malleable Job Scheduling in HPC Clusters using Real-World Workloads https://cluster-site.onrender.com/posts/evaluating-malleable-job-scheduling-in-hpc-clusters-using-real-world-workloads/ Fri, 20 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/evaluating-malleable-job-scheduling-in-hpc-clusters-using-real-world-workloads/ • Computer Science > Distributed, Parallel, and Cluster Computing [Submitted on 19 Feb 2026] Title:Evaluating Malleable Job Scheduling in HPC Clusters using Real-World Workloads Vi Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 Endocrinology Board-Style Examination https://cluster-site.onrender.com/posts/evidence-grounded-subspecialty-reasoning-evaluating-a-curated-clinical-intelligence-layer-on-the-2025-endocrinology-board-style-examination/ Thu, 19 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/evidence-grounded-subspecialty-reasoning-evaluating-a-curated-clinical-intelligence-layer-on-the-2025-endocrinology-board-style-examination/ • Computer Science > Artificial Intelligence [Submitted on 17 Feb 2026] Title:Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 Endocrinology Board-Style Examination https://cluster-site.onrender.com/posts/evidence-grounded-subspecialty-reasoning-evaluating-a-curated-clinical-intelligence-layer-on-the-2025-endocrinology-board-style-examination/ Thu, 19 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/evidence-grounded-subspecialty-reasoning-evaluating-a-curated-clinical-intelligence-layer-on-the-2025-endocrinology-board-style-examination/ • Computer Science > Artificial Intelligence [Submitted on 17 Feb 2026] Title:Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 A roadmap for evaluating moral competence in large language models https://cluster-site.onrender.com/posts/a-roadmap-for-evaluating-moral-competence-in-large-language-models/ Thu, 19 Feb 2026 01:58:18 +0000 https://cluster-site.onrender.com/posts/a-roadmap-for-evaluating-moral-competence-in-large-language-models/ • Abstract The question of whether large language models (LLMs) can exhibit moral capabilities is of growing interest and urgency, as these systems are deployed in sensitive roles Reusability Report: Evaluating the performance of a meta-learning foundation model on predicting the antibacterial activity of natural products https://cluster-site.onrender.com/posts/reusability-report-evaluating-the-performance-of-a-meta-learning-foundation-model-on-predicting-the-antibacterial-activity-of-natural-products/ Thu, 19 Feb 2026 01:18:25 +0000 https://cluster-site.onrender.com/posts/reusability-report-evaluating-the-performance-of-a-meta-learning-foundation-model-on-predicting-the-antibacterial-activity-of-natural-products/ • Nature Machine Intelligence, Published online: 12 February 2026; doi:10.1038/s42256-026-01187-y This Reusability Report tests the ability of a foundation model, ActFound, to pred ResearchGym: Evaluating Language Model Agents on Real-World AI Research https://cluster-site.onrender.com/posts/researchgym-evaluating-language-model-agents-on-real-world-ai-research/ Wed, 18 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/researchgym-evaluating-language-model-agents-on-real-world-ai-research/ • Computer Science > Artificial Intelligence [Submitted on 16 Feb 2026] Title:ResearchGym: Evaluating Language Model Agents on Real-World AI Research View PDF HTML (experimental)Ab ResearchGym: Evaluating Language Model Agents on Real-World AI Research https://cluster-site.onrender.com/posts/researchgym-evaluating-language-model-agents-on-real-world-ai-research/ Wed, 18 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/researchgym-evaluating-language-model-agents-on-real-world-ai-research/ • Computer Science > Artificial Intelligence [Submitted on 16 Feb 2026] Title:ResearchGym: Evaluating Language Model Agents on Real-World AI Research View PDF HTML (experimental)Ab ProMoral-Bench: Evaluating Prompting Strategies for Moral Reasoning and Safety in LLMs https://cluster-site.onrender.com/posts/promoral-bench-evaluating-prompting-strategies-for-moral-reasoning-and-safety-in-llms/ Tue, 17 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/promoral-bench-evaluating-prompting-strategies-for-moral-reasoning-and-safety-in-llms/ • Computer Science > Artificial Intelligence [Submitted on 5 Feb 2026] Title:ProMoral-Bench: Evaluating Prompting Strategies for Moral Reasoning and Safety in LLMs View PDF HTML (e ProMoral-Bench: Evaluating Prompting Strategies for Moral Reasoning and Safety in LLMs https://cluster-site.onrender.com/posts/promoral-bench-evaluating-prompting-strategies-for-moral-reasoning-and-safety-in-llms/ Tue, 17 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/promoral-bench-evaluating-prompting-strategies-for-moral-reasoning-and-safety-in-llms/ • Computer Science > Artificial Intelligence [Submitted on 5 Feb 2026] Title:ProMoral-Bench: Evaluating Prompting Strategies for Moral Reasoning and Safety in LLMs View PDF HTML (e TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks https://cluster-site.onrender.com/posts/temporalbench-a-benchmark-for-evaluating-llm-based-agents-on-contextual-and-event-informed-time-series-tasks/ Tue, 17 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/temporalbench-a-benchmark-for-evaluating-llm-based-agents-on-contextual-and-event-informed-time-series-tasks/ • Computer Science > Artificial Intelligence [Submitted on 5 Feb 2026] Title:TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments https://cluster-site.onrender.com/posts/openenv-in-practice-evaluating-tool-using-agents-in-real-world-environments/ Thu, 12 Feb 2026 00:00:00 +0000 https://cluster-site.onrender.com/posts/openenv-in-practice-evaluating-tool-using-agents-in-real-world-environments/ • OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments AI agents often perform impressively in controlled research settings, yet struggle when deployed in r A practical blueprint for evaluating conversational AI at scale https://cluster-site.onrender.com/posts/a-practical-blueprint-for-evaluating-conversational-ai-at-scale/ Thu, 02 Oct 2025 16:00:00 +0000 https://cluster-site.onrender.com/posts/a-practical-blueprint-for-evaluating-conversational-ai-at-scale/ • LLM applications present a deceptively simple interface: a single text box. • But behind that minimalism runs a chain of probabilistic stages, including intent classification, do A practical blueprint for evaluating conversational AI at scale https://cluster-site.onrender.com/posts/a-practical-blueprint-for-evaluating-conversational-ai-at-scale/ Thu, 02 Oct 2025 16:00:00 +0000 https://cluster-site.onrender.com/posts/a-practical-blueprint-for-evaluating-conversational-ai-at-scale/ • LLM applications present a deceptively simple interface: a single text box. • But behind that minimalism runs a chain of probabilistic stages, including intent classification, do