<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Evaluating on Tenu Tech Brief</title>
    <link>https://cluster-site.onrender.com/tags/evaluating/</link>
    <description>Recent content in Evaluating on Tenu Tech Brief</description>
    <generator>Hugo -- 0.146.0</generator>
    <language>en-us</language>
    <lastBuildDate>Wed, 25 Feb 2026 07:59:13 +0000</lastBuildDate>
    <atom:link href="https://cluster-site.onrender.com/tags/evaluating/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Implicit Intelligence -- Evaluating Agents on What Users Don&#39;t Say</title>
      <link>https://cluster-site.onrender.com/posts/implicit-intelligence--evaluating-agents-on-what-users-dont-say/</link>
      <pubDate>Wed, 25 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/implicit-intelligence--evaluating-agents-on-what-users-dont-say/</guid>
      <description>• Computer Science &amp;gt; Artificial Intelligence [Submitted on 23 Feb 2026] Title:Implicit Intelligence &amp;ndash; Evaluating Agents on What Users Don&amp;rsquo;t Say View PDF HTML (experimental)Abstrac</description>
    </item>
    <item>
      <title>Reusability Report: Evaluating the performance of a meta-learning foundation model on predicting the antibacterial activity of natural products</title>
      <link>https://cluster-site.onrender.com/posts/reusability-report-evaluating-the-performance-of-a-meta-learning-foundation-model-on-predicting-the-antibacterial-activity-of-natural-products/</link>
      <pubDate>Tue, 24 Feb 2026 00:35:20 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/reusability-report-evaluating-the-performance-of-a-meta-learning-foundation-model-on-predicting-the-antibacterial-activity-of-natural-products/</guid>
      <description>• Abstract Deep learning foundation models are becoming increasingly popular for use in bioactivity prediction. • Recently, Feng et al. • developed ActFound, a bioactive foundation</description>
    </item>
    <item>
      <title>Evaluating Text-based Conversational Agents for Mental Health: A Systematic Review of Metrics, Methods and Usage Contexts</title>
      <link>https://cluster-site.onrender.com/posts/evaluating-text-based-conversational-agents-for-mental-health-a-systematic-review-of-metrics-methods-and-usage-contexts/</link>
      <pubDate>Mon, 23 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/evaluating-text-based-conversational-agents-for-mental-health-a-systematic-review-of-metrics-methods-and-usage-contexts/</guid>
      <description>• Computer Science &amp;gt; Human-Computer Interaction [Submitted on 8 Jan 2026] Title:Evaluating Text-based Conversational Agents for Mental Health: A Systematic Review of Metrics, Metho</description>
    </item>
    <item>
      <title>The Token Games: Evaluating Language Model Reasoning with Puzzle Duels</title>
      <link>https://cluster-site.onrender.com/posts/the-token-games-evaluating-language-model-reasoning-with-puzzle-duels/</link>
      <pubDate>Mon, 23 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/the-token-games-evaluating-language-model-reasoning-with-puzzle-duels/</guid>
      <description>• Computer Science &amp;gt; Artificial Intelligence [Submitted on 19 Feb 2026] Title:The Token Games: Evaluating Language Model Reasoning with Puzzle Duels View PDF HTML (experimental)Abs</description>
    </item>
    <item>
      <title>WorkflowPerturb: Calibrated Stress Tests for Evaluating Multi-Agent Workflow Metrics</title>
      <link>https://cluster-site.onrender.com/posts/workflowperturb-calibrated-stress-tests-for-evaluating-multi-agent-workflow-metrics/</link>
      <pubDate>Mon, 23 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/workflowperturb-calibrated-stress-tests-for-evaluating-multi-agent-workflow-metrics/</guid>
      <description>• Computer Science &amp;gt; Artificial Intelligence [Submitted on 20 Feb 2026] Title:WorkflowPerturb: Calibrated Stress Tests for Evaluating Multi-Agent Workflow Metrics View PDF HTML (ex</description>
    </item>
    <item>
      <title>Reusability Report: Evaluating the performance of a meta-learning foundation model on predicting the antibacterial activity of natural products</title>
      <link>https://cluster-site.onrender.com/posts/reusability-report-evaluating-the-performance-of-a-meta-learning-foundation-model-on-predicting-the-antibacterial-activity-of-natural-products/</link>
      <pubDate>Sun, 22 Feb 2026 00:35:32 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/reusability-report-evaluating-the-performance-of-a-meta-learning-foundation-model-on-predicting-the-antibacterial-activity-of-natural-products/</guid>
      <description>• Abstract Deep learning foundation models are becoming increasingly popular for use in bioactivity prediction. • Recently, Feng et al. • developed ActFound, a bioactive foundation</description>
    </item>
    <item>
      <title>Evaluating Malleable Job Scheduling in HPC Clusters using Real-World Workloads</title>
      <link>https://cluster-site.onrender.com/posts/evaluating-malleable-job-scheduling-in-hpc-clusters-using-real-world-workloads/</link>
      <pubDate>Fri, 20 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/evaluating-malleable-job-scheduling-in-hpc-clusters-using-real-world-workloads/</guid>
      <description>• Computer Science &amp;gt; Distributed, Parallel, and Cluster Computing [Submitted on 19 Feb 2026] Title:Evaluating Malleable Job Scheduling in HPC Clusters using Real-World Workloads Vi</description>
    </item>
    <item>
      <title>Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 Endocrinology Board-Style Examination</title>
      <link>https://cluster-site.onrender.com/posts/evidence-grounded-subspecialty-reasoning-evaluating-a-curated-clinical-intelligence-layer-on-the-2025-endocrinology-board-style-examination/</link>
      <pubDate>Thu, 19 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/evidence-grounded-subspecialty-reasoning-evaluating-a-curated-clinical-intelligence-layer-on-the-2025-endocrinology-board-style-examination/</guid>
      <description>• Computer Science &amp;gt; Artificial Intelligence [Submitted on 17 Feb 2026] Title:Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025</description>
    </item>
    <item>
      <title>Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 Endocrinology Board-Style Examination</title>
      <link>https://cluster-site.onrender.com/posts/evidence-grounded-subspecialty-reasoning-evaluating-a-curated-clinical-intelligence-layer-on-the-2025-endocrinology-board-style-examination/</link>
      <pubDate>Thu, 19 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/evidence-grounded-subspecialty-reasoning-evaluating-a-curated-clinical-intelligence-layer-on-the-2025-endocrinology-board-style-examination/</guid>
      <description>• Computer Science &amp;gt; Artificial Intelligence [Submitted on 17 Feb 2026] Title:Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025</description>
    </item>
    <item>
      <title>A roadmap for evaluating moral competence in large language models</title>
      <link>https://cluster-site.onrender.com/posts/a-roadmap-for-evaluating-moral-competence-in-large-language-models/</link>
      <pubDate>Thu, 19 Feb 2026 01:58:18 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/a-roadmap-for-evaluating-moral-competence-in-large-language-models/</guid>
      <description>• Abstract The question of whether large language models (LLMs) can exhibit moral capabilities is of growing interest and urgency, as these systems are deployed in sensitive roles</description>
    </item>
    <item>
      <title>Reusability Report: Evaluating the performance of a meta-learning foundation model on predicting the antibacterial activity of natural products</title>
      <link>https://cluster-site.onrender.com/posts/reusability-report-evaluating-the-performance-of-a-meta-learning-foundation-model-on-predicting-the-antibacterial-activity-of-natural-products/</link>
      <pubDate>Thu, 19 Feb 2026 01:18:25 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/reusability-report-evaluating-the-performance-of-a-meta-learning-foundation-model-on-predicting-the-antibacterial-activity-of-natural-products/</guid>
      <description>• Nature Machine Intelligence, Published online: 12 February 2026; doi:10.1038/s42256-026-01187-y This Reusability Report tests the ability of a foundation model, ActFound, to pred</description>
    </item>
    <item>
      <title>ResearchGym: Evaluating Language Model Agents on Real-World AI Research</title>
      <link>https://cluster-site.onrender.com/posts/researchgym-evaluating-language-model-agents-on-real-world-ai-research/</link>
      <pubDate>Wed, 18 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/researchgym-evaluating-language-model-agents-on-real-world-ai-research/</guid>
      <description>• Computer Science &amp;gt; Artificial Intelligence [Submitted on 16 Feb 2026] Title:ResearchGym: Evaluating Language Model Agents on Real-World AI Research View PDF HTML (experimental)Ab</description>
    </item>
    <item>
      <title>ResearchGym: Evaluating Language Model Agents on Real-World AI Research</title>
      <link>https://cluster-site.onrender.com/posts/researchgym-evaluating-language-model-agents-on-real-world-ai-research/</link>
      <pubDate>Wed, 18 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/researchgym-evaluating-language-model-agents-on-real-world-ai-research/</guid>
      <description>• Computer Science &amp;gt; Artificial Intelligence [Submitted on 16 Feb 2026] Title:ResearchGym: Evaluating Language Model Agents on Real-World AI Research View PDF HTML (experimental)Ab</description>
    </item>
    <item>
      <title>ProMoral-Bench: Evaluating Prompting Strategies for Moral Reasoning and Safety in LLMs</title>
      <link>https://cluster-site.onrender.com/posts/promoral-bench-evaluating-prompting-strategies-for-moral-reasoning-and-safety-in-llms/</link>
      <pubDate>Tue, 17 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/promoral-bench-evaluating-prompting-strategies-for-moral-reasoning-and-safety-in-llms/</guid>
      <description>• Computer Science &amp;gt; Artificial Intelligence [Submitted on 5 Feb 2026] Title:ProMoral-Bench: Evaluating Prompting Strategies for Moral Reasoning and Safety in LLMs View PDF HTML (e</description>
    </item>
    <item>
      <title>ProMoral-Bench: Evaluating Prompting Strategies for Moral Reasoning and Safety in LLMs</title>
      <link>https://cluster-site.onrender.com/posts/promoral-bench-evaluating-prompting-strategies-for-moral-reasoning-and-safety-in-llms/</link>
      <pubDate>Tue, 17 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/promoral-bench-evaluating-prompting-strategies-for-moral-reasoning-and-safety-in-llms/</guid>
      <description>• Computer Science &amp;gt; Artificial Intelligence [Submitted on 5 Feb 2026] Title:ProMoral-Bench: Evaluating Prompting Strategies for Moral Reasoning and Safety in LLMs View PDF HTML (e</description>
    </item>
    <item>
      <title>TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks</title>
      <link>https://cluster-site.onrender.com/posts/temporalbench-a-benchmark-for-evaluating-llm-based-agents-on-contextual-and-event-informed-time-series-tasks/</link>
      <pubDate>Tue, 17 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/temporalbench-a-benchmark-for-evaluating-llm-based-agents-on-contextual-and-event-informed-time-series-tasks/</guid>
      <description>• Computer Science &amp;gt; Artificial Intelligence [Submitted on 5 Feb 2026] Title:TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series</description>
    </item>
    <item>
      <title>OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments</title>
      <link>https://cluster-site.onrender.com/posts/openenv-in-practice-evaluating-tool-using-agents-in-real-world-environments/</link>
      <pubDate>Thu, 12 Feb 2026 00:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/openenv-in-practice-evaluating-tool-using-agents-in-real-world-environments/</guid>
      <description>• OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments AI agents often perform impressively in controlled research settings, yet struggle when deployed in r</description>
    </item>
    <item>
      <title>A practical blueprint for evaluating conversational AI at scale</title>
      <link>https://cluster-site.onrender.com/posts/a-practical-blueprint-for-evaluating-conversational-ai-at-scale/</link>
      <pubDate>Thu, 02 Oct 2025 16:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/a-practical-blueprint-for-evaluating-conversational-ai-at-scale/</guid>
      <description>• LLM applications present a deceptively simple interface: a single text box. • But behind that minimalism runs a chain of probabilistic stages, including intent classification, do</description>
    </item>
    <item>
      <title>A practical blueprint for evaluating conversational AI at scale</title>
      <link>https://cluster-site.onrender.com/posts/a-practical-blueprint-for-evaluating-conversational-ai-at-scale/</link>
      <pubDate>Thu, 02 Oct 2025 16:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/a-practical-blueprint-for-evaluating-conversational-ai-at-scale/</guid>
      <description>• LLM applications present a deceptively simple interface: a single text box. • But behind that minimalism runs a chain of probabilistic stages, including intent classification, do</description>
    </item>
  </channel>
</rss>
