<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Evaluation on Tenu Tech Brief</title>
    <link>https://cluster-site.onrender.com/tags/evaluation/</link>
    <description>Recent content in Evaluation on Tenu Tech Brief</description>
    <generator>Hugo -- 0.146.0</generator>
    <language>en-us</language>
    <lastBuildDate>Thu, 26 Feb 2026 06:03:06 +0000</lastBuildDate>
    <atom:link href="https://cluster-site.onrender.com/tags/evaluation/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>TRACE: Trajectory-Aware Comprehensive Evaluation for Deep Research Agents</title>
      <link>https://cluster-site.onrender.com/posts/trace-trajectory-aware-comprehensive-evaluation-for-deep-research-agents/</link>
      <pubDate>Thu, 26 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/trace-trajectory-aware-comprehensive-evaluation-for-deep-research-agents/</guid>
      <description>• Computer Science &amp;gt; Computation and Language [Submitted on 5 Feb 2026] Title:TRACE: Trajectory-Aware Comprehensive Evaluation for Deep Research Agents View PDF HTML (experimental)</description>
    </item>
    <item>
      <title>CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation</title>
      <link>https://cluster-site.onrender.com/posts/causalreasoningbenchmark-a-real-world-benchmark-for-disentangled-evaluation-of-causal-identification-and-estimation/</link>
      <pubDate>Wed, 25 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/causalreasoningbenchmark-a-real-world-benchmark-for-disentangled-evaluation-of-causal-identification-and-estimation/</guid>
      <description>• Computer Science &amp;gt; Artificial Intelligence [Submitted on 24 Feb 2026] Title:CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification</description>
    </item>
    <item>
      <title>DREAM: Deep Research Evaluation with Agentic Metrics</title>
      <link>https://cluster-site.onrender.com/posts/dream-deep-research-evaluation-with-agentic-metrics/</link>
      <pubDate>Tue, 24 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/dream-deep-research-evaluation-with-agentic-metrics/</guid>
      <description>• DREAM introduces agentic evaluation for AI research agents, addressing lack of ground truth. • Highlights Mirage of Synthesis: surface fluency can mask factual and reasoning flaw</description>
    </item>
    <item>
      <title>A Systematic Evaluation of the Potential of Carbon-Aware Execution for Scientific Workflows</title>
      <link>https://cluster-site.onrender.com/posts/a-systematic-evaluation-of-the-potential-of-carbon-aware-execution-for-scientific-workflows/</link>
      <pubDate>Mon, 23 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/a-systematic-evaluation-of-the-potential-of-carbon-aware-execution-for-scientific-workflows/</guid>
      <description>• Computer Science &amp;gt; Distributed, Parallel, and Cluster Computing [Submitted on 20 Aug 2025 (v1), last revised 20 Feb 2026 (this version, v3)] Title:A Systematic Evaluation of the</description>
    </item>
    <item>
      <title>Simple Baselines are Competitive with Code Evolution</title>
      <link>https://cluster-site.onrender.com/posts/simple-baselines-are-competitive-with-code-evolution/</link>
      <pubDate>Fri, 20 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/simple-baselines-are-competitive-with-code-evolution/</guid>
      <description>• Code evolution uses LLMs to mutate code, yet lacks baseline comparisons. • Authors test two simple baselines across math bounds, agentic scaffolds, and ML contests. • Baselines m</description>
    </item>
    <item>
      <title>A Methodology for Identifying Evaluation Items for Practical Dialogue Systems Based on Business-Dialogue System Alignment Models</title>
      <link>https://cluster-site.onrender.com/posts/a-methodology-for-identifying-evaluation-items-for-practical-dialogue-systems-based-on-business-dialogue-system-alignment-models/</link>
      <pubDate>Thu, 19 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/a-methodology-for-identifying-evaluation-items-for-practical-dialogue-systems-based-on-business-dialogue-system-alignment-models/</guid>
      <description>• Computer Science &amp;gt; Human-Computer Interaction [Submitted on 10 Jan 2026] Title:A Methodology for Identifying Evaluation Items for Practical Dialogue Systems Based on Business-Dia</description>
    </item>
    <item>
      <title>Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents</title>
      <link>https://cluster-site.onrender.com/posts/toward-scalable-verifiable-reward-proxy-state-based-evaluation-for-multi-turn-tool-calling-llm-agents/</link>
      <pubDate>Thu, 19 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/toward-scalable-verifiable-reward-proxy-state-based-evaluation-for-multi-turn-tool-calling-llm-agents/</guid>
      <description>• Computer Science &amp;gt; Artificial Intelligence [Submitted on 18 Feb 2026] Title:Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents</description>
    </item>
    <item>
      <title>Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents</title>
      <link>https://cluster-site.onrender.com/posts/toward-scalable-verifiable-reward-proxy-state-based-evaluation-for-multi-turn-tool-calling-llm-agents/</link>
      <pubDate>Thu, 19 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/toward-scalable-verifiable-reward-proxy-state-based-evaluation-for-multi-turn-tool-calling-llm-agents/</guid>
      <description>• Computer Science &amp;gt; Artificial Intelligence [Submitted on 18 Feb 2026] Title:Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents</description>
    </item>
    <item>
      <title>BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors</title>
      <link>https://cluster-site.onrender.com/posts/botzonebench-scalable-llm-evaluation-via-graded-ai-anchors/</link>
      <pubDate>Tue, 17 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/botzonebench-scalable-llm-evaluation-via-graded-ai-anchors/</guid>
      <description>• Computer Science &amp;gt; Artificial Intelligence [Submitted on 22 Jan 2026] Title:BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors View PDF HTML (experimental)Abstract:Large</description>
    </item>
    <item>
      <title>BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors</title>
      <link>https://cluster-site.onrender.com/posts/botzonebench-scalable-llm-evaluation-via-graded-ai-anchors/</link>
      <pubDate>Tue, 17 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/botzonebench-scalable-llm-evaluation-via-graded-ai-anchors/</guid>
      <description>• Computer Science &amp;gt; Artificial Intelligence [Submitted on 22 Jan 2026] Title:BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors View PDF HTML (experimental)Abstract:Large</description>
    </item>
    <item>
      <title>PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading</title>
      <link>https://cluster-site.onrender.com/posts/plotchain-deterministic-checkpointed-evaluation-of-multimodal-llms-on-engineering-plot-reading/</link>
      <pubDate>Tue, 17 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/plotchain-deterministic-checkpointed-evaluation-of-multimodal-llms-on-engineering-plot-reading/</guid>
      <description>• Computer Science &amp;gt; Artificial Intelligence [Submitted on 29 Jan 2026] Title:PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading View P</description>
    </item>
    <item>
      <title>PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading</title>
      <link>https://cluster-site.onrender.com/posts/plotchain-deterministic-checkpointed-evaluation-of-multimodal-llms-on-engineering-plot-reading/</link>
      <pubDate>Tue, 17 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/plotchain-deterministic-checkpointed-evaluation-of-multimodal-llms-on-engineering-plot-reading/</guid>
      <description>• Computer Science &amp;gt; Artificial Intelligence [Submitted on 29 Jan 2026] Title:PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading View P</description>
    </item>
    <item>
      <title>Group Note Draft: W3C Accessibility Guidelines Evaluation Methodology (WCAG-EM) 2.0</title>
      <link>https://cluster-site.onrender.com/posts/group-note-draft-w3c-accessibility-guidelines-evaluation-methodology-wcag-em-2.0/</link>
      <pubDate>Thu, 05 Feb 2026 04:21:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/group-note-draft-w3c-accessibility-guidelines-evaluation-methodology-wcag-em-2.0/</guid>
      <description>• The Accessibility Guidelines Working Group has published the first draft of a Group Note titled W3C Accessibility Guidelines Evaluation Methodology (WCAG-EM) 2.0 . • WCAG-EM desc</description>
    </item>
    <item>
      <title>Group Note Draft: W3C Accessibility Guidelines Evaluation Methodology (WCAG-EM) 2.0</title>
      <link>https://cluster-site.onrender.com/posts/group-note-draft-w3c-accessibility-guidelines-evaluation-methodology-wcag-em-2.0/</link>
      <pubDate>Thu, 05 Feb 2026 04:21:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/group-note-draft-w3c-accessibility-guidelines-evaluation-methodology-wcag-em-2.0/</guid>
      <description>• The Accessibility Guidelines Working Group has published the first draft of a Group Note titled W3C Accessibility Guidelines Evaluation Methodology (WCAG-EM) 2.0 . • WCAG-EM desc</description>
    </item>
    <item>
      <title>Community Evals: Because we&#39;re done trusting black-box leaderboards over the community</title>
      <link>https://cluster-site.onrender.com/posts/community-evals-because-were-done-trusting-black-box-leaderboards-over-the-community/</link>
      <pubDate>Wed, 04 Feb 2026 00:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/community-evals-because-were-done-trusting-black-box-leaderboards-over-the-community/</guid>
      <description>• Evaluation metrics saturated; MMLU &amp;gt;91%, GSM8K &amp;gt;94%, yet real‑world tasks still fail. • Inconsistent benchmark scores across papers, model cards, and platforms create no single t</description>
    </item>
    <item>
      <title>CAISI Evaluation of DeepSeek AI Models Finds Shortcomings and Risks</title>
      <link>https://cluster-site.onrender.com/posts/caisi-evaluation-of-deepseek-ai-models-finds-shortcomings-and-risks/</link>
      <pubDate>Tue, 30 Sep 2025 12:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/caisi-evaluation-of-deepseek-ai-models-finds-shortcomings-and-risks/</guid>
      <description>• Official websites use .govA.govwebsite belongs to an official government organization in the United States. • Secure .gov websites use HTTPSAlock(LockA locked padlock) orhttps://</description>
    </item>
  </channel>
</rss>
