Evaluation on Tenu Tech Brief

Evaluation on Tenu Tech Brief https://cluster-site.onrender.com/tags/evaluation/ Recent content in Evaluation on Tenu Tech Brief Hugo -- 0.146.0 en-us Thu, 26 Feb 2026 06:03:06 +0000 TRACE: Trajectory-Aware Comprehensive Evaluation for Deep Research Agents https://cluster-site.onrender.com/posts/trace-trajectory-aware-comprehensive-evaluation-for-deep-research-agents/ Thu, 26 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/trace-trajectory-aware-comprehensive-evaluation-for-deep-research-agents/ • Computer Science > Computation and Language [Submitted on 5 Feb 2026] Title:TRACE: Trajectory-Aware Comprehensive Evaluation for Deep Research Agents View PDF HTML (experimental) CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation https://cluster-site.onrender.com/posts/causalreasoningbenchmark-a-real-world-benchmark-for-disentangled-evaluation-of-causal-identification-and-estimation/ Wed, 25 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/causalreasoningbenchmark-a-real-world-benchmark-for-disentangled-evaluation-of-causal-identification-and-estimation/ • Computer Science > Artificial Intelligence [Submitted on 24 Feb 2026] Title:CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification DREAM: Deep Research Evaluation with Agentic Metrics https://cluster-site.onrender.com/posts/dream-deep-research-evaluation-with-agentic-metrics/ Tue, 24 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/dream-deep-research-evaluation-with-agentic-metrics/ • DREAM introduces agentic evaluation for AI research agents, addressing lack of ground truth. • Highlights Mirage of Synthesis: surface fluency can mask factual and reasoning flaw A Systematic Evaluation of the Potential of Carbon-Aware Execution for Scientific Workflows https://cluster-site.onrender.com/posts/a-systematic-evaluation-of-the-potential-of-carbon-aware-execution-for-scientific-workflows/ Mon, 23 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/a-systematic-evaluation-of-the-potential-of-carbon-aware-execution-for-scientific-workflows/ • Computer Science > Distributed, Parallel, and Cluster Computing [Submitted on 20 Aug 2025 (v1), last revised 20 Feb 2026 (this version, v3)] Title:A Systematic Evaluation of the Simple Baselines are Competitive with Code Evolution https://cluster-site.onrender.com/posts/simple-baselines-are-competitive-with-code-evolution/ Fri, 20 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/simple-baselines-are-competitive-with-code-evolution/ • Code evolution uses LLMs to mutate code, yet lacks baseline comparisons. • Authors test two simple baselines across math bounds, agentic scaffolds, and ML contests. • Baselines m A Methodology for Identifying Evaluation Items for Practical Dialogue Systems Based on Business-Dialogue System Alignment Models https://cluster-site.onrender.com/posts/a-methodology-for-identifying-evaluation-items-for-practical-dialogue-systems-based-on-business-dialogue-system-alignment-models/ Thu, 19 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/a-methodology-for-identifying-evaluation-items-for-practical-dialogue-systems-based-on-business-dialogue-system-alignment-models/ • Computer Science > Human-Computer Interaction [Submitted on 10 Jan 2026] Title:A Methodology for Identifying Evaluation Items for Practical Dialogue Systems Based on Business-Dia Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents https://cluster-site.onrender.com/posts/toward-scalable-verifiable-reward-proxy-state-based-evaluation-for-multi-turn-tool-calling-llm-agents/ Thu, 19 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/toward-scalable-verifiable-reward-proxy-state-based-evaluation-for-multi-turn-tool-calling-llm-agents/ • Computer Science > Artificial Intelligence [Submitted on 18 Feb 2026] Title:Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents https://cluster-site.onrender.com/posts/toward-scalable-verifiable-reward-proxy-state-based-evaluation-for-multi-turn-tool-calling-llm-agents/ Thu, 19 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/toward-scalable-verifiable-reward-proxy-state-based-evaluation-for-multi-turn-tool-calling-llm-agents/ • Computer Science > Artificial Intelligence [Submitted on 18 Feb 2026] Title:Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors https://cluster-site.onrender.com/posts/botzonebench-scalable-llm-evaluation-via-graded-ai-anchors/ Tue, 17 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/botzonebench-scalable-llm-evaluation-via-graded-ai-anchors/ • Computer Science > Artificial Intelligence [Submitted on 22 Jan 2026] Title:BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors View PDF HTML (experimental)Abstract:Large BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors https://cluster-site.onrender.com/posts/botzonebench-scalable-llm-evaluation-via-graded-ai-anchors/ Tue, 17 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/botzonebench-scalable-llm-evaluation-via-graded-ai-anchors/ • Computer Science > Artificial Intelligence [Submitted on 22 Jan 2026] Title:BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors View PDF HTML (experimental)Abstract:Large PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading https://cluster-site.onrender.com/posts/plotchain-deterministic-checkpointed-evaluation-of-multimodal-llms-on-engineering-plot-reading/ Tue, 17 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/plotchain-deterministic-checkpointed-evaluation-of-multimodal-llms-on-engineering-plot-reading/ • Computer Science > Artificial Intelligence [Submitted on 29 Jan 2026] Title:PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading View P PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading https://cluster-site.onrender.com/posts/plotchain-deterministic-checkpointed-evaluation-of-multimodal-llms-on-engineering-plot-reading/ Tue, 17 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/plotchain-deterministic-checkpointed-evaluation-of-multimodal-llms-on-engineering-plot-reading/ • Computer Science > Artificial Intelligence [Submitted on 29 Jan 2026] Title:PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading View P Group Note Draft: W3C Accessibility Guidelines Evaluation Methodology (WCAG-EM) 2.0 https://cluster-site.onrender.com/posts/group-note-draft-w3c-accessibility-guidelines-evaluation-methodology-wcag-em-2.0/ Thu, 05 Feb 2026 04:21:00 +0000 https://cluster-site.onrender.com/posts/group-note-draft-w3c-accessibility-guidelines-evaluation-methodology-wcag-em-2.0/ • The Accessibility Guidelines Working Group has published the first draft of a Group Note titled W3C Accessibility Guidelines Evaluation Methodology (WCAG-EM) 2.0 . • WCAG-EM desc Group Note Draft: W3C Accessibility Guidelines Evaluation Methodology (WCAG-EM) 2.0 https://cluster-site.onrender.com/posts/group-note-draft-w3c-accessibility-guidelines-evaluation-methodology-wcag-em-2.0/ Thu, 05 Feb 2026 04:21:00 +0000 https://cluster-site.onrender.com/posts/group-note-draft-w3c-accessibility-guidelines-evaluation-methodology-wcag-em-2.0/ • The Accessibility Guidelines Working Group has published the first draft of a Group Note titled W3C Accessibility Guidelines Evaluation Methodology (WCAG-EM) 2.0 . • WCAG-EM desc Community Evals: Because we're done trusting black-box leaderboards over the community https://cluster-site.onrender.com/posts/community-evals-because-were-done-trusting-black-box-leaderboards-over-the-community/ Wed, 04 Feb 2026 00:00:00 +0000 https://cluster-site.onrender.com/posts/community-evals-because-were-done-trusting-black-box-leaderboards-over-the-community/ • Evaluation metrics saturated; MMLU >91%, GSM8K >94%, yet real‑world tasks still fail. • Inconsistent benchmark scores across papers, model cards, and platforms create no single t CAISI Evaluation of DeepSeek AI Models Finds Shortcomings and Risks https://cluster-site.onrender.com/posts/caisi-evaluation-of-deepseek-ai-models-finds-shortcomings-and-risks/ Tue, 30 Sep 2025 12:00:00 +0000 https://cluster-site.onrender.com/posts/caisi-evaluation-of-deepseek-ai-models-finds-shortcomings-and-risks/ • Official websites use .govA.govwebsite belongs to an official government organization in the United States. • Secure .gov websites use HTTPSAlock(LockA locked padlock) orhttps://