<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Benchmark on Tenu Tech Brief</title>
    <link>https://cluster-site.onrender.com/tags/benchmark/</link>
    <description>Recent content in Benchmark on Tenu Tech Brief</description>
    <generator>Hugo -- 0.146.0</generator>
    <language>en-us</language>
    <lastBuildDate>Thu, 26 Feb 2026 06:03:06 +0000</lastBuildDate>
    <atom:link href="https://cluster-site.onrender.com/tags/benchmark/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices</title>
      <link>https://cluster-site.onrender.com/posts/proactivemobile-a-comprehensive-benchmark-for-boosting-proactive-intelligence-on-mobile-devices/</link>
      <pubDate>Thu, 26 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/proactivemobile-a-comprehensive-benchmark-for-boosting-proactive-intelligence-on-mobile-devices/</guid>
      <description>• Computer Science &amp;gt; Artificial Intelligence [Submitted on 25 Feb 2026] Title:ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices View</description>
    </item>
    <item>
      <title>CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation</title>
      <link>https://cluster-site.onrender.com/posts/cage-a-framework-for-culturally-adaptive-red-teaming-benchmark-generation/</link>
      <pubDate>Wed, 25 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/cage-a-framework-for-culturally-adaptive-red-teaming-benchmark-generation/</guid>
      <description>• Computer Science &amp;gt; Computers and Society [Submitted on 9 Feb 2026] Title:CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation View PDF HTML (experimental)Ab</description>
    </item>
    <item>
      <title>CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation</title>
      <link>https://cluster-site.onrender.com/posts/causalreasoningbenchmark-a-real-world-benchmark-for-disentangled-evaluation-of-causal-identification-and-estimation/</link>
      <pubDate>Wed, 25 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/causalreasoningbenchmark-a-real-world-benchmark-for-disentangled-evaluation-of-causal-identification-and-estimation/</guid>
      <description>• Computer Science &amp;gt; Artificial Intelligence [Submitted on 24 Feb 2026] Title:CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification</description>
    </item>
    <item>
      <title>PreScience: A Benchmark for Forecasting Scientific Contributions</title>
      <link>https://cluster-site.onrender.com/posts/prescience-a-benchmark-for-forecasting-scientific-contributions/</link>
      <pubDate>Wed, 25 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/prescience-a-benchmark-for-forecasting-scientific-contributions/</guid>
      <description>• Computer Science &amp;gt; Artificial Intelligence [Submitted on 24 Feb 2026] Title:PreScience: A Benchmark for Forecasting Scientific Contributions View PDFAbstract:Can AI systems train</description>
    </item>
    <item>
      <title>ABD: Default Exception Abduction in Finite First Order Worlds</title>
      <link>https://cluster-site.onrender.com/posts/abd-default-exception-abduction-in-finite-first-order-worlds/</link>
      <pubDate>Tue, 24 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/abd-default-exception-abduction-in-finite-first-order-worlds/</guid>
      <description>• ABD benchmark tests default‑exception abduction in finite first‑order logical worlds. • Models generate sparse exception formulas to restore satisfiability under abnormality pred</description>
    </item>
    <item>
      <title>Agentic AI for Scalable and Robust Optical Systems Control</title>
      <link>https://cluster-site.onrender.com/posts/agentic-ai-for-scalable-and-robust-optical-systems-control/</link>
      <pubDate>Tue, 24 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/agentic-ai-for-scalable-and-robust-optical-systems-control/</guid>
      <description>• AgentOptics framework uses agentic AI for autonomous optical system control via MCP. • Interprets natural language tasks, executes protocol-compliant actions across heterogeneous</description>
    </item>
    <item>
      <title>Asymptotic Semantic Collapse in Hierarchical Optimization</title>
      <link>https://cluster-site.onrender.com/posts/asymptotic-semantic-collapse-in-hierarchical-optimization/</link>
      <pubDate>Tue, 24 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/asymptotic-semantic-collapse-in-hierarchical-optimization/</guid>
      <description>• Asymptotic Semantic Collapse: dominant context absorbs individual semantics in multi‑agent language systems. • Dominant Anchor Node with infinite inertia drives asymptotic alignm</description>
    </item>
    <item>
      <title>INDUCTION: Finite-Structure Concept Synthesis in First-Order Logic</title>
      <link>https://cluster-site.onrender.com/posts/induction-finite-structure-concept-synthesis-in-first-order-logic/</link>
      <pubDate>Tue, 24 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/induction-finite-structure-concept-synthesis-in-first-order-logic/</guid>
      <description>• INDUCTION benchmark tests finite-structure concept synthesis in first‑order logic across small relational worlds. • Models output a single logical formula that uniformly explains</description>
    </item>
    <item>
      <title>INSURE-Dial: A Phase-Aware Conversational Dataset \&amp; Benchmark for Compliance Verification and Phase Detection</title>
      <link>https://cluster-site.onrender.com/posts/insure-dial-a-phase-aware-conversational-dataset-%5C-benchmark-for-compliance-verification-and-phase-detection/</link>
      <pubDate>Tue, 24 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/insure-dial-a-phase-aware-conversational-dataset-%5C-benchmark-for-compliance-verification-and-phase-detection/</guid>
      <description>• INSURE‑Dial is the first public benchmark for compliance‑aware voice agents in insurance calls. • Corpus contains 50 real AI‑initiated calls and 1,000 synthetic calls, averaging</description>
    </item>
    <item>
      <title>ReportLogic: Evaluating Logical Quality in Deep Research Reports</title>
      <link>https://cluster-site.onrender.com/posts/reportlogic-evaluating-logical-quality-in-deep-research-reports/</link>
      <pubDate>Tue, 24 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/reportlogic-evaluating-logical-quality-in-deep-research-reports/</guid>
      <description>• LLMs increasingly synthesize research into structured reports, but logical reliability remains unassessed. • ReportLogic benchmark quantifies report‑level logical quality for dee</description>
    </item>
    <item>
      <title>AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding</title>
      <link>https://cluster-site.onrender.com/posts/amuse-audio-visual-benchmark-and-alignment-framework-for-agentic-multi-speaker-understanding/</link>
      <pubDate>Tue, 24 Feb 2026 00:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/amuse-audio-visual-benchmark-and-alignment-framework-for-agentic-multi-speaker-understanding/</guid>
      <description>• AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Unde</description>
    </item>
    <item>
      <title>IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages</title>
      <link>https://cluster-site.onrender.com/posts/indicjr-a-judge-free-benchmark-of-jailbreak-robustness-in-south-asian-languages/</link>
      <pubDate>Fri, 20 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/indicjr-a-judge-free-benchmark-of-jailbreak-robustness-in-south-asian-languages/</guid>
      <description>• Computer Science &amp;gt; Artificial Intelligence [Submitted on 18 Feb 2026] Title:IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages View PDF HTML (experi</description>
    </item>
    <item>
      <title>When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation</title>
      <link>https://cluster-site.onrender.com/posts/when-ai-benchmarks-plateau-a-systematic-study-of-benchmark-saturation/</link>
      <pubDate>Fri, 20 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/when-ai-benchmarks-plateau-a-systematic-study-of-benchmark-saturation/</guid>
      <description>• Computer Science &amp;gt; Artificial Intelligence [Submitted on 18 Feb 2026] Title:When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation View PDF HTML (experimental)Abs</description>
    </item>
    <item>
      <title>How Uncertain Is the Grade? A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment</title>
      <link>https://cluster-site.onrender.com/posts/how-uncertain-is-the-grade-a-benchmark-of-uncertainty-metrics-for-llm-based-automatic-assessment/</link>
      <pubDate>Thu, 19 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/how-uncertain-is-the-grade-a-benchmark-of-uncertainty-metrics-for-llm-based-automatic-assessment/</guid>
      <description>• Computer Science &amp;gt; Artificial Intelligence [Submitted on 17 Feb 2026] Title:How Uncertain Is the Grade? • A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment Vi</description>
    </item>
    <item>
      <title>How Uncertain Is the Grade? A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment</title>
      <link>https://cluster-site.onrender.com/posts/how-uncertain-is-the-grade-a-benchmark-of-uncertainty-metrics-for-llm-based-automatic-assessment/</link>
      <pubDate>Thu, 19 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/how-uncertain-is-the-grade-a-benchmark-of-uncertainty-metrics-for-llm-based-automatic-assessment/</guid>
      <description>• Computer Science &amp;gt; Artificial Intelligence [Submitted on 17 Feb 2026] Title:How Uncertain Is the Grade? • A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment Vi</description>
    </item>
    <item>
      <title>Benchmark of 1.4 million checked protein structures could sharpen AI predictions</title>
      <link>https://cluster-site.onrender.com/posts/benchmark-of-1.4-million-checked-protein-structures-could-sharpen-ai-predictions/</link>
      <pubDate>Wed, 18 Feb 2026 22:10:01 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/benchmark-of-1.4-million-checked-protein-structures-could-sharpen-ai-predictions/</guid>
      <description>• University of Missouri researchers have released the world&amp;rsquo;s largest collection of protein models with quality assessment-a groundbreaking new resource that could accelerate drug</description>
    </item>
    <item>
      <title>AI system TongGeometry generates and solves olympiad-level geometry problems</title>
      <link>https://cluster-site.onrender.com/posts/ai-system-tonggeometry-generates-and-solves-olympiad-level-geometry-problems/</link>
      <pubDate>Tue, 17 Feb 2026 17:20:01 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/ai-system-tonggeometry-generates-and-solves-olympiad-level-geometry-problems/</guid>
      <description>• TongGeometry, an AI system, autonomously generates olympiad-level geometry problems for high-level competition. • It also solves these problems, matching human expert accuracy an</description>
    </item>
    <item>
      <title>Benchmark cuts Metaplanet target, says earnings show &#39;promise and peril&#39; of bitcoin pivot</title>
      <link>https://cluster-site.onrender.com/posts/benchmark-cuts-metaplanet-target-says-earnings-show-promise-and-peril-of-bitcoin-pivot/</link>
      <pubDate>Tue, 17 Feb 2026 17:05:57 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/benchmark-cuts-metaplanet-target-says-earnings-show-promise-and-peril-of-bitcoin-pivot/</guid>
      <description>• Analysts say Metaplanet&amp;rsquo;s bitcoin-linked income business is becoming critical to funding expansion while avoiding forced BTC sales.</description>
    </item>
    <item>
      <title>TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks</title>
      <link>https://cluster-site.onrender.com/posts/temporalbench-a-benchmark-for-evaluating-llm-based-agents-on-contextual-and-event-informed-time-series-tasks/</link>
      <pubDate>Tue, 17 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/temporalbench-a-benchmark-for-evaluating-llm-based-agents-on-contextual-and-event-informed-time-series-tasks/</guid>
      <description>• Computer Science &amp;gt; Artificial Intelligence [Submitted on 5 Feb 2026] Title:TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series</description>
    </item>
    <item>
      <title>TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks</title>
      <link>https://cluster-site.onrender.com/posts/temporalbench-a-benchmark-for-evaluating-llm-based-agents-on-contextual-and-event-informed-time-series-tasks/</link>
      <pubDate>Tue, 17 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/temporalbench-a-benchmark-for-evaluating-llm-based-agents-on-contextual-and-event-informed-time-series-tasks/</guid>
      <description>• TemporalBench offers a multi-domain benchmark for temporal reasoning in LLM agents. • Four-tier taxonomy tests historical structure, context-free, contextual, and event-condition</description>
    </item>
    <item>
      <title>Benchmark cuts Coinbase price target by 37% but says business is &#39;more diversified and durable&#39; than ever</title>
      <link>https://cluster-site.onrender.com/posts/benchmark-cuts-coinbase-price-target-by-37-but-says-business-is-more-diversified-and-durable-than-ever/</link>
      <pubDate>Sat, 14 Feb 2026 22:12:54 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/benchmark-cuts-coinbase-price-target-by-37-but-says-business-is-more-diversified-and-durable-than-ever/</guid>
      <description>• Benchmark&amp;rsquo;s Mark Palmer cut his COIN price target to $267 from $421, citing worsening crypto market conditions, while reiterating a buy rating.</description>
    </item>
    <item>
      <title>Alyah ⭐️: Toward Robust Evaluation of Emirati Dialect Capabilities in Arabic LLMs</title>
      <link>https://cluster-site.onrender.com/posts/alyah-%EF%B8%8F-toward-robust-evaluation-of-emirati-dialect-capabilities-in-arabic-llms/</link>
      <pubDate>Tue, 27 Jan 2026 10:26:42 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/alyah-%EF%B8%8F-toward-robust-evaluation-of-emirati-dialect-capabilities-in-arabic-llms/</guid>
      <description>• Arabic LLMs often evaluated only on Modern Standard Arabic, neglecting dialects. • Dialects like Emirati carry unique vocabulary, syntax, and cultural context. • Existing benchma</description>
    </item>
    <item>
      <title>AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality</title>
      <link>https://cluster-site.onrender.com/posts/assetopsbench-bridging-the-gap-between-ai-agent-benchmarks-and-industrial-reality/</link>
      <pubDate>Wed, 21 Jan 2026 06:25:31 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/assetopsbench-bridging-the-gap-between-ai-agent-benchmarks-and-industrial-reality/</guid>
      <description>• AssetOpsBench introduces a multi-agent benchmark for industrial asset lifecycle management. • Covers 2.3M sensor telemetry points and 4.2K work orders. • Evaluates agents across</description>
    </item>
    <item>
      <title>How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark</title>
      <link>https://cluster-site.onrender.com/posts/how-to-evaluate-jailbreak-methods-a-case-study-with-the-strongreject-benchmark/</link>
      <pubDate>Wed, 28 Aug 2024 15:30:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/how-to-evaluate-jailbreak-methods-a-case-study-with-the-strongreject-benchmark/</guid>
      <description>• When we began studying jailbreak evaluations, we found a fascinating paper claiming that you could jailbreak frontier LLMs simply by translating forbidden prompts into obscure la</description>
    </item>
    <item>
      <title>How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark</title>
      <link>https://cluster-site.onrender.com/posts/how-to-evaluate-jailbreak-methods-a-case-study-with-the-strongreject-benchmark/</link>
      <pubDate>Wed, 28 Aug 2024 15:30:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/how-to-evaluate-jailbreak-methods-a-case-study-with-the-strongreject-benchmark/</guid>
      <description>• Researchers tested jailbreak via Scots Gaelic translation, initially replicating 43% success claim. • GPT-4 responded with bomb instructions in Gaelic, but full output differed f</description>
    </item>
    <item>
      <title>Are We Ready for Multi-Image Reasoning? Launching VHs: The Visual Haystacks Benchmark!</title>
      <link>https://cluster-site.onrender.com/posts/are-we-ready-for-multi-image-reasoning-launching-vhs-the-visual-haystacks-benchmark/</link>
      <pubDate>Sat, 20 Jul 2024 09:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/are-we-ready-for-multi-image-reasoning-launching-vhs-the-visual-haystacks-benchmark/</guid>
      <description>• Humans can sift through thousands of images, spotting subtle patterns-a skill AI still struggles to match. • Traditional VQA systems answer questions about single images, missing</description>
    </item>
  </channel>
</rss>
