<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Benchmarking on Tenu Tech Brief</title>
    <link>https://cluster-site.onrender.com/tags/benchmarking/</link>
    <description>Recent content in Benchmarking on Tenu Tech Brief</description>
    <generator>Hugo -- 0.146.0</generator>
    <language>en-us</language>
    <lastBuildDate>Wed, 25 Feb 2026 07:59:13 +0000</lastBuildDate>
    <atom:link href="https://cluster-site.onrender.com/tags/benchmarking/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Benchmarking Distilled Language Models: Performance and Efficiency in Resource-Constrained Settings</title>
      <link>https://cluster-site.onrender.com/posts/benchmarking-distilled-language-models-performance-and-efficiency-in-resource-constrained-settings/</link>
      <pubDate>Wed, 25 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/benchmarking-distilled-language-models-performance-and-efficiency-in-resource-constrained-settings/</guid>
      <description>• Computer Science &amp;gt; Computation and Language [Submitted on 28 Jan 2026] Title:Benchmarking Distilled Language Models: Performance and Efficiency in Resource-Constrained Settings V</description>
    </item>
    <item>
      <title>Benchmarking Early Deterioration Prediction Across Hospital-Rich and MCI-Like Emergency Triage Under Constrained Sensing</title>
      <link>https://cluster-site.onrender.com/posts/benchmarking-early-deterioration-prediction-across-hospital-rich-and-mci-like-emergency-triage-under-constrained-sensing/</link>
      <pubDate>Wed, 25 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/benchmarking-early-deterioration-prediction-across-hospital-rich-and-mci-like-emergency-triage-under-constrained-sensing/</guid>
      <description>• Computer Science &amp;gt; Computers and Society [Submitted on 9 Feb 2026] Title:Benchmarking Early Deterioration Prediction Across Hospital-Rich and MCI-Like Emergency Triage Under Cons</description>
    </item>
    <item>
      <title>Beyond single-channel agentic benchmarking</title>
      <link>https://cluster-site.onrender.com/posts/beyond-single-channel-agentic-benchmarking/</link>
      <pubDate>Tue, 24 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/beyond-single-channel-agentic-benchmarking/</guid>
      <description>• Current AI safety benchmarks assess agents in isolation, ignoring human‑AI interaction dynamics. • Single‑channel evaluation misrepresents operational safety, unlike redundancy‑b</description>
    </item>
    <item>
      <title>DREAM: Deep Research Evaluation with Agentic Metrics</title>
      <link>https://cluster-site.onrender.com/posts/dream-deep-research-evaluation-with-agentic-metrics/</link>
      <pubDate>Tue, 24 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/dream-deep-research-evaluation-with-agentic-metrics/</guid>
      <description>• DREAM introduces agentic evaluation for AI research agents, addressing lack of ground truth. • Highlights Mirage of Synthesis: surface fluency can mask factual and reasoning flaw</description>
    </item>
    <item>
      <title>AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks</title>
      <link>https://cluster-site.onrender.com/posts/agentlab-benchmarking-llm-agents-against-long-horizon-attacks/</link>
      <pubDate>Fri, 20 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/agentlab-benchmarking-llm-agents-against-long-horizon-attacks/</guid>
      <description>• Computer Science &amp;gt; Artificial Intelligence [Submitted on 18 Feb 2026] Title:AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks View PDF HTML (experimental)Abstract:LL</description>
    </item>
    <item>
      <title>LLM-WikiRace: Benchmarking Long-term Planning and Reasoning over Real-World Knowledge Graphs</title>
      <link>https://cluster-site.onrender.com/posts/llm-wikirace-benchmarking-long-term-planning-and-reasoning-over-real-world-knowledge-graphs/</link>
      <pubDate>Fri, 20 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/llm-wikirace-benchmarking-long-term-planning-and-reasoning-over-real-world-knowledge-graphs/</guid>
      <description>• Computer Science &amp;gt; Artificial Intelligence [Submitted on 18 Feb 2026] Title:LLM-WikiRace: Benchmarking Long-term Planning and Reasoning over Real-World Knowledge Graphs View PDF</description>
    </item>
    <item>
      <title>Linux 7.0 Showing Some Early Performance Regressions On Intel Panther Lake</title>
      <link>https://cluster-site.onrender.com/posts/linux-7.0-showing-some-early-performance-regressions-on-intel-panther-lake/</link>
      <pubDate>Wed, 18 Feb 2026 21:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/linux-7.0-showing-some-early-performance-regressions-on-intel-panther-lake/</guid>
      <description>• Linux 7.0 kernel introduces regressions on Intel Panther Lake, reducing CPU and iGPU performance. • Benchmarks on MSI Prestige 14 with Core Ultra X7 358H show slower results than</description>
    </item>
    <item>
      <title>Quantifying construct validity in large language model evaluations</title>
      <link>https://cluster-site.onrender.com/posts/quantifying-construct-validity-in-large-language-model-evaluations/</link>
      <pubDate>Wed, 18 Feb 2026 05:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/quantifying-construct-validity-in-large-language-model-evaluations/</guid>
      <description>• LLM benchmarks often misrepresent true model capabilities due to contamination and annotator errors. • Construct validity is essential to ensure benchmarks truly measure desired</description>
    </item>
    <item>
      <title>LLM Inference Benchmarking - Measure What Matters</title>
      <link>https://cluster-site.onrender.com/posts/llm-inference-benchmarking-measure-what-matters/</link>
      <pubDate>Fri, 06 Feb 2026 14:46:06 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/llm-inference-benchmarking-measure-what-matters/</guid>
      <description>• By Piyush Srivastava, Karnik Modi, Stephen Varela, and Rithish Ramesh Production-grade LLM inference is a complex systems challenge, requiring deep co-designs - from hardware pri</description>
    </item>
    <item>
      <title>Advancing AI benchmarking with Game Arena</title>
      <link>https://cluster-site.onrender.com/posts/advancing-ai-benchmarking-with-game-arena/</link>
      <pubDate>Mon, 02 Feb 2026 17:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/advancing-ai-benchmarking-with-game-arena/</guid>
      <description>• DeepMind expands Game Arena, adding Werewolf and poker to benchmark AI beyond perfect-information games. • Werewolf tests social deduction, communication, and deception handling</description>
    </item>
    <item>
      <title>Ceilometer: Uber&#39;s Adaptive Benchmarking Framework</title>
      <link>https://cluster-site.onrender.com/posts/ceilometer-ubers-adaptive-benchmarking-framework/</link>
      <pubDate>Thu, 20 Nov 2025 14:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/ceilometer-ubers-adaptive-benchmarking-framework/</guid>
      <description>• Introduction At Uber, scale and reliability define our infrastructure. • Every new server type, kernel upgrade, and configuration change must be rigorously vetted before it touch</description>
    </item>
  </channel>
</rss>
