Benchmark on Tenu Tech Brief

Benchmark on Tenu Tech Brief https://cluster-site.onrender.com/tags/benchmark/ Recent content in Benchmark on Tenu Tech Brief Hugo -- 0.146.0 en-us Thu, 26 Feb 2026 06:03:06 +0000 ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices https://cluster-site.onrender.com/posts/proactivemobile-a-comprehensive-benchmark-for-boosting-proactive-intelligence-on-mobile-devices/ Thu, 26 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/proactivemobile-a-comprehensive-benchmark-for-boosting-proactive-intelligence-on-mobile-devices/ • Computer Science > Artificial Intelligence [Submitted on 25 Feb 2026] Title:ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices View CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation https://cluster-site.onrender.com/posts/cage-a-framework-for-culturally-adaptive-red-teaming-benchmark-generation/ Wed, 25 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/cage-a-framework-for-culturally-adaptive-red-teaming-benchmark-generation/ • Computer Science > Computers and Society [Submitted on 9 Feb 2026] Title:CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation View PDF HTML (experimental)Ab CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation https://cluster-site.onrender.com/posts/causalreasoningbenchmark-a-real-world-benchmark-for-disentangled-evaluation-of-causal-identification-and-estimation/ Wed, 25 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/causalreasoningbenchmark-a-real-world-benchmark-for-disentangled-evaluation-of-causal-identification-and-estimation/ • Computer Science > Artificial Intelligence [Submitted on 24 Feb 2026] Title:CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification PreScience: A Benchmark for Forecasting Scientific Contributions https://cluster-site.onrender.com/posts/prescience-a-benchmark-for-forecasting-scientific-contributions/ Wed, 25 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/prescience-a-benchmark-for-forecasting-scientific-contributions/ • Computer Science > Artificial Intelligence [Submitted on 24 Feb 2026] Title:PreScience: A Benchmark for Forecasting Scientific Contributions View PDFAbstract:Can AI systems train ABD: Default Exception Abduction in Finite First Order Worlds https://cluster-site.onrender.com/posts/abd-default-exception-abduction-in-finite-first-order-worlds/ Tue, 24 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/abd-default-exception-abduction-in-finite-first-order-worlds/ • ABD benchmark tests default‑exception abduction in finite first‑order logical worlds. • Models generate sparse exception formulas to restore satisfiability under abnormality pred Agentic AI for Scalable and Robust Optical Systems Control https://cluster-site.onrender.com/posts/agentic-ai-for-scalable-and-robust-optical-systems-control/ Tue, 24 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/agentic-ai-for-scalable-and-robust-optical-systems-control/ • AgentOptics framework uses agentic AI for autonomous optical system control via MCP. • Interprets natural language tasks, executes protocol-compliant actions across heterogeneous Asymptotic Semantic Collapse in Hierarchical Optimization https://cluster-site.onrender.com/posts/asymptotic-semantic-collapse-in-hierarchical-optimization/ Tue, 24 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/asymptotic-semantic-collapse-in-hierarchical-optimization/ • Asymptotic Semantic Collapse: dominant context absorbs individual semantics in multi‑agent language systems. • Dominant Anchor Node with infinite inertia drives asymptotic alignm INDUCTION: Finite-Structure Concept Synthesis in First-Order Logic https://cluster-site.onrender.com/posts/induction-finite-structure-concept-synthesis-in-first-order-logic/ Tue, 24 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/induction-finite-structure-concept-synthesis-in-first-order-logic/ • INDUCTION benchmark tests finite-structure concept synthesis in first‑order logic across small relational worlds. • Models output a single logical formula that uniformly explains INSURE-Dial: A Phase-Aware Conversational Dataset \& Benchmark for Compliance Verification and Phase Detection https://cluster-site.onrender.com/posts/insure-dial-a-phase-aware-conversational-dataset-%5C-benchmark-for-compliance-verification-and-phase-detection/ Tue, 24 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/insure-dial-a-phase-aware-conversational-dataset-%5C-benchmark-for-compliance-verification-and-phase-detection/ • INSURE‑Dial is the first public benchmark for compliance‑aware voice agents in insurance calls. • Corpus contains 50 real AI‑initiated calls and 1,000 synthetic calls, averaging ReportLogic: Evaluating Logical Quality in Deep Research Reports https://cluster-site.onrender.com/posts/reportlogic-evaluating-logical-quality-in-deep-research-reports/ Tue, 24 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/reportlogic-evaluating-logical-quality-in-deep-research-reports/ • LLMs increasingly synthesize research into structured reports, but logical reliability remains unassessed. • ReportLogic benchmark quantifies report‑level logical quality for dee AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding https://cluster-site.onrender.com/posts/amuse-audio-visual-benchmark-and-alignment-framework-for-agentic-multi-speaker-understanding/ Tue, 24 Feb 2026 00:00:00 +0000 https://cluster-site.onrender.com/posts/amuse-audio-visual-benchmark-and-alignment-framework-for-agentic-multi-speaker-understanding/ • AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Unde IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages https://cluster-site.onrender.com/posts/indicjr-a-judge-free-benchmark-of-jailbreak-robustness-in-south-asian-languages/ Fri, 20 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/indicjr-a-judge-free-benchmark-of-jailbreak-robustness-in-south-asian-languages/ • Computer Science > Artificial Intelligence [Submitted on 18 Feb 2026] Title:IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages View PDF HTML (experi When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation https://cluster-site.onrender.com/posts/when-ai-benchmarks-plateau-a-systematic-study-of-benchmark-saturation/ Fri, 20 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/when-ai-benchmarks-plateau-a-systematic-study-of-benchmark-saturation/ • Computer Science > Artificial Intelligence [Submitted on 18 Feb 2026] Title:When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation View PDF HTML (experimental)Abs How Uncertain Is the Grade? A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment https://cluster-site.onrender.com/posts/how-uncertain-is-the-grade-a-benchmark-of-uncertainty-metrics-for-llm-based-automatic-assessment/ Thu, 19 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/how-uncertain-is-the-grade-a-benchmark-of-uncertainty-metrics-for-llm-based-automatic-assessment/ • Computer Science > Artificial Intelligence [Submitted on 17 Feb 2026] Title:How Uncertain Is the Grade? • A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment Vi How Uncertain Is the Grade? A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment https://cluster-site.onrender.com/posts/how-uncertain-is-the-grade-a-benchmark-of-uncertainty-metrics-for-llm-based-automatic-assessment/ Thu, 19 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/how-uncertain-is-the-grade-a-benchmark-of-uncertainty-metrics-for-llm-based-automatic-assessment/ • Computer Science > Artificial Intelligence [Submitted on 17 Feb 2026] Title:How Uncertain Is the Grade? • A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment Vi Benchmark of 1.4 million checked protein structures could sharpen AI predictions https://cluster-site.onrender.com/posts/benchmark-of-1.4-million-checked-protein-structures-could-sharpen-ai-predictions/ Wed, 18 Feb 2026 22:10:01 +0000 https://cluster-site.onrender.com/posts/benchmark-of-1.4-million-checked-protein-structures-could-sharpen-ai-predictions/ • University of Missouri researchers have released the world’s largest collection of protein models with quality assessment-a groundbreaking new resource that could accelerate drug AI system TongGeometry generates and solves olympiad-level geometry problems https://cluster-site.onrender.com/posts/ai-system-tonggeometry-generates-and-solves-olympiad-level-geometry-problems/ Tue, 17 Feb 2026 17:20:01 +0000 https://cluster-site.onrender.com/posts/ai-system-tonggeometry-generates-and-solves-olympiad-level-geometry-problems/ • TongGeometry, an AI system, autonomously generates olympiad-level geometry problems for high-level competition. • It also solves these problems, matching human expert accuracy an Benchmark cuts Metaplanet target, says earnings show 'promise and peril' of bitcoin pivot https://cluster-site.onrender.com/posts/benchmark-cuts-metaplanet-target-says-earnings-show-promise-and-peril-of-bitcoin-pivot/ Tue, 17 Feb 2026 17:05:57 +0000 https://cluster-site.onrender.com/posts/benchmark-cuts-metaplanet-target-says-earnings-show-promise-and-peril-of-bitcoin-pivot/ • Analysts say Metaplanet’s bitcoin-linked income business is becoming critical to funding expansion while avoiding forced BTC sales. TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks https://cluster-site.onrender.com/posts/temporalbench-a-benchmark-for-evaluating-llm-based-agents-on-contextual-and-event-informed-time-series-tasks/ Tue, 17 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/temporalbench-a-benchmark-for-evaluating-llm-based-agents-on-contextual-and-event-informed-time-series-tasks/ • Computer Science > Artificial Intelligence [Submitted on 5 Feb 2026] Title:TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks https://cluster-site.onrender.com/posts/temporalbench-a-benchmark-for-evaluating-llm-based-agents-on-contextual-and-event-informed-time-series-tasks/ Tue, 17 Feb 2026 05:00:00 +0000 https://cluster-site.onrender.com/posts/temporalbench-a-benchmark-for-evaluating-llm-based-agents-on-contextual-and-event-informed-time-series-tasks/ • TemporalBench offers a multi-domain benchmark for temporal reasoning in LLM agents. • Four-tier taxonomy tests historical structure, context-free, contextual, and event-condition Benchmark cuts Coinbase price target by 37% but says business is 'more diversified and durable' than ever https://cluster-site.onrender.com/posts/benchmark-cuts-coinbase-price-target-by-37-but-says-business-is-more-diversified-and-durable-than-ever/ Sat, 14 Feb 2026 22:12:54 +0000 https://cluster-site.onrender.com/posts/benchmark-cuts-coinbase-price-target-by-37-but-says-business-is-more-diversified-and-durable-than-ever/ • Benchmark’s Mark Palmer cut his COIN price target to $267 from $421, citing worsening crypto market conditions, while reiterating a buy rating. Alyah ⭐️: Toward Robust Evaluation of Emirati Dialect Capabilities in Arabic LLMs https://cluster-site.onrender.com/posts/alyah-%EF%B8%8F-toward-robust-evaluation-of-emirati-dialect-capabilities-in-arabic-llms/ Tue, 27 Jan 2026 10:26:42 +0000 https://cluster-site.onrender.com/posts/alyah-%EF%B8%8F-toward-robust-evaluation-of-emirati-dialect-capabilities-in-arabic-llms/ • Arabic LLMs often evaluated only on Modern Standard Arabic, neglecting dialects. • Dialects like Emirati carry unique vocabulary, syntax, and cultural context. • Existing benchma AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality https://cluster-site.onrender.com/posts/assetopsbench-bridging-the-gap-between-ai-agent-benchmarks-and-industrial-reality/ Wed, 21 Jan 2026 06:25:31 +0000 https://cluster-site.onrender.com/posts/assetopsbench-bridging-the-gap-between-ai-agent-benchmarks-and-industrial-reality/ • AssetOpsBench introduces a multi-agent benchmark for industrial asset lifecycle management. • Covers 2.3M sensor telemetry points and 4.2K work orders. • Evaluates agents across How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark https://cluster-site.onrender.com/posts/how-to-evaluate-jailbreak-methods-a-case-study-with-the-strongreject-benchmark/ Wed, 28 Aug 2024 15:30:00 +0000 https://cluster-site.onrender.com/posts/how-to-evaluate-jailbreak-methods-a-case-study-with-the-strongreject-benchmark/ • When we began studying jailbreak evaluations, we found a fascinating paper claiming that you could jailbreak frontier LLMs simply by translating forbidden prompts into obscure la How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark https://cluster-site.onrender.com/posts/how-to-evaluate-jailbreak-methods-a-case-study-with-the-strongreject-benchmark/ Wed, 28 Aug 2024 15:30:00 +0000 https://cluster-site.onrender.com/posts/how-to-evaluate-jailbreak-methods-a-case-study-with-the-strongreject-benchmark/ • Researchers tested jailbreak via Scots Gaelic translation, initially replicating 43% success claim. • GPT-4 responded with bomb instructions in Gaelic, but full output differed f Are We Ready for Multi-Image Reasoning? Launching VHs: The Visual Haystacks Benchmark! https://cluster-site.onrender.com/posts/are-we-ready-for-multi-image-reasoning-launching-vhs-the-visual-haystacks-benchmark/ Sat, 20 Jul 2024 09:00:00 +0000 https://cluster-site.onrender.com/posts/are-we-ready-for-multi-image-reasoning-launching-vhs-the-visual-haystacks-benchmark/ • Humans can sift through thousands of images, spotting subtle patterns-a skill AI still struggles to match. • Traditional VQA systems answer questions about single images, missing