<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Evaluate on Tenu Tech Brief</title>
    <link>https://cluster-site.onrender.com/tags/evaluate/</link>
    <description>Recent content in Evaluate on Tenu Tech Brief</description>
    <generator>Hugo -- 0.146.0</generator>
    <language>en-us</language>
    <lastBuildDate>Wed, 25 Feb 2026 07:59:04 +0000</lastBuildDate>
    <atom:link href="https://cluster-site.onrender.com/tags/evaluate/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Why we no longer evaluate SWE-bench Verified</title>
      <link>https://cluster-site.onrender.com/posts/why-we-no-longer-evaluate-swe-bench-verified/</link>
      <pubDate>Mon, 23 Feb 2026 11:00:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/why-we-no-longer-evaluate-swe-bench-verified/</guid>
      <description>• SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. • Our analysis shows flawed tests and training leakage. • We recommend SWE-bench Pro. •</description>
    </item>
    <item>
      <title>The Peace-Athabasca Delta is at risk. Here&#39;s what we can do to evaluate the threats</title>
      <link>https://cluster-site.onrender.com/posts/the-peace-athabasca-delta-is-at-risk.-heres-what-we-can-do-to-evaluate-the-threats/</link>
      <pubDate>Wed, 18 Feb 2026 17:26:25 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/the-peace-athabasca-delta-is-at-risk.-heres-what-we-can-do-to-evaluate-the-threats/</guid>
      <description>• River deltas are among the most complex and productive environments on Earth. • Yet, they face serious threats from upstream industrialization and climate change, which alter sup</description>
    </item>
    <item>
      <title>How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark</title>
      <link>https://cluster-site.onrender.com/posts/how-to-evaluate-jailbreak-methods-a-case-study-with-the-strongreject-benchmark/</link>
      <pubDate>Wed, 28 Aug 2024 15:30:00 +0000</pubDate>
      <guid>https://cluster-site.onrender.com/posts/how-to-evaluate-jailbreak-methods-a-case-study-with-the-strongreject-benchmark/</guid>
      <description>• When we began studying jailbreak evaluations, we found a fascinating paper claiming that you could jailbreak frontier LLMs simply by translating forbidden prompts into obscure la</description>
    </item>
  </channel>
</rss>
