<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Alexander Barry]]></title><description><![CDATA[Statistical Consultant]]></description><link>https://abstatisticalconsulting.substack.com</link><image><url>https://substackcdn.com/image/fetch/$s_!2BQe!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2462b60-1b88-4d8c-9ba0-ebdae8d48910_288x288.jpeg</url><title>Alexander Barry</title><link>https://abstatisticalconsulting.substack.com</link></image><generator>Substack</generator><lastBuildDate>Tue, 12 May 2026 07:20:27 GMT</lastBuildDate><atom:link href="https://abstatisticalconsulting.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Alexander Barry]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[abstatisticalconsulting@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[abstatisticalconsulting@substack.com]]></itunes:email><itunes:name><![CDATA[Alexander Barry]]></itunes:name></itunes:owner><itunes:author><![CDATA[Alexander Barry]]></itunes:author><googleplay:owner><![CDATA[abstatisticalconsulting@substack.com]]></googleplay:owner><googleplay:email><![CDATA[abstatisticalconsulting@substack.com]]></googleplay:email><googleplay:author><![CDATA[Alexander Barry]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Predicting GPT 5.5 Time Horizon from its ECI]]></title><description><![CDATA[And comparing the predictiveness for OpenAI and Anthropic LLMs]]></description><link>https://abstatisticalconsulting.substack.com/p/predicting-gpt-55-time-horizon-from</link><guid isPermaLink="false">https://abstatisticalconsulting.substack.com/p/predicting-gpt-55-time-horizon-from</guid><dc:creator><![CDATA[Alexander Barry]]></dc:creator><pubDate>Thu, 30 Apr 2026 19:58:52 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!9z7R!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F899b3c36-4f77-4e3d-bb19-86f160497193_2700x1800.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Note: While I have worked with both METR and Epoch in my capacity as a statistical consultant this post is based entirely on public information. For more information or to enquire about hiring me for a project see <a href="http://abstats.co.uk/">abstats.co.uk</a></em></p><p>In my <a href="https://abstatisticalconsulting.substack.com/p/predicting-time-horizon-from-anthropics">last post</a>, I looked at using Anthropic&#8217;s internal version of the Epoch Capabilities Index (&#8216;AECI&#8217;) to predict <a href="https://metr.org/time-horizons/">METR time horizon</a> values for Mythos Preview and Opus 4.7.<br><br>We can also do the same using the &#8216;official&#8217; <a href="https://epoch.ai/eci">ECI from Epoch AI</a>, with the particular application of predicting the time horizon for GPT 5.5, which has an ECI value but not yet official time horizon results.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://abstatisticalconsulting.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h1>GPT 5.5 Time Horizon Prediction</h1><p>We will keep the same structure as last time, where we model the logarithm of time horizon as a linear function of ECI, this time using just the OpenAI LLMs:<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9z7R!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F899b3c36-4f77-4e3d-bb19-86f160497193_2700x1800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9z7R!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F899b3c36-4f77-4e3d-bb19-86f160497193_2700x1800.png 424w, https://substackcdn.com/image/fetch/$s_!9z7R!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F899b3c36-4f77-4e3d-bb19-86f160497193_2700x1800.png 848w, https://substackcdn.com/image/fetch/$s_!9z7R!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F899b3c36-4f77-4e3d-bb19-86f160497193_2700x1800.png 1272w, https://substackcdn.com/image/fetch/$s_!9z7R!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F899b3c36-4f77-4e3d-bb19-86f160497193_2700x1800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9z7R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F899b3c36-4f77-4e3d-bb19-86f160497193_2700x1800.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/899b3c36-4f77-4e3d-bb19-86f160497193_2700x1800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:333940,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://abstatisticalconsulting.substack.com/i/195874313?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F899b3c36-4f77-4e3d-bb19-86f160497193_2700x1800.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9z7R!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F899b3c36-4f77-4e3d-bb19-86f160497193_2700x1800.png 424w, https://substackcdn.com/image/fetch/$s_!9z7R!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F899b3c36-4f77-4e3d-bb19-86f160497193_2700x1800.png 848w, https://substackcdn.com/image/fetch/$s_!9z7R!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F899b3c36-4f77-4e3d-bb19-86f160497193_2700x1800.png 1272w, https://substackcdn.com/image/fetch/$s_!9z7R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F899b3c36-4f77-4e3d-bb19-86f160497193_2700x1800.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>This would place GPT 5.5 behind current leader Opus 4.6 for 50% time horizon, and but slightly ahead on 80% time horizon, where Gemini 3.1 Pro currently leads with 1h 30m.</p><p>Both of these are behind my <a href="https://abstatisticalconsulting.substack.com/p/predicting-time-horizon-from-anthropics">previous predictions</a> for the Opus 4.7 and Mythos Preview (50% TH of 18.8 and 40.3 hours, 80% TH of 2.6 and 5.5 hours respectively).</p><p>Notably these values for GPT 5.5 are low enough that they should be just about within the limits of what the current TH1.1 task suite can estimate.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a></p><p>The OpenAI fit also seems somewhat influenced by GPT-5.3 Codex and GPT-5.4&#8217;s relatively low time horizon results, which were partially caused by an <a href="https://x.com/METR_Evals/status/2042640545126965441?s=20">unusual amount of reward hacking attempts</a>. Removing them gives a somewhat different fit:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7cs1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b51d59-7230-4d9a-8600-a123bc0444b2_2700x1800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7cs1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b51d59-7230-4d9a-8600-a123bc0444b2_2700x1800.png 424w, https://substackcdn.com/image/fetch/$s_!7cs1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b51d59-7230-4d9a-8600-a123bc0444b2_2700x1800.png 848w, https://substackcdn.com/image/fetch/$s_!7cs1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b51d59-7230-4d9a-8600-a123bc0444b2_2700x1800.png 1272w, https://substackcdn.com/image/fetch/$s_!7cs1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b51d59-7230-4d9a-8600-a123bc0444b2_2700x1800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7cs1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b51d59-7230-4d9a-8600-a123bc0444b2_2700x1800.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/82b51d59-7230-4d9a-8600-a123bc0444b2_2700x1800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:325302,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://abstatisticalconsulting.substack.com/i/195874313?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b51d59-7230-4d9a-8600-a123bc0444b2_2700x1800.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!7cs1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b51d59-7230-4d9a-8600-a123bc0444b2_2700x1800.png 424w, https://substackcdn.com/image/fetch/$s_!7cs1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b51d59-7230-4d9a-8600-a123bc0444b2_2700x1800.png 848w, https://substackcdn.com/image/fetch/$s_!7cs1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b51d59-7230-4d9a-8600-a123bc0444b2_2700x1800.png 1272w, https://substackcdn.com/image/fetch/$s_!7cs1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b51d59-7230-4d9a-8600-a123bc0444b2_2700x1800.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>Comparing to AECI Fit</h1><p>We can also compare the results from using the official ECI to predict Opus 4.7&#8217;s time horizon to our previous attempts using the AECI:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Hz5m!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc81fa6ae-6444-4360-a086-5552c2577b12_2700x1800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Hz5m!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc81fa6ae-6444-4360-a086-5552c2577b12_2700x1800.png 424w, https://substackcdn.com/image/fetch/$s_!Hz5m!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc81fa6ae-6444-4360-a086-5552c2577b12_2700x1800.png 848w, https://substackcdn.com/image/fetch/$s_!Hz5m!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc81fa6ae-6444-4360-a086-5552c2577b12_2700x1800.png 1272w, https://substackcdn.com/image/fetch/$s_!Hz5m!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc81fa6ae-6444-4360-a086-5552c2577b12_2700x1800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Hz5m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc81fa6ae-6444-4360-a086-5552c2577b12_2700x1800.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c81fa6ae-6444-4360-a086-5552c2577b12_2700x1800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:340655,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://abstatisticalconsulting.substack.com/i/195874313?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc81fa6ae-6444-4360-a086-5552c2577b12_2700x1800.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Hz5m!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc81fa6ae-6444-4360-a086-5552c2577b12_2700x1800.png 424w, https://substackcdn.com/image/fetch/$s_!Hz5m!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc81fa6ae-6444-4360-a086-5552c2577b12_2700x1800.png 848w, https://substackcdn.com/image/fetch/$s_!Hz5m!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc81fa6ae-6444-4360-a086-5552c2577b12_2700x1800.png 1272w, https://substackcdn.com/image/fetch/$s_!Hz5m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc81fa6ae-6444-4360-a086-5552c2577b12_2700x1800.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cORk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc50f2dcb-76a4-4156-a380-4fa3bbbbe53b_2700x1800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cORk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc50f2dcb-76a4-4156-a380-4fa3bbbbe53b_2700x1800.png 424w, https://substackcdn.com/image/fetch/$s_!cORk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc50f2dcb-76a4-4156-a380-4fa3bbbbe53b_2700x1800.png 848w, https://substackcdn.com/image/fetch/$s_!cORk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc50f2dcb-76a4-4156-a380-4fa3bbbbe53b_2700x1800.png 1272w, https://substackcdn.com/image/fetch/$s_!cORk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc50f2dcb-76a4-4156-a380-4fa3bbbbe53b_2700x1800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cORk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc50f2dcb-76a4-4156-a380-4fa3bbbbe53b_2700x1800.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c50f2dcb-76a4-4156-a380-4fa3bbbbe53b_2700x1800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:351813,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://abstatisticalconsulting.substack.com/i/195874313?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc50f2dcb-76a4-4156-a380-4fa3bbbbe53b_2700x1800.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cORk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc50f2dcb-76a4-4156-a380-4fa3bbbbe53b_2700x1800.png 424w, https://substackcdn.com/image/fetch/$s_!cORk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc50f2dcb-76a4-4156-a380-4fa3bbbbe53b_2700x1800.png 848w, https://substackcdn.com/image/fetch/$s_!cORk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc50f2dcb-76a4-4156-a380-4fa3bbbbe53b_2700x1800.png 1272w, https://substackcdn.com/image/fetch/$s_!cORk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc50f2dcb-76a4-4156-a380-4fa3bbbbe53b_2700x1800.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The AECI has slightly higher R^2 values - perhaps because Anthropic&#8217;s internal benchmarks are more SWE focused than those that make up the overall ECI. </p><p>Using the ECI reduces the predicted time horizons for Opus 4.7 compared to using the AECI. This is largely due to 4.7&#8217;s ECI being very similar to 4.6, whereas there is a larger gap in their AECI.</p><h1>All-lab Fit</h1><p>In the above sections I used separate fits for Anthropic and OpenAI LLMs. We can try a combined fit that uses all LLMs which have both ECI and time horizon values:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cc3B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6026a564-1b18-4b1b-8f47-b0d4c28d0ab3_2700x1800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cc3B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6026a564-1b18-4b1b-8f47-b0d4c28d0ab3_2700x1800.png 424w, https://substackcdn.com/image/fetch/$s_!cc3B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6026a564-1b18-4b1b-8f47-b0d4c28d0ab3_2700x1800.png 848w, https://substackcdn.com/image/fetch/$s_!cc3B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6026a564-1b18-4b1b-8f47-b0d4c28d0ab3_2700x1800.png 1272w, https://substackcdn.com/image/fetch/$s_!cc3B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6026a564-1b18-4b1b-8f47-b0d4c28d0ab3_2700x1800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cc3B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6026a564-1b18-4b1b-8f47-b0d4c28d0ab3_2700x1800.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6026a564-1b18-4b1b-8f47-b0d4c28d0ab3_2700x1800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:366235,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://abstatisticalconsulting.substack.com/i/195874313?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6026a564-1b18-4b1b-8f47-b0d4c28d0ab3_2700x1800.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cc3B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6026a564-1b18-4b1b-8f47-b0d4c28d0ab3_2700x1800.png 424w, https://substackcdn.com/image/fetch/$s_!cc3B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6026a564-1b18-4b1b-8f47-b0d4c28d0ab3_2700x1800.png 848w, https://substackcdn.com/image/fetch/$s_!cc3B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6026a564-1b18-4b1b-8f47-b0d4c28d0ab3_2700x1800.png 1272w, https://substackcdn.com/image/fetch/$s_!cc3B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6026a564-1b18-4b1b-8f47-b0d4c28d0ab3_2700x1800.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>This is a somewhat worse fit than we saw above - it seems that the relationships genuinely are slightly different for Anthropic vs OpenAI LLMs. This isn&#8217;t too surprising, as it is what we would expect if one lab&#8217;s LLMs are consistently slightly better at software engineering compared to their other abilities.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://abstatisticalconsulting.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>I also included GPT 3.5 with ECI 119 from <a href="https://epoch.ai/data-insights/ai-capabilities-progress-has-sped-up">Epoch&#8217;s attempt to extend the ECI to earlier LLMs.</a></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Unlike for Opus 4.7 and Mythos, which seem like they might too completely saturate the task suite for accurate 50% time horizon estimates.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Predicting Time Horizon from Anthropic's Internal ECI]]></title><description><![CDATA[Predicting Time Horizon from Anthropic's Internal ECI for Mythos Preview and Opus 4.7]]></description><link>https://abstatisticalconsulting.substack.com/p/predicting-time-horizon-from-anthropics</link><guid isPermaLink="false">https://abstatisticalconsulting.substack.com/p/predicting-time-horizon-from-anthropics</guid><dc:creator><![CDATA[Alexander Barry]]></dc:creator><pubDate>Wed, 22 Apr 2026 13:51:16 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Uym3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e742eda-e89e-4d6e-bdfd-dd3961c1bc25_2700x1800.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Note: While I have worked with both METR and Epoch in my capacity as a statistical consultant this post is based entirely on public information. For more information or to enquire about hiring me for a project see <a href="http://abstats.co.uk">abstats.co.uk</a></em></p><h1>Anthropic&#8217;s Internal ECI</h1><p>Starting with Mythos preview Anthropic included a new &#8216;Anthropic ECI&#8217; or &#8220;AECI&#8221; in their system cards. This is based on the same methodology as the <a href="https://epoch.ai/eci">Epoch Capabilities Index</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> but with additional information from their internal benchmarks.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://abstatisticalconsulting.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Unfortunately they only share the results in the form of a pretty but hard to read plot:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WMqt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F947c4b9d-15e4-44c2-9f0d-533e4cd9a8b2_1769x1040.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WMqt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F947c4b9d-15e4-44c2-9f0d-533e4cd9a8b2_1769x1040.png 424w, https://substackcdn.com/image/fetch/$s_!WMqt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F947c4b9d-15e4-44c2-9f0d-533e4cd9a8b2_1769x1040.png 848w, https://substackcdn.com/image/fetch/$s_!WMqt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F947c4b9d-15e4-44c2-9f0d-533e4cd9a8b2_1769x1040.png 1272w, https://substackcdn.com/image/fetch/$s_!WMqt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F947c4b9d-15e4-44c2-9f0d-533e4cd9a8b2_1769x1040.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WMqt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F947c4b9d-15e4-44c2-9f0d-533e4cd9a8b2_1769x1040.png" width="1456" height="856" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/947c4b9d-15e4-44c2-9f0d-533e4cd9a8b2_1769x1040.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:856,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:221337,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://abstatisticalconsulting.substack.com/i/194081160?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F947c4b9d-15e4-44c2-9f0d-533e4cd9a8b2_1769x1040.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WMqt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F947c4b9d-15e4-44c2-9f0d-533e4cd9a8b2_1769x1040.png 424w, https://substackcdn.com/image/fetch/$s_!WMqt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F947c4b9d-15e4-44c2-9f0d-533e4cd9a8b2_1769x1040.png 848w, https://substackcdn.com/image/fetch/$s_!WMqt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F947c4b9d-15e4-44c2-9f0d-533e4cd9a8b2_1769x1040.png 1272w, https://substackcdn.com/image/fetch/$s_!WMqt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F947c4b9d-15e4-44c2-9f0d-533e4cd9a8b2_1769x1040.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">AECI plot taken directly from the Opus 4.7 model card</figcaption></figure></div><p> I extracted all of the values from this to obtain:</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/drvwE/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/21c8fd5e-2a4f-4a8c-9115-969c90a5df63_1220x1524.png&quot;,&quot;thumbnail_url_full&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/13561e61-a8bf-4a0d-b94d-c6cbe7fce0d5_1220x1716.png&quot;,&quot;height&quot;:895,&quot;title&quot;:&quot;Anthropic Internal ECI values&quot;,&quot;description&quot;:&quot;Extracted from Opus 4.7 model card, and compared to Epoch's official ECI values. Note Anthropic CIs are 95% compared to Epoch's 90%, and are calculated with different bootstrap approaches.&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/drvwE/1/" width="730" height="895" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><h1>Mapping to Time Horizon</h1><p>When Epoch released the ECI they noted that it was very predictive of <a href="https://metr.org/time-horizons/">METR&#8217;s 50% Time Horizon</a> results.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a></p><p>We can leverage this to see how AECI relates to 50% and 80% Time Horizon, to get early estimates of how Opus 4.7 and Mythos would perform:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Uym3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e742eda-e89e-4d6e-bdfd-dd3961c1bc25_2700x1800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Uym3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e742eda-e89e-4d6e-bdfd-dd3961c1bc25_2700x1800.png 424w, https://substackcdn.com/image/fetch/$s_!Uym3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e742eda-e89e-4d6e-bdfd-dd3961c1bc25_2700x1800.png 848w, https://substackcdn.com/image/fetch/$s_!Uym3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e742eda-e89e-4d6e-bdfd-dd3961c1bc25_2700x1800.png 1272w, https://substackcdn.com/image/fetch/$s_!Uym3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e742eda-e89e-4d6e-bdfd-dd3961c1bc25_2700x1800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Uym3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e742eda-e89e-4d6e-bdfd-dd3961c1bc25_2700x1800.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2e742eda-e89e-4d6e-bdfd-dd3961c1bc25_2700x1800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:241156,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://abstatisticalconsulting.substack.com/i/194081160?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e742eda-e89e-4d6e-bdfd-dd3961c1bc25_2700x1800.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Uym3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e742eda-e89e-4d6e-bdfd-dd3961c1bc25_2700x1800.png 424w, https://substackcdn.com/image/fetch/$s_!Uym3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e742eda-e89e-4d6e-bdfd-dd3961c1bc25_2700x1800.png 848w, https://substackcdn.com/image/fetch/$s_!Uym3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e742eda-e89e-4d6e-bdfd-dd3961c1bc25_2700x1800.png 1272w, https://substackcdn.com/image/fetch/$s_!Uym3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e742eda-e89e-4d6e-bdfd-dd3961c1bc25_2700x1800.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aQKw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa48df9e4-5325-43c9-a65f-095f1374a0cd_2700x1800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aQKw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa48df9e4-5325-43c9-a65f-095f1374a0cd_2700x1800.png 424w, https://substackcdn.com/image/fetch/$s_!aQKw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa48df9e4-5325-43c9-a65f-095f1374a0cd_2700x1800.png 848w, https://substackcdn.com/image/fetch/$s_!aQKw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa48df9e4-5325-43c9-a65f-095f1374a0cd_2700x1800.png 1272w, https://substackcdn.com/image/fetch/$s_!aQKw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa48df9e4-5325-43c9-a65f-095f1374a0cd_2700x1800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aQKw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa48df9e4-5325-43c9-a65f-095f1374a0cd_2700x1800.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a48df9e4-5325-43c9-a65f-095f1374a0cd_2700x1800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:236033,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://abstatisticalconsulting.substack.com/i/194081160?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa48df9e4-5325-43c9-a65f-095f1374a0cd_2700x1800.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aQKw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa48df9e4-5325-43c9-a65f-095f1374a0cd_2700x1800.png 424w, https://substackcdn.com/image/fetch/$s_!aQKw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa48df9e4-5325-43c9-a65f-095f1374a0cd_2700x1800.png 848w, https://substackcdn.com/image/fetch/$s_!aQKw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa48df9e4-5325-43c9-a65f-095f1374a0cd_2700x1800.png 1272w, https://substackcdn.com/image/fetch/$s_!aQKw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa48df9e4-5325-43c9-a65f-095f1374a0cd_2700x1800.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Both show a clear relationship, especially the 50% time horizon, with a R^2 of &gt;0.99 (for log-scale Time Horizon). We predict:</p><ul><li><p>Opus 4.7:</p><ul><li><p>50% Time Horizon: 18.8 Hours</p></li><li><p>80% Time Horizon: 2.6 hours</p></li></ul></li><li><p>Mythos Preview</p><ul><li><p>50% Time Horizon: 40.3 Hours</p></li><li><p>80% Time Horizon: 5.5 hours</p></li></ul></li></ul><p>However we shouldn&#8217;t expect to see actual 50% Time Horizon values this high from METR, as the longest task in their current TH1.1 task suite is 30 hours (and very few are over 16 hours).<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a><br><br>Thus until METR update their task suite to include more tasks, we expect they may not be able to accurately measure the 50% time horizon of the most capable models. The 80% time horizon should still fall within the measurable range however, so we will be able to compare those when they are released to these predicted values.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://abstatisticalconsulting.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>See the two <a href="https://abstatisticalconsulting.substack.com/p/kicking-the-tires-of-the-epoch-capabilities">previous</a> <a href="https://abstatisticalconsulting.substack.com/p/kicking-the-tires-of-the-epoch-capabilities-741">posts</a> on this substack for more discussion of the ECI.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>In the version shared with Mythos Preview they also accidentally scaled 3.5 Sonnet (New) to have ECI 130 instead of the original 3.5 Sonnet release, as Epoch does. They fixed this issue in the release with the Opus 4.7 system card. If only someone could have foreseen releasing a model with the same name twice causing confusion.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>In particular fitting the log of time horizon as a linear transformation of ECI.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>You can view the interactive task success rate plot on <a href="https://metr.org/time-horizons/">this page</a> (which I helped create) to better understand the tasks that go into calculating METR&#8217;s time horizon results.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Kicking the Tires of the Epoch Capabilities Index (ECI) Part 2: Uncertainty and Alternative Models]]></title><description><![CDATA[In this post I discuss potential problems with Epoch's confidence intervals for the ECI, and develop alternative Bayesian models to construct the index.]]></description><link>https://abstatisticalconsulting.substack.com/p/kicking-the-tires-of-the-epoch-capabilities-741</link><guid isPermaLink="false">https://abstatisticalconsulting.substack.com/p/kicking-the-tires-of-the-epoch-capabilities-741</guid><dc:creator><![CDATA[Alexander Barry]]></dc:creator><pubDate>Sun, 15 Feb 2026 06:19:44 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/f6559a23-41d7-42c5-bdd3-f01dd791b297_2980x1500.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is the second post in my three part series on the Epoch Capabilities Index (ECI). For background on the ECI and my replication of Epoch&#8217;s process of constructing it see the <a href="https://t.co/FI3JcaeqUA">first post</a>. See part three (upcoming) for multidimensional extensions of ECI, the relative importance of different benchmarks for calculating the ECI, and whether the trend in ECI improvements has been speeding up.</em></p><h1>Introduction</h1><p>In the <a href="https://abstatisticalconsulting.substack.com/p/kicking-the-tires-of-the-epoch-capabilities?r=79zn9d&amp;utm_campaign=post&amp;utm_medium=web&amp;triedRedirect=true">first post</a> of this series I replicated Epoch&#8217;s exact method for constructing the ECI. In this post I explore alternative models, and see how they impact the results and model fit. However I will first discuss the process Epoch use to construct their confidence intervals for the ECI, and why I think they might not be appropriate, and how Bayesian models could be a superior alternative.</p><p>Note that while I found some issues with the data used to construct the ECI in the first post, for this post I will continue using the same underlying data as accessed on 2026/02/04. In part three of this series (upcoming) I will look at the impacts of any updates to the data.</p><h1>Uncertainty</h1><p>As discussed in the last post, Epoch use (non-hierarchical) bootstrapping to produce the confidence intervals for their ECI results. The idea of bootstrapping is to repeatedly resample with replacement from the data as ways of estimating what the data &#8216;could have been&#8217;. This is a powerful and flexible approach, but to be valid it relies on having enough datapoints that the resamples are representative of real world data.</p><p>While ~1250 benchmark results contribute to the ECI overall, the average number of results per LLM is just 8.7 and the minimum for inclusion is only 4<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>. This matters because the ECI model is naturally hierarchical (the data is grouped by benchmark/LLM), which makes the conditions for bootstrap validity more demanding. It&#8217;s not enough to have a large total sample; we also need sufficient data within each group, and per-LLM counts this low mean the theoretical guarantees that justify bootstrapped confidence intervals don&#8217;t apply.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a></p><p>Fortunately there is an alternative, as using <a href="https://en.wikipedia.org/wiki/Bayesian_statistics">Bayesian statistics</a> lets us sidestep bootstrapping entirely, and directly model the hierarchical structure. Bayesian models naturally produce uncertainty estimates as part of their fitting process, giving us valid CIs<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a> as long as we can come up with reasonable prior distributions for our parameters.</p><h1>Loss</h1><p>As covered in part 1 the original ECI model finds the parameters that minimise the following loss function:<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\sum_{n=1}^{N} \\left( (\\text{predicted_score}_n - \\text{actual_score}_n)^2 \\right) + \\frac{0.1}{M + 2B - 2} \\left( \\sum_{b=1}^{B} (\\alpha_b^2 + D_b^2) + \\sum_{m=1}^{M} C_m^2 \\right)&quot;,&quot;id&quot;:&quot;NGYFPQFZHL&quot;}" data-component-name="LatexBlockToDOM"></div><p>where </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{predicted_score(m,b)} = \\sigma\\bigl(\\alpha_b \\cdot (C_m - D_b)\\bigr) \\quad \\text{where} \\quad \\sigma(x) := \\frac{1}{1 + e^{-x}}&quot;,&quot;id&quot;:&quot;CEIUQWMAXF&quot;}" data-component-name="LatexBlockToDOM"></div><p>This is a frequentist approach to statistical modelling, finding the single set of parameter values that best fits the data given the specified model. </p><p>However a penalised regressions, such as the one above, can also equivalently be viewed<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a> as finding modal result of a corresponding Bayesian model (the &#8216;<a href="https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation">maximum a posteriori&#8217; (MAP) estimator</a>).</p><p>Considering this equivalent Bayesian model has a number of benefits. Firstly, as discussed in the previous section, Bayesian models can generate principled uncertainty estimates without relying on bootstrapping. Secondly it makes it natural to consider model extensions (like allowing different benchmarks to have different noise levels) that would be awkward to motivate as changes to a loss function but are straightforward as modelling choices. I explore this in the next section.</p><h1>Bayesian Models</h1><h2>Base</h2><p>As discussed above we can directly convert Epoch&#8217;s frequentist model into a Bayesian model with equivalent loss, obtaining the likelihood:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{score}_{m,b} \\sim \\text{Normal}\\!\\left(\\text{predicted_score}(m,b),\\; \\sigma^2\\right)&quot;,&quot;id&quot;:&quot;OFCQPAKNZK&quot;}" data-component-name="LatexBlockToDOM"></div><p>With priors on the parameters:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;C_m \\sim \\text{Normal}(0,\\; \\sigma_p^2), \\quad D_b \\sim \\text{Normal}(0,\\; \\sigma_p^2), \\quad \\alpha_b \\sim \\text{Normal}(0,\\; \\sigma_p^2)&quot;,&quot;id&quot;:&quot;SYPYJRLSQF&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\sigma \\sim \\text{Uniform}(0.01,\\; 1), \\quad \\sigma_p = \\sigma \\sqrt{10(M+2B-2)}&quot;,&quot;id&quot;:&quot;PHMEPUJKXJ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Note the sigma parameter here is not present in the original frequentist model<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a> but is required there to be a sensible interpretation of the model predictions when doing the goodness of fit checks covered later in this post.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a></p><p>The model is otherwise kept as similar as possible to the frequentist model, with WinoGrande being used to set the scale by having its discrimination parameter fixed to 1, and the constraints that &#120572; &#8712; [0.1,10] and <em>C, D</em> &#8712; [-10,10] still applied.</p><h2>Improved Normal</h2><p>Its apparent from inspecting the Base Bayesian model that it (and thus also the frequentist model) assume the amount of noise in the benchmark scores is constant across all LLMs and benchmarks, but this doesn&#8217;t seem very likely to be true. So in this model we relax the assumption by allowing the different benchmarks to have different variances, meaning that some will be expected to be noisier than others.</p><p>The base model also models the actual scores as being normally distributed, despite the fact that they can only fall into [0,1]. We can address this by truncating the normal distribution to only allow outputs between 0 and 1 (and scaling the rest of the density accordingly).<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a></p><p> The likelihood looks very similar, just with a different variance parameter that is now allowed to vary by benchmark:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{score}_{m,b} \\sim \\text{Normal}\\!\\left(\\text{predicted_score}(m,b),\\; \\sigma_b^2\\right), \\quad \\text{truncated to } [0, 1]&quot;,&quot;id&quot;:&quot;UWZQKCEFND&quot;}" data-component-name="LatexBlockToDOM"></div><p>When making this model I also took the opportunity to make various other minor changes<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-10" href="#footnote-10" target="_self">10</a> that I think are natural for Bayesian modelling, such as removing the constraints on the possible parameter values and changing the priors to a more flexible setup:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;C_m \\sim \\text{Normal}(0,\\; \\tau_{CD}^2), \\quad D_b \\sim \\text{Normal}(0,\\; \\tau_{CD}^2), \\quad \\alpha_b \\sim \\text{LogNormal}(0,\\; \\tau_\\alpha^2)&quot;,&quot;id&quot;:&quot;LGZYYAEGDE&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\tau_{CD} \\sim \\text{LogNormal}(\\log 3,\\; 1), \\quad \\tau_\\alpha \\sim \\text{LogNormal}(\\log 0.5,\\; 0.5) \\quad \\sigma_b \\sim \\text{LogNormal}(\\log 0.05,\\; 0.5)&quot;,&quot;id&quot;:&quot;WTWPDULXPI&quot;}" data-component-name="LatexBlockToDOM"></div><p>And instead of setting the scale by anchoring WinoGrande to have alpha = 1 I fix the average benchmark difficulty to 0 and slope to 1<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-11" href="#footnote-11" target="_self">11</a>  which I think is cleaner than fixing a specific benchmark:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\sum_b D_b = 0, \\quad \\prod_b \\alpha_b = 1&quot;,&quot;id&quot;:&quot;UIHAUOBTIF&quot;}" data-component-name="LatexBlockToDOM"></div><h2>Improved Beta</h2><p>Another natural question to consider is whether the assumption of normally distributed noise/errors is correct (even after allowing for different variances for different benchmarks as above).<br><br>As an intuition for why this might not be appropriate, under the models with normal errors we penalise a predicted score of 55% when the true score is 50% just as much as a predicted score of 95% when the true score is 90%<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-12" href="#footnote-12" target="_self">12</a>, even though 95% might seem intuitively further away from 90%  than 55%  is from 50% in a sense that matters.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-13" href="#footnote-13" target="_self">13</a></p><p>One way to deal with this is to instead assume the score score follows a <a href="https://en.wikipedia.org/wiki/Beta_distribution">beta distribution</a> with expected value equal to the predicted score, but which instead has errors that penalise mistakes more when we get close to 0 or 1.</p><p>This just involves replacing the likelihood above with:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{score}_{m,b} \\sim \\text{Beta}\\!\\left(\\text{predicted_score}(m,b) \\cdot \\left(\\frac{1}{4\\sigma_b^2} - 1\\right),\\;\\; \\left(1 - \\text{predicted_score}(m,b)\\right) \\cdot \\left(\\frac{1}{4\\sigma_b^2} - 1\\right)\\right)&quot;,&quot;id&quot;:&quot;UGJWJFEPSC&quot;}" data-component-name="LatexBlockToDOM"></div><p>while leaving all priors the same (with sigma_b still being allowed to vary across different benchmarks). All this does is give us a distribution where:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbb{E}[\\text{score}_{m,b}] = \\text{predicted_score}(m,b), &quot;,&quot;id&quot;:&quot;NRNRGKITQS&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Var}(\\text{score}_{m,b}) = \\sigma_b^2 \\cdot 4 \\cdot \\text{predicted_score}(m,b)\\,(1 - \\text{predicted_score}(m,b))\\, &quot;,&quot;id&quot;:&quot;JWLCKWFPPN&quot;}" data-component-name="LatexBlockToDOM"></div><p>So as desired we have the expected value always equal to the predicted score, but also  the variance (noise) is scaled to be at its maximum at a predicted score of 0.5, but decrease as it moves away from that and closer to 0 or 1.</p><h1>Implementation</h1><p>I wrote Stan code (specialised software for fitting Bayesian models) to implement all three of the Bayesian models discussed above, generating posterior samples that we can use for CIs for the parameters and any derived results. <br><br>All models fit well with no errors, see the appendix for more details on the fitting parameters and convergence statistics.</p><h1>Results</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uh_X!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8129fcd9-e90a-4c0b-a351-5d5ac0b82191_1800x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uh_X!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8129fcd9-e90a-4c0b-a351-5d5ac0b82191_1800x1200.png 424w, https://substackcdn.com/image/fetch/$s_!uh_X!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8129fcd9-e90a-4c0b-a351-5d5ac0b82191_1800x1200.png 848w, https://substackcdn.com/image/fetch/$s_!uh_X!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8129fcd9-e90a-4c0b-a351-5d5ac0b82191_1800x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!uh_X!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8129fcd9-e90a-4c0b-a351-5d5ac0b82191_1800x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uh_X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8129fcd9-e90a-4c0b-a351-5d5ac0b82191_1800x1200.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8129fcd9-e90a-4c0b-a351-5d5ac0b82191_1800x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:184650,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://abstatisticalconsulting.substack.com/i/187690321?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8129fcd9-e90a-4c0b-a351-5d5ac0b82191_1800x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uh_X!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8129fcd9-e90a-4c0b-a351-5d5ac0b82191_1800x1200.png 424w, https://substackcdn.com/image/fetch/$s_!uh_X!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8129fcd9-e90a-4c0b-a351-5d5ac0b82191_1800x1200.png 848w, https://substackcdn.com/image/fetch/$s_!uh_X!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8129fcd9-e90a-4c0b-a351-5d5ac0b82191_1800x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!uh_X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8129fcd9-e90a-4c0b-a351-5d5ac0b82191_1800x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We see that the SOTA ECI results for the Bayesian models are generally very similar to the original Epoch model, but with some deviations and typically smaller confidence intervals.</p><p>We can also look in particular at which LLM the different models think is best by comparing to Gemini 3 Pro (which Epoch finds strongest currently):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Wx0s!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f964f94-c6eb-4e96-80c2-d6875a95f86c_1800x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Wx0s!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f964f94-c6eb-4e96-80c2-d6875a95f86c_1800x900.png 424w, https://substackcdn.com/image/fetch/$s_!Wx0s!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f964f94-c6eb-4e96-80c2-d6875a95f86c_1800x900.png 848w, https://substackcdn.com/image/fetch/$s_!Wx0s!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f964f94-c6eb-4e96-80c2-d6875a95f86c_1800x900.png 1272w, https://substackcdn.com/image/fetch/$s_!Wx0s!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f964f94-c6eb-4e96-80c2-d6875a95f86c_1800x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Wx0s!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f964f94-c6eb-4e96-80c2-d6875a95f86c_1800x900.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0f964f94-c6eb-4e96-80c2-d6875a95f86c_1800x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:113115,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://abstatisticalconsulting.substack.com/i/187690321?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f964f94-c6eb-4e96-80c2-d6875a95f86c_1800x900.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Wx0s!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f964f94-c6eb-4e96-80c2-d6875a95f86c_1800x900.png 424w, https://substackcdn.com/image/fetch/$s_!Wx0s!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f964f94-c6eb-4e96-80c2-d6875a95f86c_1800x900.png 848w, https://substackcdn.com/image/fetch/$s_!Wx0s!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f964f94-c6eb-4e96-80c2-d6875a95f86c_1800x900.png 1272w, https://substackcdn.com/image/fetch/$s_!Wx0s!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f964f94-c6eb-4e96-80c2-d6875a95f86c_1800x900.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The two improved Bayesian models actually find GPT-5.2 stronger than Gemini 3 Pro, with the improved normal model finding the difference to be statistically significant at p = 0.05, but it both cases it is only a change from being 1 ECI point lower to 1 ECI point higher.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Gknk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e5a2be8-ec2d-4362-a326-f6ef8f98cf4a_1500x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Gknk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e5a2be8-ec2d-4362-a326-f6ef8f98cf4a_1500x900.png 424w, https://substackcdn.com/image/fetch/$s_!Gknk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e5a2be8-ec2d-4362-a326-f6ef8f98cf4a_1500x900.png 848w, https://substackcdn.com/image/fetch/$s_!Gknk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e5a2be8-ec2d-4362-a326-f6ef8f98cf4a_1500x900.png 1272w, https://substackcdn.com/image/fetch/$s_!Gknk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e5a2be8-ec2d-4362-a326-f6ef8f98cf4a_1500x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Gknk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e5a2be8-ec2d-4362-a326-f6ef8f98cf4a_1500x900.png" width="1456" height="874" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2e5a2be8-ec2d-4362-a326-f6ef8f98cf4a_1500x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:874,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:64773,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://abstatisticalconsulting.substack.com/i/187690321?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e5a2be8-ec2d-4362-a326-f6ef8f98cf4a_1500x900.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Gknk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e5a2be8-ec2d-4362-a326-f6ef8f98cf4a_1500x900.png 424w, https://substackcdn.com/image/fetch/$s_!Gknk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e5a2be8-ec2d-4362-a326-f6ef8f98cf4a_1500x900.png 848w, https://substackcdn.com/image/fetch/$s_!Gknk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e5a2be8-ec2d-4362-a326-f6ef8f98cf4a_1500x900.png 1272w, https://substackcdn.com/image/fetch/$s_!Gknk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e5a2be8-ec2d-4362-a326-f6ef8f98cf4a_1500x900.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The models also have similar ECI/year trends, although the Beta model is around 10% faster than the rest (but this is within the noise of the different estimates). We will do a deeper dive into the ECI/year trends in the next post.</p><p>See full results (ECI values for every LLM and difficulty and discrimination values for every benchmark, for every model) in the appendix.</p><h1>Model Comparison</h1><p>We have seen the results for all the models, and I have gestured towards some theoretical considerations which might favour the Bayesian models, but it remains unclear if they should in fact be preferred. In this section we will look at various criteria on which we can compare the models, to see which fit the data best and are most predictive.</p><h2>Goodness of Fit</h2><p>&#8216;Goodness of fit&#8217; tests are one way of assessing how well a model fits the data. The general idea is picking a (hopefully quite general) feature of the data, and comparing it to how we would expect it to behave if the model is correct.</p><p>We can start with some simple examples of this, comparing <a href="https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot">QQ plots</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fCnz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320a9fba-6edd-4491-9322-769af85405af_1800x1500.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fCnz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320a9fba-6edd-4491-9322-769af85405af_1800x1500.png 424w, https://substackcdn.com/image/fetch/$s_!fCnz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320a9fba-6edd-4491-9322-769af85405af_1800x1500.png 848w, https://substackcdn.com/image/fetch/$s_!fCnz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320a9fba-6edd-4491-9322-769af85405af_1800x1500.png 1272w, https://substackcdn.com/image/fetch/$s_!fCnz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320a9fba-6edd-4491-9322-769af85405af_1800x1500.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fCnz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320a9fba-6edd-4491-9322-769af85405af_1800x1500.png" width="1456" height="1213" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/320a9fba-6edd-4491-9322-769af85405af_1800x1500.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1213,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:178956,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://abstatisticalconsulting.substack.com/i/187690321?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320a9fba-6edd-4491-9322-769af85405af_1800x1500.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fCnz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320a9fba-6edd-4491-9322-769af85405af_1800x1500.png 424w, https://substackcdn.com/image/fetch/$s_!fCnz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320a9fba-6edd-4491-9322-769af85405af_1800x1500.png 848w, https://substackcdn.com/image/fetch/$s_!fCnz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320a9fba-6edd-4491-9322-769af85405af_1800x1500.png 1272w, https://substackcdn.com/image/fetch/$s_!fCnz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320a9fba-6edd-4491-9322-769af85405af_1800x1500.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here the frequentist model<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-14" href="#footnote-14" target="_self">14</a>  and the Base Bayesian model do notably worse than the improved models, with the tails coming substantially away from the main fit. Both Improved models still have some issues, with the beta having modest deviation in the mid-low and mid-high sections<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-15" href="#footnote-15" target="_self">15</a>, and the improved normal struggling mostly at the upper tail.</p><p>Also see the appendix for a full set of predicted vs actual plots for each model broken down by benchmark.</p><p>I also looked at the &#8216;Posterior Predictive P-values&#8217; (PPP) by comparing the size of  the squared Pearson residuals for the actual data to those simulated from the <a href="https://en.wikipedia.org/wiki/Posterior_predictive_distribution">posterior predictive distribution</a> of each model. Here p values correspond to the chance of the models producing data more extreme than the data we actually saw, and values close to 0.5 are ideal, with values close to 0 or 1 being concerning:</p><ul><li><p>Frequentist: Method not applicable, but fit should be similar to Base Bayesian model</p></li><li><p>Base: p = 0.03 (pretty concerning)</p></li><li><p>Improved Normal: p = 0.062 (somewhat concerning)</p></li><li><p>Improved Beta: p = 0.399 (good)</p></li></ul><p>We can also break these down per benchmark (where ideally each benchmark would also be close to p=0.5):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZG75!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b6c8d29-8fbc-4032-8da7-ad3ea735ca7f_2100x1800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZG75!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b6c8d29-8fbc-4032-8da7-ad3ea735ca7f_2100x1800.png 424w, https://substackcdn.com/image/fetch/$s_!ZG75!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b6c8d29-8fbc-4032-8da7-ad3ea735ca7f_2100x1800.png 848w, https://substackcdn.com/image/fetch/$s_!ZG75!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b6c8d29-8fbc-4032-8da7-ad3ea735ca7f_2100x1800.png 1272w, https://substackcdn.com/image/fetch/$s_!ZG75!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b6c8d29-8fbc-4032-8da7-ad3ea735ca7f_2100x1800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZG75!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b6c8d29-8fbc-4032-8da7-ad3ea735ca7f_2100x1800.png" width="1456" height="1248" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3b6c8d29-8fbc-4032-8da7-ad3ea735ca7f_2100x1800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1248,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:236386,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://abstatisticalconsulting.substack.com/i/187690321?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b6c8d29-8fbc-4032-8da7-ad3ea735ca7f_2100x1800.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZG75!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b6c8d29-8fbc-4032-8da7-ad3ea735ca7f_2100x1800.png 424w, https://substackcdn.com/image/fetch/$s_!ZG75!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b6c8d29-8fbc-4032-8da7-ad3ea735ca7f_2100x1800.png 848w, https://substackcdn.com/image/fetch/$s_!ZG75!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b6c8d29-8fbc-4032-8da7-ad3ea735ca7f_2100x1800.png 1272w, https://substackcdn.com/image/fetch/$s_!ZG75!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b6c8d29-8fbc-4032-8da7-ad3ea735ca7f_2100x1800.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This makes it clear how poorly the assumption of constant variance across benchmarks fits the data, since for the Base model for most benchmarks it either expects more or less noise than we saw in the data (corresponding to p values close to 0 to 1). The improved models both stay much closer to 0.5 across the set of benchmarks. The Improved Normal model underpredicts noise on average, whereas the Improved Beta has a mix of under and overpredicting.</p><h2>Cross Validation</h2><p>An alternative approach to assess the models is to see how well they fit on new data that wasn&#8217;t included when they were being trained. Even without new data we can simulate this by using leave one out cross validation (LOO CV).<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-16" href="#footnote-16" target="_self">16</a> </p><p>It isn&#8217;t immediately clear what the correct measure of error to use for this is, so I look at the squared error (which Epoch&#8217;s model is trained to minimise), the mean absolute error, and a scaled version of the squared error that penalises errors more strongly when they are closer to 0 or 1<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-17" href="#footnote-17" target="_self">17</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3uO4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab43bdb8-c311-46e6-883e-151ecae0721a_2100x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3uO4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab43bdb8-c311-46e6-883e-151ecae0721a_2100x900.png 424w, https://substackcdn.com/image/fetch/$s_!3uO4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab43bdb8-c311-46e6-883e-151ecae0721a_2100x900.png 848w, https://substackcdn.com/image/fetch/$s_!3uO4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab43bdb8-c311-46e6-883e-151ecae0721a_2100x900.png 1272w, https://substackcdn.com/image/fetch/$s_!3uO4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab43bdb8-c311-46e6-883e-151ecae0721a_2100x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3uO4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab43bdb8-c311-46e6-883e-151ecae0721a_2100x900.png" width="1456" height="624" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ab43bdb8-c311-46e6-883e-151ecae0721a_2100x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:624,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:88953,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://abstatisticalconsulting.substack.com/i/187690321?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab43bdb8-c311-46e6-883e-151ecae0721a_2100x900.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3uO4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab43bdb8-c311-46e6-883e-151ecae0721a_2100x900.png 424w, https://substackcdn.com/image/fetch/$s_!3uO4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab43bdb8-c311-46e6-883e-151ecae0721a_2100x900.png 848w, https://substackcdn.com/image/fetch/$s_!3uO4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab43bdb8-c311-46e6-883e-151ecae0721a_2100x900.png 1272w, https://substackcdn.com/image/fetch/$s_!3uO4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab43bdb8-c311-46e6-883e-151ecae0721a_2100x900.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We see that the improved Bayesian models do best, each winning on one measure and both essentially drawing on the scaled RMSE, and both beating the base and replication models on most measures, although the frequentist model slightly beats the Improved Beta on MAE.</p><p>Another measure (only available for the Bayesian models) is the Expected Log Posterior Density (ELPD),<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-18" href="#footnote-18" target="_self">18</a> which is less interpretable but has the advantage of allowing us to compute confidence intervals for the size of difference between models:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4AQJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd74559b2-22fb-4fb6-bb58-516ee83d7abe_1500x750.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4AQJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd74559b2-22fb-4fb6-bb58-516ee83d7abe_1500x750.png 424w, https://substackcdn.com/image/fetch/$s_!4AQJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd74559b2-22fb-4fb6-bb58-516ee83d7abe_1500x750.png 848w, https://substackcdn.com/image/fetch/$s_!4AQJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd74559b2-22fb-4fb6-bb58-516ee83d7abe_1500x750.png 1272w, https://substackcdn.com/image/fetch/$s_!4AQJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd74559b2-22fb-4fb6-bb58-516ee83d7abe_1500x750.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4AQJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd74559b2-22fb-4fb6-bb58-516ee83d7abe_1500x750.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d74559b2-22fb-4fb6-bb58-516ee83d7abe_1500x750.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:56747,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://abstatisticalconsulting.substack.com/i/187690321?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd74559b2-22fb-4fb6-bb58-516ee83d7abe_1500x750.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4AQJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd74559b2-22fb-4fb6-bb58-516ee83d7abe_1500x750.png 424w, https://substackcdn.com/image/fetch/$s_!4AQJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd74559b2-22fb-4fb6-bb58-516ee83d7abe_1500x750.png 848w, https://substackcdn.com/image/fetch/$s_!4AQJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd74559b2-22fb-4fb6-bb58-516ee83d7abe_1500x750.png 1272w, https://substackcdn.com/image/fetch/$s_!4AQJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd74559b2-22fb-4fb6-bb58-516ee83d7abe_1500x750.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here the Improved Beta model does best, and with the 95% CIs for the other models not overlapping zero.</p><h2>Improved Beta Seems Best</h2><p>Since it has by far the best posterior predictive p value, and wins on most of the cross validated error metrics I conclude the Improved Beta Bayesian model is the best fit for the data, and will use it as the basis for the analysis in post 3 (upcoming). </p><p>The Improved normal model also has a good showing however, and I suspect the &#8216;true&#8217; distribution of the data is somewhere in between, with benchmark noise declining close to 0/1 but not quite in the manner the beta model assumes.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-19" href="#footnote-19" target="_self">19</a></p><p>I recommend Epoch move to this kind of model (or some further refinement) for the ECI calculations, although it is worth noting that the differences between the ECI results from the different models is usually only 1-2 ECI points (although there can be more disagreement for the very weak models, see the appendix), and there is heavy overlap of the CIs in every case.</p><h1>Conclusion</h1><p>Epoch&#8217;s ECI relies on bootstrapping for its confidence intervals, but the hierarchical structure and low number of benchmark results per LLM means this has unclear theoretical support. Moving to Bayesian models sidesteps this entirely, producing uncertainty estimates as a natural part of fitting the model.</p><p>I investigated three Bayesian alternatives to Epoch&#8217;s model: first directly translating Epoch&#8217;s model to a Bayesian framework, second allowing benchmarks to have different amounts of noise in their results (&#8216;Improved Normal&#8217;) and making other minor improvements, and third assuming less noise when scores are close to 0% or 100% (the &#8216;Improved Beta&#8217; model). This last model performs best overall in terms of cross-validated error and the goodness of fit checks I performed.</p><p>The resulting ECI values are generally very similar to Epoch&#8217;s, but with some differences; most notably narrower CIs and a (small) change in the top ranked LLM, with both improved models giving GPT 5.2 a higher ECI than Gemini 3 Pro. I recommend Epoch consider adopting this kind of model, although the practical differences are usually small.</p><p>In part three (upcoming) I will use the &#8216;Improved Beta&#8217; model to explore further extensions of the ECI model, and also look at the relative importance of different benchmarks and whether the trend in ECI improvements has been speeding up.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://abstatisticalconsulting.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h1>Appendix</h1><h2>Bayesian Model Convergence</h2><p>I ran each model on 4 chains, each with 4000 warmup and sampling iterations. adapt_delta was set to 0.95 and max_treedepth to 12. Each model takes around 10 minutes to fit.</p><p>All models mixed well, with no divergence transitions or max treedepth hits. The largest rhat value was 1.01, the smallest E-BFMI was 0.4 and the smallest ESS (tail and bulk) was 183, although that was only in the base model, and the other two have &gt;400. </p><p>Replication code is available <a href="https://github.com/Alexander-Barry/ECI_replication">here</a>.</p><h1>Full LLM Results</h1><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/qKpLR/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ac80b1d4-9dc3-4bcd-93f2-188e7cacb487_1220x1022.png&quot;,&quot;thumbnail_url_full&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0315132c-9fc6-4d5e-8da7-1727e479d2fe_1220x1022.png&quot;,&quot;height&quot;:542,&quot;title&quot;:&quot;Created with Datawrapper&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/qKpLR/1/" width="730" height="542" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><h1>Full Benchmark Results</h1><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/g534c/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eee3a085-8c91-47c1-9be4-50ce1d0a6321_1220x830.png&quot;,&quot;thumbnail_url_full&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0b12f3a5-4269-439c-a9f2-3990481d61b1_1220x830.png&quot;,&quot;height&quot;:441,&quot;title&quot;:&quot;Created with Datawrapper&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/g534c/1/" width="730" height="441" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><h1>Predicted vs Actual Scatterplots</h1><h2>Frequentist</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Cbyq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faec97bcb-d67b-4893-8634-fd1d03bbcd22_2700x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Cbyq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faec97bcb-d67b-4893-8634-fd1d03bbcd22_2700x2400.png 424w, https://substackcdn.com/image/fetch/$s_!Cbyq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faec97bcb-d67b-4893-8634-fd1d03bbcd22_2700x2400.png 848w, https://substackcdn.com/image/fetch/$s_!Cbyq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faec97bcb-d67b-4893-8634-fd1d03bbcd22_2700x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!Cbyq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faec97bcb-d67b-4893-8634-fd1d03bbcd22_2700x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Cbyq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faec97bcb-d67b-4893-8634-fd1d03bbcd22_2700x2400.png" width="1456" height="1294" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aec97bcb-d67b-4893-8634-fd1d03bbcd22_2700x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1294,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:660649,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://abstatisticalconsulting.substack.com/i/187690321?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faec97bcb-d67b-4893-8634-fd1d03bbcd22_2700x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Cbyq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faec97bcb-d67b-4893-8634-fd1d03bbcd22_2700x2400.png 424w, https://substackcdn.com/image/fetch/$s_!Cbyq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faec97bcb-d67b-4893-8634-fd1d03bbcd22_2700x2400.png 848w, https://substackcdn.com/image/fetch/$s_!Cbyq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faec97bcb-d67b-4893-8634-fd1d03bbcd22_2700x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!Cbyq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faec97bcb-d67b-4893-8634-fd1d03bbcd22_2700x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Base (Bayesian)</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yY-D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68f70690-1ce9-42f1-9d21-02a34b837b1f_2700x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yY-D!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68f70690-1ce9-42f1-9d21-02a34b837b1f_2700x2400.png 424w, https://substackcdn.com/image/fetch/$s_!yY-D!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68f70690-1ce9-42f1-9d21-02a34b837b1f_2700x2400.png 848w, https://substackcdn.com/image/fetch/$s_!yY-D!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68f70690-1ce9-42f1-9d21-02a34b837b1f_2700x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!yY-D!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68f70690-1ce9-42f1-9d21-02a34b837b1f_2700x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yY-D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68f70690-1ce9-42f1-9d21-02a34b837b1f_2700x2400.png" width="1456" height="1294" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/68f70690-1ce9-42f1-9d21-02a34b837b1f_2700x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1294,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:659376,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://abstatisticalconsulting.substack.com/i/187690321?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68f70690-1ce9-42f1-9d21-02a34b837b1f_2700x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yY-D!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68f70690-1ce9-42f1-9d21-02a34b837b1f_2700x2400.png 424w, https://substackcdn.com/image/fetch/$s_!yY-D!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68f70690-1ce9-42f1-9d21-02a34b837b1f_2700x2400.png 848w, https://substackcdn.com/image/fetch/$s_!yY-D!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68f70690-1ce9-42f1-9d21-02a34b837b1f_2700x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!yY-D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68f70690-1ce9-42f1-9d21-02a34b837b1f_2700x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Improved Normal (Bayesian)</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ajvx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ce84303-00be-461f-a1d3-7e05fb6be1fb_2700x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ajvx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ce84303-00be-461f-a1d3-7e05fb6be1fb_2700x2400.png 424w, https://substackcdn.com/image/fetch/$s_!ajvx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ce84303-00be-461f-a1d3-7e05fb6be1fb_2700x2400.png 848w, https://substackcdn.com/image/fetch/$s_!ajvx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ce84303-00be-461f-a1d3-7e05fb6be1fb_2700x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!ajvx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ce84303-00be-461f-a1d3-7e05fb6be1fb_2700x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ajvx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ce84303-00be-461f-a1d3-7e05fb6be1fb_2700x2400.png" width="1456" height="1294" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0ce84303-00be-461f-a1d3-7e05fb6be1fb_2700x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1294,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:655951,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://abstatisticalconsulting.substack.com/i/187690321?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ce84303-00be-461f-a1d3-7e05fb6be1fb_2700x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ajvx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ce84303-00be-461f-a1d3-7e05fb6be1fb_2700x2400.png 424w, https://substackcdn.com/image/fetch/$s_!ajvx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ce84303-00be-461f-a1d3-7e05fb6be1fb_2700x2400.png 848w, https://substackcdn.com/image/fetch/$s_!ajvx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ce84303-00be-461f-a1d3-7e05fb6be1fb_2700x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!ajvx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ce84303-00be-461f-a1d3-7e05fb6be1fb_2700x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Improved Beta (Bayesian)</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tHQF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ab0a7d8-3431-4320-a4b0-646f1905395e_2700x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tHQF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ab0a7d8-3431-4320-a4b0-646f1905395e_2700x2400.png 424w, https://substackcdn.com/image/fetch/$s_!tHQF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ab0a7d8-3431-4320-a4b0-646f1905395e_2700x2400.png 848w, https://substackcdn.com/image/fetch/$s_!tHQF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ab0a7d8-3431-4320-a4b0-646f1905395e_2700x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!tHQF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ab0a7d8-3431-4320-a4b0-646f1905395e_2700x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tHQF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ab0a7d8-3431-4320-a4b0-646f1905395e_2700x2400.png" width="1456" height="1294" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6ab0a7d8-3431-4320-a4b0-646f1905395e_2700x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1294,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:660164,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://abstatisticalconsulting.substack.com/i/187690321?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ab0a7d8-3431-4320-a4b0-646f1905395e_2700x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tHQF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ab0a7d8-3431-4320-a4b0-646f1905395e_2700x2400.png 424w, https://substackcdn.com/image/fetch/$s_!tHQF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ab0a7d8-3431-4320-a4b0-646f1905395e_2700x2400.png 848w, https://substackcdn.com/image/fetch/$s_!tHQF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ab0a7d8-3431-4320-a4b0-646f1905395e_2700x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!tHQF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ab0a7d8-3431-4320-a4b0-646f1905395e_2700x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>The smallest benchmark has 10 results, with an average of 33.7.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>We also need there to be large numbers enough of benchmarks and LLMs themselves, although this is likely satisfied with the 37 benchmarks and ~150 LLMs.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>As discussed in the first post Epoch actually use a non-hierarchical bootstrap setup - but I think this is doubly inappropriate; I think a hierarchical bootstrap would both be more valid (due to the hierarchical structure of the data) but still not well theoretically justified (because of the low number of data points per LLM).</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>Technically Bayesian Models produce <em>Credible Intervals</em> as opposed to frequentist <em>Confidence Intervals</em>. While extremely long debates are possible about the merits and interpretations of both,<em> </em>for the purposes of these posts I will use and describe both interchangeably as &#8216;CIs&#8217;.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>For more details see the <a href="https://abstatisticalconsulting.substack.com/p/kicking-the-tires-of-the-epoch-capabilities">first post</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p><a href="https://bjlkeng.io/posts/probabilistic-interpretation-of-regularization/">This post</a> gives more details on how to think about the equivalence.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>Since it just controls the overall scale of the loss, but this means it isn&#8217;t required for finding the parameter values that minimise it in the original penalised regression.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>I gave it a uniform prior to minimise the amount that it would contribute to the loss when fitting the model, to keep things as close to the frequentist model as possible.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>This means that the true mean and variance of the distribution do not match the distributions parameters, but since it mostly seems like the model should be able to learn around this I took no steps to address it, and leave it as an extension for further work. When calculating the Pearson residuals I use the true mean and variance of the truncated distribution.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-10" href="#footnote-anchor-10" class="footnote-number" contenteditable="false" target="_self">10</a><div class="footnote-content"><p>Ideally I would consider the impact of each change in isolation, but as I believe these adjustments are individually well-motivated and for the interest of time I will just combine them.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-11" href="#footnote-anchor-11" class="footnote-number" contenteditable="false" target="_self">11</a><div class="footnote-content"><p>On the untransformed scale before we convert Claude Sonnet 3.5 to 130 and ChatGPT 5.0 to 150.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-12" href="#footnote-anchor-12" class="footnote-number" contenteditable="false" target="_self">12</a><div class="footnote-content"><p>Since in any model with normally distributed errors (or in Epoch&#8217;s model) the loss is a function of just (predicted - actual), which is 0.05 in both cases.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-13" href="#footnote-anchor-13" class="footnote-number" contenteditable="false" target="_self">13</a><div class="footnote-content"><p>One way to think about this would be considering the relative error instead of the absolute error.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-14" href="#footnote-anchor-14" class="footnote-number" contenteditable="false" target="_self">14</a><div class="footnote-content"><p>To standardise the residuals for the frequentist model I used their observed standard deviation.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-15" href="#footnote-anchor-15" class="footnote-number" contenteditable="false" target="_self">15</a><div class="footnote-content"><p>The beta QQ plot here uses an alternative method (PIT vs uniform) to the direct quantile comparisons in the other two plots, so its shape isn&#8217;t directly comparable, but it remains the case that a perfect fit would be a straight line on y=x.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-16" href="#footnote-anchor-16" class="footnote-number" contenteditable="false" target="_self">16</a><div class="footnote-content"><p>For the replication model I compute the LOO CV directly, but as the Bayesian models take longer to fit I instead use the PSIS-LOO approximation from the loo R package. On each Bayesian model ~50/1248 samples have high pareto-k values which indicate they might have biased results, but in the interests of time I did not investigate or address this.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-17" href="#footnote-anchor-17" class="footnote-number" contenteditable="false" target="_self">17</a><div class="footnote-content"><p>This is constructed by scaling each residual by min(10, 2/sqrt(p(1-p))) where p is the actual value observed.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-18" href="#footnote-anchor-18" class="footnote-number" contenteditable="false" target="_self">18</a><div class="footnote-content"><p>Also calculated using PSIS LOO via the loo package.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-19" href="#footnote-anchor-19" class="footnote-number" contenteditable="false" target="_self">19</a><div class="footnote-content"><p>One could explore models to try and capture this behaviour, but for the interests of time I do not do this here.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Kicking the Tires of the Epoch Capabilities Index (ECI) Part 1: Introduction and Replication]]></title><description><![CDATA[Using Epoch's public data and code I replicate the construction of the ECI, and discover some issues and undisclosed steps in their process.]]></description><link>https://abstatisticalconsulting.substack.com/p/kicking-the-tires-of-the-epoch-capabilities</link><guid isPermaLink="false">https://abstatisticalconsulting.substack.com/p/kicking-the-tires-of-the-epoch-capabilities</guid><dc:creator><![CDATA[Alexander Barry]]></dc:creator><pubDate>Wed, 11 Feb 2026 18:19:06 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/d0c6862c-9162-4f75-83a6-a69d351e315c_2285x1200.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>I&#8217;m Alexander Barry, an independent statistical consultant, and this is the first post on my Substack. I expect to post here occasionally when I have thoughts or work that I think would be interesting to share. For more information or to enquire about hiring me for a project see my website <a href="http://abstats.co.uk">abstats.co.uk</a></em></p><h1>Introduction</h1><p>In December 2025 <a href="https://epoch.ai/">Epoch AI</a> launched their <a href="https://epoch.ai/benchmarks/eci">Epoch Capabilities Index</a> (ECI). This seeks to combine measures of LLM performance from many benchmarks together to create a unified scale that allows comparisons between LLMs, built on the paper <a href="https://arxiv.org/abs/2512.00193">A Rosetta Stone for AI Benchmarks</a> they wrote in conjunction with DeepMind.<br><br>The approach they use is inspired by <a href="https://en.wikipedia.org/wiki/Item_response_theory">Item Response Theory</a>, an area that was originally developed for use in human testing but has recently been seeing use in LLM evaluations, perhaps most prominently in <a href="https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/">METR&#8217;s Time Horizon</a> work.<br><br>I have been working with METR on a statistical model for time horizon calculations, and so have been thinking a lot about this area. With this background when I saw the ECI launch and read the accompanying paper I thought it would be interesting to see if I could apply some of my knowledge to the ECI.<br><br>This is part one of a three part series of posts about the ECI:</p><ul><li><p>In part 1 (this post) I describe the process of directly replicating the ECI, highlighting some details of how it is constructed that are either not obvious or differ from the information on Epoch&#8217;s website or the paper.</p></li><li><p>In <a href="https://abstatisticalconsulting.substack.com/p/kicking-the-tires-of-the-epoch-capabilities-741">part 2</a> I develop various alternative statistical models for constructing the ECI, and look at how these impact the results.</p></li><li><p>In part 3 (upcoming) I look at derived results from the ECI, such as whether the trend of increasing ECI over time is accelerating, the relative importance of the different benchmarks, and the impact of adding additional dimensions of ability.</p></li></ul><p>These posts will go into relatively high amounts of technical detail about the statistical models involved, so if that is not of interest I suggest liberal skimming. I will summarise the important takeaways in the conclusion of each post.</p><h1>Epoch&#8217;s Methodology</h1><p>To replicate the ECI I rely on three main sources of information released by Epoch:</p><ol><li><p>The &#8220;<a href="https://arxiv.org/abs/2512.00193">A Rosetta stone for AI Benchmarks</a>&#8221; paper they published with DeepMind, and the <a href="https://github.com/epoch-research/benchmark-stitching/">replication code</a> they published alongside the paper.</p></li><li><p>The information on the <a href="https://epoch.ai/benchmarks/eci">ECI section</a> of the Epoch website, and the public <a href="https://github.com/epoch-research/eci-public">Github repo</a> they released for the ECI.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a></p></li><li><p>The <a href="https://epoch.ai/benchmarks/use-this-data">data Epoch release on their website</a>, accessed on 2026/02/04</p></li></ol><h2>Epoch&#8217;s Data</h2><p>The list of benchmarks used to calculate the &#8216;live&#8217; ECI results given on Epoch&#8217;s website is outdated<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>, but fortunately (and to their credit) they <a href="https://epoch.ai/benchmarks/use-this-data">provide the full dataset</a> so it is possible to reconstruct from this which benchmarks are used<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>. For a full list see the appendix.</p><p>They process the data by:<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a></p><ol><li><p>Removing any LLMs released prior to 2023/01/01 and any benchmarks outside the 37 selected.</p></li><li><p>Combining together the results from any LLMs they consider to have the same base model, taking the maximum result whenever there are multiple results for the same benchmark.</p><ol><li><p>As far as I can tell they don&#8217;t have any public list for which LLM they think should be combined in this way, but in one of the spreadsheets they include both &#8220;Model name&#8221; and  &#8220;Model version&#8221; columns with the former being the criteria they use for combining LLMs.<br><br>Note the way they combined models conflicts with their claim<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a> that they only aggregate together models with the same release date, as 24/144 of the sets of aggregated LLMs (same &#8216;Model&#8217; column and ECI values in Epoch&#8217;s data) contain LLMs with different release dates, and 12 of them have release dates that differ by over 30 days. An example of this is that DeepSeek V3 (released Dec 2024) and DeepSeek V3-0324 (released March 2025) seem to be aggregated together and treated as a single model launched in March. <br><br>When LLMs with different release dates are aggregated together in this way the release date that Epoch assign to the resulting aggregated LLM seems to be arbitrary<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a> which can cause problems with evaluating the trend in ECI over time, e.g. Gemini 1.5 Pro is given release date 2024-05-24, despite all of its benchmark results that contribute to the ECI coming from the 002 release launched on 2024-09-24. This is especially problematic as it results in Gemini 1.5 Pro appearing to be SOTA on release (given as 2025-05-24), but it is purely an artifact of its incorrectly assigned release date. <br><br>Epoch have confirmed this is a mistake in how the LLMs are aggregated and they will correct it in an update. Since the goal of this post is replication I will proceed with the same (flawed) approach as Epoch here, but will explore changing this in future posts.</p></li></ol></li><li><p>Removing any (aggregated) LLMs with &lt;4 benchmark results (from the set of 37 benchmarks)</p></li><li><p>Linearly rescale results on any benchmarks on which guessing is possible so that guessing would (on average) correspond with a score of zero.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a> If this results in any scores less then zero these are replaced with zero.</p><ol><li><p>The list of which benchmarks Epoch considers to allow guessing (and the probability of correctly guessing) is not given on their website or in the paper, but is in the ECI Github repo. It matches intuitions with n-option multiple choice questions being given 1/n guessing rates.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a></p></li></ol></li></ol><p>Once these steps are complete we are left with a dataset of 144 LLMs&#8217; performance on 37 benchmarks, with 1248 total benchmark scores (an average of 8.7 benchmark results per LLM).<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a></p><h2>Epoch&#8217;s Model</h2><p>Epoch predict the performance (from 0 to 100%) of LLM m on benchmark b as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{predicted_score(m,b)} = \\sigma\\bigl(\\alpha_b \\cdot (C_m - D_b)\\bigr) \\quad \\text{where} \\quad \\sigma(x) := \\frac{1}{1 + e^{-x}}&quot;,&quot;id&quot;:&quot;FGUGDFPVQZ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here &#120572;<em><sub>b</sub></em> and <em>D<sub>b</sub></em> are parameters reflecting the benchmarks discrimination and difficulty respectively and <em>C<sub>m</sub></em> is the LLM&#8217;s ability. When <em>C<sub>m</sub></em> = <em>D<sub>b</sub></em> the LLM will be predicted exactly 50% on the benchmark, and &#120572;<em><sub>b</sub></em> controls how quickly this changes as <em>C<sub>m</sub></em> gets smaller or larger than<em> D<sub>b</sub></em>.<br><br>To fix the scale of the model they set the discrimination parameter of the WinoGrande benchmark to &#120572; = 1.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-10" href="#footnote-10" target="_self">10</a> The model is then fit by finding (via optimisation) the parameter values that minimise the following loss:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\sum_{n=1}^{N} \\left( (\\text{predicted_score}_n - \\text{actual_score}_n)^2 \\right) + \\frac{0.1}{M + 2B - 2} \\left( \\sum_{b=1}^{B} (\\alpha_b^2 + D_b^2) + \\sum_{m=1}^{M} C_m^2 \\right)&quot;,&quot;id&quot;:&quot;UVEPPKFNCQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>where N is the total number of datapoints (1248), M is the number of LLMs (144) and B the number of benchmarks (37) and with the constraints that &#120572; &#8712; [0.1,10] and <em>C, D</em> &#8712; [-10,10].<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-11" href="#footnote-11" target="_self">11</a> </p><p>This corresponds to minimising the squared prediction error, weighting all models and benchmarks equally, plus very weak L2 regularisation where the parameters are shrunk towards zero, but scaled so that the total penalty from all the parameters is weighted only 1/10th as much as a prediction error on a single data point. Note that dividing by the number of parameters in the regularisation term is not the conventional approach, and results in much weaker regularisation then you would get from the conventional approach with the same nominal regularisation strength.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-12" href="#footnote-12" target="_self">12</a> </p><p>After the model is fit the results are linearly transformed so that Claude Sonnet 3.5 has ECI 130 and GPT 5 has ECI 150.</p><p>Epoch construct confidence intervals for the ECI results by bootstrapping, taking a non-hierarchical approach of resampling from the entire dataset with replacement without e.g. taking into account which LLMs the datapoints correspond to. They use 100 bootstrap samples, which is quite low.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-13" href="#footnote-13" target="_self">13</a></p><p>The non-hierarchical approach means that it is possible for LLMs/Benchmarks to be totally dropped from individual bootstrap samples. If this occurs then due to the penalisation any alpha values will be shrunk to 0.1 (the lower bound), and C/D values shrunk to zero on the raw scale, which corresponds to an ECI value of ~124.8. This occurs in ~1.8% of bootstrap samples for LLMs with the minimum 4 benchmark results, and so can potentially shift their 90% CIs somewhat.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-14" href="#footnote-14" target="_self">14</a></p><h1>Replication Setup</h1><p>To try and replicate the ECI I wrote R code to implement the data processing and model fitting process described above as closely as possible, starting with the raw data available on Epoch&#8217;s website, as accessed on 2026/02/04.</p><p>I used 10,000 bootstrap samples, and wrote custom c++ code to evaluate the loss function given above for speed when finding the optimal parameters. With this it takes about 1 minute to run all the replication code.</p><p>My full replication code is <a href="https://github.com/Alexander-Barry/ECI_replication">available here</a>.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-15" href="#footnote-15" target="_self">15</a></p><h1>Replication Results</h1><p>My replication results match the Epoch&#8217;s results very closely, both for the LLM ECI results, but also the benchmark parameters:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6u5o!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6551d95d-83c5-48df-9e30-1da15068662a_2100x750.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6u5o!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6551d95d-83c5-48df-9e30-1da15068662a_2100x750.png 424w, https://substackcdn.com/image/fetch/$s_!6u5o!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6551d95d-83c5-48df-9e30-1da15068662a_2100x750.png 848w, https://substackcdn.com/image/fetch/$s_!6u5o!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6551d95d-83c5-48df-9e30-1da15068662a_2100x750.png 1272w, https://substackcdn.com/image/fetch/$s_!6u5o!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6551d95d-83c5-48df-9e30-1da15068662a_2100x750.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6u5o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6551d95d-83c5-48df-9e30-1da15068662a_2100x750.png" width="1456" height="520" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6551d95d-83c5-48df-9e30-1da15068662a_2100x750.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:520,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:146099,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://abstatisticalconsulting.substack.com/i/185281398?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6551d95d-83c5-48df-9e30-1da15068662a_2100x750.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6u5o!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6551d95d-83c5-48df-9e30-1da15068662a_2100x750.png 424w, https://substackcdn.com/image/fetch/$s_!6u5o!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6551d95d-83c5-48df-9e30-1da15068662a_2100x750.png 848w, https://substackcdn.com/image/fetch/$s_!6u5o!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6551d95d-83c5-48df-9e30-1da15068662a_2100x750.png 1272w, https://substackcdn.com/image/fetch/$s_!6u5o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6551d95d-83c5-48df-9e30-1da15068662a_2100x750.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Looking only at the LLM&#8217;s Epoch considers SOTA on launch<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-16" href="#footnote-16" target="_self">16</a> and comparing the confidence intervals:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VSyV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd691e3b7-8f68-4421-aafb-19c6031e5656_1800x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VSyV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd691e3b7-8f68-4421-aafb-19c6031e5656_1800x1200.png 424w, https://substackcdn.com/image/fetch/$s_!VSyV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd691e3b7-8f68-4421-aafb-19c6031e5656_1800x1200.png 848w, https://substackcdn.com/image/fetch/$s_!VSyV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd691e3b7-8f68-4421-aafb-19c6031e5656_1800x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!VSyV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd691e3b7-8f68-4421-aafb-19c6031e5656_1800x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VSyV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd691e3b7-8f68-4421-aafb-19c6031e5656_1800x1200.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d691e3b7-8f68-4421-aafb-19c6031e5656_1800x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:152012,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://abstatisticalconsulting.substack.com/i/185281398?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd691e3b7-8f68-4421-aafb-19c6031e5656_1800x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VSyV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd691e3b7-8f68-4421-aafb-19c6031e5656_1800x1200.png 424w, https://substackcdn.com/image/fetch/$s_!VSyV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd691e3b7-8f68-4421-aafb-19c6031e5656_1800x1200.png 848w, https://substackcdn.com/image/fetch/$s_!VSyV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd691e3b7-8f68-4421-aafb-19c6031e5656_1800x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!VSyV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd691e3b7-8f68-4421-aafb-19c6031e5656_1800x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The values generally match very closely, albeit with slightly different confidence intervals, which I believe is caused by random variation due to Epoch only using 100 bootstrap samples.</p><p>For the full results (ECI results for every LLM and difficulty and discrimination parameters for every benchmark) see the appendix.</p><h1>Conclusion</h1><p>Due to Epoch releasing all their data and code publicly I was able to replicate their ECI results very closely, and I think they deserve praise for this transparency. </p><p>However there are a few instances where the process used to construct the ECI does not match their public description:</p><ul><li><p>The list of benchmarks included on their website is out of date, with two changes since it was published, and so is the number of LLMs included.</p></li><li><p>The way that results from different versions of the same &#8216;base&#8217; LLMs are aggregated sometimes results in LLMs with very different release dates being combined together, despite them saying only LLMs with the same release date are combined. This in particular causes issues as release date they assign to the resulting aggregated model is effectively arbitrary, which can result in cases such as Gemini 1.5 Pro being declared SOTA on launch in 2024-05-24 entirely due to benchmark results from the 002 update released on 2024-09-24.</p></li><li><p>The fact that Epoch only uses 100 bootstrap samples adds unnecessary noise to the confidence intervals in their ECI estimates, and their non-hierarchical approach will slightly bias the confidence intervals for LLMs with small numbers of benchmark results towards an ECI of ~125.</p></li><li><p>The level of regularisation used in the regression is much weaker than stated in the paper (by a factor of ~200) due to an unconventional loss setup, and various other details (restricted parameter ranges, and the exact bootstrap setup) are not publicly stated anywhere.</p></li></ul><p>While the latter two are technical details, the first two seem at least potentially relevant to public understanding of ECI results. I sent this post to Epoch for pre-publication review and they confirmed both issues exist, and will be fixed in future updates.</p><p>In <a href="https://abstatisticalconsulting.substack.com/p/kicking-the-tires-of-the-epoch-capabilities-741">part 2</a> I present various ways I attempt to improve on this model, and how it impacts the ECI results.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://abstatisticalconsulting.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h1>Appendix</h1><h2>List of included benchmarks</h2><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/IrVNJ/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e0fe9f05-8443-4d62-a0b1-f66455569f75_1220x2750.png&quot;,&quot;thumbnail_url_full&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8d5de041-eb93-4486-b178-cb89a4668664_1220x2750.png&quot;,&quot;height&quot;:1302,&quot;title&quot;:&quot;Created with Datawrapper&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/IrVNJ/1/" width="730" height="1302" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><h2>List of aggregated LLMs</h2><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/9DQvY/3/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/89062a1e-9771-4092-bfb7-63b46367cf0f_1220x2490.png&quot;,&quot;thumbnail_url_full&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/631845e9-398b-4c30-8b98-0604855366a2_1220x2490.png&quot;,&quot;height&quot;:1424,&quot;title&quot;:&quot;Created with Datawrapper&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/9DQvY/3/" width="730" height="1424" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><h2>Full Results</h2><h3>LLM Results</h3><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/CvOxx/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a7b8c119-c47e-477f-a35f-3d35b3995da5_1220x1946.png&quot;,&quot;thumbnail_url_full&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/87bad743-776c-4489-a198-7670679cb964_1220x1946.png&quot;,&quot;height&quot;:878,&quot;title&quot;:&quot;Created with Datawrapper&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/CvOxx/1/" width="730" height="878" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><h3>Benchmark Results</h3><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/QqkmL/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bded227e-907f-41e8-8a4e-b323aa743718_1220x1498.png&quot;,&quot;thumbnail_url_full&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0b5472a1-ab0d-47e0-9d10-7c165690202d_1220x1498.png&quot;,&quot;height&quot;:708,&quot;title&quot;:&quot;Created with Datawrapper&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/QqkmL/1/" width="730" height="708" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>The ECI repo isn&#8217;t linked anywhere on the Epoch website, but they pointed me to it in private communication. Currently the public code isn&#8217;t used directly to calculate the ECI, but they say it is representative and they plan to switch over to it in the future.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>They list &#8216;LiveBench&#8217; and &#8216;SuperGLUE&#8217; which seem to have been replaced with &#8216;Chess Puzzles&#8217; and &#8216;The Agent Company&#8217;. Epoch confirm this is out of date and they plan to update it soon.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>I did this by checking which benchmarks are given difficulty and discrimination parameters in the &#8220;eci_benchmark_difficulties_and_slopes.csv&#8221; spreadsheet that is included in the data.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>They mention the existence of all of these processing steps in their <a href="https://epoch.ai/benchmarks/eci">description of the ECI</a>, although don&#8217;t always give all the details.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>The claim is made in the <a href="https://epoch.ai/benchmarks/eci#data">ECI Data section</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>I believe the release date epoch list on their website is determined simply by whatever the first row in the &#8220;epoch_capabilities_index.csv&#8221; spreadsheet happens to be, and as far as I can tell this ordering is arbitrary (e.g. it is not consistently the first or the last date associated with a given model, or the date from which most of the benchmark results are drawn).</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>e.g. Multiple choice questions with 4 options would be transformed by f(x) = (x - 0.25)/0.75 so that 25% &#8594; 0 and 100% &#8594; 100%.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>This includes &#8220;OTIS Mock AIME 2024-2025&#8220; whose answers are integers from 0 to 999, and thus has a 1/1000 guessing rate.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>Note this also does not match the LLM and datapoint figures on Epoch&#8217;s website which Epoch confirm are out of date and have not been updated as new LLMs have been added.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-10" href="#footnote-anchor-10" class="footnote-number" contenteditable="false" target="_self">10</a><div class="footnote-content"><p>Technically WinoGrande&#8217;s difficulty parameter <em>D</em> is also set to zero, but this is done after the model is fit by a simple shift to all <em>C</em> and <em>D</em> parameters, so it does not have any effects given the results are then transformed again to the ECI scale. In particular the penalisation and range restrictions are applied to the <em>C</em> and <em>D</em> values before they are shifted to have WinoGrande&#8217;s <em>D</em>=0.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-11" href="#footnote-anchor-11" class="footnote-number" contenteditable="false" target="_self">11</a><div class="footnote-content"><p>Note this is not discussed in the paper, and is only apparent by inspecting the replication code. Since WinoGrande is fixed to have an &#120572; value of 1 the &#120572; restriction corresponds to an assumption that no benchmark is &gt;10x more/less discriminative than WinoGrande. The limits on <em>C</em> and <em>D</em> correspond (in the current fit) to assuming no LLM or benchmark has an ability/difficulty level below -95.6 or above 345.1 on the ECI scale.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-12" href="#footnote-anchor-12" class="footnote-number" contenteditable="false" target="_self">12</a><div class="footnote-content"><p>This detail is not included in the paper, which states a penalisation strength of 0.1, but is apparent from inspecting the replication code. The setup used instead corresponds to a conventional penalisation strength of lambda = ~0.0005.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-13" href="#footnote-anchor-13" class="footnote-number" contenteditable="false" target="_self">13</a><div class="footnote-content"><p>This is not covered anywhere on the website (or in the paper, which does not use the bootstrapped confidence intervals) but it is apparent from inspecting the code in the ECI Github repo.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-14" href="#footnote-anchor-14" class="footnote-number" contenteditable="false" target="_self">14</a><div class="footnote-content"><p>In the case where these shrunk values would otherwise be entirely outside the 90% CI it would shift it to effectively be e.g. (7%, 97%) instead of (5%, 95%).</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-15" href="#footnote-anchor-15" class="footnote-number" contenteditable="false" target="_self">15</a><div class="footnote-content"><p>Who replicates the replicators?</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-16" href="#footnote-anchor-16" class="footnote-number" contenteditable="false" target="_self">16</a><div class="footnote-content"><p>Starting with &#8216;GPT-4 (March 2024)&#8217; as Epoch also do when presenting their ECI &#8216;frontier trend&#8217;.</p><p></p></div></div>]]></content:encoded></item></channel></rss>