Alexander Barry

Predicting GPT 5.5 Time Horizon from its ECI

Alexander Barry — Thu, 30 Apr 2026 19:58:52 GMT

Note: While I have worked with both METR and Epoch in my capacity as a statistical consultant this post is based entirely on public information. For more information or to enquire about hiring me for a project see abstats.co.uk

In my last post, I looked at using Anthropic’s internal version of the Epoch Capabilities Index (‘AECI’) to predict METR time horizon values for Mythos Preview and Opus 4.7.

We can also do the same using the ‘official’ ECI from Epoch AI, with the particular application of predicting the time horizon for GPT 5.5, which has an ECI value but not yet official time horizon results.

GPT 5.5 Time Horizon Prediction

We will keep the same structure as last time, where we model the logarithm of time horizon as a linear function of ECI, this time using just the OpenAI LLMs:1

This would place GPT 5.5 behind current leader Opus 4.6 for 50% time horizon, and but slightly ahead on 80% time horizon, where Gemini 3.1 Pro currently leads with 1h 30m.

Both of these are behind my previous predictions for the Opus 4.7 and Mythos Preview (50% TH of 18.8 and 40.3 hours, 80% TH of 2.6 and 5.5 hours respectively).

Notably these values for GPT 5.5 are low enough that they should be just about within the limits of what the current TH1.1 task suite can estimate.2

The OpenAI fit also seems somewhat influenced by GPT-5.3 Codex and GPT-5.4’s relatively low time horizon results, which were partially caused by an unusual amount of reward hacking attempts. Removing them gives a somewhat different fit:

Comparing to AECI Fit

We can also compare the results from using the official ECI to predict Opus 4.7’s time horizon to our previous attempts using the AECI:

The AECI has slightly higher R^2 values - perhaps because Anthropic’s internal benchmarks are more SWE focused than those that make up the overall ECI.

Using the ECI reduces the predicted time horizons for Opus 4.7 compared to using the AECI. This is largely due to 4.7’s ECI being very similar to 4.6, whereas there is a larger gap in their AECI.

All-lab Fit

In the above sections I used separate fits for Anthropic and OpenAI LLMs. We can try a combined fit that uses all LLMs which have both ECI and time horizon values:

This is a somewhat worse fit than we saw above - it seems that the relationships genuinely are slightly different for Anthropic vs OpenAI LLMs. This isn’t too surprising, as it is what we would expect if one lab’s LLMs are consistently slightly better at software engineering compared to their other abilities.

I also included GPT 3.5 with ECI 119 from Epoch’s attempt to extend the ECI to earlier LLMs.

Unlike for Opus 4.7 and Mythos, which seem like they might too completely saturate the task suite for accurate 50% time horizon estimates.

Predicting Time Horizon from Anthropic's Internal ECI

Alexander Barry — Wed, 22 Apr 2026 13:51:16 GMT

Anthropic’s Internal ECI

Starting with Mythos preview Anthropic included a new ‘Anthropic ECI’ or “AECI” in their system cards. This is based on the same methodology as the Epoch Capabilities Index 1 but with additional information from their internal benchmarks.2

Unfortunately they only share the results in the form of a pretty but hard to read plot:

AECI plot taken directly from the Opus 4.7 model card

I extracted all of the values from this to obtain:

Mapping to Time Horizon

When Epoch released the ECI they noted that it was very predictive of METR’s 50% Time Horizon results.3

We can leverage this to see how AECI relates to 50% and 80% Time Horizon, to get early estimates of how Opus 4.7 and Mythos would perform:

Both show a clear relationship, especially the 50% time horizon, with a R^2 of >0.99 (for log-scale Time Horizon). We predict:

Opus 4.7:
- 50% Time Horizon: 18.8 Hours
- 80% Time Horizon: 2.6 hours
Mythos Preview
- 50% Time Horizon: 40.3 Hours
- 80% Time Horizon: 5.5 hours

However we shouldn’t expect to see actual 50% Time Horizon values this high from METR, as the longest task in their current TH1.1 task suite is 30 hours (and very few are over 16 hours).4

Thus until METR update their task suite to include more tasks, we expect they may not be able to accurately measure the 50% time horizon of the most capable models. The 80% time horizon should still fall within the measurable range however, so we will be able to compare those when they are released to these predicted values.

See the two previous posts on this substack for more discussion of the ECI.

In the version shared with Mythos Preview they also accidentally scaled 3.5 Sonnet (New) to have ECI 130 instead of the original 3.5 Sonnet release, as Epoch does. They fixed this issue in the release with the Opus 4.7 system card. If only someone could have foreseen releasing a model with the same name twice causing confusion.

In particular fitting the log of time horizon as a linear transformation of ECI.

You can view the interactive task success rate plot on this page (which I helped create) to better understand the tasks that go into calculating METR’s time horizon results.

Kicking the Tires of the Epoch Capabilities Index (ECI) Part 2: Uncertainty and Alternative Models

Alexander Barry — Sun, 15 Feb 2026 06:19:44 GMT

This is the second post in my three part series on the Epoch Capabilities Index (ECI). For background on the ECI and my replication of Epoch’s process of constructing it see the first post. See part three (upcoming) for multidimensional extensions of ECI, the relative importance of different benchmarks for calculating the ECI, and whether the trend in ECI improvements has been speeding up.

Introduction

In the first post of this series I replicated Epoch’s exact method for constructing the ECI. In this post I explore alternative models, and see how they impact the results and model fit. However I will first discuss the process Epoch use to construct their confidence intervals for the ECI, and why I think they might not be appropriate, and how Bayesian models could be a superior alternative.

Note that while I found some issues with the data used to construct the ECI in the first post, for this post I will continue using the same underlying data as accessed on 2026/02/04. In part three of this series (upcoming) I will look at the impacts of any updates to the data.

Uncertainty

As discussed in the last post, Epoch use (non-hierarchical) bootstrapping to produce the confidence intervals for their ECI results. The idea of bootstrapping is to repeatedly resample with replacement from the data as ways of estimating what the data ‘could have been’. This is a powerful and flexible approach, but to be valid it relies on having enough datapoints that the resamples are representative of real world data.

While ~1250 benchmark results contribute to the ECI overall, the average number of results per LLM is just 8.7 and the minimum for inclusion is only 41. This matters because the ECI model is naturally hierarchical (the data is grouped by benchmark/LLM), which makes the conditions for bootstrap validity more demanding. It’s not enough to have a large total sample; we also need sufficient data within each group, and per-LLM counts this low mean the theoretical guarantees that justify bootstrapped confidence intervals don’t apply.2 3

Fortunately there is an alternative, as using Bayesian statistics lets us sidestep bootstrapping entirely, and directly model the hierarchical structure. Bayesian models naturally produce uncertainty estimates as part of their fitting process, giving us valid CIs4 as long as we can come up with reasonable prior distributions for our parameters.

Loss

As covered in part 1 the original ECI model finds the parameters that minimise the following loss function:5

where

This is a frequentist approach to statistical modelling, finding the single set of parameter values that best fits the data given the specified model.

However a penalised regressions, such as the one above, can also equivalently be viewed6 as finding modal result of a corresponding Bayesian model (the ‘maximum a posteriori’ (MAP) estimator).

Considering this equivalent Bayesian model has a number of benefits. Firstly, as discussed in the previous section, Bayesian models can generate principled uncertainty estimates without relying on bootstrapping. Secondly it makes it natural to consider model extensions (like allowing different benchmarks to have different noise levels) that would be awkward to motivate as changes to a loss function but are straightforward as modelling choices. I explore this in the next section.

Bayesian Models

Base

As discussed above we can directly convert Epoch’s frequentist model into a Bayesian model with equivalent loss, obtaining the likelihood:

With priors on the parameters:

Note the sigma parameter here is not present in the original frequentist model7 but is required there to be a sensible interpretation of the model predictions when doing the goodness of fit checks covered later in this post.8

The model is otherwise kept as similar as possible to the frequentist model, with WinoGrande being used to set the scale by having its discrimination parameter fixed to 1, and the constraints that 𝛼 ∈ [0.1,10] and C, D ∈ [-10,10] still applied.

Improved Normal

Its apparent from inspecting the Base Bayesian model that it (and thus also the frequentist model) assume the amount of noise in the benchmark scores is constant across all LLMs and benchmarks, but this doesn’t seem very likely to be true. So in this model we relax the assumption by allowing the different benchmarks to have different variances, meaning that some will be expected to be noisier than others.

The base model also models the actual scores as being normally distributed, despite the fact that they can only fall into [0,1]. We can address this by truncating the normal distribution to only allow outputs between 0 and 1 (and scaling the rest of the density accordingly).9

The likelihood looks very similar, just with a different variance parameter that is now allowed to vary by benchmark:

When making this model I also took the opportunity to make various other minor changes10 that I think are natural for Bayesian modelling, such as removing the constraints on the possible parameter values and changing the priors to a more flexible setup:

And instead of setting the scale by anchoring WinoGrande to have alpha = 1 I fix the average benchmark difficulty to 0 and slope to 111 which I think is cleaner than fixing a specific benchmark:

Improved Beta

Another natural question to consider is whether the assumption of normally distributed noise/errors is correct (even after allowing for different variances for different benchmarks as above).

As an intuition for why this might not be appropriate, under the models with normal errors we penalise a predicted score of 55% when the true score is 50% just as much as a predicted score of 95% when the true score is 90%12, even though 95% might seem intuitively further away from 90% than 55% is from 50% in a sense that matters.13

One way to deal with this is to instead assume the score score follows a beta distribution with expected value equal to the predicted score, but which instead has errors that penalise mistakes more when we get close to 0 or 1.

This just involves replacing the likelihood above with:

while leaving all priors the same (with sigma_b still being allowed to vary across different benchmarks). All this does is give us a distribution where:

So as desired we have the expected value always equal to the predicted score, but also the variance (noise) is scaled to be at its maximum at a predicted score of 0.5, but decrease as it moves away from that and closer to 0 or 1.

Implementation

I wrote Stan code (specialised software for fitting Bayesian models) to implement all three of the Bayesian models discussed above, generating posterior samples that we can use for CIs for the parameters and any derived results.

All models fit well with no errors, see the appendix for more details on the fitting parameters and convergence statistics.

Results

We see that the SOTA ECI results for the Bayesian models are generally very similar to the original Epoch model, but with some deviations and typically smaller confidence intervals.

We can also look in particular at which LLM the different models think is best by comparing to Gemini 3 Pro (which Epoch finds strongest currently):

The two improved Bayesian models actually find GPT-5.2 stronger than Gemini 3 Pro, with the improved normal model finding the difference to be statistically significant at p = 0.05, but it both cases it is only a change from being 1 ECI point lower to 1 ECI point higher.

The models also have similar ECI/year trends, although the Beta model is around 10% faster than the rest (but this is within the noise of the different estimates). We will do a deeper dive into the ECI/year trends in the next post.

See full results (ECI values for every LLM and difficulty and discrimination values for every benchmark, for every model) in the appendix.

Model Comparison

We have seen the results for all the models, and I have gestured towards some theoretical considerations which might favour the Bayesian models, but it remains unclear if they should in fact be preferred. In this section we will look at various criteria on which we can compare the models, to see which fit the data best and are most predictive.

Goodness of Fit

‘Goodness of fit’ tests are one way of assessing how well a model fits the data. The general idea is picking a (hopefully quite general) feature of the data, and comparing it to how we would expect it to behave if the model is correct.

We can start with some simple examples of this, comparing QQ plots:

Here the frequentist model14 and the Base Bayesian model do notably worse than the improved models, with the tails coming substantially away from the main fit. Both Improved models still have some issues, with the beta having modest deviation in the mid-low and mid-high sections15, and the improved normal struggling mostly at the upper tail.

Also see the appendix for a full set of predicted vs actual plots for each model broken down by benchmark.

I also looked at the ‘Posterior Predictive P-values’ (PPP) by comparing the size of the squared Pearson residuals for the actual data to those simulated from the posterior predictive distribution of each model. Here p values correspond to the chance of the models producing data more extreme than the data we actually saw, and values close to 0.5 are ideal, with values close to 0 or 1 being concerning:

Frequentist: Method not applicable, but fit should be similar to Base Bayesian model
Base: p = 0.03 (pretty concerning)
Improved Normal: p = 0.062 (somewhat concerning)
Improved Beta: p = 0.399 (good)

We can also break these down per benchmark (where ideally each benchmark would also be close to p=0.5):

This makes it clear how poorly the assumption of constant variance across benchmarks fits the data, since for the Base model for most benchmarks it either expects more or less noise than we saw in the data (corresponding to p values close to 0 to 1). The improved models both stay much closer to 0.5 across the set of benchmarks. The Improved Normal model underpredicts noise on average, whereas the Improved Beta has a mix of under and overpredicting.

Cross Validation

An alternative approach to assess the models is to see how well they fit on new data that wasn’t included when they were being trained. Even without new data we can simulate this by using leave one out cross validation (LOO CV).16

It isn’t immediately clear what the correct measure of error to use for this is, so I look at the squared error (which Epoch’s model is trained to minimise), the mean absolute error, and a scaled version of the squared error that penalises errors more strongly when they are closer to 0 or 117:

We see that the improved Bayesian models do best, each winning on one measure and both essentially drawing on the scaled RMSE, and both beating the base and replication models on most measures, although the frequentist model slightly beats the Improved Beta on MAE.

Another measure (only available for the Bayesian models) is the Expected Log Posterior Density (ELPD),18 which is less interpretable but has the advantage of allowing us to compute confidence intervals for the size of difference between models:

Here the Improved Beta model does best, and with the 95% CIs for the other models not overlapping zero.

Improved Beta Seems Best

Since it has by far the best posterior predictive p value, and wins on most of the cross validated error metrics I conclude the Improved Beta Bayesian model is the best fit for the data, and will use it as the basis for the analysis in post 3 (upcoming).

The Improved normal model also has a good showing however, and I suspect the ‘true’ distribution of the data is somewhere in between, with benchmark noise declining close to 0/1 but not quite in the manner the beta model assumes.19

I recommend Epoch move to this kind of model (or some further refinement) for the ECI calculations, although it is worth noting that the differences between the ECI results from the different models is usually only 1-2 ECI points (although there can be more disagreement for the very weak models, see the appendix), and there is heavy overlap of the CIs in every case.

Conclusion

Epoch’s ECI relies on bootstrapping for its confidence intervals, but the hierarchical structure and low number of benchmark results per LLM means this has unclear theoretical support. Moving to Bayesian models sidesteps this entirely, producing uncertainty estimates as a natural part of fitting the model.

I investigated three Bayesian alternatives to Epoch’s model: first directly translating Epoch’s model to a Bayesian framework, second allowing benchmarks to have different amounts of noise in their results (‘Improved Normal’) and making other minor improvements, and third assuming less noise when scores are close to 0% or 100% (the ‘Improved Beta’ model). This last model performs best overall in terms of cross-validated error and the goodness of fit checks I performed.

The resulting ECI values are generally very similar to Epoch’s, but with some differences; most notably narrower CIs and a (small) change in the top ranked LLM, with both improved models giving GPT 5.2 a higher ECI than Gemini 3 Pro. I recommend Epoch consider adopting this kind of model, although the practical differences are usually small.

In part three (upcoming) I will use the ‘Improved Beta’ model to explore further extensions of the ECI model, and also look at the relative importance of different benchmarks and whether the trend in ECI improvements has been speeding up.

Appendix

Bayesian Model Convergence

I ran each model on 4 chains, each with 4000 warmup and sampling iterations. adapt_delta was set to 0.95 and max_treedepth to 12. Each model takes around 10 minutes to fit.

All models mixed well, with no divergence transitions or max treedepth hits. The largest rhat value was 1.01, the smallest E-BFMI was 0.4 and the smallest ESS (tail and bulk) was 183, although that was only in the base model, and the other two have >400.

Replication code is available here.

Full LLM Results

Full Benchmark Results

Predicted vs Actual Scatterplots

Frequentist

Base (Bayesian)

Improved Normal (Bayesian)

Improved Beta (Bayesian)

The smallest benchmark has 10 results, with an average of 33.7.

We also need there to be large numbers enough of benchmarks and LLMs themselves, although this is likely satisfied with the 37 benchmarks and ~150 LLMs.

As discussed in the first post Epoch actually use a non-hierarchical bootstrap setup - but I think this is doubly inappropriate; I think a hierarchical bootstrap would both be more valid (due to the hierarchical structure of the data) but still not well theoretically justified (because of the low number of data points per LLM).

Technically Bayesian Models produce Credible Intervals as opposed to frequentist Confidence Intervals. While extremely long debates are possible about the merits and interpretations of both, for the purposes of these posts I will use and describe both interchangeably as ‘CIs’.

For more details see the first post.

This post gives more details on how to think about the equivalence.

Since it just controls the overall scale of the loss, but this means it isn’t required for finding the parameter values that minimise it in the original penalised regression.

I gave it a uniform prior to minimise the amount that it would contribute to the loss when fitting the model, to keep things as close to the frequentist model as possible.

This means that the true mean and variance of the distribution do not match the distributions parameters, but since it mostly seems like the model should be able to learn around this I took no steps to address it, and leave it as an extension for further work. When calculating the Pearson residuals I use the true mean and variance of the truncated distribution.

Ideally I would consider the impact of each change in isolation, but as I believe these adjustments are individually well-motivated and for the interest of time I will just combine them.

On the untransformed scale before we convert Claude Sonnet 3.5 to 130 and ChatGPT 5.0 to 150.

Since in any model with normally distributed errors (or in Epoch’s model) the loss is a function of just (predicted - actual), which is 0.05 in both cases.

One way to think about this would be considering the relative error instead of the absolute error.

To standardise the residuals for the frequentist model I used their observed standard deviation.

The beta QQ plot here uses an alternative method (PIT vs uniform) to the direct quantile comparisons in the other two plots, so its shape isn’t directly comparable, but it remains the case that a perfect fit would be a straight line on y=x.

For the replication model I compute the LOO CV directly, but as the Bayesian models take longer to fit I instead use the PSIS-LOO approximation from the loo R package. On each Bayesian model ~50/1248 samples have high pareto-k values which indicate they might have biased results, but in the interests of time I did not investigate or address this.

This is constructed by scaling each residual by min(10, 2/sqrt(p(1-p))) where p is the actual value observed.

Also calculated using PSIS LOO via the loo package.

One could explore models to try and capture this behaviour, but for the interests of time I do not do this here.

Kicking the Tires of the Epoch Capabilities Index (ECI) Part 1: Introduction and Replication

Alexander Barry — Wed, 11 Feb 2026 18:19:06 GMT

I’m Alexander Barry, an independent statistical consultant, and this is the first post on my Substack. I expect to post here occasionally when I have thoughts or work that I think would be interesting to share. For more information or to enquire about hiring me for a project see my website abstats.co.uk

Introduction

In December 2025 Epoch AI launched their Epoch Capabilities Index (ECI). This seeks to combine measures of LLM performance from many benchmarks together to create a unified scale that allows comparisons between LLMs, built on the paper A Rosetta Stone for AI Benchmarks they wrote in conjunction with DeepMind.

The approach they use is inspired by Item Response Theory, an area that was originally developed for use in human testing but has recently been seeing use in LLM evaluations, perhaps most prominently in METR’s Time Horizon work.

I have been working with METR on a statistical model for time horizon calculations, and so have been thinking a lot about this area. With this background when I saw the ECI launch and read the accompanying paper I thought it would be interesting to see if I could apply some of my knowledge to the ECI.

This is part one of a three part series of posts about the ECI:

In part 1 (this post) I describe the process of directly replicating the ECI, highlighting some details of how it is constructed that are either not obvious or differ from the information on Epoch’s website or the paper.
In part 2 I develop various alternative statistical models for constructing the ECI, and look at how these impact the results.
In part 3 (upcoming) I look at derived results from the ECI, such as whether the trend of increasing ECI over time is accelerating, the relative importance of the different benchmarks, and the impact of adding additional dimensions of ability.

These posts will go into relatively high amounts of technical detail about the statistical models involved, so if that is not of interest I suggest liberal skimming. I will summarise the important takeaways in the conclusion of each post.

Epoch’s Methodology

To replicate the ECI I rely on three main sources of information released by Epoch:

The “A Rosetta stone for AI Benchmarks” paper they published with DeepMind, and the replication code they published alongside the paper.
The information on the ECI section of the Epoch website, and the public Github repo they released for the ECI.1
The data Epoch release on their website, accessed on 2026/02/04

Epoch’s Data

The list of benchmarks used to calculate the ‘live’ ECI results given on Epoch’s website is outdated2, but fortunately (and to their credit) they provide the full dataset so it is possible to reconstruct from this which benchmarks are used3. For a full list see the appendix.

They process the data by:4

Removing any LLMs released prior to 2023/01/01 and any benchmarks outside the 37 selected.
Combining together the results from any LLMs they consider to have the same base model, taking the maximum result whenever there are multiple results for the same benchmark.
1. As far as I can tell they don’t have any public list for which LLM they think should be combined in this way, but in one of the spreadsheets they include both “Model name” and “Model version” columns with the former being the criteria they use for combining LLMs.
  
  Note the way they combined models conflicts with their claim5 that they only aggregate together models with the same release date, as 24/144 of the sets of aggregated LLMs (same ‘Model’ column and ECI values in Epoch’s data) contain LLMs with different release dates, and 12 of them have release dates that differ by over 30 days. An example of this is that DeepSeek V3 (released Dec 2024) and DeepSeek V3-0324 (released March 2025) seem to be aggregated together and treated as a single model launched in March.
  
  When LLMs with different release dates are aggregated together in this way the release date that Epoch assign to the resulting aggregated LLM seems to be arbitrary6 which can cause problems with evaluating the trend in ECI over time, e.g. Gemini 1.5 Pro is given release date 2024-05-24, despite all of its benchmark results that contribute to the ECI coming from the 002 release launched on 2024-09-24. This is especially problematic as it results in Gemini 1.5 Pro appearing to be SOTA on release (given as 2025-05-24), but it is purely an artifact of its incorrectly assigned release date.
  
  Epoch have confirmed this is a mistake in how the LLMs are aggregated and they will correct it in an update. Since the goal of this post is replication I will proceed with the same (flawed) approach as Epoch here, but will explore changing this in future posts.
Removing any (aggregated) LLMs with <4 benchmark results (from the set of 37 benchmarks)
Linearly rescale results on any benchmarks on which guessing is possible so that guessing would (on average) correspond with a score of zero.7 If this results in any scores less then zero these are replaced with zero.
1. The list of which benchmarks Epoch considers to allow guessing (and the probability of correctly guessing) is not given on their website or in the paper, but is in the ECI Github repo. It matches intuitions with n-option multiple choice questions being given 1/n guessing rates.8

Once these steps are complete we are left with a dataset of 144 LLMs’ performance on 37 benchmarks, with 1248 total benchmark scores (an average of 8.7 benchmark results per LLM).9

Epoch’s Model

Epoch predict the performance (from 0 to 100%) of LLM m on benchmark b as:

Here 𝛼_b and D_b are parameters reflecting the benchmarks discrimination and difficulty respectively and C_m is the LLM’s ability. When C_m = D_b the LLM will be predicted exactly 50% on the benchmark, and 𝛼_b controls how quickly this changes as C_m gets smaller or larger than D_b.

To fix the scale of the model they set the discrimination parameter of the WinoGrande benchmark to 𝛼 = 1.10 The model is then fit by finding (via optimisation) the parameter values that minimise the following loss:

where N is the total number of datapoints (1248), M is the number of LLMs (144) and B the number of benchmarks (37) and with the constraints that 𝛼 ∈ [0.1,10] and C, D ∈ [-10,10].11

This corresponds to minimising the squared prediction error, weighting all models and benchmarks equally, plus very weak L2 regularisation where the parameters are shrunk towards zero, but scaled so that the total penalty from all the parameters is weighted only 1/10th as much as a prediction error on a single data point. Note that dividing by the number of parameters in the regularisation term is not the conventional approach, and results in much weaker regularisation then you would get from the conventional approach with the same nominal regularisation strength.12

After the model is fit the results are linearly transformed so that Claude Sonnet 3.5 has ECI 130 and GPT 5 has ECI 150.

Epoch construct confidence intervals for the ECI results by bootstrapping, taking a non-hierarchical approach of resampling from the entire dataset with replacement without e.g. taking into account which LLMs the datapoints correspond to. They use 100 bootstrap samples, which is quite low.13

The non-hierarchical approach means that it is possible for LLMs/Benchmarks to be totally dropped from individual bootstrap samples. If this occurs then due to the penalisation any alpha values will be shrunk to 0.1 (the lower bound), and C/D values shrunk to zero on the raw scale, which corresponds to an ECI value of ~124.8. This occurs in ~1.8% of bootstrap samples for LLMs with the minimum 4 benchmark results, and so can potentially shift their 90% CIs somewhat.14

Replication Setup

To try and replicate the ECI I wrote R code to implement the data processing and model fitting process described above as closely as possible, starting with the raw data available on Epoch’s website, as accessed on 2026/02/04.

I used 10,000 bootstrap samples, and wrote custom c++ code to evaluate the loss function given above for speed when finding the optimal parameters. With this it takes about 1 minute to run all the replication code.

My full replication code is available here.15

Replication Results

My replication results match the Epoch’s results very closely, both for the LLM ECI results, but also the benchmark parameters:

Looking only at the LLM’s Epoch considers SOTA on launch16 and comparing the confidence intervals:

The values generally match very closely, albeit with slightly different confidence intervals, which I believe is caused by random variation due to Epoch only using 100 bootstrap samples.

For the full results (ECI results for every LLM and difficulty and discrimination parameters for every benchmark) see the appendix.

Conclusion

Due to Epoch releasing all their data and code publicly I was able to replicate their ECI results very closely, and I think they deserve praise for this transparency.

However there are a few instances where the process used to construct the ECI does not match their public description:

The list of benchmarks included on their website is out of date, with two changes since it was published, and so is the number of LLMs included.
The way that results from different versions of the same ‘base’ LLMs are aggregated sometimes results in LLMs with very different release dates being combined together, despite them saying only LLMs with the same release date are combined. This in particular causes issues as release date they assign to the resulting aggregated model is effectively arbitrary, which can result in cases such as Gemini 1.5 Pro being declared SOTA on launch in 2024-05-24 entirely due to benchmark results from the 002 update released on 2024-09-24.
The fact that Epoch only uses 100 bootstrap samples adds unnecessary noise to the confidence intervals in their ECI estimates, and their non-hierarchical approach will slightly bias the confidence intervals for LLMs with small numbers of benchmark results towards an ECI of ~125.
The level of regularisation used in the regression is much weaker than stated in the paper (by a factor of ~200) due to an unconventional loss setup, and various other details (restricted parameter ranges, and the exact bootstrap setup) are not publicly stated anywhere.

While the latter two are technical details, the first two seem at least potentially relevant to public understanding of ECI results. I sent this post to Epoch for pre-publication review and they confirmed both issues exist, and will be fixed in future updates.

In part 2 I present various ways I attempt to improve on this model, and how it impacts the ECI results.

Appendix

List of included benchmarks

List of aggregated LLMs

Full Results

LLM Results

Benchmark Results

The ECI repo isn’t linked anywhere on the Epoch website, but they pointed me to it in private communication. Currently the public code isn’t used directly to calculate the ECI, but they say it is representative and they plan to switch over to it in the future.

They list ‘LiveBench’ and ‘SuperGLUE’ which seem to have been replaced with ‘Chess Puzzles’ and ‘The Agent Company’. Epoch confirm this is out of date and they plan to update it soon.

I did this by checking which benchmarks are given difficulty and discrimination parameters in the “eci_benchmark_difficulties_and_slopes.csv” spreadsheet that is included in the data.

They mention the existence of all of these processing steps in their description of the ECI, although don’t always give all the details.

The claim is made in the ECI Data section.

I believe the release date epoch list on their website is determined simply by whatever the first row in the “epoch_capabilities_index.csv” spreadsheet happens to be, and as far as I can tell this ordering is arbitrary (e.g. it is not consistently the first or the last date associated with a given model, or the date from which most of the benchmark results are drawn).

e.g. Multiple choice questions with 4 options would be transformed by f(x) = (x - 0.25)/0.75 so that 25% → 0 and 100% → 100%.

This includes “OTIS Mock AIME 2024-2025“ whose answers are integers from 0 to 999, and thus has a 1/1000 guessing rate.

Note this also does not match the LLM and datapoint figures on Epoch’s website which Epoch confirm are out of date and have not been updated as new LLMs have been added.

Technically WinoGrande’s difficulty parameter D is also set to zero, but this is done after the model is fit by a simple shift to all C and D parameters, so it does not have any effects given the results are then transformed again to the ECI scale. In particular the penalisation and range restrictions are applied to the C and D values before they are shifted to have WinoGrande’s D=0.

Note this is not discussed in the paper, and is only apparent by inspecting the replication code. Since WinoGrande is fixed to have an 𝛼 value of 1 the 𝛼 restriction corresponds to an assumption that no benchmark is >10x more/less discriminative than WinoGrande. The limits on C and D correspond (in the current fit) to assuming no LLM or benchmark has an ability/difficulty level below -95.6 or above 345.1 on the ECI scale.

This detail is not included in the paper, which states a penalisation strength of 0.1, but is apparent from inspecting the replication code. The setup used instead corresponds to a conventional penalisation strength of lambda = ~0.0005.

This is not covered anywhere on the website (or in the paper, which does not use the bootstrapped confidence intervals) but it is apparent from inspecting the code in the ECI Github repo.

In the case where these shrunk values would otherwise be entirely outside the 90% CI it would shift it to effectively be e.g. (7%, 97%) instead of (5%, 95%).

Who replicates the replicators?

Starting with ‘GPT-4 (March 2024)’ as Epoch also do when presenting their ECI ‘frontier trend’.