ImageBench
Back to blog
· 8 min read

Estimated Preference Score

ImageBench V1 now publishes three per-model scores. The original pass-rate is now called Capability — it measures whether a model did what the prompt asked. Alongside it we add Estimated Preference Score (EPS), an aesthetic human-preference axis derived from the HPSv3 reward model. The headline Overall score is the average of the two.

Two complementary axes

Capability answers a question with a right and wrong answer: if the prompt asks for “three red balls on the left and two blue cubes on the right,” either the image has that arrangement or it does not. That is exactly what the routed multi-VLM PASS/FAIL judge measures over all 192 images.

But two models can both satisfy a prompt and still differ enormously in how good the result looks. Capability is blind to that. EPS is the second axis: it estimates how much a human would prefer the image, independent of whether the prompt was followed. A model can be highly capable and aesthetically plain, or beautiful and unreliable — the two scores separate those cases instead of averaging them away inside a single number.

Where the signal comes from: HPSv3

EPS is built on HPSv3, a reward model trained on the HPDv3 human-preference dataset with a Qwen2-VL-7B backbone. HPSv3 is a Bradley-Terry preference model: given a prompt and an image it emits a single scalar mu — a preference logit. The model also reports an uncertainty term, hps_sigma, which we ignore.

The important property of a Bradley-Terry model is that only differences of mu are meaningful — the absolute offset is arbitrary. For two images generated from the same prompt, the probability a human prefers image A over image B is

P(A preferred over B) = sigmoid(mu_A − mu_B)

The scale is fixed logits: a gap of 1 corresponds to roughly a 73% preference. A raw mu on its own tells you nothing; a difference against a well-chosen reference tells you everything.

The frozen reference

To turn per-image logits into a published, comparable score we pin a reference for each prompt. For every prompt p we take the median of mu over a baseline field of 32 models and call it mu_ref(p). That reference is computed once, saved to public/bench/data/eps-reference-v1.json, and never recomputed. A model’s EPS is then its average win probability against that frozen reference:

EPS(m) = 100 × mean_p sigmoid( mu(m, p) − mu_ref(p) )

Read it plainly: EPS is how often this model beats a frozen benchmark-quality image, averaged over the eligible prompts and scaled to 0–100. An EPS of 50 means the model is, on average, as good as the July-2026 median model — it wins about half the time. Higher is more preferred; lower is less.

Why freeze it: stability

Freezing mu_ref(p) is what makes EPS a stable, publishable number rather than a moving target. Because the reference never changes, adding a new model scores only that model — every existing EPS on the leaderboard stays exactly the same. If the reference were recomputed from the current field each time, every published score would drift whenever a new model arrived, and historical numbers would be meaningless. A frozen reference keeps the axis anchored to a fixed point in time.

What EPS excludes, and why

HPSv3 was trained on HPDv3, a dataset of human aesthetic preferences. It has no reliable signal for whether a rendered word is spelled correctly or a chart is laid out sensibly — those are correctness judgements, not taste. Scoring them with a preference model would add noise, not information.

So EPS excludes the prompts where preference is the wrong tool: all Text Rendering prompts, plus Graphical design → Layout & Design and Graphical design → Data Visualisation. That is 9 of the 64 tags, leaving 55 tags — 165 of the 192 images — in the EPS basis. Capability still covers all 192: text rendering is a capability, it is simply one that an aesthetic reward model cannot judge.

Combining into Overall

The new headline score, and the default sort on the leaderboard, is a simple equal-weight blend of the two axes:

Overall = 0.5 × Capability + 0.5 × EPS

Both components are on a 0–100 scale, so Overall is too. The 50/50 split is a deliberate statement that following the prompt and looking good matter equally — you can still sort by either axis alone to see where a model’s strength actually lies. See the V1 methodology post for how Capability itself is scored.

What the two axes reveal

Splitting the scores apart is not academic — it changes the ranking in ways a single number would hide.

  • gpt-image-2 and nano-banana-2 lead on both axes — capable and preferred. These are the models that top the Overall board.
  • qwen-image scores high on aesthetics (EPS around 67) despite only mid capability. It makes attractive images that do not always do exactly what the prompt asked — a story a pass-rate alone would bury.
  • ideogram-v4 and the krea family rank high on Capability but low on EPS. Their strength is text and precision, or a distinctive look that HPDv3’s taste does not reward — exactly the case where the two axes should, and do, disagree.

Limitations

  • HPDv3 taste is not universal. EPS reflects the aesthetic preferences baked into one human-preference dataset. A divisive or stylized look can be genuinely good and still score low because it is not what the annotators tended to prefer.
  • EPS is preference, not correctness. A high EPS says an image is likely to be liked, not that it satisfies the prompt. That is what Capability is for; read the two together.
  • Text and layout are excluded. The 27 images in text-rendering and layout/data-visualisation tags contribute nothing to EPS. For text-heavy use cases, weight Capability accordingly.
  • The reference is a snapshot. EPS is measured against the July-2026 median field. It is stable by design, but “as good as the median model” means the median at that time.