Why a Few Good Prompts Are Enough
The instinct behind a benchmark is “more is better” — more prompts, more images, more coverage. But a result from language-model evaluation says otherwise: a small, carefully chosen set of examples can estimate a model's score almost as well as the full set. This is the theory that shapes how ImageBench is built.
A lesson from language models
In 2024 a paper called tinyBenchmarks (Maia Polo et al., ICML 2024) gave a striking answer. They showed you can replace MMLU's 14,000 questions with about 100 well-chosen ones and still estimate a model's true score to within roughly 2%. That is a 140× reduction in evaluation cost for a couple of percentage points of error.
| Benchmark | Original size | Reduction | Estimation error |
|---|---|---|---|
| MMLU | 14,000 | 140× | ±1.5% |
| HellaSwag | 10,000 | 100× | ±1.1% |
| TruthfulQA | ~800 | 8× | ±1.1% |
| GSM8K | 1,319 | 13× | ±2.0% |
| ARC | 1,172 | 11× | ±1.0% |
| Winogrande | 1,267 | 12× | ±1.1% |
The result only works because the examples are chosen, not sampled at random. A random 100 questions would give you a noisy estimate. The right 100 give you a precise one. To understand why, you need to see that not every question is equally informative.
Not all prompts are equally informative
The tool underneath tinyBenchmarks is Item Response Theory (IRT), a framework originally built to score standardized tests. Its core idea is that whether a test-taker gets an item right depends on two things about the item and one thing about the taker:
| Parameter | What it captures |
|---|---|
| β (difficulty) | How hard a prompt is. A high-β prompt is one that most models fail; a low-β prompt is one almost everyone passes. |
| α (discrimination) | How sharply a prompt separates strong models from weak ones. High-α prompts are the informative ones — they change verdict exactly around the ability level you care about. |
| θ (ability) | The latent skill of a model. Not a single number in the multidimensional version — a vector of capabilities that a good prompt set can triangulate. |
IRT ties these together with a simple curve. The probability that model l passes prompt i is a sigmoid of how far the model's ability sits above the prompt's difficulty, scaled by how discriminating the prompt is:
P(pass) = σ( αᵢ · θₗ − βᵢ )
The intuition is what matters. A prompt that every model passes (very low difficulty) or that every model fails (very high difficulty) adds almost nothing to a ranking — the answer is a foregone conclusion. The valuable prompts are the discriminating ones: the prompts where better models pass and weaker ones fail, right around the ability range you are trying to resolve. Fit an IRT model to how existing models performed, and you can read off which prompts those are — then keep them and drop the rest.
Mapping the idea onto image models
tinyBenchmarks was written for language models, where an answer is right or wrong. But a capability benchmark for image models has the same shape. Record, for every model and every prompt, a single bit: did this model pass this prompt? That gives a response matrix — models down the rows, prompts across the columns, a pass/fail in each cell. It is exactly the object IRT consumes.
Under that lens the IRT parameters read naturally for text-to-image:
- Difficulty (β) — “a photorealistic hand holding five distinct objects” is hard; “a red apple” is not.
- Discrimination (α) — a prompt that only the strongest models render correctly is worth many prompts that every model handles.
- Ability (θ) — a model's latent capability, ideally broken out by dimension: text rendering, spatial reasoning, human realism, and so on.
The payoff is the same as in the language case. Once you know each prompt's difficulty and discrimination, you can select the handful that carry the most information and estimate a new model's full-benchmark score from just those — instead of re-running everything.
Where ImageBench stands today
To be clear about what is theory and what is shipping: ImageBench does not yet fit an IRT model or extrapolate scores from a reduced prompt set. The method is documented in the V1 methodology.
What the theory already shapes is the design philosophy. Benchmark v1 is built from prompts chosen to be discriminating — grouped into capability categories, written so that a specific, checkable question separates a passing image from a failing one — rather than from a huge random sweep.
Further reading
- tinyBenchmarks: evaluating LLMs with fewer examples — Maia Polo, Weber, Choshen, Sun, Xu, Yurochkin (ICML 2024).
- tinyBenchmarks code and datasets — reference implementation of the estimators.
- Anchor Points — the direct precursor: selecting representative examples by clustering model predictions.