April 1, 2025 · 5 min read

ImageBench V1 Methodology

ImageBench V1 evaluates text-to-image models with fixed prompts, fixed scoring questions, and a fixed multi-VLM routing strategy. It is designed to be easily reproducible and fully automated. The goal is to produce a meaningful ordering of models at capability level, even when individual tests are not perfect.

Benchmark structure

V1 contains 64 tests. Each test has 3 prompt variants, so each model run produces 192 images total.

Category	Tests	What is evaluated
Text Rendering	5	Spelling accuracy, multi-line layout, typography style
Spatial Reasoning	19	Counting, relative position, scale, compositionality
Human Realism	14	Faces, expressions, hands, full-body coherence
Professional Studio	9	Camera and lighting control, color precision
Graphical Design	8	Layout, style diversity, data visualization
Truthfulness	9	Physics, reflections, world knowledge constraints

Evaluation pipeline

Generate 192 images for the target model from the fixed V1 prompt suite.
Ask vision judges a concrete binary question per image and parse a PASS/FAIL verdict.
Apply category-level preferred/fallback routing to produce one blended verdict per image.
Aggregate overall, category, subcategory, and difficulty pass rates for publication.

Each evaluation question is written specifically for the test. A spatial reasoning test asking for “three red balls on the left and two blue cubes on the right” gets a question that asks exactly that — not a generic quality question. This makes the scoring anchored to the prompt intent, not to general aesthetics.

Multi-VLM routing

No single vision-language model (VLM) is best at every category. We route each evaluation to the VLM that performs best on that category in our calibration study. A fallback VLM is used when the preferred model is unavailable or returns an ambiguous verdict.

Category	Preferred VLM
Text Rendering	qwen3-vl
Spatial Reasoning	qwen35-122b
Human Realism	qwen3-vl
Professional Studio	gemma4-26b
Graphical Design	qwen3-vl
Truthfulness	qwen36-27b

Scoring

The headline score is the blended routing pass rate across all 192 images:

score = PASS count / total evaluated images

The same formula applies per category, subcategory, and difficulty tier. This lets you see not just the overall rank, but exactly where a model excels or struggles.

Limitations

VLM judges can hallucinate or be inconsistent on edge cases. The routing strategy reduces this, but does not eliminate it.
64 tests is a small set. High variance on a handful of tests can shift overall rankings. Three prompt variants per test mitigate this but do not eliminate it.
The benchmark measures capability adherence, not aesthetic quality or user preference. A model can score poorly and still produce beautiful images.
Scores reflect the model at the time of evaluation. API models change without notice; local models are pinned to specific checkpoints.

Open leaderboard Read: Quality Metrics Analysis