ImageBench V1 Methodology
ImageBench V1 evaluates text-to-image models with fixed prompts, fixed scoring questions, and a fixed multi-VLM routing strategy. It is designed to be easily reproducible and fully automated. The goal is to produce a meaningful ordering of models at capability level, even when individual tests are not perfect.
Benchmark structure
V1 contains 64 tests. Each test has 3 prompt variants, so each model run produces 192 images total.
| Category | Tests | What is evaluated |
|---|---|---|
| Text Rendering | 5 | Spelling accuracy, multi-line layout, typography style |
| Spatial Reasoning | 19 | Counting, relative position, scale, compositionality |
| Human Realism | 14 | Faces, expressions, hands, full-body coherence |
| Professional Studio | 9 | Camera and lighting control, color precision |
| Graphical Design | 8 | Layout, style diversity, data visualization |
| Truthfulness | 9 | Physics, reflections, world knowledge constraints |
Evaluation pipeline
- Generate 192 images for the target model from the fixed V1 prompt suite.
- Ask vision judges a concrete binary question per image and parse a PASS/FAIL verdict.
- Apply category-level preferred/fallback routing to produce one blended verdict per image.
- Aggregate overall, category, subcategory, and difficulty pass rates for publication.
Each evaluation question is written specifically for the test. A spatial reasoning test asking for “three red balls on the left and two blue cubes on the right” gets a question that asks exactly that — not a generic quality question. This makes the scoring anchored to the prompt intent, not to general aesthetics.
Multi-VLM routing
No single vision-language model (VLM) is best at every category. We route each evaluation to the VLM that performs best on that category in our calibration study. A fallback VLM is used when the preferred model is unavailable or returns an ambiguous verdict.
| Category | Preferred VLM |
|---|---|
| Text Rendering | qwen3-vl |
| Spatial Reasoning | qwen35-122b |
| Human Realism | qwen3-vl |
| Professional Studio | gemma4-26b |
| Graphical Design | qwen3-vl |
| Truthfulness | qwen36-27b |
Scoring
The headline score is the blended routing pass rate across all 192 images:
score = PASS count / total evaluated images
The same formula applies per category, subcategory, and difficulty tier. This lets you see not just the overall rank, but exactly where a model excels or struggles.
Limitations
- VLM judges can hallucinate or be inconsistent on edge cases. The routing strategy reduces this, but does not eliminate it.
- 64 tests is a small set. High variance on a handful of tests can shift overall rankings. Three prompt variants per test mitigate this but do not eliminate it.
- The benchmark measures capability adherence, not aesthetic quality or user preference. A model can score poorly and still produce beautiful images.
- Scores reflect the model at the time of evaluation. API models change without notice; local models are pinned to specific checkpoints.