Benchmark V1 Methodology
ImageBench V1 evaluates text-to-image models with fixed prompts, fixed scoring questions, and a fixed multi-VLM routing strategy. It is designed to be easily reproducible and fully automated. The goal is to produce a meaningful ordering of models at capability level, even when individual tests are not perfect.
Design goals
- Broad capability coverage across six categories and multiple difficulty tiers.
- Three prompt variants per test to reduce lucky or unlucky single-sample variance.
- VLM-graded PASS/FAIL decisions using concrete, test-specific evaluation criteria.
- Reproducibility through fixed prompt definitions and deterministic routing policy.
Benchmark structure
V1 contains 64 tests. Each test has 3 prompt variants, so each model run produces 192 images total.
| Category | Tests | What is evaluated |
|---|---|---|
| Text Rendering | 5 | Spelling accuracy, multi-line layout, typography style |
| Spatial Reasoning | 19 | Counting, relative position, scale, compositionality |
| Human Realism | 14 | Faces, expressions, hands, full-body coherence |
| Professional Studio | 9 | Camera and lighting control, color precision |
| Graphical Design | 8 | Layout, style diversity, data visualization |
| Truthfulness | 9 | Physics, reflections, world knowledge constraints |
Evaluation pipeline
- Generate 192 images for the target model from the fixed V1 prompt suite.
- Ask vision judges a concrete binary question per image and parse a PASS/FAIL verdict.
- Apply category-level preferred/fallback routing to produce one blended verdict per image.
- Aggregate overall, category, subcategory, and difficulty pass rates for publication.
Multi-VLM routing table
| Category | Preferred VLM |
|---|---|
| Text Rendering | qwen3-vl |
| Spatial Reasoning | qwen35-122b |
| Human Realism | qwen3-vl |
| Professional Studio | gemma4-26b |
| Graphical Design | qwen3-vl |
| Truthfulness | qwen36-27b |
Scoring
Headline score is the blended routing score from eval_results_multi-vlm.csv:
score = PASS count / total evaluated images
The same computation is reported per category, subcategory, and difficulty.