ImageBench
Back to blog
· 5 min read

ImageBench V1 Methodology

ImageBench V1 evaluates text-to-image models with fixed prompts, fixed scoring questions, and a fixed multi-VLM routing strategy. It is designed to be easily reproducible and fully automated. The goal is to produce a meaningful ordering of models at capability level, even when individual tests are not perfect.

Benchmark structure

V1 contains 64 tests. Each test has 3 prompt variants, so each model run produces 192 images total.

CategoryTestsWhat is evaluated
Text Rendering5Spelling accuracy, multi-line layout, typography style
Spatial Reasoning19Counting, relative position, scale, compositionality
Human Realism14Faces, expressions, hands, full-body coherence
Professional Studio9Camera and lighting control, color precision
Graphical Design8Layout, style diversity, data visualization
Truthfulness9Physics, reflections, world knowledge constraints

Evaluation pipeline

  1. Generate 192 images for the target model from the fixed V1 prompt suite.
  2. Ask vision judges a concrete binary question per image and parse a PASS/FAIL verdict.
  3. Apply category-level preferred/fallback routing to produce one blended verdict per image.
  4. Aggregate overall, category, subcategory, and difficulty pass rates for publication.

Each evaluation question is written specifically for the test. A spatial reasoning test asking for “three red balls on the left and two blue cubes on the right” gets a question that asks exactly that — not a generic quality question. This makes the scoring anchored to the prompt intent, not to general aesthetics.

Multi-VLM routing

No single vision-language model (VLM) is best at every category. We route each evaluation to the VLM that performs best on that category in our calibration study. A fallback VLM is used when the preferred model is unavailable or returns an ambiguous verdict.

CategoryPreferred VLM
Text Renderingqwen3-vl
Spatial Reasoningqwen35-122b
Human Realismqwen3-vl
Professional Studiogemma4-26b
Graphical Designqwen3-vl
Truthfulnessqwen36-27b

Scoring

The headline score is the blended routing pass rate across all 192 images:

score = PASS count / total evaluated images

The same formula applies per category, subcategory, and difficulty tier. This lets you see not just the overall rank, but exactly where a model excels or struggles.

Limitations

  • VLM judges can hallucinate or be inconsistent on edge cases. The routing strategy reduces this, but does not eliminate it.
  • 64 tests is a small set. High variance on a handful of tests can shift overall rankings. Three prompt variants per test mitigate this but do not eliminate it.
  • The benchmark measures capability adherence, not aesthetic quality or user preference. A model can score poorly and still produce beautiful images.
  • Scores reflect the model at the time of evaluation. API models change without notice; local models are pinned to specific checkpoints.