ImageBench

Benchmark V1 Methodology

ImageBench V1 evaluates text-to-image models with fixed prompts, fixed scoring questions, and a fixed multi-VLM routing strategy. It is designed to be easily reproducible and fully automated. The goal is to produce a meaningful ordering of models at capability level, even when individual tests are not perfect.

Design goals

  • Broad capability coverage across six categories and multiple difficulty tiers.
  • Three prompt variants per test to reduce lucky or unlucky single-sample variance.
  • VLM-graded PASS/FAIL decisions using concrete, test-specific evaluation criteria.
  • Reproducibility through fixed prompt definitions and deterministic routing policy.

Benchmark structure

V1 contains 64 tests. Each test has 3 prompt variants, so each model run produces 192 images total.

CategoryTestsWhat is evaluated
Text Rendering5Spelling accuracy, multi-line layout, typography style
Spatial Reasoning19Counting, relative position, scale, compositionality
Human Realism14Faces, expressions, hands, full-body coherence
Professional Studio9Camera and lighting control, color precision
Graphical Design8Layout, style diversity, data visualization
Truthfulness9Physics, reflections, world knowledge constraints

Evaluation pipeline

  1. Generate 192 images for the target model from the fixed V1 prompt suite.
  2. Ask vision judges a concrete binary question per image and parse a PASS/FAIL verdict.
  3. Apply category-level preferred/fallback routing to produce one blended verdict per image.
  4. Aggregate overall, category, subcategory, and difficulty pass rates for publication.

Multi-VLM routing table

CategoryPreferred VLM
Text Renderingqwen3-vl
Spatial Reasoningqwen35-122b
Human Realismqwen3-vl
Professional Studiogemma4-26b
Graphical Designqwen3-vl
Truthfulnessqwen36-27b

Scoring

Headline score is the blended routing score from eval_results_multi-vlm.csv:

score = PASS count / total evaluated images

The same computation is reported per category, subcategory, and difficulty.