Benchmark V1 Methodology

ImageBench V1 evaluates text-to-image models with fixed prompts, fixed scoring questions, and a fixed multi-VLM routing strategy. It is designed to be easily reproducible and fully automated. The goal is to produce a meaningful ordering of models at capability level, even when individual tests are not perfect.

Design goals

Broad capability coverage across six categories and multiple difficulty tiers.
Three prompt variants per test to reduce lucky or unlucky single-sample variance.
VLM-graded PASS/FAIL decisions using concrete, test-specific evaluation criteria.
Reproducibility through fixed prompt definitions and deterministic routing policy.

Benchmark structure

V1 contains 64 tests. Each test has 3 prompt variants, so each model run produces 192 images total.

Category	Tests	What is evaluated
Text Rendering	5	Spelling accuracy, multi-line layout, typography style
Spatial Reasoning	19	Counting, relative position, scale, compositionality
Human Realism	14	Faces, expressions, hands, full-body coherence
Professional Studio	9	Camera and lighting control, color precision
Graphical Design	8	Layout, style diversity, data visualization
Truthfulness	9	Physics, reflections, world knowledge constraints

Evaluation pipeline

Generate 192 images for the target model from the fixed V1 prompt suite.
Ask vision judges a concrete binary question per image and parse a PASS/FAIL verdict.
Apply category-level preferred/fallback routing to produce one blended verdict per image.
Aggregate overall, category, subcategory, and difficulty pass rates for publication.

Multi-VLM routing table

Category	Preferred VLM
Text Rendering	qwen3-vl
Spatial Reasoning	qwen35-122b
Human Realism	qwen3-vl
Professional Studio	gemma4-26b
Graphical Design	qwen3-vl
Truthfulness	qwen36-27b

Scoring

Headline score is the blended routing score from eval_results_multi-vlm.csv:

score = PASS count / total evaluated images

The same computation is reported per category, subcategory, and difficulty.

Open leaderboard Read detailed guides