Blog

Analysis, methodology, and findings from the ImageBench project.

Can Local VLMs Recognize Celebrities?

A 90-image public-figure recognition study across global icons, field-famous specialists, and long-tail Wikipedia notables.

Which VLM Should Judge Style Diversity?

A VLM calibration study for ImageBench Style Diversity, selecting Qwen 3.5 122B as the route judge.

RealBench V1 Methodology

How RealBench measures photorealism — paired real/AI images and human votes from the ImageBench community.

Which VLM Best Detects Bad Hands?

A detector calibration study using Nano Banana Pro and Bonsai Image 4B hand outputs.

GPT Image 2 vs Flux 2 vs Nano Banana 2

Side-by-side benchmark results for leading AI image generation models.

Best AI Image Generator for Text Rendering

Which models handle spelling, posters, labels, typography, and small text best.

ImageBench V1 Methodology

How ImageBench V1 is designed, scored, and reported across 64 tests and 6 capability categories.

Quality Metrics Across 10 Models

MUSIQ, NIQE, NIMA, and TOPIQ scores for all 10 ImageBench V1 models — and why the results are still preliminary.