Blog
Analysis, methodology, and findings from the ImageBench project.
016 min
Can Local VLMs Recognize Celebrities?
A 90-image public-figure recognition study across global icons, field-famous specialists, and long-tail Wikipedia notables.
025 min
Which VLM Should Judge Style Diversity?
A VLM calibration study for ImageBench Style Diversity, selecting Qwen 3.5 122B as the route judge.
033 min
RealBench V1 Methodology
How RealBench measures photorealism — paired real/AI images and human votes from the ImageBench community.
044 min
Which VLM Best Detects Bad Hands?
A detector calibration study using Nano Banana Pro and Bonsai Image 4B hand outputs.
055 min
GPT Image 2 vs Flux 2 vs Nano Banana 2
Side-by-side benchmark results for leading AI image generation models.
065 min
Best AI Image Generator for Text Rendering
Which models handle spelling, posters, labels, typography, and small text best.
075 min
ImageBench V1 Methodology
How ImageBench V1 is designed, scored, and reported across 64 tests and 6 capability categories.
087 min
Quality Metrics Across 10 Models
MUSIQ, NIQE, NIMA, and TOPIQ scores for all 10 ImageBench V1 models — and why the results are still preliminary.