May 1, 2026 · 7 min read

Quality Metrics Across 10 Models

We ran four image quality metrics — MUSIQ, NIQE, NIMA, and TOPIQ — on all 192 images per model in the ImageBench V1 suite. Here is what the numbers reveal about aesthetic quality, and why it is not the same as capability.

What are these metrics?

These metrics assess image quality without needing a reference “perfect” image to compare against. They model what human observers find pleasant, sharp, or natural — purely from the image itself. This makes them practical for benchmarking generative models where there is no ground truth.

The four metrics used here each capture a different aspect of quality:

MUSIQ0 – 100

ICCV 2021·paper ↗

Multi-scale Image Quality Transformer. Trained on human opinion scores from multiple IQA datasets. Unlike CNN-based methods it handles variable-resolution inputs natively, capturing blur, noise, and compression artifacts at multiple scales without resizing.

NIQElower is better

IEEE Signal Processing Letters 2013·paper ↗

Natural Image Quality Evaluator. A completely blind metric — no training on human opinion scores. Fits a multivariate Gaussian to natural image patches and measures the statistical distance of a test image from that model. Heavily penalizes over-sharpening and unusual textures.

NIMA1 – 10

IEEE TIP 2018·paper ↗

Neural Image Assessment. Trained on the AVA dataset of ~255 000 images rated by photographers on a 1–10 aesthetic scale. Predicts the full score distribution rather than a single mean, capturing composition, color harmony, and overall photographic appeal.

TOPIQ0 – 1

IEEE TIP 2024·paper ↗

Top-down Perceptual Image Quality. Uses a semantic-aware encoder (CLIP backbone) to evaluate quality at three levels: low-level distortion, mid-level texture, and high-level semantic content fidelity. Achieves strong correlation with human judgments across both synthetic and authentic distortion datasets.

Full results table

Each value is the mean over 192 images. Green = best in column, red = worst. Models are sorted by ImageBench V1 pass rate (capability score), not by quality metrics.

Model	Pass Rate ↑	MUSIQ ↑	NIQE ↓	NIMA ↑	TOPIQ ↑
openai/gpt-image-2	95.8%	72.74	5.5300	5.4200	0.6770
fal/fal-ai/nano-banana-2	93.8%	72.10	4.2300	5.4400	0.6260
bfl/flux-2-max	78.6%	70.07	5.1300	5.5600	0.6410
fal/fal-ai/nano-banana-pro	78.6%	71.19	4.3900	5.3800	0.6020
z-image-local/z-image-turbo	75.5%	70.80	5.1200	5.2500	0.6300
bfl/flux-2-klein-9b	75.5%	71.13	5.2400	5.3000	0.6270
bfl/flux-2-pro	73.4%	70.06	5.1600	5.5500	0.6450
nucleus-local/nucleus-image	67.2%	64.82	6.6200	5.4400	0.5130
bfl/flux-2-klein-4b	63.5%	69.05	5.3500	5.2600	0.5870
sana-local/sana-1.5-1.6b	52.6%	70.77	5.7900	5.8700	0.6190

All metrics computed over n=192 images per model. NIQE: lower is better. All others: higher is better.

Caveats and preliminary takeaway

Honestly, these preliminary results don’t seem very useful yet, and there are two reasons for that. First, these metrics are meant to be run on an image set designed for image quality assessment — the 192 prompts in ImageBench V1 were built to probe capability (text rendering, spatial reasoning, truthfulness, and so on), not to stress-test perceptual quality. Running quality metrics on this corpus mixes the signal we want to measure with the signal the corpus was designed around. Second, before drawing any conclusions we’d need to verify how each metric correlates with human judgement on this particular distribution of generated images — published correlation coefficients come from natural-photo datasets and may not transfer.

With those caveats in mind, the preliminary observation is that three of the four metrics — MUSIQ, NIQE, and TOPIQ — broadly track the capability pass rate (nucleus-image at the bottom, nano-banana-2 and gpt-image-2 at the top). NIMA is the odd one out: it ranks Sana highest despite Sana having the lowest capability score. It would be interesting to dig into why NIMA diverges from the other three — whether it’s the AVA aesthetic training distribution, a sensitivity to stylistic choices, or something else specific to this set of images.

Open leaderboard Read: V1 Methodology