Automated Metrics

Automated metrics let you evaluate image generation models without human raters. They're fast, cheap, and reproducible — but each measures a narrow aspect of quality, and misusing them leads to misleading conclusions.

This guide covers the metrics that matter, what each actually measures, and when to use (or avoid) each one.

FID — Fréchet Inception Distance

FID is the most widely reported metric in image generation research. It measures how similar the distribution of generated images is to a distribution of real images.

How It Works

Pass both generated and reference images through an Inception v3 network pretrained on ImageNet
Extract activations from the penultimate layer (2048-dimensional feature vectors)
Fit a multivariate Gaussian to each set of features
Compute the Fréchet distance between the two Gaussians

The formula compares means and covariances of the two distributions. Lower distance means the generated distribution is closer to the real distribution.

Interpretation

Lower is better. FID = 0 means the distributions are identical.
State-of-the-art FID scores on COCO-30K: 6–15 for top models.
FID below 10 is generally excellent. FID above 50 indicates significant quality gaps.

When to Use

FID is appropriate for comparing the overall distributional quality of a model's outputs against a reference dataset. It's most useful during model training and for high-level model comparison.

Limitations

Sample size sensitivity. FID requires at least 10,000–50,000 images for stable estimates. At smaller sizes, variance is high and comparisons are unreliable. Chong & Forsyth (2020) showed FID estimates can shift by 10+ points with only 5,000 samples.
Reference dataset dependency. FID scores are only meaningful relative to a specific reference set. FID on COCO ≠ FID on ImageNet.
Ignores individual quality. A model producing 90% excellent images and 10% garbage can achieve the same FID as a model producing uniformly mediocre images.
Inception v3 bias. The feature extractor was trained on ImageNet classification. It may miss visual features important for artistic or specialized domains.
Insensitive to mode dropping. A model generating only a few types of images well can score low FID if those types match the reference distribution.

CLIP Score

CLIP Score measures text-image alignment — how well a generated image matches its text prompt.

How It Works

Encode the text prompt using CLIP's text encoder
Encode the generated image using CLIP's image encoder
Compute the cosine similarity between the two embeddings
Scale by 100 for readability (convention varies)

Interpretation

Higher is better. A score of 100 would mean perfect alignment (never achieved in practice).
Typical range for modern models: 25–35 on diverse prompt sets.
Scores above 30 generally indicate strong prompt adherence.

When to Use

CLIP Score is the standard metric for prompt fidelity — verifying that a model generates what was asked for. It's fast (one forward pass per image) and doesn't need a reference dataset.

Limitations

Quality-blind. CLIP Score doesn't distinguish between a beautiful rendering and an ugly one, as long as both depict the right content. A blurry photo of a cat scores similarly to a sharp one.
Saturation at high quality. Beyond a certain quality threshold, CLIP Score stops differentiating. The gap between "good" and "great" is often invisible to CLIP.
Compositional weakness. CLIP struggles with complex prompts involving multiple objects, spatial relationships, and attribute binding. "A blue sphere on top of a red cube" may score well even if colors are swapped.
Gameable. Text rendered directly onto images can artificially inflate CLIP Score without actual visual correspondence.

LPIPS — Learned Perceptual Image Patch Similarity

LPIPS measures perceptual similarity between two specific images — how similar they look to a human observer.

How It Works

Pass both images through a pretrained network (typically VGG or AlexNet)
Extract features at multiple layers
Normalize and compute weighted L2 distances between corresponding feature maps
Sum across layers to produce a single distance score

Interpretation

Lower is better (it's a distance metric). LPIPS = 0 means the images are perceptually identical.
Typical range: 0.0–1.0. Below 0.1 indicates very high similarity.

When to Use

LPIPS requires a reference image to compare against. Common applications:

Consistency measurement: generate the same prompt multiple times, compute pairwise LPIPS to quantify output variance
Image-to-image evaluation: compare model output to an expected result
Style transfer: measure how much structure is preserved

Limitations

Requires a reference. Not applicable to open-ended text-to-image evaluation.
Network bias. The underlying VGG/AlexNet was trained on ImageNet, potentially missing domain-specific perceptual features.

SSIM and PSNR

These are pixel-level metrics from signal processing.

SSIM (Structural Similarity Index) compares luminance, contrast, and structural patterns. Range: −1 to 1, higher is better.
PSNR (Peak Signal-to-Noise Ratio) measures the ratio of maximum signal to noise in decibels. Higher is better.

When to Use

Almost never for text-to-image evaluation. They're appropriate for:

Super-resolution (comparing upscaled vs high-res reference)
Compression quality assessment
Image restoration tasks

Why They're Limited

Pixel-level metrics don't correlate well with human perception. Two images that look identical to humans can have wildly different PSNR scores, and two images with similar PSNR can look very different. For generated images — where there's no pixel-level ground truth — they're essentially meaningless.

VQAScore

VQAScore reframes image evaluation as visual question answering.

How It Works

Convert the text prompt into a yes/no question: "A red car parked in front of a blue house" → "Does this image show a red car parked in front of a blue house?"
Pass the image and question to a VQA model (typically BLIP-2 or a similar vision-language model)
The model's confidence in answering "yes" becomes the score

Interpretation

Higher is better. Range: 0–1 (probability).
Correlates better with human judgment than CLIP Score on compositional prompts.

When to Use

VQAScore excels at compositional evaluation — prompts with multiple objects, attributes, and spatial relationships. Where CLIP Score saturates or fails on complex prompts, VQAScore provides more granular signal.

Lin et al. (2024) showed VQAScore outperforms CLIP Score, BLIP Score, and other metrics on 8 out of 9 text-to-image benchmarks for prompt fidelity evaluation.

Limitations

Slower. Running a large VQA model makes it 5–10× slower than CLIP Score.
Question formulation matters. How you convert the prompt to a question affects the score.
VQA model biases. The underlying model has its own failure modes that transfer to the evaluation.

ImageReward and HPS

These are human preference predictors — models trained to predict which images humans would prefer.

ImageReward

Trained on 137K expert comparisons of text-image pairs. Takes a (text, image) pair and outputs a scalar reward score predicting human preference. Built on BLIP architecture.

HPS (Human Preference Score)

Trained on the Human Preference Dataset v2 (HPD v2) with 798K preference choices. Uses CLIP as a backbone with a preference prediction head.

When to Use

Both are useful for ranking model outputs when you want a holistic quality estimate approximating human judgment:

RLHF for image generation models
Automatic best-of-N selection
Quality screening before human review

Limitations

Training distribution bias. Both were trained on specific model outputs and may not generalize well to new architectures or styles.
Aesthetic bias. They tend to favor vibrant, high-contrast images over subtle or muted aesthetics.
Not compositional. Like CLIP, they struggle with complex multi-object prompts.