Automated metrics let you evaluate image generation models without human raters. They're fast, cheap, and reproducible — but each measures a narrow aspect of quality, and misusing them leads to misleading conclusions.
This guide covers the metrics that matter, what each actually measures, and when to use (or avoid) each one.
FID — Fréchet Inception Distance
FID is the most widely reported metric in image generation research. It measures how similar the distribution of generated images is to a distribution of real images.
How It Works
- Pass both generated and reference images through an Inception v3 network pretrained on ImageNet
- Extract activations from the penultimate layer (2048-dimensional feature vectors)
- Fit a multivariate Gaussian to each set of features
- Compute the Fréchet distance between the two Gaussians
The formula compares means and covariances of the two distributions. Lower distance means the generated distribution is closer to the real distribution.
Interpretation
- Lower is better. FID = 0 means the distributions are identical.
- State-of-the-art FID scores on COCO-30K: 6–15 for top models.
- FID below 10 is generally excellent. FID above 50 indicates significant quality gaps.
When to Use
FID is appropriate for comparing the overall distributional quality of a model's outputs against a reference dataset. It's most useful during model training and for high-level model comparison.
Limitations
- Sample size sensitivity. FID requires at least 10,000–50,000 images for stable estimates. At smaller sizes, variance is high and comparisons are unreliable. Chong & Forsyth (2020) showed FID estimates can shift by 10+ points with only 5,000 samples.
- Reference dataset dependency. FID scores are only meaningful relative to a specific reference set. FID on COCO ≠ FID on ImageNet.
- Ignores individual quality. A model producing 90% excellent images and 10% garbage can achieve the same FID as a model producing uniformly mediocre images.
- Inception v3 bias. The feature extractor was trained on ImageNet classification. It may miss visual features important for artistic or specialized domains.
- Insensitive to mode dropping. A model generating only a few types of images well can score low FID if those types match the reference distribution.
CLIP Score
CLIP Score measures text-image alignment — how well a generated image matches its text prompt.
How It Works
- Encode the text prompt using CLIP's text encoder
- Encode the generated image using CLIP's image encoder
- Compute the cosine similarity between the two embeddings
- Scale by 100 for readability (convention varies)
Interpretation
- Higher is better. A score of 100 would mean perfect alignment (never achieved in practice).
- Typical range for modern models: 25–35 on diverse prompt sets.
- Scores above 30 generally indicate strong prompt adherence.
When to Use
CLIP Score is the standard metric for prompt fidelity — verifying that a model generates what was asked for. It's fast (one forward pass per image) and doesn't need a reference dataset.
Limitations
- Quality-blind. CLIP Score doesn't distinguish between a beautiful rendering and an ugly one, as long as both depict the right content. A blurry photo of a cat scores similarly to a sharp one.
- Saturation at high quality. Beyond a certain quality threshold, CLIP Score stops differentiating. The gap between "good" and "great" is often invisible to CLIP.
- Compositional weakness. CLIP struggles with complex prompts involving multiple objects, spatial relationships, and attribute binding. "A blue sphere on top of a red cube" may score well even if colors are swapped.
- Gameable. Text rendered directly onto images can artificially inflate CLIP Score without actual visual correspondence.
LPIPS — Learned Perceptual Image Patch Similarity
LPIPS measures perceptual similarity between two specific images — how similar they look to a human observer.
How It Works
- Pass both images through a pretrained network (typically VGG or AlexNet)
- Extract features at multiple layers
- Normalize and compute weighted L2 distances between corresponding feature maps
- Sum across layers to produce a single distance score
Interpretation
- Lower is better (it's a distance metric). LPIPS = 0 means the images are perceptually identical.
- Typical range: 0.0–1.0. Below 0.1 indicates very high similarity.
When to Use
LPIPS requires a reference image to compare against. Common applications:
- Consistency measurement: generate the same prompt multiple times, compute pairwise LPIPS to quantify output variance
- Image-to-image evaluation: compare model output to an expected result
- Style transfer: measure how much structure is preserved
Limitations
- Requires a reference. Not applicable to open-ended text-to-image evaluation.
- Network bias. The underlying VGG/AlexNet was trained on ImageNet, potentially missing domain-specific perceptual features.
SSIM and PSNR
These are pixel-level metrics from signal processing.
- SSIM (Structural Similarity Index) compares luminance, contrast, and structural patterns. Range: −1 to 1, higher is better.
- PSNR (Peak Signal-to-Noise Ratio) measures the ratio of maximum signal to noise in decibels. Higher is better.
When to Use
Almost never for text-to-image evaluation. They're appropriate for:
- Super-resolution (comparing upscaled vs high-res reference)
- Compression quality assessment
- Image restoration tasks
Why They're Limited
Pixel-level metrics don't correlate well with human perception. Two images that look identical to humans can have wildly different PSNR scores, and two images with similar PSNR can look very different. For generated images — where there's no pixel-level ground truth — they're essentially meaningless.
VQAScore
VQAScore reframes image evaluation as visual question answering.
How It Works
- Convert the text prompt into a yes/no question: "A red car parked in front of a blue house" → "Does this image show a red car parked in front of a blue house?"
- Pass the image and question to a VQA model (typically BLIP-2 or a similar vision-language model)
- The model's confidence in answering "yes" becomes the score
Interpretation
- Higher is better. Range: 0–1 (probability).
- Correlates better with human judgment than CLIP Score on compositional prompts.
When to Use
VQAScore excels at compositional evaluation — prompts with multiple objects, attributes, and spatial relationships. Where CLIP Score saturates or fails on complex prompts, VQAScore provides more granular signal.
Lin et al. (2024) showed VQAScore outperforms CLIP Score, BLIP Score, and other metrics on 8 out of 9 text-to-image benchmarks for prompt fidelity evaluation.
Limitations
- Slower. Running a large VQA model makes it 5–10× slower than CLIP Score.
- Question formulation matters. How you convert the prompt to a question affects the score.
- VQA model biases. The underlying model has its own failure modes that transfer to the evaluation.
ImageReward and HPS
These are human preference predictors — models trained to predict which images humans would prefer.
ImageReward
Trained on 137K expert comparisons of text-image pairs. Takes a (text, image) pair and outputs a scalar reward score predicting human preference. Built on BLIP architecture.
HPS (Human Preference Score)
Trained on the Human Preference Dataset v2 (HPD v2) with 798K preference choices. Uses CLIP as a backbone with a preference prediction head.
When to Use
Both are useful for ranking model outputs when you want a holistic quality estimate approximating human judgment:
- RLHF for image generation models
- Automatic best-of-N selection
- Quality screening before human review
Limitations
- Training distribution bias. Both were trained on specific model outputs and may not generalize well to new architectures or styles.
- Aesthetic bias. They tend to favor vibrant, high-contrast images over subtle or muted aesthetics.
- Not compositional. Like CLIP, they struggle with complex multi-object prompts.
Decision Matrix: When to Use Which
| What you're measuring | Primary metric | Fallback | |---|---|---| | Overall distributional quality | FID | IS (Inception Score) | | Prompt adherence (simple prompts) | CLIP Score | — | | Prompt adherence (complex prompts) | VQAScore | T2I-CompBench metrics | | Perceptual similarity to reference | LPIPS | SSIM | | Human preference prediction | ImageReward | HPS v2 | | Aesthetic quality | HPS v2 | ImageReward |
Use multiple metrics. No single number tells the full story.
Common Pitfalls
Small-Sample FID
Computing FID on fewer than 10,000 images produces unstable estimates. A common mistake in papers and blog posts: reporting FID on 1,000 images and comparing models whose scores differ by 2–3 points. At that sample size, the noise exceeds the signal.
CLIP Score Saturation
Most modern models score between 28 and 33 on CLIP Score for general prompt sets. Reporting a 0.5-point improvement as meaningful is misleading — it's within the noise floor for prompt-level evaluation.
Metric Gaming
Optimizing directly for a metric often degrades actual quality. Models fine-tuned to maximize CLIP Score learn to embed text-like patterns in images. Models optimized for FID can sacrifice diversity for distributional match. Always validate automated metrics with human evaluation on a representative sample.
Cherry-Picking Reference Sets
FID depends entirely on the reference dataset. Choosing a reference set that happens to match your model's strengths produces misleadingly low FID. Always use standardized reference sets and report which one.
Ignoring Confidence Intervals
Metrics are point estimates. Report confidence intervals (via bootstrap or repeated sampling) to distinguish real differences from noise. A 2-point FID difference without error bars is not a result — it's a guess.