Adversarial prompts — Carefully designed prompts intended to expose model weaknesses, such as attribute binding failures, spatial reasoning errors, or counting mistakes.
Arena ranking — A ranking system derived from pairwise comparisons where models compete in head-to-head matchups, commonly using ELO or Bradley-Terry models to aggregate preferences.
CFG (Classifier-Free Guidance) — A technique that steers diffusion models toward text alignment by interpolating between conditional and unconditional predictions using a guidance scale parameter.
CLIP — Contrastive Language-Image Pre-training, a model trained to align text and image embeddings in a shared latent space, widely used for zero-shot classification and text-image similarity.
CLIP Score — A metric measuring text-image alignment by computing cosine similarity between CLIP embeddings of a prompt and generated image; higher scores indicate stronger prompt adherence.
Compositionality — The ability of a model to correctly combine multiple concepts, attributes, and relationships in a single image, such as binding colors to specific objects or understanding spatial arrangements.
ControlNet — An architecture that adds spatial conditioning controls (e.g., edge maps, depth maps, pose skeletons) to diffusion models, enabling precise structural guidance during generation.
DDPM (Denoising Diffusion Probabilistic Models) — A foundational class of generative models that learn to reverse a gradual noising process, generating images by iteratively denoising pure noise.
Deterministic — A process that produces the same output given identical inputs; in image generation, using the same seed and parameters yields identical results.
Diffusion — A class of generative models that produce images by gradually removing noise from random samples, guided by learned denoising steps conditioned on text or other inputs.
DrawBench — A benchmark containing 200 prompts designed to test compositionality, spatial reasoning, attribute binding, and other challenging aspects of text-to-image generation.
ELO — A rating system originally developed for chess that ranks models or images based on pairwise comparison outcomes, commonly used in preference-based evaluation.
EU AI Act — European Union legislation regulating high-risk AI systems, including requirements for transparency, safety, and human oversight that affect deployment of generative models.
FID (Fréchet Inception Distance) — A distributional metric comparing generated and real image sets by measuring the distance between Gaussian fits of Inception v3 features; lower is better.
GenEval — A benchmark evaluating compositional generation across multiple dimensions including object count, attribute binding, spatial relationships, and complex prompts.
Guardrails — Safety mechanisms that filter or block inappropriate outputs, enforce content policies, or prevent misuse of generative models.
HPS (Human Preference Score) — A learned metric trained on 798K human preferences that predicts which images humans would prefer given a text prompt.
Human evaluation — Assessment of generated images by human raters, considered the gold standard but expensive and requiring careful design to minimize bias.
ImageReward — A reward model trained on 137K expert comparisons to predict human preference for text-image pairs, built on BLIP architecture.
Inference — The process of generating outputs from a trained model; in diffusion models, this involves iterative denoising from random noise to a final image.
Inter-annotator agreement — The degree to which multiple human raters assign consistent scores or preferences to the same images, measured by metrics like Krippendorff's alpha or Fleiss' kappa.
Latent diffusion — Diffusion models that operate in a compressed latent space rather than pixel space, enabling faster generation and lower memory usage while maintaining quality.
Latent space — A compressed, learned representation space where diffusion or other generative models perform computations before decoding to pixel space.
Likert scale — A rating system where annotators assign discrete scores (e.g., 1-5) to measure quality dimensions; requires careful anchor definitions and suffers from subjective variation.
LoRA (Low-Rank Adaptation) — A parameter-efficient fine-tuning method that adapts pretrained models by training small low-rank weight updates, commonly used for style or subject customization.
LPIPS (Learned Perceptual Image Patch Similarity) — A perceptual distance metric between two images computed using deep network features; lower values indicate greater perceptual similarity.
Mode collapse — A failure mode where a generative model produces limited diversity, repeatedly generating similar outputs instead of covering the full distribution.
NSFW (Not Safe For Work) — Content containing nudity, violence, or other material inappropriate for general audiences; typically filtered by safety guardrails in production systems.
Pairwise preference — An evaluation method where annotators choose the better of two images rather than assigning absolute scores, reducing bias and improving consistency.
Pareto frontier — The set of models or configurations where no alternative is strictly better across all dimensions; represents optimal trade-offs between competing objectives like quality, speed, and cost.
PartiPrompts — A benchmark of 1,600+ prompts organized by difficulty and testing various aspects of text-to-image generation including complex scenes and abstract concepts.
Prompt adherence — The degree to which a generated image accurately reflects the content, attributes, and relationships specified in the text prompt.
PSNR (Peak Signal-to-Noise Ratio) — A pixel-level metric measuring reconstruction quality in decibels; rarely useful for text-to-image evaluation due to poor correlation with perceptual quality.
Red teaming — Systematic adversarial testing to identify safety failures, bias amplification, harmful outputs, or policy violations in generative models before deployment.
Seed — A random number initializing the noise distribution in stochastic generation; fixing the seed enables reproducible outputs.
SSIM (Structural Similarity Index) — A metric comparing luminance, contrast, and structure between two images; like PSNR, it's pixel-focused and poorly suited for generative model evaluation.
Stochastic — A process involving randomness that produces different outputs across runs; in image generation, different seeds yield different results even with identical prompts.
T2I-CompBench — Text-to-Image Compositional Benchmark, a suite evaluating attribute binding, object relationships, and complex compositions through automated and human metrics.
VLM (Vision-Language Model) — A model jointly trained on images and text that can perform tasks requiring understanding of both modalities, such as image captioning or visual question answering.
VQAScore — A metric reframing image evaluation as visual question answering, measuring whether an image correctly answers questions derived from the prompt; strong for compositional evaluation.