ImageBench
Back to blog
· 3 min read

RealBench V1 Methodology

RealBench is a visual Turing test for image models. Instead of grading capabilities, it asks a simpler question: can a model's output pass as a real photograph to a human? The score comes from real votes collected from a quick in-site game where players guess whether each image is a real photo or AI.

How a score is built

  1. Paired images. Each round pairs a real photograph from a curated dataset (Unsplash / Pexels / Flickr30k) with an AI image generated from the same caption — semantically equivalent, not random art vs a random photo.
  2. Human votes. Players see one image at a time and guess Real Photo or AI Generated. Every vote is recorded against the AI image's source model.
  3. Realism score. A model's realism score is the share of votes on its AI images that judged them to be real photographs. Higher means it looked more like a real photo to the people voting.
  4. Thresholds & snapshots. Models need at least 10 votes to appear, so early noise does not dominate. The leaderboard is a periodic snapshot; the snapshot date is shown on the hub.

Scoring

The realism score is the share of votes on a model's AI images that mistook them for real photographs:

realism = votes that said “real” / total votes

A higher realism score means the model fooled more people. Models are ranked most-photoreal first on the leaderboard.

Limitations

  • Votes are crowd-sourced from people choosing to play a game — not a controlled lab panel. Volume and the 10-vote minimum reduce noise but do not remove sampling bias.
  • Realism is not quality. A model can look convincingly real while ignoring the prompt; that is what the capability Benchmark V1 measures instead.
  • Scores depend on the image subjects in the pool (landscapes, people, products, and more). A model strong on one subject may score differently as the pool grows.
  • The leaderboard is a snapshot, refreshed periodically rather than live.