· 3 min read
RealBench V1 Methodology
RealBench is a visual Turing test for image models. Instead of grading capabilities, it asks a simpler question: can a model's output pass as a real photograph to a human? The score comes from real votes collected from a quick in-site game where players guess whether each image is a real photo or AI.
How a score is built
- Paired images. Each round pairs a real photograph from a curated dataset (Unsplash / Pexels / Flickr30k) with an AI image generated from the same caption — semantically equivalent, not random art vs a random photo.
- Human votes. Players see one image at a time and guess Real Photo or AI Generated. Every vote is recorded against the AI image's source model.
- Realism score. A model's realism score is the share of votes on its AI images that judged them to be real photographs. Higher means it looked more like a real photo to the people voting.
- Thresholds & snapshots. Models need at least 10 votes to appear, so early noise does not dominate. The leaderboard is a periodic snapshot; the snapshot date is shown on the hub.
Scoring
The realism score is the share of votes on a model's AI images that mistook them for real photographs:
realism = votes that said “real” / total votes
A higher realism score means the model fooled more people. Models are ranked most-photoreal first on the leaderboard.
Limitations
- Votes are crowd-sourced from people choosing to play a game — not a controlled lab panel. Volume and the 10-vote minimum reduce noise but do not remove sampling bias.
- Realism is not quality. A model can look convincingly real while ignoring the prompt; that is what the capability Benchmark V1 measures instead.
- Scores depend on the image subjects in the pool (landscapes, people, products, and more). A model strong on one subject may score differently as the pool grows.
- The leaderboard is a snapshot, refreshed periodically rather than live.