ImageBench

ImageBench V1 —

192 evaluations across 6 categories

136
Pass
56
Fail
71%
Pass Rate
80.2s
Avg Latency
79.7s
Min Latency
82.0s
Max Latency
Text Rendering60%Spatial Reasoning79%Human realism86%Truthfulness63%Professional Studio82%Graphical design29%Latency0%
The word 'CHAPTER ONE' typed on aged paper with a vintage typewriter font, complete with slightly uneven ink
Text RenderingTypography StyleEasyqwen-image-local/qwen-image-gen

Prompt: The word 'CHAPTER ONE' typed on aged paper with a vintage typewriter font, complete with slightly uneven ink

Text Rendering 60%

Typography Style100%
Easy 100%
The word 'CHAPTER ONE' typed on aged paper with a vintage tyPASS
The word 'CHAPTER ONE' typed on aged paper with a vintage typewriter font, complete with slightly uneven ink

Evaluator (Qwen 3 VL)

The text “CHAPTER ONE” is spelled correctly and rendered in a bold, blocky font with uneven, slightly smudged ink—strongly evoking a vintage typewriter. The aged paper texture with stains and fibers enhances the authenticity. The style convincingly matches the requested medium. VERDICT: PASS

The phrase 'TODAY'S SPECIAL' written in white chalk on a darPASS
The phrase 'TODAY'S SPECIAL' written in white chalk on a dark green chalkboard with chalk dust

Evaluator (Qwen 3 VL)

Text is spelled correctly and rendered in a convincing chalk-on-chalkboard style with visible texture, smudges, and dust. The white chalk strokes match the medium’s appearance, including uneven edges and residue. VERDICT: PASS

The word 'OPEN' displayed on a red LED sign with glowing pixPASS
The word 'OPEN' displayed on a red LED sign with glowing pixel-style letters

Evaluator (Qwen 3 VL)

The text “OPEN” is spelled correctly and rendered in a convincing red LED pixel-style display, with glowing individual dots forming the letters. The medium matches the prompt precisely. VERDICT: PASS

Writing accuracy50%
Easy 100%
A single large word 'HELLO' centered on a white backgroundPASS
A single large word 'HELLO' centered on a white background

Evaluator (Qwen 3 VL)

The word "HELLO" is spelled correctly and fully readable. All letters are clearly formed and legible against the white background. No misspellings, missing letters, or illegible parts are present. VERDICT: PASS

The word 'ADVENTURE' in big bold letters centered on a plainPASS
The word 'ADVENTURE' in big bold letters centered on a plain cream background

Evaluator (Qwen 3 VL)

The word "ADVENTURE" is spelled correctly and fully readable, despite its textured, distressed appearance. All letters are intact and legible against the cream background. No misspellings or missing parts are present. VERDICT: PASS

The word 'SERENITY' written large and centered on a solid liPASS
The word 'SERENITY' written large and centered on a solid light blue background

Evaluator (Qwen 3 VL)

The word "SERENITY" is spelled correctly and fully readable, despite its stylized, brush-like purple font. No letters are missing or misrendered. The text is centered on a light blue background, matching the intent. All characters are discernible. VERDICT: PASS

Medium 67%
The sentence 'The quick brown fox jumps high' in clean blackPASS
The sentence 'The quick brown fox jumps high' in clean black text on a white background

Evaluator (Qwen 3 VL)

All words are spelled correctly and fully readable: "The", "quick", "brown", "fox", "jumps", "high". No misspellings or illegible parts. Text matches the intended phrase exactly. VERDICT: PASS

'Every sunset brings a new dawn ahead' written neatly on a pFAIL
'Every sunset brings a new dawn ahead' written neatly on a pale yellow background

Evaluator (Qwen 3 VL)

The word "aheed" is misspelled; it should be "ahead". All other words are correctly spelled and readable. The handwriting is legible despite being stylized. VERDICT: FAIL

'Stars shine brightest in the dark night' displayed on a clePASS
'Stars shine brightest in the dark night' displayed on a clean off-white background

Evaluator (Qwen 3 VL)

The text reads: "Stars shine brightest in the dark night". All words are spelled correctly and fully readable. The phrase matches the intended text exactly, with no missing or illegible parts. The cursive script is clear despite decorative lights. VERDICT: PASS

Hard 0%
A poster with headline 'SUMMER SALE' in large bold text, bodFAIL
A poster with headline 'SUMMER SALE' in large bold text, body text 'Up to 50% off all items this weekend only' in medium size, and fine print 'Terms and conditions apply. See store for details.' at the bottom in small text

Evaluator (Qwen 3 VL)

The word “conditions” is misspelled as “condltions” in the fine print. The “l” is missing. All other text matches the intended message exactly. The misspelling renders the fine print inaccurate. VERDICT: FAIL

A flyer with large title 'GRAND OPENING', subtitle 'Join us FAIL
A flyer with large title 'GRAND OPENING', subtitle 'Join us for food, music, and fun' in medium text, and small disclaimer 'RSVP required. Limited seating available.' at the bottom

Evaluator (Qwen 3 VL)

“RSVP requirec.” has a misspelling: “requirec” should be “required.” “Limited serting available.” has a misspelling: “serting” should be “seating.” Both errors are clearly visible and incorrect. VERDICT: FAIL

An event banner with headline 'ANNUAL GALA 2025' at the top FAIL
An event banner with headline 'ANNUAL GALA 2025' at the top in large serif font, description 'An evening of fine dining and live jazz' in medium text, and footnote 'Black tie optional. Doors open at 6 PM.' in small text

Evaluator (Qwen 3 VL)

The footer text contains errors: “Black puite auxcimal” is misspelled (should be “Black tie optional”) and “Doors·open” has a misplaced dot. “auxcimal” is not a word. The rest of the text is correct and readable. VERDICT: FAIL

Extreme 33%
A vintage movie poster for 'THE LAST VOYAGE' featuring a tagFAIL
A vintage movie poster for 'THE LAST VOYAGE' featuring a tagline 'One ship. One chance. No turning back.' with credits block reading 'Directed by James Harlow. Starring Elena Voss, Marcus Chen, and Sofia Reyes. Music by Thomas Grant. A Paramount Pict

Evaluator (Qwen 3 VL)

Misspellings: “turning” → “tuvning” (in tagline), “Starring” → “Sttt” (in credits), “Thomas Grant” → “Thomase Grant” (in credits). Missing words: “Starring” is partially cut off. “Coming Summer 2025” → “Comagd Stwmer 2025” (illegible). Several names are distorted or unreadable. VERDICT: FAIL

A magazine cover titled 'NATIONAL EXPLORER' with cover linesPASS
A magazine cover titled 'NATIONAL EXPLORER' with cover lines 'The Hidden Temples of Cambodia — page 34', 'Interview: Dr. Maria Santos on Climate Action', 'Top 10 Trails You Must Hike This Year', and a barcode area with 'Issue 247 | March 2025 | $6.99

Evaluator (Qwen 3 VL)

All text matches the intended content exactly: “NATIONAL EXPLORER”, “The Hidden Temples of Cambodia — page 34”, “Interview: Dr. Maria Santos on Climate Action”, “Top 10 Trails You Must Hike This Year”, and the barcode area “Issue 247 | March 2025 | $6.99”. No misspellings, missing words, or illegible parts detected. VERDICT: PASS

A product label for 'GOLDEN HARVEST ORGANIC HONEY' with subtFAIL
A product label for 'GOLDEN HARVEST ORGANIC HONEY' with subtitle 'Raw & Unfiltered', weight '12 oz (340g)', description 'Pure wildflower honey harvested from sustainable apiaries in the Pacific Northwest. No additives or preservatives.', and small te

Evaluator (Qwen 3 VL)

Misspellings: “presenoratives” → “preservatives”, “prointent” → “Product”, “06/2020” (intended 06/2026), “Store in cool, dry place” → “Store in cool, anyplace”. Also, “apliaries” is misspelled. Some text is partially obscured or illegible. Not all words are fully readable or accurate. VERDICT: FAIL

Spatial Reasoning 79%

Attributes Binding100%
Easy 100%
A red ball and a blue cube sitting side by side on a white sPASS
A red ball and a blue cube sitting side by side on a white surface

Evaluator (Qwen 3.5 122B)

The image shows two objects on a white surface: - Left object: Red, spherical (ball-shaped), medium size. - Right object: Blue, cubic (cube-shaped), medium size. Both objects are side by side. Colors, shapes, and positions match the prompt exactly — no swaps or mismatches. VERDICT: PASS

A green apple and an orange pumpkin on a wooden cutting boarPASS
A green apple and an orange pumpkin on a wooden cutting board

Evaluator (Qwen 3.5 122B)

The image shows a green apple and an orange pumpkin on a wooden cutting board. The apple is correctly green, round, and appropriately sized relative to the pumpkin. The pumpkin is orange, round with a stem, and also correctly sized. No attributes are swapped — color, shape, and size are all correctly bound to each object. VERDICT: PASS

A yellow taxi and a black limousine parked next to each othePASS
A yellow taxi and a black limousine parked next to each other on a street

Evaluator (Qwen 3.5 122B)

The image shows a yellow taxi in the foreground and a black limousine behind it, both on a street. The taxi is correctly colored yellow and has taxi markings (including a “T” logo). The limousine is correctly colored black and has the elongated shape typical of a limo. No attributes are swapped — color, size, and shape are correctly bound to each object. VERDICT: PASS

Medium 100%
A large red circle, a small blue triangle, and a medium greePASS
A large red circle, a small blue triangle, and a medium green star arranged on a white canvas

Evaluator (Qwen 3.5 122B)

The image shows: - A large red circle — correct color, size, and shape. - A small blue triangle — correct color, size, and shape. - A medium green star — correct color, size, and shape. All attributes are correctly bound to their respective objects with no swaps. The arrangement on a white canvas is also accurate. VERDICT: PASS

A tall thin white candle, a short fat red candle, and a mediPASS
A tall thin white candle, a short fat red candle, and a medium spiral yellow candle on a shelf

Evaluator (Qwen 3.5 122B)

The image shows three candles on a wooden shelf: 1. Left candle: Tall, thin, white, with a twisted/spiral texture — matches “tall thin white candle”. 2. Middle candle: Short, fat, red, with textured surface — matches “short fat red candle”. 3. Right candle: Medium height, yellow, spiral-shaped — matches “medium spiral yellow candle”. All attributes (color, size, shape) are correctly bound to each object with no swaps. VERDICT: PASS

A big brown horse, a small white rabbit, and a medium-sized PASS
A big brown horse, a small white rabbit, and a medium-sized black cat standing in a meadow

Evaluator (Qwen 3.5 122B)

The image contains a big brown horse, a small white rabbit, and a medium-sized black cat — all correctly matched to their described attributes. The horse is large and brown, the rabbit is small and white, and the cat is medium-sized and black. All are standing in a grassy meadow. No attribute swaps or misidentifications are present. VERDICT: PASS

Hard 100%
A pink fire truck, a blue banana, and a green sun in a surrePASS
A pink fire truck, a blue banana, and a green sun in a surreal landscape

Evaluator (Qwen 3.5 122B)

The image contains: - A pink fire truck: Correct color, shape, and object type. - A blue banana: Correct color and shape; size is surreal but matches prompt intent. - A green sun: The celestial body is green and round, but it resembles a moon more than a sun (no rays, night-like sky). However, the prompt says “green sun,” and color/shape are correct — contextually acceptable as surreal. All attributes are bound correctly to intended objects. No swaps. VERDICT: PASS

A purple carrot, a red tree with red leaves, and a white watPASS
A purple carrot, a red tree with red leaves, and a white watermelon sliced open

Evaluator (Qwen 3.5 122B)

The image contains: - Three purple carrots with green tops — correct color and shape. - A tree in the background with dense red leaves — matches “red tree with red leaves”. - A watermelon sliced open, revealing white flesh with seeds — matches “white watermelon sliced open”. All attributes are correctly bound to their respective objects. No swaps or mismatches observed. VERDICT: PASS

An orange elephant, a yellow penguin, and a silver strawberrPASS
An orange elephant, a yellow penguin, and a silver strawberry arranged on a table

Evaluator (Qwen 3.5 122B)

The image contains an orange elephant (correct color, shape, and size), a yellow penguin (correct color, shape, and size), and two strawberries — one red and one silver. The prompt asked for “a silver strawberry,” which is present, but also implies only one strawberry total. However, the presence of an extra red strawberry doesn’t violate the core requirement if we interpret “a silver strawberry” as at least one being silver. All specified objects are correctly attributed: elephant = orange, pen

Compositionality78%
Easy 100%
A tabby cat wearing oversized red sunglassesPASS
A tabby cat wearing oversized red sunglasses

Evaluator (Qwen 3.5 122B)

- Tabby cat: Present — visible fur pattern and facial structure match a tabby. - Oversized red sunglasses: Present — large, bright red frames sit on the cat’s face, covering eyes. - Coherent combination: Yes — glasses fit naturally (though stylized), no anatomical contradictions. All elements are present and coherently combined. VERDICT: PASS

A golden retriever wearing a tiny cowboy hatPASS
A golden retriever wearing a tiny cowboy hat

Evaluator (Qwen 3.5 122B)

- Golden retriever: Present — clearly visible, with correct fur color and facial features. - Tiny cowboy hat: Present — brown, with band and buckle, proportionally small on the dog’s head. - Coherent combination: Yes — hat sits naturally on the dog’s head; no anatomical or logical inconsistencies. All elements are accurately depicted and integrated. VERDICT: PASS

A penguin holding a colorful umbrellaPASS
A penguin holding a colorful umbrella

Evaluator (Qwen 3.5 122B)

- Penguin: Present — clearly visible, standing upright with typical black-and-white plumage and orange beak. - Holding: Present — penguin’s flipper is wrapped around the umbrella’s curved handle in a plausible grip. - Colorful umbrella: Present — rainbow-colored panels (red, orange, yellow, green, blue, purple) are clearly depicted. - Coherent combination: Yes — the scene is logically composed; rain is falling, umbrella is open above the penguin, and water droplets enhance realism. All elements

Medium 100%
A medieval knight in full armor sitting at a desk typing on PASS
A medieval knight in full armor sitting at a desk typing on a modern laptop

Evaluator (Qwen 3.5 122B)

- Medieval knight in full armor: Present — detailed plate and chainmail armor with helmet. - Sitting at a desk: Present — seated at a rustic wooden desk. - Typing on a modern laptop: Present — hands positioned over a sleek, contemporary laptop keyboard. All elements are coherently combined with no contradictions or missing components. VERDICT: PASS

A Roman gladiator taking a selfie with a smartphone in the CPASS
A Roman gladiator taking a selfie with a smartphone in the Colosseum

Evaluator (Qwen 3.5 122B)

- Roman gladiator: Present — armored figure with helmet, red plume, and muscle cuirass. - Selfie action: Present — holding smartphone up as if taking a photo of himself. - Smartphone: Present — modern device with camera module visible in hand. - Colosseum: Present — iconic Roman amphitheater structure in background. All elements are visually present and coherently combined despite the anachronism. VERDICT: PASS

An astronaut in a spacesuit riding a bicycle through a parkPASS
An astronaut in a spacesuit riding a bicycle through a park

Evaluator (Qwen 3.5 122B)

- Astronaut: Present — clearly depicted in a full spacesuit with helmet, patches, and life-support pack. - Spacesuit: Present — detailed, realistic design with red stripes, gloves, boots, and visor. - Riding a bicycle: Present — seated on a mountain bike, gripping handlebars, feet on pedals. - Bicycle: Present — black frame, knobby tires, water bottle, functional components visible. - Park setting: Present — grassy area, trees with autumn leaves, paved path, natural lighting. All elements are c

Hard 33%
An octopus wearing a top hat and monocle playing chess on thPASS
An octopus wearing a top hat and monocle playing chess on the surface of the moon with Earth visible in the background

Evaluator (Qwen 3.5 122B)

- Octopus: Present - Top hat: Present - Monocle: Present - Chessboard with pieces: Present - Moon surface: Present - Earth in background: Present All elements are visually present and coherently combined. The scene is surreal but logically assembled per the prompt. VERDICT: PASS

A dinosaur in a business suit giving a PowerPoint presentatiFAIL
A dinosaur in a business suit giving a PowerPoint presentation in a modern office while drinking coffee

Evaluator (Qwen 3.5 122B)

- Dinosaur: Present (T-Rex head and hands) - Business suit: Present (dark suit, white shirt, striped tie) - PowerPoint presentation: Not visible — no screen, slides, or projector shown - Modern office: Present (glass windows, desks, chairs, lighting) - Drinking coffee: Present (holding a coffee cup) Core concept is creatively rendered, but “giving a PowerPoint presentation” is not depicted — only implied by gesture. Since the prompt explicitly requires that element to be present, it’s missing.

A robot dressed as a pirate sailing a paper boat on a lake oFAIL
A robot dressed as a pirate sailing a paper boat on a lake of lava while juggling three kittens

Evaluator (Qwen 3.5 122B)

Let’s examine each element from the prompt: 1. “A robot” — The central figure is a cat with mechanical/robotic features (e.g., metallic hand, glowing eye), so partially robotic. Acceptable as “robot” in stylized form. 2. “dressed as a pirate” — Yes: tricorn hat with skull emblem, bandana, coat, and accessories match pirate attire. 3. “sailing a paper boat” — The boat appears to be made of folded brown material resembling paper or cardboard — plausible interpretation. 4. “on a lake of lava” — Th

Counting89%
Easy 100%
Exactly 3 red apples arranged on a wooden tablePASS
Exactly 3 red apples arranged on a wooden table

Evaluator (Qwen 3.5 122B)

The image shows three red apples placed on a wooden surface. The apples are arranged with one in the foreground and two slightly behind, all resting on a textured wooden table. The count of apples is exactly three, matching the prompt’s request. VERDICT: PASS

Exactly 5 yellow rubber ducks sitting in a row on a bathroomPASS
Exactly 5 yellow rubber ducks sitting in a row on a bathroom shelf

Evaluator (Qwen 3.5 122B)

The image shows five yellow rubber ducks arranged in a row on what appears to be a bathroom shelf or tub edge. All ducks are identical in color and style, with orange beaks and black eyes. The count matches the prompt’s request exactly: 5 ducks, yellow, rubber-like, in a row, on a bathroom surface. VERDICT: PASS

Exactly 4 white candles standing on a dark surfacePASS
Exactly 4 white candles standing on a dark surface

Evaluator (Qwen 3.5 122B)

The image shows four white candles, each with a flame, standing upright in brass candleholders on a dark surface. The count of candles is exactly 4, matching the prompt’s request. All candles are white and positioned on a dark surface as described. VERDICT: PASS

Medium 100%
2 orange cats and 3 black dogs sitting together on a grassy PASS
2 orange cats and 3 black dogs sitting together on a grassy lawn

Evaluator (Qwen 3.5 122B)

I see 2 orange cats and 3 black dogs sitting together on a grassy lawn. The counts exactly match the prompt’s request: two cats, three dogs, all seated on grass. No extra or missing animals. Composition is accurate. VERDICT: PASS

4 red roses and 2 white daisies in a glass vase on a tablePASS
4 red roses and 2 white daisies in a glass vase on a table

Evaluator (Qwen 3.5 122B)

The image shows 4 red roses and 2 white daisies arranged in a clear glass vase, placed on a wooden table. The counts match the prompt exactly: 4 roses, 2 daisies, 1 glass vase, 1 table. All requested elements are present and correctly counted. VERDICT: PASS

1 blue butterfly and 3 yellow butterflies resting on a gardePASS
1 blue butterfly and 3 yellow butterflies resting on a garden fence

Evaluator (Qwen 3.5 122B)

The image shows 1 blue butterfly and 3 yellow butterflies, all resting on a chain-link fence. The count matches the prompt exactly: one blue, three yellow. All are positioned on the fence structure as described. VERDICT: PASS

Hard 67%
Exactly 9 goldfish swimming in a large round fishbowlFAIL
Exactly 9 goldfish swimming in a large round fishbowl

Evaluator (Qwen 3.5 122B)

The image shows a round fishbowl with multiple goldfish. Counting carefully: there are 8 clearly visible goldfish inside the bowl. One faint reflection or partial fish appears on the right edge, but it’s not a distinct, fully rendered fish. The prompt requests “exactly 9 goldfish.” Since only 8 are clearly present and the 9th is ambiguous/reflected, the count does not exactly match. VERDICT: FAIL

A fruit bowl containing exactly 3 bananas, 4 oranges, and 2 PASS
A fruit bowl containing exactly 3 bananas, 4 oranges, and 2 apples

Evaluator (Qwen 3.5 122B)

The image shows a fruit bowl with: - 3 bananas (correct) - 4 oranges (correct) - 2 apples (one green, one red — correct) All counts match the prompt exactly. VERDICT: PASS

Exactly 11 colorful balloons floating against a clear blue sPASS
Exactly 11 colorful balloons floating against a clear blue sky

Evaluator (Qwen 3.5 122B)

I count 11 colorful balloons in the image: green, blue, pink, orange (x2), yellow (x2), red, purple, and another pink. They are floating against a clear blue sky with some clouds. The count matches exactly what was requested. VERDICT: PASS

Negation33%
Easy 67%
A tropical beach scene with no peoplePASS
A tropical beach scene with no people

Evaluator (Qwen 3.5 122B)

The image depicts a serene tropical beach scene with golden sand, turquoise water, and lush palm trees framing the view. The shoreline curves gently into the distance, where more palms and greenery line the coast. There are no visible people, footprints, or signs of human presence — only natural elements: sand, sea, sky, and vegetation. The exclusion criterion (“no people”) is fully satisfied. VERDICT: PASS

A cat lying on a couch without a collarFAIL
A cat lying on a couch without a collar

Evaluator (Qwen 3.5 122B)

The image shows a tabby cat lying on a gray couch. The cat is wearing a brown leather collar with a metal buckle around its neck. Since the prompt explicitly excluded “a collar,” and the collar is clearly present, the excluded element is not absent. VERDICT: FAIL

A classroom with desks and a chalkboard but no studentsPASS
A classroom with desks and a chalkboard but no students

Evaluator (Qwen 3.5 122B)

The image shows an empty classroom with wooden desks, chairs, and a large chalkboard at the front. There are potted plants on some desks and backpacks hanging or placed nearby. No students are visible in the scene — the room is unoccupied by people. The excluded element (“students”) is truly absent. VERDICT: PASS

Medium 33%
A sandy beach with palm trees, seashells, and a beach umbrelFAIL
A sandy beach with palm trees, seashells, and a beach umbrella but no water or ocean visible

Evaluator (Qwen 3.5 122B)

The image clearly shows a sandy beach with palm trees, seashells in the foreground, and a thatched beach umbrella. However, behind the umbrella and between the palm trees, there is a large body of water — specifically, an ocean with visible waves breaking on the shore and sparkling under sunlight. The prompt explicitly excluded “water or ocean visible,” which is directly contradicted by this image. VERDICT: FAIL

An office workspace with a desk, chair, bookshelf, and lamp PASS
An office workspace with a desk, chair, bookshelf, and lamp but no computers or screens

Evaluator (Qwen 3.5 122B)

The image shows a traditional office workspace with a wooden desk, an office chair, a bookshelf filled with books, and a lit lamp. There are no computers, monitors, tablets, or any electronic screens visible on the desk or elsewhere in the room. The excluded elements are truly absent. VERDICT: PASS

A kitchen with pots, pans, a stove, and a cutting board but FAIL
A kitchen with pots, pans, a stove, and a cutting board but no food

Evaluator (Qwen 3.5 122B)

The image shows a rustic kitchen scene with pots, pans, a stove, and a cutting board — as described. However, there is visible food: a browned item (possibly meat or bread) in the small pan on the cutting board, and what appears to be cooked residue or sauce in other pans. The prompt explicitly excluded “no food,” so this violates the constraint. VERDICT: FAIL

Hard 0%
A pepperoni pizza with tomato sauce, basil, and crust but abFAIL
A pepperoni pizza with tomato sauce, basil, and crust but absolutely no cheese

Evaluator (Qwen 3.5 122B)

The image shows a pepperoni pizza with tomato sauce, basil leaves, and a crust. However, there is clearly melted cheese visible between the pepperoni slices and under the basil — it’s bubbly, golden-brown in spots, and stretches slightly where the pizza is sliced. Since the prompt explicitly excluded cheese (“absolutely no cheese”), and cheese is present, the core criterion is violated. VERDICT: FAIL

A car body with windows, doors, and headlights but no wheelsFAIL
A car body with windows, doors, and headlights but no wheels

Evaluator (Qwen 3.5 122B)

The image shows a complete car, including visible wheels. The prompt explicitly excluded wheels (“no wheels”), but the car in the image clearly has black alloy wheels attached to its body. Therefore, the excluded element is present. VERDICT: FAIL

A human face with eyes, nose, and mouth but no eyebrowsFAIL
A human face with eyes, nose, and mouth but no eyebrows

Evaluator (Qwen 3.5 122B)

The image shows a close-up of a human face with clearly visible eyes, nose, and mouth. However, the prompt explicitly excluded eyebrows — and in this image, both eyebrows are present, well-defined, and natural-looking. Since the excluded element (eyebrows) is visibly present, the image fails to meet the exclusion criterion. VERDICT: FAIL

Relative Position83%
Easy 67%
A cat sitting on top of a cardboard boxFAIL
A cat sitting on top of a cardboard box

Evaluator (Qwen 3.5 122B)

The cat is positioned inside the open cardboard box, with its front paws resting on the rim and its body seated within the box’s interior. The prompt says “sitting on top of,” which implies the cat is atop the closed or flat surface of the box — not inside it. This misrepresents the actual spatial relationship. VERDICT: FAIL

A red ball underneath a wooden chairPASS
A red ball underneath a wooden chair

Evaluator (Qwen 3.5 122B)

The image shows a red, weathered ball positioned directly beneath the seat of a wooden chair, resting on the floor between the chair’s legs. The spatial relationship “underneath” is accurately depicted — the ball is below the chair’s seat and enclosed within the space defined by the chair’s legs. All described positional relationships are correct. VERDICT: PASS

A coffee mug next to an open book on a deskPASS
A coffee mug next to an open book on a desk

Evaluator (Qwen 3.5 122B)

The image shows a coffee mug in the foreground and an open book behind it, both resting on a wooden surface (desk). The mug is positioned in front of the book, not strictly “next to” in a side-by-side sense, but spatially adjacent and co-located on the same plane. The prompt’s phrase “next to” can reasonably include front-back adjacency in casual usage. All objects are correctly placed “on” the desk. No contradictory spatial claims (e.g., under, above) are violated. VERDICT: PASS

Medium 100%
A bird perched on a branch, the branch extending from a treePASS
A bird perched on a branch, the branch extending from a tree next to a park bench

Evaluator (Qwen 3.5 122B)

The image shows a bird perched on a branch that extends from a large tree trunk on the right. In the background, slightly to the left and below the branch, is a park bench. The spatial relationships described — bird “on” branch, branch extending from tree, tree “next to” bench — are all accurately represented. The bench is not directly under or beside the branch but is in proximity, which satisfies “next to.” All positional descriptors are correct. VERDICT: PASS

A cat sitting on a pillow, the pillow placed on a wooden chaPASS
A cat sitting on a pillow, the pillow placed on a wooden chair

Evaluator (Qwen 3.5 122B)

The image shows a fluffy cat sitting directly on a light-colored pillow. The pillow is placed on the seat of a wooden chair. All spatial relationships described in the prompt — “cat on pillow,” “pillow on chair” — are accurately represented. No positional errors (left/right/above/below) are present; the arrangement matches the prompt exactly. VERDICT: PASS

A candle on top of a stack of books, the books next to a potPASS
A candle on top of a stack of books, the books next to a potted plant

Evaluator (Qwen 3.5 122B)

The candle is correctly placed on top of a stack of two books. The potted plant is positioned to the left of the books, satisfying “next to.” All specified spatial relationships — “on top of,” “stack of books,” and “next to” — are accurately represented in the image. No contradictions or misplacements observed. VERDICT: PASS

Hard 67%
Five chess pieces on a board: a king in the center, a queen FAIL
Five chess pieces on a board: a king in the center, a queen to its left, a bishop to its right, a rook above the king, and a knight below the king

Evaluator (Qwen 3.5 122B)

The image shows five wooden chess pieces on a board. The king is in the foreground center. To its left (viewer’s left) is the queen — correct. To its right is a knight — not a bishop as specified. Behind the king (above in 2D view) is a rook — acceptable interpretation of “above.” There is no piece below the king; the knight is to the right-front, not below. Also, the piece described as “bishop” is actually a knight. Thus, multiple positional and identity errors exist. VERDICT: FAIL

A desk scene: laptop in the center, coffee mug to the left, PASS
A desk scene: laptop in the center, coffee mug to the left, phone to the right, lamp above, and notebook below

Evaluator (Qwen 3.5 122B)

The image shows a laptop centered on the desk. A coffee mug is to the left of the laptop. A phone is to the right. A lamp is positioned above and behind the laptop. An open notebook lies below (in front of) the laptop. All specified spatial relationships — left, right, above, below — are accurately represented in the image. VERDICT: PASS

Five animals in a field: a horse on the far left, a cow to tPASS
Five animals in a field: a horse on the far left, a cow to the right of the horse, a sheep in the center, a goat to the right of the sheep, and a dog on the far right

Evaluator (Qwen 3.5 122B)

The image shows five animals in a field. From left to right: a brown horse is on the far left. To its right stands a large brown cow. In front and slightly left of center is a white sheep. To the sheep’s right is a dark goat with a white face. On the far right is a small brown-and-black dog. All specified spatial relationships — “horse on far left,” “cow to right of horse,” “sheep in center,” “goat to right of sheep,” “dog on far right” — are accurately depicted. VERDICT: PASS

Extreme 100%
A landscape with a red barn in the foreground at large scalePASS
A landscape with a red barn in the foreground at large scale, a windmill in the midground at medium scale, and snow-capped mountains in the background at small scale, all with consistent depth perspective

Evaluator (Qwen 3.5 122B)

The image shows a red barn in the foreground, large and prominent. A windmill stands to the left in the midground, smaller than the barn but clearly visible. Snow-capped mountains dominate the background, appearing distant and smaller due to perspective — consistent with “small scale” in depth. All spatial relationships (foreground/midground/background) and relative scales are correctly rendered. No positional errors (e.g., barn behind mountains or windmill above barn). Depth cues like atmospher

A beach scene with a sandcastle in the foreground, a surfer PASS
A beach scene with a sandcastle in the foreground, a surfer on the waves in the midground, and a cruise ship on the horizon in the background, sizes decreasing with distance

Evaluator (Qwen 3.5 122B)

The image shows a sandcastle in the foreground, a surfer on waves in the midground, and a cruise ship on the horizon in the background — matching the prompt’s spatial layout. Sizes decrease with distance as described. All positional relationships (foreground/midground/background) are correctly represented. No incorrect “on,” “under,” “left,” etc., relations are present. VERDICT: PASS

A city park with a park bench and pigeons in the foreground,PASS
A city park with a park bench and pigeons in the foreground, a fountain in the midground, and skyscrapers in the background, each layer progressively smaller

Evaluator (Qwen 3.5 122B)

The image correctly shows pigeons on park benches in the foreground, a fountain in the midground, and skyscrapers in the background — matching the prompt’s layered spatial arrangement. Objects are positioned as described: benches and pigeons are closest (foreground), fountain is behind them (midground), and buildings are farthest (background). No incorrect positional relationships (e.g., “above,” “left”) are violated. The scene adheres to the specified depth progression. VERDICT: PASS

Scale & Proportions89%
Easy 100%
A person standing next to a yellow school bus, showing realiPASS
A person standing next to a yellow school bus, showing realistic relative scale

Evaluator (Qwen 3.5 122B)

The person’s height relative to the school bus appears realistic — roughly 1/3 to 1/2 the bus’s height, which is consistent with real-world proportions. The bus’s wheels, windows, and door scale appropriately around the human figure. No obvious distortions or size mismatches are visible. VERDICT: PASS

A ladybug sitting on a large green leaf, shown at realistic PASS
A ladybug sitting on a large green leaf, shown at realistic proportions

Evaluator (Qwen 3.5 122B)

The ladybug appears appropriately sized relative to the leaf — typical for a real-world macro shot. The leaf’s veins, texture, and serrated edge are detailed and scale-consistent with a natural leaf. The beetle’s body proportions (head, thorax, elytra, legs) also match biological realism. No obvious scaling errors or distortions are present. VERDICT: PASS

A child holding a basketball, both at correct relative sizesPASS
A child holding a basketball, both at correct relative sizes

Evaluator (Qwen 3.5 122B)

The child’s hands are proportionally appropriate for holding a basketball of standard size. The ball fits naturally in his grasp, and its diameter relative to his torso and arms is realistic for a young child. No anatomical or object-scale distortions are visible. The proportions align with real-world expectations. VERDICT: PASS

Medium 67%
A tiny person standing on the rim of a coffee cup, looking dPASS
A tiny person standing on the rim of a coffee cup, looking down into it

Evaluator (Qwen 3.5 122B)

The image shows a miniature person standing on the rim of a coffee cup, with steam rising from the dark liquid inside. The person’s scale relative to the cup is consistent with the prompt — they appear tiny compared to the cup, which is rendered at normal size. The proportions are visually coherent and match the described size relationship. VERDICT: PASS

A giant tabby cat walking between city skyscrapers like a kaFAIL
A giant tabby cat walking between city skyscrapers like a kaiju

Evaluator (Qwen 3.5 122B)

The image shows a tabby cat with Godzilla-like features (spikes, claws, tail) walking down a city street. However, the cat is not “giant” — it’s roughly human-sized or slightly larger, while the skyscrapers in the background are full-scale buildings. This breaks the intended scale: if the cat were truly kaiju-sized, it would tower over the buildings or at least be comparable in height to them. The proportions are inconsistent with the prompt’s description of a “giant” cat among skyscrapers. VER

A miniature house sitting on the palm of someone's handPASS
A miniature house sitting on the palm of someone's hand

Evaluator (Qwen 3.5 122B)

The image shows a detailed miniature house resting in the palm of a human hand. The house is appropriately scaled to fit comfortably within the palm, with fingers curled slightly around it — consistent with the prompt’s description. Proportions between the hand and the house are realistic for a “miniature” object held in hand. No size inconsistencies are visible. VERDICT: PASS

Hard 100%
A group photo of an elephant, a horse, a dog, a cat, and a mPASS
A group photo of an elephant, a horse, a dog, a cat, and a mouse, all standing in a line with correct real-world proportional sizes

Evaluator (Qwen 3.5 122B)

The image shows an elephant, horse, dog, cat, and mouse standing in a line. Their relative sizes are accurately scaled to real-world proportions: the elephant is largest, followed by the horse, then the dog, cat, and finally the smallest — the mouse. All animals are positioned side-by-side with correct size hierarchy preserved. VERDICT: PASS

A dinner table scene with a plate, fork, knife, wine glass, PASS
A dinner table scene with a plate, fork, knife, wine glass, salt shaker, and peppercorn, all at realistic relative sizes

Evaluator (Qwen 3.5 122B)

The image shows a dinner table setting with a plate, fork, two knives, wine glass, salt and pepper shakers, and scattered peppercorns. All objects appear to be at realistic relative sizes: the plate is appropriately large compared to the cutlery, the wine glass is proportionate to the plate, and the shakers are smaller than the plate but larger than individual peppercorns. The spatial arrangement and scale are consistent with real-world expectations. VERDICT: PASS

A farm scene with a barn, tractor, farmer, cow, chicken, andPASS
A farm scene with a barn, tractor, farmer, cow, chicken, and a fence post, all at correct proportional scale

Evaluator (Qwen 3.5 122B)

The image displays a farm scene with all requested elements: barn, tractor, farmer, cow, chicken, and fence post. The relative sizes appear consistent — the farmer is appropriately scaled to the tractor and barn; the chickens are small in foreground; the cow is distant but proportionally sized; the fence post matches human scale. No obvious distortions or impossible proportions are visible. VERDICT: PASS

Human realism 86%

Faces & Expressions83%
Easy 100%
A close-up portrait of a young woman with a warm natural smiPASS
A close-up portrait of a young woman with a warm natural smile

Evaluator (Qwen 3 VL)

Face is symmetric with natural, correctly proportioned features. Smile appears warm and genuine, clearly conveying happiness. No noticeable distortions or unnatural artifacts. Lighting and skin texture look authentic. Meets all prompt criteria. VERDICT: PASS

A portrait of a middle-aged man with a calm, serious expressPASS
A portrait of a middle-aged man with a calm, serious expression

Evaluator (Qwen 3 VL)

The face is symmetric with natural, correctly proportioned features. The calm, serious expression is clearly readable through neutral lips and steady gaze. No significant distortions are present. Skin texture and hair detail appear authentic. The portrait successfully matches the prompt’s intent. VERDICT: PASS

A headshot of a teenager with a relaxed, neutral facePASS
A headshot of a teenager with a relaxed, neutral face

Evaluator (Qwen 3 VL)

The face is symmetric with natural, correctly proportioned features. The expression is relaxed and neutral, clearly conveying calmness without distortion. Skin texture and lighting are realistic. No visible artifacts or unnatural deformities. The subject appears to be a teenager. All criteria met. VERDICT: PASS

Medium 67%
A close-up of a person's face showing clear disgust, nose wrFAIL
A close-up of a person's face showing clear disgust, nose wrinkled and lip curled

Evaluator (Qwen 3 VL)

The face is largely symmetric with realistic features. However, the lips are not clearly curled upward or downward in disgust; they appear neutral or slightly downturned. The nose is not visibly wrinkled. The expression reads more as intense displeasure or anger than clear disgust. Minor AI artifacts exist in skin texture and hair, but not severe distortions. VERDICT: FAIL

A portrait capturing genuine surprise — raised eyebrows, widPASS
A portrait capturing genuine surprise — raised eyebrows, wide eyes, slightly open mouth

Evaluator (Qwen 3 VL)

Face is symmetric with natural features. Eyes are wide, eyebrows raised, mouth slightly open — all clearly conveying genuine surprise. No significant distortions. Skin texture and lighting are realistic. Emotion is unmistakable and well-executed. VERDICT: PASS

A face showing quiet melancholy — downcast eyes, slightly drPASS
A face showing quiet melancholy — downcast eyes, slightly drawn mouth, distant gaze

Evaluator (Qwen 3 VL)

Face is symmetric with natural, correctly rendered features. Eyes are downcast, lips slightly drawn, and gaze appears distant — all aligning with the prompt’s melancholy description. No visible distortions. Emotional intent is clearly readable. VERDICT: PASS

Hard 67%
An elderly woman in her 80s with deep wrinkles and white haiPASS
An elderly woman in her 80s with deep wrinkles and white hair, smiling warmly with kind eyes

Evaluator (Qwen 3 VL)

The face is symmetric with natural, aged features: deep wrinkles, white hair, and warm, kind eyes. The smile is clearly readable and genuine. No distortions are present; the expression and details align perfectly with the prompt. VERDICT: PASS

A 5-year-old child with round cheeks and large eyes, crying FAIL
A 5-year-old child with round cheeks and large eyes, crying with tears streaming down their face

Evaluator (Qwen 3 VL)

The face is mostly symmetrical with natural features, but the child’s expression is inconsistent: tears are present, yet the mouth is slightly upturned in a smile, contradicting the “crying” prompt. This emotional mismatch undermines the intended emotion. Minor facial distortions exist, such as slightly uneven tear paths and unnatural skin texture. The prompt’s core emotion is not clearly readable. VERDICT: FAIL

An elderly man in his 70s with age spots and thinning gray hPASS
An elderly man in his 70s with age spots and thinning gray hair, looking skeptically over reading glasses

Evaluator (Qwen 3 VL)

Face is symmetric with natural wrinkles, age spots, and thinning gray hair. Expression conveys skepticism via furrowed brow and direct gaze. Glasses are correctly placed. No significant distortions. Features align well with prompt. VERDICT: PASS

Extreme 100%
A group of three friends at a table: one laughing with head PASS
A group of three friends at a table: one laughing with head tilted back, one looking shocked with wide eyes, and one rolling their eyes with a smirk

Evaluator (Qwen 3 VL)

Faces are mostly symmetrical with natural features. Emotions are clearly readable: man laughs with head tilted, woman center looks shocked with wide eyes, woman left shows surprise (not rolling eyes). Rightmost person is partially cut off. Minor facial distortions in the central woman’s eyes and mouth, but overall realistic. VERDICT: PASS

A family photo with four people: a father beaming with pridePASS
A family photo with four people: a father beaming with pride, a mother smiling gently, a teenager looking bored, and a child grinning excitedly

Evaluator (Qwen 3 VL)

Faces are symmetric with natural features. Emotions are clearly readable: father beams, mother smiles gently, teenager looks bored, child grins excitedly. No significant distortions. All four individuals match the prompt’s emotional descriptions accurately. VERDICT: PASS

Three coworkers in a meeting: one looking frustrated with fuPASS
Three coworkers in a meeting: one looking frustrated with furrowed brows, one nodding with a pleased expression, and one looking confused with a tilted head

Evaluator (Qwen 3 VL)

Faces are symmetric with correct features. Emotions are clearly readable: left man shows frustration (furrowed brows), center woman shows confusion (tilted head, furrowed brow), right man shows pleased nodding (smile, upward gaze). No distortions. All three expressions match the prompt. VERDICT: PASS

Full Body92%
Easy 100%
A woman standing upright in a relaxed pose, arms at her sidePASS
A woman standing upright in a relaxed pose, arms at her sides, wearing jeans and a white t-shirt

Evaluator (Qwen 3 VL)

The woman stands upright with arms at her sides, wearing jeans and a white t-shirt — perfectly matching the prompt. Limb count and proportions are anatomically correct. No distortions are visible. The relaxed, neutral pose is clearly recognizable. All criteria are satisfied. VERDICT: PASS

A man standing straight facing the camera, hands in pockets,PASS
A man standing straight facing the camera, hands in pockets, normal proportions

Evaluator (Qwen 3 VL)

Limb count and proportions are anatomically correct. The pose — standing straight, facing camera, hands in pockets — is clearly recognizable and matches the prompt. No distortions are visible. The figure’s posture and clothing fit naturally. VERDICT: PASS

A person standing in a neutral pose in a park, full body visPASS
A person standing in a neutral pose in a park, full body visible from head to feet

Evaluator (Qwen 3 VL)

The person’s limbs are correctly counted and proportioned. The neutral standing pose is anatomically accurate and recognizable. No distortions are visible. The full body is clearly shown from head to feet in a park setting. All prompt requirements are met. VERDICT: PASS

Medium 100%
A woman mid-stride running through a park, hair and clothes PASS
A woman mid-stride running through a park, hair and clothes showing motion

Evaluator (Qwen 3 VL)

Limb count and proportions are anatomically correct. The running pose is recognizable — mid-stride with arms bent, one leg forward, the other back. Hair and clothing show motion blur appropriately. No significant distortions. The image captures the dynamic action as requested. VERDICT: PASS

A man captured mid-jump over a puddle, body in the air with PASS
A man captured mid-jump over a puddle, body in the air with bent knees

Evaluator (Qwen 3 VL)

Limb count and proportions are anatomically correct. The pose—mid-air jump with bent knees over a puddle—is recognizable and dynamic. No distortions in limbs or body structure. The motion is plausible, with water splashing realistically. The man’s form and action align perfectly with the prompt. VERDICT: PASS

A dancer in a dynamic spin, dress flowing outward with one lPASS
A dancer in a dynamic spin, dress flowing outward with one leg extended

Evaluator (Qwen 3 VL)

The dancer’s limbs are correctly counted and proportioned. The pose is anatomically plausible — one leg extended upward, arms arched, dress flowing outward — capturing a dynamic spin. No distortions in body structure or limbs. The action is clearly recognizable as a spinning motion with dramatic fabric movement. VERDICT: PASS

Hard 100%
A person in a yoga tree pose — standing on one leg with the PASS
A person in a yoga tree pose — standing on one leg with the other foot pressed to the inner thigh, arms raised overhead

Evaluator (Qwen 3 VL)

The person has two legs, correct proportions, and performs the tree pose accurately: standing on one foot, the other foot’s sole pressed to the inner thigh, arms raised overhead in prayer position. No anatomical distortions or limb anomalies are visible. The pose is clearly recognizable and executed correctly. VERDICT: PASS

A rock climber clinging to a cliff face with arms and legs ePASS
A rock climber clinging to a cliff face with arms and legs extended to different holds

Evaluator (Qwen 3 VL)

The climber has correct limb count and proportions. His pose is anatomically plausible for dynamic climbing — arms and legs extended to different holds, with visible muscle strain. The action is recognizable as intense rock climbing. Slight distortion: the lower leg/foot placement appears slightly unnatural, but overall, the pose and anatomy are coherent with the prompt. VERDICT: PASS

A cellist seated with correct posture, legs around the instrPASS
A cellist seated with correct posture, legs around the instrument, bow arm extended

Evaluator (Qwen 3 VL)

The cellist has correct limb count and proportions. The pose is anatomically plausible: seated, legs around the instrument, bow arm extended. The cello is properly supported by a stand, and the bow is held correctly. No significant distortions are visible. The action matches the prompt accurately. VERDICT: PASS

Extreme 67%
Two tango dancers in a close embrace, the man dipping the woPASS
Two tango dancers in a close embrace, the man dipping the woman backward with intertwined legs

Evaluator (Qwen 3 VL)

Limb count and proportions are correct. The pose is a recognizable tango dip with intertwined legs and close embrace. Anatomical structure is plausible, though the woman’s leg positioning shows slight unnatural strain. No major distortions; the action is clearly depicted. VERDICT: PASS

Two soccer players mid-tackle, one sliding and the other jumFAIL
Two soccer players mid-tackle, one sliding and the other jumping over with the ball at their feet

Evaluator (Qwen 3 VL)

Limb counts and proportions are anatomically correct. The sliding player’s body is elongated unnaturally, and his right arm appears to pass through the ball. The jumping player’s pose is dynamic but slightly distorted — his left leg is awkwardly bent, and his right foot is oddly positioned. The action is recognizable as a tackle, but the anatomy and motion are inconsistent with real human biomechanics. VERDICT: FAIL

Two gymnasts performing a synchronized handstand, side by siPASS
Two gymnasts performing a synchronized handstand, side by side with identical body alignment

Evaluator (Qwen 3 VL)

Both gymnasts have correct limb count and proportions. Their poses are anatomically plausible handstands with synchronized alignment, though one has a leg extended upward while the other holds the other’s foot — not a standard handstand. The action is recognizable as a synchronized acrobatic pose. Minor distortions: slight asymmetry in hand placement and leg angles. Overall, the core requirement of synchronized handstands with identical alignment is met. VERDICT: PASS

Hands83%
Easy 100%
A person standing with both hands visible at their sides, fiPASS
A person standing with both hands visible at their sides, fingers relaxed and naturally spread

Evaluator (Qwen 3 VL)

Both hands show five fingers each, anatomically correct with relaxed, naturally spread fingers. No anomalies detected. Hands rest on hips, consistent with “at sides” posture. Fingers are not clenched or unnatural. All details align with prompt. VERDICT: PASS

A woman holding her hands out in front of her palms-up, all PASS
A woman holding her hands out in front of her palms-up, all ten fingers clearly visible

Evaluator (Qwen 3 VL)

Both hands show all ten fingers clearly — five per hand. The gesture is anatomically correct: palms up, fingers slightly spread, wrists neutral. No anomalies in finger count or structure. Nails show minor wear or chipping, but not deformities. The pose matches the prompt precisely. VERDICT: PASS

A man with arms crossed, both hands resting on opposite uppePASS
A man with arms crossed, both hands resting on opposite upper arms with fingers visible

Evaluator (Qwen 3 VL)

Both hands show five fingers each — no anomalies in count. Fingers are positioned naturally, with palms facing inward and knuckles aligned, consistent with a crossed-arm posture. Anatomy is plausible: no extra or missing digits, joints appear correctly formed, and hand placement on opposite upper arms is realistic. No deformities or distortions. VERDICT: PASS

Medium 67%
A person holding a white ceramic coffee mug by the handle wiPASS
A person holding a white ceramic coffee mug by the handle with their right hand

Evaluator (Qwen 3 VL)

The visible right hand holds the mug handle with thumb and fingers, showing all five fingers clearly. The grip is anatomically correct — thumb wraps around the handle, fingers curl naturally. No anomalies or missing fingers. The prompt is accurately fulfilled. VERDICT: PASS

A woman writing in a notebook with a pen gripped correctly iPASS
A woman writing in a notebook with a pen gripped correctly in her right hand

Evaluator (Qwen 3 VL)

Both visible hands show five fingers each. The right hand grips the pen with a natural, anatomically correct writing posture — thumb, index, and middle fingers controlling the pen, while the ring and pinky support. The left hand rests on the notebook, fingers slightly curled. No anomalies detected. VERDICT: PASS

A person playing an acoustic guitar with left hand on the frFAIL
A person playing an acoustic guitar with left hand on the fretboard and right hand strumming

Evaluator (Qwen 3 VL)

Left hand shows 5 fingers pressing strings, grip is anatomically correct for fretting. Right hand is partially visible near body, but only thumb and part of palm are seen — fingers not clearly countable. Since prompt requires right hand strumming and we can’t confirm 5 fingers or proper strumming gesture, the image fails to fully meet the prompt’s anatomical and compositional requirements. VERDICT: FAIL

Hard 100%
A person making a peace sign with their right hand, clearly PASS
A person making a peace sign with their right hand, clearly showing two fingers up and three curled

Evaluator (Qwen 3 VL)

The right hand clearly shows two extended index and middle fingers, with the ring, pinky, and thumb curled inward — totaling five fingers. The gesture is anatomically correct, with natural finger positioning and no anomalies. All five fingers are visible and properly formed. VERDICT: PASS

A hand giving a thumbs-up with four fingers curled and thumbPASS
A hand giving a thumbs-up with four fingers curled and thumb extended upward

Evaluator (Qwen 3 VL)

The hand shows four fingers curled (index, middle, ring, pinky) and one thumb extended upward — exactly five digits visible. The gesture is anatomically correct: thumb opposes the palm, fingers curl naturally. No anomalies like extra or missing fingers, malformed joints, or unnatural skin texture. All details align with the prompt. VERDICT: PASS

A person counting to three on their fingers, with index, midPASS
A person counting to three on their fingers, with index, middle, and ring fingers extended

Evaluator (Qwen 3 VL)

The hand shows all five fingers extended — thumb, index, middle, ring, and pinky — matching the prompt. The gesture is anatomically correct with no visible anomalies. Fingernails, skin texture, and finger spacing appear natural. The hand is clearly a single hand, not two, so the prompt’s “counting to three” is not literally fulfilled, but the gesture (three fingers extended) is visually present and correctly rendered. VERDICT: PASS

Extreme 67%
Two people shaking hands in a professional setting, all fingPASS
Two people shaking hands in a professional setting, all fingers of both hands anatomically correct

Evaluator (Qwen 3 VL)

Both hands show five clearly defined fingers each. The handshake gesture is anatomically plausible — thumbs interlock, fingers wrap around, and the grip is natural for a professional context. No visible anomalies like extra or missing fingers, malformed joints, or unnatural positioning. The image meets the anatomical and compositional requirements. VERDICT: PASS

Two people doing a high-five, palms meeting with all fingersFAIL
Two people doing a high-five, palms meeting with all fingers extended on both hands

Evaluator (Qwen 3 VL)

Both hands show only 4 visible fingers each (thumbs are obscured by the gesture). The thumbs are not extended, and the hands are clasped palm-to-palm, not in a high-five. This is anatomically incorrect for a high-five, which requires open palms with all five fingers extended. The gesture is more like a prayer or prayer-like clasp. VERDICT: FAIL

One person handing a small gift box to another person, both PASS
One person handing a small gift box to another person, both sets of hands gripping the box correctly

Evaluator (Qwen 3 VL)

Both hands show five fingers each, with no missing or extra digits. The grip is anatomically correct: thumbs support the sides, fingers cradle the box, and palms face each other during the transfer. No anomalies in finger count or hand posture. The gesture is natural and consistent with handing an object. VERDICT: PASS

Multi-Subject83%
Easy 67%
Two women standing side by side: one with short blonde hair PASS
Two women standing side by side: one with short blonde hair wearing a red jacket, the other with long black hair wearing a blue dress

Evaluator (Qwen 3 VL)

Person 1: Short blonde hair, wearing a red jacket over a black shirt — matches description. Person 2: Long black hair, wearing a blue sleeveless top — matches description. Both are visually distinct and positioned side by side. VERDICT: PASS

Two men at a cafe: one with a beard and glasses wearing a grFAIL
Two men at a cafe: one with a beard and glasses wearing a green sweater, the other clean-shaven in a black hoodie

Evaluator (Qwen 3 VL)

Person 1: Bearded, wears glasses, green sweater over collared shirt — matches description. Person 2: Has facial hair (light beard), wears black hoodie — does NOT match “clean-shaven” description. VERDICT: FAIL

Two friends posing: one tall with curly red hair in denim ovPASS
Two friends posing: one tall with curly red hair in denim overalls, the other short with straight brown hair in a yellow sundress

Evaluator (Qwen 3 VL)

Person 1 (tall): Curly red hair, denim overalls over a light blue t-shirt — matches description. Person 2 (short): Straight brown hair with bangs, yellow sundress — matches description. Both are visually distinct, clearly posed together, and all features align. VERDICT: PASS

Medium 100%
Four people at a dinner table: a bald man in a suit, a womanPASS
Four people at a dinner table: a bald man in a suit, a woman with silver hair in a red blouse, a young man with dreadlocks in a denim jacket, and a woman with a hijab in a green dress

Evaluator (Qwen 3 VL)

Bald man in suit: Present, visually distinct. Silver-haired woman in red blouse: Present, visually distinct. Young man with dreadlocks in denim jacket: Present, visually distinct. Woman with hijab in green dress: Present, visually distinct. All four match descriptions perfectly. VERDICT: PASS

Three musicians on stage: a tall woman with short pink hair PASS
Three musicians on stage: a tall woman with short pink hair playing bass, a man with a beard and bandana on drums, and a slim man with glasses playing keyboard

Evaluator (Qwen 3 VL)

Tall woman: pink short hair, playing bass — ✅ matches. Slim man with glasses: playing keyboard — ✅ matches. Bearded man with bandana: on drums — ✅ matches (visible in background). All three are visually distinct and correctly portrayed. VERDICT: PASS

Three hikers on a trail: a man with a backpack and sunhat, aPASS
Three hikers on a trail: a man with a backpack and sunhat, a woman with braided hair and hiking poles, and a teenager with a baseball cap and water bottle

Evaluator (Qwen 3 VL)

Man: Sunhat, backpack, hiking pole — matches. Woman: Braided hair, backpack, hiking poles — matches. Teenager: Baseball cap, water bottle, backpack — matches. All three are visually distinct and match described features. VERDICT: PASS

Truthfulness 63%

Photorealism100%
Medium 100%
A still life on a wooden table featuring a clear glass of waPASS
A still life on a wooden table featuring a clear glass of water, a polished silver spoon, a folded linen napkin, and a ripe red apple — each material rendered with accurate texture and light behavior

Evaluator (Qwen 3.6 27B)

The image accurately renders the requested materials. The glass shows correct refraction and transparency, the spoon exhibits metallic reflectivity, the wood has a convincing grain texture, the linen napkin displays appropriate fabric folds and weave, and the apple has realistic skin texture and lighting. All visual properties align with physical expectations. VERDICT: PASS

A workspace desk with a brushed aluminum laptop, a leather-bPASS
A workspace desk with a brushed aluminum laptop, a leather-bound notebook, a ceramic coffee mug, and a glass paperweight — all materials visually distinct and realistic

Evaluator (Qwen 3.6 27B)

The image successfully renders the requested materials with high fidelity. The brushed aluminum laptop exhibits a realistic metallic sheen with subtle directional reflections. The leather notebook shows appropriate texture and wear. The ceramic mug has a convincing matte, glazed finish, and the glass paperweight demonstrates correct refraction and transparency. The wood grain on the desk is distinct and natural. All materials are visually distinct and realistic. VERDICT: PASS

A kitchen counter with a stainless steel pot, a wooden cuttiPASS
A kitchen counter with a stainless steel pot, a wooden cutting board, a glass bottle of olive oil, and a cotton dish towel — each surface reflecting light correctly

Evaluator (Qwen 3.6 27B)

The image accurately renders the distinct visual properties of each material. The stainless steel pot shows sharp, distorted reflections consistent with polished metal. The glass bottle exhibits translucency and refraction, with the olive oil visible inside. The wooden cutting board displays natural grain and diffuse light scattering, while the cotton towel shows soft fabric texture and matte absorption. All materials reflect light correctly according to their physical characteristics. VERDICT:

Physics & Reflections75%
Easy 33%
A red sphere on a white surface with a single shadow cast toPASS
A red sphere on a white surface with a single shadow cast to the right, consistent with a light source from the upper left

Evaluator (Qwen 3.6 27B)

The image shows a red sphere on a white surface. The shadow is cast to the right and slightly downward, which is consistent with a light source from the upper left. The shadow’s softness and gradient are physically plausible for a diffuse light source. The sphere’s highlight on the upper-left side also matches the lighting direction. No distortions or color anomalies are present in the shadow or reflection. VERDICT: PASS

A coffee mug on a sunny windowsill casting a sharp shadow toFAIL
A coffee mug on a sunny windowsill casting a sharp shadow to the left

Evaluator (Qwen 3.6 27B)

The shadow is cast to the left, which matches the prompt. However, the shadow’s shape is physically implausible: it shows a distorted, elongated form that doesn’t correspond to the mug’s actual geometry — especially the handle and rim. The reflection in the coffee also appears unnatural, with an odd dark patch that doesn’t match typical liquid surface behavior. These distortions break physical plausibility. VERDICT: FAIL

A person standing outdoors at midday with a short shadow dirFAIL
A person standing outdoors at midday with a short shadow directly below them

Evaluator (Qwen 3.6 27B)

The shadow is extremely elongated and stretches far forward, which contradicts the “midday” condition (where shadows should be short and directly beneath). The lighting also suggests a low sun angle (long shadows, strong backlighting), not midday. Additionally, the shadow’s shape appears distorted and overly broad at the base, lacking realistic perspective tapering. These inconsistencies make the shadow physically implausible for the described scenario. VERDICT: FAIL

Medium 67%
A woman standing in front of a full-length mirror, her refleFAIL
A woman standing in front of a full-length mirror, her reflection accurately showing the back of her outfit and reversed orientation

Evaluator (Qwen 3.6 27B)

The reflection is physically implausible. The woman is facing the mirror, so her reflection should show her front. Instead, the reflection shows her back, which contradicts the laws of optics. Additionally, the reflection’s orientation and details do not match a true mirror image. VERDICT: FAIL

A cat sitting on a polished marble floor with its reflectionPASS
A cat sitting on a polished marble floor with its reflection visible below

Evaluator (Qwen 3.6 27B)

The reflection is physically plausible. It appears directly beneath the cat, matching its orientation and position. The slight blur and distortion are consistent with a polished but not perfectly mirror-like marble surface. The colors in the reflection match the cat’s fur tones, and the lighting direction aligns with the scene. No obvious artifacts or inconsistencies are present. VERDICT: PASS

A vase of flowers on a glass table with a clear reflection vPASS
A vase of flowers on a glass table with a clear reflection visible on the table surface

Evaluator (Qwen 3.6 27B)

The reflection is vertically inverted and positioned directly beneath the vase, which is physically correct for a horizontal glass surface. The distortion of the stems and vase shape in the reflection matches the curvature of the glass and the angle of view. The lighting and color tones in the reflection are consistent with the objects above, and the transparency of the glass allows the wood grain to show through both the vase and its reflection. VERDICT: PASS

Hard 100%
A row of trees reflected in a calm lake at sunset, with the PASS
A row of trees reflected in a calm lake at sunset, with the reflection slightly darker and gently rippled

Evaluator (Qwen 3.6 27B)

The reflections are physically plausible. They are vertically inverted, matching the trees’ positions and shapes. The gentle ripples distort the reflection realistically, consistent with a calm lake surface. The reflection is slightly darker than the original, as expected due to light absorption and scattering in water. The warm sunset tones are preserved in the reflection, maintaining color coherence. No directional or orientation errors are present. VERDICT: PASS

A city skyline reflected in a rain puddle on a street, with PASS
A city skyline reflected in a rain puddle on a street, with mild distortion and color shift

Evaluator (Qwen 3.6 27B)

The reflections in the puddle are physically plausible. The orientation is correct (inverted), and the perspective matches the low camera angle. The distortion is appropriate for a water surface, showing slight waviness and blurring consistent with ripples. The color shift is subtle but present, with the reflection appearing slightly darker and more muted than the actual buildings, which is realistic due to light absorption and surface texture. The direction of the reflection aligns with the pos

A mountain range reflected in a still alpine lake, with the PASS
A mountain range reflected in a still alpine lake, with the reflection inverted and subtly muted in color

Evaluator (Qwen 3.6 27B)

The reflection is physically plausible. It is correctly inverted, matching the mountain range’s orientation. The colors are subtly muted and darker, consistent with how water reflects light. There is minimal distortion, suggesting a still lake surface. Shadows and highlights align logically with the lighting direction. No unnatural warping or misalignment is present. VERDICT: PASS

Extreme 100%
A clear glass sphere sitting on a mirror surface near a poolPASS
A clear glass sphere sitting on a mirror surface near a pool of water, showing refraction through the sphere, reflection on the mirror, and water reflection simultaneously

Evaluator (Qwen 3.6 27B)

The image displays a clear glass sphere on a reflective surface. The reflection of the sphere on the mirror is vertically inverted and correctly positioned directly beneath the object, matching the expected physics of a mirror reflection. The refraction through the sphere inverts the background scene (pool and trees), which is physically accurate for a convex lens effect. The lighting and shadows are consistent with an outdoor setting. The colors and distortions align with optical principles. V

A chrome sculpture in a room with a glass floor and large wiPASS
A chrome sculpture in a room with a glass floor and large windows, showing multiple reflections and environmental mapping on every surface

Evaluator (Qwen 3.6 27B)

The reflections on the chrome sculpture are physically plausible, accurately mirroring the surrounding windows, greenery, and ceiling with appropriate distortion based on the curved surface. The glass floor correctly reflects the sculpture and the room’s structure. However, the lighting appears inconsistent: the sculpture is brightly lit from above and the side, yet there are no corresponding sharp shadows cast on the floor or walls, which would be expected in such a high-contrast environment. D

A crystal wine glass on a polished marble counter next to a PASS
A crystal wine glass on a polished marble counter next to a window overlooking a lake — showing refraction in the glass, reflection on the marble, and the lake's own reflective surface

Evaluator (Qwen 3.6 27B)

The reflections and refractions are physically plausible. The glass stem casts a clear, correctly oriented reflection on the polished marble surface. The lake outside shows a soft, natural reflection of the sky and trees, consistent with water’s reflective properties. Light direction from the window aligns with highlights on the glass and counter. No distortions or color mismatches are evident. VERDICT: PASS

World Knowledge42%
Easy 67%
The Eiffel Tower in Paris on a clear day, architecturally acPASS
The Eiffel Tower in Paris on a clear day, architecturally accurate with its iron lattice structure and three levels

Evaluator (Qwen 3.6 27B)

The image depicts the Eiffel Tower with high architectural accuracy. The iron lattice structure, the curvature of the arches, and the placement of the platforms are correct. The perspective from below looking up is consistent with real-world photography. The sky is clear with some clouds, matching the prompt. No significant inaccuracies are present. VERDICT: PASS

The Taj Mahal in Agra seen from the reflecting pool, with acPASS
The Taj Mahal in Agra seen from the reflecting pool, with accurate dome, minarets, and symmetry

Evaluator (Qwen 3.6 27B)

The image depicts the Taj Mahal with high architectural accuracy. The central dome, four minarets, and the symmetrical layout are correctly represented. The perspective from the reflecting pool is authentic, showing the characteristic long water channel leading to the mausoleum. The surrounding gardens and cypress trees align with the real-world site. No significant factual or structural inaccuracies are present. VERDICT: PASS

The Statue of Liberty on her island pedestal, with correct tFAIL
The Statue of Liberty on her island pedestal, with correct torch, crown, tablet, and green patina

Evaluator (Qwen 3.6 27B)

The subject is recognizable as the Statue of Liberty, but it contains significant factual and architectural inaccuracies. The torch is incorrectly depicted as a solid golden flame rather than the actual metal structure with a glass enclosure. The facial features are distorted and lack the correct proportions. Additionally, the surrounding island layout and background structures do not match the real-world geography of Liberty Island. VERDICT: FAIL

Medium 0%
A traditional Japanese shamisen instrument with accurate thrFAIL
A traditional Japanese shamisen instrument with accurate three-stringed design, wooden body, and bachi plectrum

Evaluator (Qwen 3.6 27B)

The image depicts a shamisen-like instrument with a wooden body and a bachi (plectrum), which aligns with the prompt. However, the instrument has four tuning pegs instead of the traditional three, and the strings appear to be four rather than three. These inaccuracies make it not fully factually accurate for a traditional shamisen. VERDICT: FAIL

A medieval astrolabe with accurate rete, tympan, and engraveFAIL
A medieval astrolabe with accurate rete, tympan, and engraved degree markings on brass

Evaluator (Qwen 3.6 27B)

The image depicts a complex brass instrument resembling an astrolabe, with concentric rings and a central rotating structure. However, the engraved markings are nonsensical gibberish rather than accurate degree markings or Latin text. The mechanical components appear fused and structurally incoherent, lacking the precise engineering of a real medieval astrolabe. While visually evocative, it fails factual accuracy in details critical to the prompt. VERDICT: FAIL

A Japanese kintsugi bowl — cracked ceramic repaired with visFAIL
A Japanese kintsugi bowl — cracked ceramic repaired with visible gold lacquer lines

Evaluator (Qwen 3.6 27B)

The image depicts a bowl with a crackle glaze pattern filled with gold, which visually mimics the aesthetic of kintsugi. However, kintsugi is defined by the repair of actual broken pieces using lacquer and gold, resulting in distinct, irregular fracture lines. This image shows a uniform, manufactured crackle pattern (crazing) rather than repaired breaks. Therefore, it is not factually accurate to the specific subject of "kintsugi" (repair), but rather a stylistic imitation or crackle-glaze ceram

Hard 0%
A detailed diagram of the water cycle showing evaporation frFAIL
A detailed diagram of the water cycle showing evaporation from the ocean, condensation forming clouds, precipitation as rain, and collection in rivers flowing back to the sea

Evaluator (Qwen 3.6 27B)

The image depicts a surreal, artistic representation of the world map underwater with fantasy elements like flowers and a figure on a swing. It does not show a diagram of the water cycle with labeled stages (evaporation, condensation, precipitation, collection). The subject is not recognizable as a scientific diagram, nor is it factually accurate in representing the water cycle process. VERDICT: FAIL

A cross-section showing tectonic plates at a convergent bounFAIL
A cross-section showing tectonic plates at a convergent boundary with subduction, a trench, volcanic arc, and layered mantle

Evaluator (Qwen 3.6 27B)

The image depicts a dramatic landscape with a volcano and a foreground that resembles a geological cross-section. However, it fails to accurately represent the requested scientific concepts. There is no visible subduction zone, oceanic trench, or distinct layered mantle structure. The "cross-section" appears to be a stylized, artistic cliff face rather than a factual diagram of tectonic plates. The core criteria of showing a convergent boundary with subduction are not met. VERDICT: FAIL

An anatomical illustration of the human heart showing four cFAIL
An anatomical illustration of the human heart showing four chambers, valves, aorta, pulmonary arteries, and veins with correct placement

Evaluator (Qwen 3.6 27B)

The image depicts a human heart with high visual fidelity, but it contains significant anatomical inaccuracies. The internal structures are distorted; for instance, the septum appears to have a honeycomb texture rather than muscular tissue, and the valve structures are misshapen and incorrectly positioned. The major vessels (aorta, pulmonary arteries) are also arranged in a non-anatomical manner. Therefore, it fails the requirement for factual accuracy. VERDICT: FAIL

Extreme 100%
A Morpho menelaus butterfly with accurate iridescent blue doPASS
A Morpho menelaus butterfly with accurate iridescent blue dorsal wings, brown ventral wing pattern with eyespots, and correct body morphology

Evaluator (Qwen 3.6 27B)

The image depicts a butterfly that is clearly recognizable as a Morpho species, featuring the characteristic iridescent blue dorsal wings and brown ventral patterns with eyespots. The body morphology, including antennae and wing venation, appears anatomically correct. While the specific species identification (Morpho menelaus) is difficult to confirm without expert taxonomic analysis, the visual representation aligns well with the general characteristics of Morpho butterflies. There are no obvio

The Hagia Sophia interior showing the massive central dome, PASS
The Hagia Sophia interior showing the massive central dome, pendentives, semi-domes, and Islamic calligraphy medallions alongside Byzantine mosaics

Evaluator (Qwen 3.6 27B)

The image depicts the interior of the Hagia Sophia with high architectural accuracy. The massive central dome, the semi-domes extending southward, and the pendentives are clearly visible and correctly positioned. The large Islamic calligraphy medallions hanging in the nave are accurately represented, as are the Byzantine mosaics (such as the angel figures) on the pendentives and walls. The lighting and perspective match the real-world structure. VERDICT: PASS

A specimen of bismuth crystal showing accurate iridescent oxPASS
A specimen of bismuth crystal showing accurate iridescent oxide layers, geometric hopper crystal formation, and characteristic staircase structure

Evaluator (Qwen 3.6 27B)

The image depicts a bismuth crystal with high fidelity. The geometric hopper formation is distinct, featuring the characteristic stepped, staircase-like structure typical of bismuth growth. The iridescent oxide layers are rendered accurately, displaying the expected spectrum of colors (blues, greens, purples) caused by thin-film interference. The subject is clearly recognizable and factually accurate. VERDICT: PASS

Professional Studio 82%

Camera & Lighting83%
Easy 100%
A wide-angle shot of a modern living room interior showing ePASS
A wide-angle shot of a modern living room interior showing expansive space with slight barrel distortion at the edges

Evaluator (Gemma 4 26B)

The image features a wide-angle perspective with a distinct circular vignette that creates a strong barrel distortion effect at the edges, matching the prompt's requirement. The lighting is soft and natural, coming from large windows, creating a modern interior feel. The depth of field is deep, keeping the room's expanse in focus. The visual characteristics align perfectly with the specified camera and lighting descriptions. VERDICT: PASS

A close-up macro shot of a single dewdrop on a rose petal, sPASS
A close-up macro shot of a single dewdrop on a rose petal, shallow depth of field

Evaluator (Gemma 4 26B)

The image perfectly matches the prompt. It features a close-up macro shot of a single, clear dewdrop resting on a red rose petal. The depth of field is shallow, with the background being heavily blurred (bokeh effect), focusing all attention on the droplet and the texture of the petal. The lighting is soft and natural, highlighting the transparency and reflections within the water drop. VERDICT: PASS

An aerial bird's-eye view looking straight down at a trafficPASS
An aerial bird's-eye view looking straight down at a traffic roundabout with cars and trees

Evaluator (Gemma 4 26B)

The image perfectly matches the prompt. It features an aerial bird's-eye view looking straight down at a traffic roundabout. The perspective is a top-down orthographic view, the depth of field is deep (keeping the entire scene in focus), and the lighting is even, consistent with a high-angle daylight shot. All elements—the roundabout, cars, and surrounding trees—are correctly positioned according to the description. VERDICT: PASS

Medium 100%
A portrait lit with classic Rembrandt lighting — triangle ofPASS
A portrait lit with classic Rembrandt lighting — triangle of light on the shadowed cheek, moody atmosphere

Evaluator (Gemma 4 26B)

The image features a portrait with a clear "Rembrandt lighting" pattern. There is a distinct, small triangle of light visible on the shadowed cheek (the viewer's left, the subject's right), which is the hallmark of this lighting style. The lighting is moody and directional, creating a dramatic atmosphere. The depth of field is shallow, focusing on the subject's face, and the perspective is a standard portrait shot. VERDICT: PASS

A landscape photograph bathed in golden hour light — long waPASS
A landscape photograph bathed in golden hour light — long warm shadows, orange sky, soft glow

Evaluator (Gemma 4 26B)

The image perfectly matches the prompt. It depicts a landscape during golden hour with a prominent orange sky and a bright sun low on the horizon. The lighting creates long, warm shadows across the grassy field and silhouettes the trees. The soft glow and warm color palette are consistent with the requested camera and lighting settings. VERDICT: PASS

A high-key portrait of a person against a bright white backgPASS
A high-key portrait of a person against a bright white background, minimal shadows, airy and luminous

Evaluator (Gemma 4 26B)

The image perfectly matches the prompt. The background is a bright, seamless white, creating a high-key effect. The lighting is soft, even, and luminous, with minimal shadows on the subject's face and neck. The perspective is a standard portrait shot, and the overall aesthetic is airy and bright, adhering strictly to the requested camera and lighting specifications. VERDICT: PASS

Hard 67%
A portrait shot at 85mm f/1.4 with creamy bokeh background, PASS
A portrait shot at 85mm f/1.4 with creamy bokeh background, backlit with warm rim light separating the subject from a blurred garden

Evaluator (Gemma 4 26B)

The image perfectly matches the prompt. The 85mm perspective is evident in the flattering portrait framing. The depth of field is shallow, creating a "creamy bokeh" background with soft, out-of-focus flowers and foliage. Most importantly, there is a distinct warm rim light catching the subject's hair and shoulders, effectively separating her from the background as requested. VERDICT: PASS

A street photograph with shallow depth of field as if shot aPASS
A street photograph with shallow depth of field as if shot at 50mm f/1.8, subject sharp with soft city lights in the background

Evaluator (Gemma 4 26B)

The image perfectly matches the prompt. The subject is sharp and in focus, while the background features a shallow depth of field with soft, out-of-focus city lights (bokeh). The perspective is a close-up portrait typical of a 50mm lens, and the lighting creates a naturalistic street photography aesthetic. VERDICT: PASS

A food photograph shot from above at 35mm f/2.8 with naturalFAIL
A food photograph shot from above at 35mm f/2.8 with natural window light from the left, slight vignette, warm color temperature

Evaluator (Gemma 4 26B)

The image fails to meet the prompt's requirements. The prompt specifies a shot "from above" (top-down/flat lay perspective), but the image is shot from a low, side-angle perspective. Additionally, the light source is not clearly coming from the left; the lighting is relatively even and lacks the distinct directional shadows and "natural window light" characteristics described. VERDICT: FAIL

Extreme 67%
A tilt-shift photograph of a city street making real-life loPASS
A tilt-shift photograph of a city street making real-life look like a miniature model — selective focus strip, saturated colors

Evaluator (Gemma 4 26B)

The image successfully implements the tilt-shift effect. It features a distinct selective focus strip where the center of the street is sharp while the top and bottom are heavily blurred, creating a miniature model appearance. The colors are highly saturated, and the perspective mimics a high-angle miniature shot. The lighting is even and bright, consistent with the described aesthetic. VERDICT: PASS

A cinematic frame with blue and orange anamorphic lens flarePASS
A cinematic frame with blue and orange anamorphic lens flare streaking horizontally across a dark moody scene

Evaluator (Gemma 4 26B)

The image features a dark, moody scene with prominent horizontal light streaks. There are distinct blue and orange light patterns: a bright blue starburst/flare on the left and an orange/amber flare/glow on the right. These streaks stretch horizontally across the frame, matching the "blue and orange anamorphic lens flare" description. The lighting pattern and cinematic atmosphere align perfectly with the prompt's requirements. VERDICT: PASS

A portrait with dramatic split lighting — half the face brigFAIL
A portrait with dramatic split lighting — half the face brightly lit, the other half in deep shadow, high contrast chiaroscuro

Evaluator (Gemma 4 26B)

The image features a close-up portrait with high-contrast lighting. However, the lighting pattern does not match the "split lighting" description. Instead of one side of the face being in deep shadow, the light is distributed more evenly across the face, with subtle shadows around the nose and eyes, but no dramatic division between light and dark halves. The lighting is soft and diffused rather than high-contrast chiaroscuro. VERDICT: FAIL

Color Precision92%
Easy 100%
A bright red sports car parked on a gray asphalt road — the PASS
A bright red sports car parked on a gray asphalt road — the red is vivid and unambiguous

Evaluator (Gemma 4 26B)

The image features a sports car that is a bright, vivid red, which perfectly matches the prompt's requirement for "bright red" and "vivid and unambiguous" color. The car is parked on a gray asphalt road, also matching the description. The color saturation and hue are consistent with the specified colors. VERDICT: PASS

A woman wearing a royal blue evening dress against a neutralPASS
A woman wearing a royal blue evening dress against a neutral background

Evaluator (Gemma 4 26B)

The woman is wearing a deep, dark blue evening dress, which accurately represents the "royal blue" color specified in the prompt. The background is a solid, light grey/beige, which qualifies as a "neutral background." The colors in the image perfectly match the descriptions provided. VERDICT: PASS

A field of pure sunflower yellow flowers under a clear blue PASS
A field of pure sunflower yellow flowers under a clear blue sky

Evaluator (Gemma 4 26B)

The image accurately reflects the prompt's color requirements. The sunflowers are a vibrant, "pure sunflower yellow," and the sky is a clear, bright blue. There are no deviations in color temperature or hue that would suggest a mismatch with the described colors. VERDICT: PASS

Medium 67%
A bedroom wall painted in #FF6B35 (burnt orange) with white PASS
A bedroom wall painted in #FF6B35 (burnt orange) with white furniture and natural daylight

Evaluator (Gemma 4 26B)

The wall color in the image is a vibrant, saturated orange that closely aligns with the description of "#FF6B35 (burnt orange)". The color is consistent across the wall surface and matches the warm, reddish-orange tone specified in the prompt. The furniture is white, and the lighting appears to be natural daylight as requested. VERDICT: PASS

A forest scene where the foliage is precisely #2D5F2D (dark PASS
A forest scene where the foliage is precisely #2D5F2D (dark muted green) with brown tree trunks

Evaluator (Gemma 4 26B)

The image depicts a forest scene with foliage that is a dark, muted green, consistent with the description of #2D5F2D. The tree trunks are a natural brown color. The colors in the image align well with the specific hex value and the descriptive text provided in the prompt. VERDICT: PASS

A room with walls in #4A90D9 (medium blue) with warm wooden FAIL
A room with walls in #4A90D9 (medium blue) with warm wooden floor and white trim

Evaluator (Gemma 4 26B)

The walls in the image are a deep, saturated blue. While the prompt specifies #4A90D9 (a medium, slightly lighter blue), the color in the image appears significantly darker and more navy than the requested hex code. The wooden floor is warm, and the trim is white, but the primary wall color deviates from the specific medium blue requested. VERDICT: FAIL

Hard 100%
A minimalist room with walls in #F5E6CC (warm cream), a sofaPASS
A minimalist room with walls in #F5E6CC (warm cream), a sofa in #2C3E50 (dark navy), and accent pillows in #E74C3C (bright red)

Evaluator (Gemma 4 26B)

The image accurately reflects the requested colors. The walls are a warm cream color (matching #F5E6CC), the sofa is a dark navy blue (matching #2C3E50), and the accent pillows are a bright red (matching #E74C3C). The color tones and specific hues align perfectly with the hex codes and descriptions provided in the prompt. VERDICT: PASS

A product flat-lay on a #1A1A2E (deep dark blue) background,PASS
A product flat-lay on a #1A1A2E (deep dark blue) background, with a #F0C27F (soft gold) watch and #FFFFFF (pure white) earbuds

Evaluator (Gemma 4 26B)

The image matches the prompt's color requirements. The background is a deep, dark blue consistent with #1A1A2E. The watch is a soft gold color, matching #F0C27F. The earbuds are pure white, consistent with #FFFFFF. All specified colors and their corresponding hex values are accurately represented in the flat-lay composition. VERDICT: PASS

A graphic design composition with background #FAFAFA (near-wPASS
A graphic design composition with background #FAFAFA (near-white), main shape #FF4500 (orange-red), and accent lines #333333 (dark charcoal)

Evaluator (Gemma 4 26B)

The image features a large, rounded shape with a vibrant orange-red color, which closely aligns with the requested #FF4500. The background is a near-white, off-white color, consistent with #FAFAFA. The thick outlines and accent lines are a dark charcoal/black, matching the #333333 requirement. The color accuracy is high and adheres to the specific hex values provided in the prompt. VERDICT: PASS

Extreme 100%
A smooth gradient on a wall going from #FF0000 (red) on the PASS
A smooth gradient on a wall going from #FF0000 (red) on the left to #0000FF (blue) on the right, transitioning evenly through purple

Evaluator (Gemma 4 26B)

The image displays a smooth gradient starting with a bright red (#FF0000) on the left side, transitioning through a deep purple middle section, and ending with a vibrant blue (#0000FF) on the right. The transition is even and follows the color progression specified in the prompt accurately. VERDICT: PASS

A sunset sky gradient on a flat surface from #FF6B35 (orangePASS
A sunset sky gradient on a flat surface from #FF6B35 (orange) at the bottom to #1B1464 (deep indigo) at the top

Evaluator (Gemma 4 26B)

The image displays a sunset sky with a gradient. The bottom of the sky/horizon area shows a bright orange color, which aligns with #FF6B35. The top of the image transitions into a deep, dark indigo/purple hue, which aligns with #1B1464. The color transition follows the requested gradient from orange at the bottom to deep indigo at the top. VERDICT: PASS

A horizontal banner with a gradient from #00C9FF (cyan) on tPASS
A horizontal banner with a gradient from #00C9FF (cyan) on the left to #92FE9D (mint green) on the right

Evaluator (Gemma 4 26B)

The image displays a horizontal banner with a gradient. On the left side, the color is a bright cyan, which aligns with the hex code #00C9FF. On the right side, the color transitions into a light mint green, which matches the hex code #92FE9D. The gradient transition is smooth and accurately reflects the specific color requirements provided in the prompt. VERDICT: PASS

Photorealism33%
Easy 33%
A single red sneaker on a clean white background with soft sFAIL
A single red sneaker on a clean white background with soft studio lighting and gentle shadow — product photography style

Evaluator (Gemma 4 26B)

The image fails to meet the criteria for a professional product shot. Upon close inspection, the sneaker exhibits significant AI artifacts, particularly in the texture and structure. The way the laces and the tongue of the shoe merge into the body is nonsensical and physically impossible. The texture of the fabric is inconsistent, and the overall geometry of the shoe is warped and unnatural, making it clearly identifiable as an AI-generated image rather than a real photograph. VERDICT: FAIL

A clear glass perfume bottle on a reflective black surface wFAIL
A clear glass perfume bottle on a reflective black surface with studio lighting, no background distractions

Evaluator (Gemma 4 26B)

The image contains significant AI artifacts that prevent it from looking like a real photograph. The most obvious issue is the bottom of the bottle, where the glass structure becomes nonsensical and merges into the reflection. Additionally, the internal geometry of the glass is inconsistent, with strange, non-physical light refractions and a lack of a coherent liquid or base. The reflection at the bottom is also distorted and lacks physical accuracy. VERDICT: FAIL

A ripe green apple on a seamless white background with soft PASS
A ripe green apple on a seamless white background with soft side lighting and subtle shadow — catalog style

Evaluator (Gemma 4 26B)

The image successfully meets the prompt's requirements. It features a ripe green apple on a seamless white background with soft side lighting and a subtle shadow. The texture of the skin, the lighting, and the shadow are consistent with a high-quality studio product shot. There are no obvious AI artifacts, distortions, or unnatural textures visible; the apple looks like a real photograph. VERDICT: PASS

Graphical design 29%

Data Visualisation0%
Easy 0%
A bar chart showing monthly sales for January through June wFAIL
A bar chart showing monthly sales for January through June with values 12K, 18K, 15K, 22K, 19K, 25K — labeled axes, color-coded bars, and a title 'Monthly Sales 2025'

Evaluator (Qwen 3 VL)

The chart misrepresents data: y-axis labels are incorrect (e.g., “N6”, “SO”), and x-axis labels show values (12K, 15K) instead of months. The values on bars (18K, 15K, etc.) don’t match the requested sequence (12K, 18K, 15K, 22K, 19K, 25K). The title is correct, but axes are unlabeled with months. Chart type is a bar chart, but visual clutter and errors undermine readability. VERDICT: FAIL

A pie chart showing market share: Company A 35%, Company B 2FAIL
A pie chart showing market share: Company A 35%, Company B 25%, Company C 20%, Others 20% — each slice a different color with labels and percentages

Evaluator (Qwen 3 VL)

The chart is a pie chart but mislabels categories: “Company A” and “Company B” are both labeled 35% (should be 35% and 25% respectively), and “Others” is labeled 20% but appears as “Ottuge” and “Socker”/“Ocker” (incorrect names). Percentages and labels are unreadable or wrong. Chart type matches, but content is inaccurate. VERDICT: FAIL

A simple line graph showing daily website visitors over one FAIL
A simple line graph showing daily website visitors over one week (Mon-Sun): 1200, 1500, 1100, 1800, 2000, 1700, 1400 — with grid lines, axis labels, and title 'Weekly Traffic'

Evaluator (Qwen 3 VL)

The chart misrepresents data: y-axis labels are illegible symbols, not numbers; x-axis values are reversed and misaligned (e.g., 1400 appears rightmost, but 1200 is leftmost). The data points (1200, 1500, 1100, 1800, 2000, 1700, 1400) are visually placed incorrectly. Grid lines and title exist, but axis labels are unreadable. Chart type is a line graph, but presentation fails basic readability. VERDICT: FAIL

Layout & Design33%
Easy 100%
A movie night poster with a large popcorn image in the centePASS
A movie night poster with a large popcorn image in the center, the title 'MOVIE NIGHT' in bold at the top, and the tagline 'Every Friday at 8 PM' at the bottom

Evaluator (Qwen 3 VL)

All required elements are present: bold “MOVIE NIGHT” at top, large popcorn center, and “Every Friday at 8 PM” at bottom. Text is readable and correctly placed. Visual hierarchy is strong, with popcorn as focal point and title/tagline appropriately sized. Layout matches prompt exactly. VERDICT: PASS

An event poster featuring a guitar illustration, the headlinPASS
An event poster featuring a guitar illustration, the headline 'LIVE MUSIC FEST' in large text, and 'Tickets available now' as a tagline below

Evaluator (Qwen 3 VL)

All required elements are present: guitar illustration, headline “LIVE MUSIC FEST” with correct visual hierarchy (large, centered), and tagline “Tickets available now” below in a button. Text is readable and correctly placed. Layout follows prompt structure. VERDICT: PASS

A book club poster with an open book illustration, the titlePASS
A book club poster with an open book illustration, the title 'JOIN THE BOOK CLUB' prominently displayed, and 'Meets every Wednesday' as the tagline

Evaluator (Qwen 3 VL)

All required elements are present: “JOIN THE BOOK CLUB” title at top with strong visual hierarchy, an open book illustration centered below, and “Meets every Wednesday” tagline at bottom. Text is readable and correctly placed. Layout follows prompt structure with clear emphasis on the main title. VERDICT: PASS

Medium 0%
A landing page design with a hero header image at the top, aFAIL
A landing page design with a hero header image at the top, a headline and subtitle in the middle, a 'Sign Up' call-to-action button, and a footer with three social media icons

Evaluator (Qwen 3 VL)

The layout includes a hero image, headline/subtitle, “Sign Up” button, and footer with social icons — all present. However, the headline and subtitle text is gibberish and unreadable, violating the requirement for readable, correctly placed text. Visual hierarchy is mostly intact, but illegible text undermines usability. VERDICT: FAIL

A newsletter layout with a banner image header, a two-columnFAIL
A newsletter layout with a banner image header, a two-column body section with text on the left and an image on the right, a highlighted quote block, and a footer with unsubscribe link

Evaluator (Qwen 3 VL)

The layout includes a banner header, two-column body (text left, image right), a highlighted quote block, and a footer with an unsubscribe link. However, all text is gibberish or placeholder, unreadable, and improperly formatted. Visual hierarchy is present but compromised by illegible content. The structure is correct, but functionality fails due to unreadable text. VERDICT: FAIL

A restaurant menu layout with a logo header, three category FAIL
A restaurant menu layout with a logo header, three category sections (appetizers, mains, desserts) each with items and prices, and a footer with address and hours

Evaluator (Qwen 3 VL)

The menu includes a logo header, three category sections (appetizers, mains, desserts), and a footer with address/hours — all visually distinct. However, all text is gibberish, unreadable, and incorrectly placed, violating the prompt’s requirement for readable, correctly positioned content. VERDICT: FAIL

Hard 0%
A magazine double-page spread with a left page containing a FAIL
A magazine double-page spread with a left page containing a large hero photo, and a right page with a two-column article, a pull quote, a sidebar with author bio, and a page number

Evaluator (Qwen 3 VL)

The layout matches the prompt: left page has a large hero photo, right page has two columns, a pull quote, and a sidebar with author bio. Page numbers (18, 19) are visible. However, all text is gibberish — unreadable and incorrectly placed — violating the requirement for readable, correctly placed text. VERDICT: FAIL

A mobile app UI mockup showing a dashboard with a top navigaFAIL
A mobile app UI mockup showing a dashboard with a top navigation bar, a search field, a 2x2 grid of feature cards with icons, a list of recent activity items, and a bottom tab bar

Evaluator (Qwen 3 VL)

The layout includes a top navigation bar, search field, 2x2 feature cards, recent activity list, and bottom tab bar — all present. However, text is garbled and unreadable, violating the requirement for correct, readable text placement. Visual hierarchy is mostly intact, but illegible text undermines usability. VERDICT: FAIL

A fashion magazine spread with a full-bleed left-page photo,FAIL
A fashion magazine spread with a full-bleed left-page photo, right page with three-column layout, headline, body copy, a smaller inset image, and styled page numbers

Evaluator (Qwen 3 VL)

The left page has a full-bleed photo as required. The right page features a headline, body copy, and a smaller inset image within a bordered frame, with page number “+35” at bottom right. However, all text is gibberish and unreadable, violating the requirement for readable, correctly placed text. Layout structure is present but text fails functional criteria. VERDICT: FAIL

Style Diversity33%
Easy 33%
A golden retriever sitting in a garden, painted in the stylePASS
A golden retriever sitting in a garden, painted in the style of an oil painting with visible brushstrokes and rich texture

Evaluator (Qwen 3 VL)

The image clearly depicts a golden retriever in a garden setting. It exhibits visible brushstrokes and rich texture consistent with an oil painting style, especially in the fur and background foliage. The lighting and color palette enhance the painterly effect. All key characteristics of the requested style are present. VERDICT: PASS

A golden retriever sitting in a garden, rendered in JapaneseFAIL
A golden retriever sitting in a garden, rendered in Japanese anime style with flat colors and large expressive eyes

Evaluator (Qwen 3 VL)

The image is a photorealistic depiction of a golden retriever, not Japanese anime style. It lacks flat colors, large expressive eyes, or anime aesthetics. The style requested is entirely absent. VERDICT: FAIL

A golden retriever sitting in a garden, rendered in pixel arFAIL
A golden retriever sitting in a garden, rendered in pixel art style with a 32x32 grid and limited color palette

Evaluator (Qwen 3 VL)

The image lacks true pixel art characteristics: it’s not a 32x32 grid, and the “pixels” are large, soft, and painterly, not blocky. The dog’s fur and background show smooth gradients and fine detail, contradicting the limited palette and low-resolution style. The prompt’s core requirements are not met. VERDICT: FAIL

Medium 0%
A city street scene in the style of Art Deco — geometric shaFAIL
A city street scene in the style of Art Deco — geometric shapes, gold and black palette, ornamental symmetry

Evaluator (Qwen 3 VL)

The image depicts a modern city street with skyscrapers and cars, lacking Art Deco’s defining geometric ornamentation, stylized symmetry, or a dominant gold/black palette. While the warm lighting and vertical lines hint at grandeur, they don’t fulfill the requested style. Key Art Deco elements are absent. VERDICT: FAIL

A woman with a parasol in the style of Impressionism — soft FAIL
A woman with a parasol in the style of Impressionism — soft brushstrokes, dappled light, pastel palette

Evaluator (Qwen 3 VL)

The image shows soft, visible brushstrokes and dappled light filtering through foliage, consistent with Impressionism. The palette is pastel-like, with gentle greens, blues, and pinks. However, the figure’s facial features and clothing are overly detailed and smooth, lacking the loose, painterly quality typical of true Impressionist works. The style is evocative but not authentically rendered. VERDICT: FAIL

A portrait of a man in Pop Art style — bold outlines, Ben-DaFAIL
A portrait of a man in Pop Art style — bold outlines, Ben-Day dots, flat vivid primary colors

Evaluator (Qwen 3 VL)

The image attempts Pop Art with Ben-Day dots in background areas and bold color blocks, but lacks consistent flat primary colors and strong outlines across the entire portrait. Facial textures appear painterly and detailed, conflicting with the requested style. Key Pop Art characteristics are not uniformly applied. VERDICT: FAIL

Hard 33%
A forest rendered in photorealistic watercolor — the detail FAIL
A forest rendered in photorealistic watercolor — the detail and lighting of a photo combined with visible watercolor bleeds and paper texture

Evaluator (Qwen 3 VL)

The image displays photorealistic detail and lighting, but lacks visible watercolor bleeds or paper texture. It appears digitally rendered, not painted. The requested hybrid style is not recognizable. VERDICT: FAIL

A woman in a futuristic city rendered in cyberpunk Art NouvePASS
A woman in a futuristic city rendered in cyberpunk Art Nouveau — neon colors and tech elements combined with flowing organic lines and floral borders

Evaluator (Qwen 3 VL)

The image successfully blends cyberpunk’s neon tech with Art Nouveau’s flowing organic lines and floral borders. The woman’s hair and body feature intricate, curvilinear designs, while the cityscape and frame are adorned with stylized flowers. Vibrant neon hues dominate, and tech elements (e.g., glowing circuits) integrate seamlessly with organic forms. The style is unmistakably recognizable and fully satisfies the prompt. VERDICT: PASS

A medieval castle depicted in Japanese ukiyo-e woodblock priFAIL
A medieval castle depicted in Japanese ukiyo-e woodblock print style combined with photorealistic sky and lighting

Evaluator (Qwen 3 VL)

The image depicts a Japanese castle (not a medieval one), which misrepresents the prompt’s core request. While it has ukiyo-e-inspired elements like layered roofs and stylized composition, the photorealistic sky and lighting are present but clash with the traditional woodblock aesthetic. The style is partially recognizable but fundamentally inaccurate. VERDICT: FAIL

Extreme 67%
An interior scene rendered in soft brutalism — raw concrete PASS
An interior scene rendered in soft brutalism — raw concrete textures combined with rounded pastel furniture and warm diffused light

Evaluator (Qwen 3 VL)

The image clearly showcases soft brutalism: raw, textured concrete walls and arches contrast with rounded, pastel-toned furniture. Warm, diffused lighting enhances the cozy, organic feel. All key elements from the prompt are present and well-executed. VERDICT: PASS

A still life painted in anxious minimalism — sparse compositFAIL
A still life painted in anxious minimalism — sparse composition with uncomfortable negative space, muted colors, and subtle visual tension

Evaluator (Qwen 3 VL)

The image presents a still life with muted, earthy tones and a sparse, slightly cluttered arrangement. Negative space is present but not “uncomfortable,” and visual tension is minimal. The style leans toward traditional realism, not “anxious minimalism.” Key requested elements like unsettling negative space and subtle tension are absent. VERDICT: FAIL

A cityscape rendered in nostalgic futurism — retro-futuristiPASS
A cityscape rendered in nostalgic futurism — retro-futuristic 1960s space-age aesthetics with a melancholic warm-toned patina

Evaluator (Qwen 3 VL)

The image clearly embodies retro-futurism with 1960s space-age aesthetics: bulbous towers, flying cars, and organic architectural curves. Warm, golden lighting and a melancholic patina evoke nostalgia. Visual cues like the sun-drenched haze and aged textures reinforce the requested mood. All key characteristics are present and well-executed. VERDICT: PASS