Tuning Krea-2 Turbo for Better Prompt Following
Most of ImageBench is about ranking models against each other. Sometimes the sharper question is what a single knob inside one model does. Krea-2 Turbo — a 12-billion-parameter distilled diffusion transformer — carries a small linear layer on its text-conditioning path. Re-weighting that one layer, while holding everything else fixed, lets you watch prompt adherence rise, peak, and then fall apart.
A one-variable experiment
The setup is deliberately boring, because that is what makes it readable. Same prompt, same seed (42), same eight denoising steps, same 1024×1024 output, guidance held at 0 (Krea-2 Turbo is distilled, so classifier-free guidance is off). Across ten generations the only thing that changes is the strength of one weight patch — from 0 (patched layer off, i.e. the stock model) up to 10 (extreme).
The prompt is a dense, multi-subject scene with many independent, checkable constraints — exactly the kind of instruction-heavy prompt that separates a model that is listening from one that is improvising. We did not write it: we found it on Reddit, and the author was kind enough to let us use it here. It is reproduced in full below, so you can judge adherence for yourself.
We can see a wide, straight red couch in a dimly lit room of a high-end disco club. Sitting in the couch we can see four women: - The first woman is a eighteen years old young woman, blonde hair, with blue eyes, slim body, small breasts, wearing a very short red sequin miniskirt, a sequin red blouse, red scarpin high-heels shoes, glossy red lipstick, dark eyeshadow. She is sitting with her legs crossed and her hands on her lap. She is looking to her friends and laughing hard. - The second woman is a twenty-five years old african american woman, voluptuous body with large breasts, wearing a very short fit green minidress, green scarpin high-heels shoes, discreet lipstick, subtle green eyeshadow. She is looking at the camera with a surprised expression on her face. - The third woman is a twenty years old redhead woman with green eyes, long curly hair, chubby body, large breasts, wearing very fit black pants and a black bustier, black scarpin high-heels shoes, matte pink lipstick, smoky eye pink eyeshadow. She is sitting with her legs apart, one hand on her thigh and the other holding a whisky glass. She is smiling. - The fourth woman is a twenty-three years old asian woman, slim body, wearing a very short black miniskirt and a thin-strapped black blouse, black scarpin high-heels shoes with red soles, glossy red lipstick, gray eyeshadow. She is sitting with her legs crossed and has one hand near her face, holding a cigarette between her fingers. She is looking at the camera and smiling. In the background we can see the couch they are sitting in, and also the wooden wall behind it. The overall mood is sexy, provocative. The overall look of the image is of an amateur, homemade candid photo taken with a cell phone camera. Dimly lit scene, homemade photography.
What the patch actually touches
The patch is 48 bytes. It is not a normal LoRA — there are no low-rank A/B factors. It is a Comfy-style .diff patch: a single weight-delta tensor of shape [1, 12] keyed to text_fusion.projector.weight.
That target is a Linear(12 → 1) layer inside the transformer. It takes 12 selected hidden-state layers from the Qwen3-VL text encoder and collapses them into a single scalar that modulates how the text conditioning is applied — one number that gates how strongly the prompt is allowed to steer generation. The patch re-weights how those 12 encoder layers contribute to that scalar; it introduces no new capability, it just shifts the balance on a path that already exists. (The patch circulates as a community “filter-bypass” modification; the question here is narrower and fully measurable — what it does to prompt adherence.)
Mechanically, the server adds delta × scale to that weight before a generation and subtracts it afterward. One wrinkle matters for reading the sweep: a Comfy .diff scale runs about 100× a normal LoRA weight, so a strength of 1.0 is roughly a ×100 push.
| Strength | vs. a typical LoRA weight |
|---|---|
| 0.01 | ≈ 1× |
| 0.1 | ≈ 10× |
| 1.0 | ≈ 100× |
| 3.0 | ≈ 300× |
| 10.0 | ≈ 1000× |
The sweep
Six points along the range, same prompt and seed throughout. Click any tile to enlarge.
0.0 — off
Base Krea-2 Turbo. A competent club scene, but several stated attributes are missing.
0.1 — ≈10×
Barely moved. Expressions still generic.
0.3 — ≈30×
Expressions snap into place — laughing, surprised, smiling.
1.0 — ≈100×
The sweet spot. Body types, expressions, props, and shoe details all land.
3.0 — ≈300×
Over-driven. Small violations creep back — a shoe color drifts.
10.0 — ≈1000×
Prompt drift and a visible drop in coherence.
Read left to right. With the patch off, the model paints a plausible scene but quietly ignores several of the stated details — the person described as heavier-set comes out slim, the expressions default to generic. By 0.3 the expressions lock in. At 1.0 the rest follows: the described body types, the distinct expressions (laughing hard, surprised, smiling), the cigarette held between the fingers, the red-soled heels. Past 1.0 the extra push starts to cost you — a shoe color drifts away from the prompt, a garment simplifies — and by 10 the image is over-driven, with subjects picking up props they were never given and a clear loss of coherence.
Off vs. the sweet spot
The clearest way to see the effect is the two ends of the useful range, side by side.
Why we sweep before we benchmark
A benchmark score is only meaningful if the model was run at a sensible setting. Defaults are not always the best setting, and one badly chosen parameter can make a capable model look worse than it is. So before wiring a model or a configuration into ImageBench, we characterize the parameters that actually move output — a controlled, one-variable sweep like this one — so the published number reflects the model at a fair setting rather than an arbitrary one.
It is the same instinct behind the rest of the methodology: a benchmark is a measurement instrument, and a measurement is only as good as the care taken setting it up. More on how those measurements turn into published results is in the V1 methodology, and on why a small, well-chosen prompt set does most of the work in Why a Few Good Prompts Are Enough.