How to Write a Great AI Image Prompt: The 6-Part Anatomy
May 31, 2026
You can picture an image perfectly and still get garbage out. The fix is structure.
If you've ever typed "a cool dragon" into an AI image tool and gotten back something flat, generic, and nothing like the picture in your head, the problem isn't the model. It's that you gave it one word of intent and left the other ninety-nine decisions to a coin flip. A great AI image prompt isn't longer or more clever — it's structured. It names the six decisions the model would otherwise guess at.
This guide breaks an image prompt into six parts, then builds one prompt from scratch, one part at a time, writing out the actual prompt text at every step so you can see exactly what each layer adds. You'll get a lazy-vs-structured comparison with both results described in full, a section on what to exclude, the style anchors that actually move the output, and a copy-paste template you can reuse forever. Everything here works in any modern generator — including the AI image generator on SentX.
The 6-part anatomy of an AI image prompt
Every strong image prompt answers six questions, in roughly this order: Subject → Composition → Style → Lighting → Lens/Camera → Mood/Color. Lead with the subject, because the model leans hardest on the words it reads first; layer the rest as refinement.
Here is what each part controls:
- Subject — what is in the frame. The noun plus its defining traits, clothing, action, and setting. This is the single most important part; a vague subject can't be rescued by good lighting.
- Composition — where the subject sits in the frame and how the shot is framed. Close-up, wide shot, rule-of-thirds, centered, low angle, over-the-shoulder. This is the part beginners skip most, and it's why their images feel like flat ID photos.
- Style — the medium and visual language. Photograph, oil painting, watercolor, 3D render, ukiyo-e, charcoal sketch, risograph print. Style is the biggest single lever on the whole "feel" of the result.
- Lighting — where the light comes from and how hard it is. Golden-hour backlight, soft north-window light, hard noon sun, neon glow, candlelight, overcast diffusion. Without it the model defaults to safe, boring, even lighting.
- Lens / Camera — how the image was captured. 35mm, 85mm portrait, macro, wide-angle 24mm, shallow depth of field, tilt-shift. This is what separates "looks like a snapshot" from "looks like it was shot by someone who owns lenses."
- Mood / Color — the emotional temperature and palette. Warm and nostalgic, cold and clinical, muted earth tones, high-contrast and moody, pastel and ethereal. This ties the whole frame together.
Miss any one of these and the model fills the gap with its statistical average — which is exactly the bland output everyone complains about.
Building a prompt one part at a time
The fastest way to internalize the anatomy is to watch a prompt grow. We'll start with a bare subject and add exactly one layer per step, so you can see what each part does to the instruction. Target image: an old fisherman by the sea.
Step 1 — Subject only. State what's in the frame, with defining traits and action.
An elderly fisherman with a weathered face and white stubble, mending a net, standing on a wooden dock by the sea
This is already better than "a fisherman" because it has age, texture (weathered, stubble), an action (mending a net), and a setting (dock by the sea). But the model still has to guess the framing, medium, light, lens, and mood.
Step 2 — Add Composition. Decide where he sits in the frame and the camera angle.
An elderly fisherman with a weathered face and white stubble, mending a net, standing on a wooden dock by the sea, medium close-up from a low angle, subject off-center on the right third, net filling the foreground
Now we've stopped getting a centered passport photo. The low angle gives him presence; the off-center placement leaves room for the sea.
Step 3 — Add Style. Name the medium and visual language.
An elderly fisherman with a weathered face and white stubble, mending a net, standing on a wooden dock by the sea, medium close-up from a low angle, subject off-center on the right third, net filling the foreground, shot as a documentary photograph in the style of mid-century photojournalism
"Documentary photograph" and "photojournalism" pull the model away from glossy, over-rendered, AI-looking output and toward something that reads as real and observed.
Step 4 — Add Lighting. Tell it where the light comes from.
...shot as a documentary photograph in the style of mid-century photojournalism, soft golden-hour light raking across his face from the left, warm rim light on the net
Lighting is the difference between a face and a portrait. The raking side light carves out the wrinkles instead of flattening them.
Step 5 — Add Lens / Camera. Specify how it was captured.
...soft golden-hour light raking across his face from the left, warm rim light on the net, 85mm lens, shallow depth of field, background sea softly blurred
The 85mm and shallow depth of field separate him from the background and give the believable compression of a real portrait lens.
Step 6 — Add Mood / Color. Set the emotional temperature and palette.
An elderly fisherman with a weathered face and white stubble, mending a net, standing on a wooden dock by the sea, medium close-up from a low angle, subject off-center on the right third, net filling the foreground, shot as a documentary photograph in the style of mid-century photojournalism, soft golden-hour light raking across his face from the left, warm rim light on the net, 85mm lens, shallow depth of field, background sea softly blurred, quiet and dignified mood, muted teal-and-amber palette
That's the full six-part prompt. Read it back and notice you can picture the exact image — which means the model can too. There's almost nothing left for it to guess.
Lazy prompt vs. structured prompt, side by side
Here's the same idea written two ways, with both results described honestly.
The lazy version:
old fisherman by the sea, realistic
What you tend to get: A man who could be any age, shot dead-center and chest-up like a stock-photo headshot. Even, flat midday lighting with no shadow to define his features. A generic blue-sky-and-water backdrop in sharp focus, so nothing pops. The whole thing reads as "AI image" — competent, soulless, instantly forgettable. The word "realistic" did almost nothing, because the model didn't know which kind of realistic.
The structured version (the Step-6 prompt above):
What you tend to get: A specific, weathered man placed on the right third with a low, slightly heroic angle. Golden side-light carving real texture into his face. The net in soft foreground, the sea blurred behind from the shallow depth of field, so your eye lands exactly where you want it. A muted teal-and-amber grade that feels like a still from a film. It reads as a photograph someone made, not one a machine averaged.
Same subject. The only difference is that the second prompt answered the six questions the first one left to chance.
Negative direction: what to exclude
Telling the model what you don't want is half the craft, because models love to add clutter, extra fingers, watermarks, and busy backgrounds. There are two ways to do it.
1. Use a negative-prompt field if your tool has one. Many generators have a separate box for things to suppress. Common, genuinely useful entries:
blurry, lowres, extra fingers, deformed hands, watermark, text, signature, jpeg artifacts, oversaturated, plastic skin, busy background
2. State exclusions in plain language if there's no negative field. Phrasing the absence directly inside the prompt works in conversational generators:
...clean uncluttered background, no text or watermark, natural skin texture rather than smooth plastic skin
A note that saves frustration: phrase positives as presence, not absence. "Empty road" beats "no cars," because the model latches onto the noun cars and sometimes draws them anyway. Save the negative field for true artifacts (extra limbs, watermarks, lowres) and describe everything else as something that is there.
Style anchors that actually change the output
Most "style" words are noise. Stacking ultra HD, 8K, highly detailed, masterpiece, trending on artstation does almost nothing now — the model treats them as generic quality filler and they cancel each other out. The anchors that actually move the image are specific, named, and information-dense. Use two or three, not ten.
These pull hard:
- A named medium with a technique:
cyanotype print,gouache on toned paper,35mm film, slight grain,low-poly 3D render,linocut,charcoal on newsprint. - A named art movement or era:
art nouveau,Bauhaus poster,1970s sci-fi paperback cover,Dutch Golden Age still life,ukiyo-e woodblock. - A named lighting setup:
Rembrandt lighting,chiaroscuro,softbox studio light,golden hour backlight,single candle. - A real lens behavior:
macro,tilt-shift,fisheye,85mm portrait compression,wide-angle 24mm.
These do almost nothing:
beautiful,amazing,high quality,4K,8K,award-winning,masterpiece,trending.
The rule: specificity carries information per word. "Ukiyo-e woodblock" tells the model more than "Japanese-style art," and "Rembrandt lighting" tells it more than "dramatic lighting." Swap vague adjectives for named techniques and the output changes immediately. Once you've nailed a still you love, you can even bring it to life — see turning a still image into motion.
The reusable copy-paste template
Keep this in a note. Fill the brackets, delete what you don't need, and you have a structured prompt every time:
[SUBJECT: who/what + defining traits + clothing + action + setting],
[COMPOSITION: shot type + camera angle + where the subject sits in frame],
[STYLE: named medium or art movement, e.g. 35mm film photograph / gouache painting / low-poly 3D render],
[LIGHTING: source + quality, e.g. soft golden-hour backlight / hard noon sun / single candle],
[LENS: focal length + depth of field, e.g. 85mm, shallow depth of field],
[MOOD/COLOR: emotional temperature + palette, e.g. quiet and nostalgic, muted earth tones]
Negative (if your tool has the field):
blurry, lowres, extra fingers, deformed hands, watermark, text, signature, oversaturated, plastic skin
Worked fill-in for a completely different subject, so you can see it generalize:
A red fox curled asleep in fresh snow, tail wrapped over its nose,
extreme close-up from ground level, fox centered with snow filling the frame,
wildlife photograph,
cold blue overcast light with faint warm glow on the fur,
300mm telephoto, very shallow depth of field, soft snow bokeh,
calm and intimate mood, cool blue palette with warm amber accents
Paste that into any generator and you'll get a deliberate, photographer's image instead of a guess. If you want to try it right now, open SentX's image generator and run the lazy version and the structured version back to back — seeing the gap on your own subject is the fastest way to make the anatomy stick.
Frequently asked questions
How long should an AI image prompt be?
Long enough to answer the six parts, and no longer — usually one to three lines. Past that, extra words start competing with each other and the model loses the thread. If you're padding with 8K, ultra-detailed, masterpiece, award-winning, you've hit the point of diminishing returns. Cut the filler and add a specific detail instead: a named lens, a named lighting setup, a concrete color.
Should I write prompts as a sentence or as comma-separated tags?
Either works on modern generators, and most handle both. Comma-separated phrases (elderly fisherman, golden-hour light, 85mm, muted palette) give you fine control over each part. Natural sentences read more easily and work well in conversational tools. Pick whichever you'll actually maintain — the structure matters far more than the grammar.
Why does my prompt produce something different every time?
Image generators are random by design — the same prompt with a different starting seed produces a different image. Structure reduces the range of that randomness: a tightly specified prompt varies within a narrow band you'll like, while a vague one can land anywhere. If you find a result you love, keep the prompt and reuse it; many tools let you lock or reuse a seed to stay close to a result.
What's the single biggest mistake beginners make?
Skipping composition and lighting. Almost everyone writes a decent subject and then leaves the framing and the light to chance — which is exactly why their images look like flat, centered, evenly-lit stock photos. Add a shot type and a light source and your output jumps a tier immediately.
Do these prompts work for AI video too?
The same six-part thinking applies — you still specify subject, composition, style, lighting, and mood — but video adds motion and timing on top (camera moves, what the subject does over the shot's duration). The anatomy here is the foundation; layer movement direction on top of it. If you're heading that way, the principles in writing better AI prompts and adding motion to a still image carry straight over.
How do I get consistent results across multiple images?
Lock the parts you want stable and vary only one. Keep the same style, lighting, lens, and mood lines, and change only the subject or composition between generations. Because you've separated the prompt into discrete parts, you can hold five of them constant and turn just one dial — which is impossible when your whole prompt is a single vague phrase. Tools that carry context across a session make this easier: you can refine "same style, new subject" without re-typing the whole structure each time.