Temperature and top-p: tuning when the answer matters more than novelty

Temperature and top-p are the two sampling parameters every team adjusts and almost none tune systematically. The default of 0.7 is everyone’s first guess, the second guess is 0, and that’s where most projects stop. The real cost shows up later: classification tasks running with creative-writing temperatures, and creative writing tasks suffocating at temperature zero.

The decision rule that actually scales

For tasks with a single correct answer — classification, extraction, structured output — temperature should be 0 and top-p doesn’t matter. For tasks with many acceptable answers — summarization, rewriting — 0.5 to 0.7 with top-p around 0.9 is a reasonable starting point. For genuinely creative work, 0.8 to 1.0 is the right band, but always with top-p capped to avoid the tail of low-probability tokens that cause incoherence.

What the defaults hide

Setting temperature to 0 doesn’t make models deterministic — there’s still floating-point noise in tied probabilities. Two identical calls can produce different outputs. If you need true reproducibility, you need to capture the seed too, and not all APIs expose it. Treat temperature 0 as low-variance, not zero-variance, and your tests will stop being flaky.

The teams that ship reliable LLM features pick sampling parameters per task, not per project. The default config is the wrong config for half your endpoints.

Related Posts

Mastering prompt engineering for production use

Lorem ipsum dolor sit amet consectetur adipisicing elit. Production prompts have very different fail ...

Chain-of-thought prompting that holds up under pressure

Chain-of-thought prompting is the easiest reasoning trick to ship and the hardest to keep working. T ...

Few-shot examples: choose them like you choose unit tests

A prompt with five well-chosen examples beats the same model with fifty mediocre ones, almost every ...

Self-consistency sampling: cheap reliability when you need the right answer

Self-consistency sampling sounds like the kind of thing a researcher proposes and a production engin ...

System prompts that survive long sessions

Every team writes a careful system prompt and forgets it. The model follows it for the first few tur ...