Decoding & Sampling

Understand how token selection strategies control output quality, diversity, and consistency.

Difficulty intermediate
Read time 9 min
decoding sampling temperature top-p top-k beam-search greedy
Updated February 11, 2026

What Is Decoding & Sampling?

Imagine an LLM is writing one token at a time while staring at a giant keyboard of possible next tokens. At each step, it has a preference list: some tokens are very likely, others are possible but less likely.

Decoding is the strategy for choosing the next token from that list.

  • If you always pick the most likely token, you get very consistent output.
  • If you allow controlled randomness, you get more variety and creativity.

That is why the same prompt can produce different outputs: the model weights are the same, but the token selection policy can differ.

Technical definition: after a forward pass, the model outputs logits for the next token. Decoding transforms these logits into a probability distribution and chooses a token according to rules such as greedy, temperature sampling, top-k, top-p, or beam search.

Why Does It Matter?

Decoding is one of the strongest runtime controls you have. It directly affects:

  • Consistency: deterministic settings are useful for grading, extraction, or policy-sensitive text.
  • Creativity: stochastic settings help brainstorming, storytelling, and ideation.
  • Safety and reliability: aggressive randomness can increase format errors or drift from instructions.
  • Evaluation: comparing prompts is noisy unless decoding settings are fixed.

Many teams debug prompts when the real issue is decoding mismatch. A prompt that looks “unstable” may become reliable with stricter decoding, and a prompt that feels boring may improve with mild sampling.

How It Works

At each generation step, the model produces logits z_i for each token i in the vocabulary.

  1. Convert logits to probabilities

    Usually with softmax:

    p_i = exp(z_i) / sum_j exp(z_j)

  2. Optionally reshape distribution with temperature

    p_i(T) = softmax(z_i / T)

    • T < 1: sharper distribution, safer and more repetitive.
    • T > 1: flatter distribution, more diverse and risky.
  3. Apply candidate filtering (optional)

    • Top-k: keep only the k highest-probability tokens.
    • Top-p (nucleus): keep the smallest token set whose cumulative probability exceeds p.
  4. Select next token

    • Deterministically (greedy / beam best path).
    • Stochastically (sample from filtered distribution).
  5. Append token and repeat until stop condition.

Major decoding strategies

  • Greedy decoding Always choose argmax token. Fast and deterministic, but can get trapped in repetitive local choices.

  • Beam search Track multiple candidate sequences (beams), keeping top global paths by accumulated score. Better for tasks like translation where global coherence matters, but often less diverse and sometimes overly generic.

  • Top-k sampling Sample from only the best k options. Prevents absurd low-probability picks while preserving variety.

  • Top-p sampling Dynamic candidate size based on probability mass. Adapts better than fixed top-k when confidence changes by step.

  • Temperature sampling Scales randomness across all above methods.

Practical mental model

Think of logits as a city map with many roads:

  • Greedy always takes the widest road.
  • Beam explores a few major roads in parallel.
  • Top-k opens only the biggest intersections.
  • Top-p opens enough roads to cover most traffic.
  • Temperature changes how adventurous your driver is.

Example behavior snapshot

Prompt: “Write one tagline for a new eco-friendly running shoe.”

  • Greedy, T=0: “Run lighter, live greener.”
  • Top-p 0.9, T=0.7: similar but with moderate variation.
  • Top-p 0.95, T=1.1: more novel, sometimes too poetic or off-brief.

No single setting is universally best. The best setting depends on the task objective.

Key Terminology

  • Logits: Raw model scores before conversion to probabilities.
  • Temperature: Parameter that sharpens or flattens token probabilities.
  • Top-k sampling: Sampling only from the k most likely next tokens.
  • Top-p (nucleus) sampling: Sampling from the smallest set of tokens covering probability mass p.
  • Beam search: Multi-path deterministic decoding that optimizes sequence-level score.

Real-World Applications

  • Structured extraction and classification: Often use low temperature or greedy decoding for repeatability.
  • Marketing copy tools: Use moderate sampling to produce multiple candidate drafts.
  • Code generation assistants: Use stricter decoding for syntax reliability, then optional high-diversity regeneration on demand.
  • Conversational assistants: Balance temperature and top-p to avoid both robotic repetition and chaotic drift.

Common Misconceptions

  1. “Temperature alone controls randomness.” It is important, but top-k/top-p and penalties also strongly shape output behavior.

  2. “Lower temperature always means higher quality.” Lower temperature improves consistency, not necessarily relevance or creativity for all tasks.

  3. “If two runs differ, the model is broken.” Variation is expected under sampling. Deterministic settings are needed when reproducibility matters.

Further Reading

  • Holtzman et al. (2020), The Curious Case of Neural Text Degeneration (top-p motivation).
  • OpenAI API documentation on text generation parameters (temperature, top-p, and related controls).
  • Hugging Face documentation on generation strategies (greedy, beam, sampling).