Tokenization

What Is Tokenization?

Tokenization is the step where raw text gets chopped into smaller pieces—called tokens—that a model can work with.

Analogy: imagine you’re building with LEGO. You can’t pour in “a castle” as one blob; you need bricks. Tokenization is the process of turning your text into the “bricks” the model is trained to recognize and manipulate.

At first glance, you might think tokens are just words. Sometimes they are. But modern AI systems often use subword tokens (pieces of words) or even byte-level tokens. That lets the model handle rare words, new names, slang, typos, and many languages without needing a gigantic vocabulary.

Technically, tokenization maps a string of characters into a sequence of discrete symbols from a fixed vocabulary (plus some special symbols). Those symbols are then converted into vectors (embeddings) and processed by the model.

Why Does It Matter?

Tokenization matters because it quietly controls a bunch of very practical things:

What the model can “see” and learn. The model never sees raw characters directly; it sees token IDs. If a concept is split into weird pieces, the model’s job becomes harder.
Cost and speed. Many LLM services price and limit usage by token count. More tokens usually means higher cost and slower responses.
Context window limits. Models can only process a fixed number of tokens at once. Tokenization decides how quickly you “spend” that budget.
Handling unknown words. Word-level tokenizers fail on new words (they become <UNK>). Subword tokenizers are designed so almost anything can be represented as smaller pieces.
Multilingual robustness. Some tokenizers work better across languages; others are biased toward the language distribution they were trained on.

If you’ve ever wondered why “same meaning” prompts can produce different results—or why a short-looking sentence can still be “many tokens”—tokenization is often the culprit.

How It Works

A tokenizer is usually a pipeline. The exact steps vary by model, but a common structure looks like this:

1) Normalize the text (sometimes)

Before splitting, many tokenizers normalize text in consistent ways:

Unicode normalization (so visually similar characters are treated consistently)
Lowercasing (some models do this, many modern LLMs don’t)
Handling whitespace or control characters

The point is to reduce “accidental variety” so the model doesn’t waste capacity on irrelevant differences.

2) Pre-tokenize into rough chunks

Next, the text is often split into coarse pieces like words and punctuation.

Example:

Input: “Hello, world!”
Rough split: ["Hello", ",", "world", "!"]

This step is not the final tokenization; it’s just preparing the text for the main algorithm.

3) Apply a subword algorithm (the real magic)

Here’s where modern tokenization earns its keep. Instead of storing every possible word in the vocabulary, we store common subword units.

There are a few major families you’ll run into:

A) BPE (Byte Pair Encoding) / merge-based tokenization

BPE-style tokenizers start with small units (characters or bytes) and repeatedly merge the most frequent neighboring pairs to form larger tokens.

Intuition: if "th" appears constantly, it becomes a token. If "tion" appears constantly, it becomes a token. Over time, frequent patterns become single units.

Example idea (simplified):

Start: ["u", "n", "h", "a", "p", "p", "i", "n", "e", "s", "s"]
After merges: ["un", "happi", "ness"] (illustrative)

Many modern LLMs use byte-level BPE variants, which start from bytes so they can represent any text reliably (including emojis and weird Unicode) without an <UNK> token.

B) WordPiece (vocabulary + longest-match splitting)

WordPiece also produces subword tokens, but the common “production behavior” is: take a word and repeatedly pick the longest subword from the vocabulary that matches the remaining characters.

Example (classic illustration):

Word: "hugs"
Tokens might become: ["hug", "##s"] where ## marks “this piece continues a word”.

The big mental model: WordPiece is like a greedy cutter—“use the largest known chunk you can.”

C) Unigram (probabilistic subword selection)

Unigram tokenization (often used via SentencePiece) treats tokenization as choosing a set of subword units that best explain the data. It starts with a large candidate vocabulary and prunes it down, and at inference time it can choose among multiple possible segmentations.

Intuition: instead of rigid “merge rules,” you have a vocabulary with probabilities, and you pick a segmentation that scores well.

This can be useful for languages where word boundaries are tricky or where different segmentations make sense.

4) Add special tokens

Most models use special markers like:

start-of-sequence, end-of-sequence
padding (for batching)
separators (for paired inputs)
“unknown” (some tokenizers avoid <UNK> by design)

These tokens are part of what makes a raw list of token IDs meaningful to the model.

5) Output token IDs (and attention masks, etc.)

Finally, tokens are mapped to integer IDs:

["un", "happi", "ness"] → [421, 9831, 117] (example IDs)

These IDs are what the model actually consumes.

A simple, concrete mental checklist

When you see odd model behavior, ask:

Did my text explode into a lot of tokens?
Did a key term get split into weird fragments?
Am I mixing languages/scripts where the tokenizer is less efficient?

Key Terminology

Token: A discrete unit the model processes (often a word piece, not a whole word).
Vocabulary: The fixed set of tokens the model knows, each mapped to an integer ID.
Subword tokenization: Splitting words into reusable pieces so rare words can be represented without <UNK>.
BPE / WordPiece / Unigram: Common tokenization algorithm families used by modern NLP models.
Special tokens: Reserved symbols like start/end markers, padding, or separators that structure inputs.

Real-World Applications

LLM APIs and billing: Usage limits and pricing often depend on token count; tokenization affects cost directly.
Prompt engineering and reliability: A “single word” concept might be multiple tokens; subtle punctuation changes can alter token boundaries and model output.
Search and retrieval systems: Tokenization (or a close cousin) underpins indexing, query parsing, and semantic pipelines.
Multilingual products: Choosing a tokenizer (and vocabulary) is a big design choice for translation, chatbots, and global apps.
Training new models: Tokenizer design influences how efficiently a model learns and how large the embedding matrix must be.

Common Misconceptions

“Tokens are words.” Not reliably. Tokens are often subwords or even byte-level chunks. Two words of similar length can produce very different token counts.
“Tokenization is universal across models.” Each model family can use a different tokenizer and vocabulary. A prompt can be 20 tokens in one model and 30 in another—so token counts and behavior don’t always transfer.
“Tokenization is just a boring preprocessing step.” It’s a design lever. Tokenization affects cost, context limits, multilingual performance, and even what patterns are easier or harder for the model to learn.