Skip to content

Zoom-in: Tokenizer

Karify98·
Cover Image for Zoom-in: Tokenizer

If you ask a state-of-the-art language model how many "r"s are in the word "strawberry", there is a good chance it will confidently answer 2 instead of 3. This is a classic quirk that makes people laugh: why can an AI that debugs complex code and solves advanced math fail at a primary-school question?

Let's zoom in on the translator of the model.


Layer 1 — The Root Problem: Computers Don't Understand Text

A language model is essentially a neural network that only works with numbers and mathematical matrices. It cannot directly read characters like a, b, c or words as we do.

Therefore, we need an intermediary component called a tokenizer to convert text into a list of numbers before feeding it to the model.

graph TD
    Text["📝 Raw Text: 'road'"] --> Tokenizer["⚙️ Tokenizer"]
    Tokenizer --> Tokens["🔢 Token ID: [1204]"]
    style Text fill:#1e293b,stroke:#475569,color:#cbd5e1
    style Tokenizer fill:#1e293b,stroke:#475569,color:#cbd5e1
    style Tokens fill:#1e293b,stroke:#475569,color:#cbd5e1

The simplest way is to map each word or character to a number. However:

  • If we split character-by-character: The model will take too many computational steps to process a long sentence and struggle to grasp word-level meanings.
  • If we split word-by-word: The model's vocabulary will bloat infinitely as new words, tenses, prefixes, and suffixes constantly emerge.
The optimal solution is to slice text into 'sub-words' or tokens, sitting between characters and full words.

Layer 2 — The Byte Pair Encoding Mechanism

The most popular algorithm for tokenization today is Byte Pair Encoding (BPE). It works like this:

  1. Start by treating every individual character as a token.
  2. Scan the massive training dataset to find which pair of tokens appears next to each other most frequently.
  3. Merge that pair into a new token and add it to the vocabulary dictionary.
  4. Repeat the process until the dictionary reaches the target size (e.g., 100,000 tokens).

When the word "strawberry" goes through the tokenizer, it is typically split into two tokens:

  • First token: straw (ID: 4123)
  • Second token: berry (ID: 8912)

For the model, this word is merely two inputs: [4123, 8912]. Because the model only recognizes these IDs and has no concept of the individual letters inside them, counting the letter "r" is like guessing how many red marbles are inside two sealed boxes without being allowed to open them.


Layer 3 — The Disadvantage for Accented/Non-English Languages

Most tokenizers built by tech giants are primarily trained on English data. Consequently, they are optimized for English and disadvantage other languages, particularly those with complex accent systems like Vietnamese.

Accented characters (such as ư, , đ) are often not found as unified tokens in English-skewed vocabularies. As a result, the tokenizer has to break non-English words into much smaller, fragmented chunks.

English: "information" ───> [1 token]
Vietnamese: "thông tin" ───> [3 to 4 tokens]
graph TD
    subgraph English
    A[information] --> B([1 token])
    end
    subgraph Vietnamese
    C[đường] --> D([đư])
    C --> E([ờng])
    end

This discrepancy leads to two real-world consequences:

  1. Higher Costs: API providers charge based on token count. Users writing in non-English languages often pay 2 to 3 times more than English users to transmit the same amount of information.
  2. Shorter Memory Span: Every model has a limit called the context window – the maximum number of tokens it can read and remember in a session. Consuming more tokens per word means the model will forget the conversation context much faster.
Newer models like GPT-4o or Llama-3 have expanded their vocabularies by 2-3x to significantly improve token efficiency for non-English languages.

Understanding tokenizers helps you write more efficient prompts, save API costs, and structure your input data so that models can process it with maximum accuracy.


Full picture

graph TD
    Input["Raw Text: 'strawberry'"] -->|Tokenization Process| Tokenizer[Tokenizer: BPE Algorithm]
    Tokenizer -->|1. Slice text into sub-words| Subwords["Sub-words: ['straw', 'berry']"]
    Subwords -->|2. Map to numerical values| Tokens["Token IDs: [4123, 8912]"]
    Tokens -->|3. Feed to Neural Network| Model[LLM Neural Network]
    
    style Input fill:#1e293b,stroke:#475569,color:#cbd5e1
    style Tokenizer fill:#1e293b,stroke:#475569,color:#cbd5e1
    style Subwords fill:#1e293b,stroke:#475569,color:#cbd5e1
    style Tokens fill:#1e293b,stroke:#475569,color:#cbd5e1
    style Model fill:#1e293b,stroke:#475569,color:#cbd5e1

Takeaway

A tokenizer acts as a translator between natural language and the numerical domain of LLMs. Because it uses sub-word tokenization algorithms (like Byte Pair Encoding) rather than reading character-by-character, the underlying neural network has no direct concept of individual letters, causing quirks like miscounting characters in strawberry. Additionally, because most tokenizers are optimized for English, non-English languages require more tokens per word, leading to higher API costs and faster context window exhaustion.

Related Posts