Zoom-in: Tokenizer

If you ask a state-of-the-art language model how many "r"s are in the word "strawberry", there is a good chance it will confidently answer 2 instead of 3. This is a classic quirk that makes people laugh: why can an AI that debugs complex code and solves advanced math fail at a primary-school question?
Let's zoom in on the translator of the model.
Layer 1 — The Root Problem: Computers Don't Understand Text
A language model is essentially a neural network that only works with numbers and mathematical matrices. It cannot directly read characters like a, b, c or words as we do.
Therefore, we need an intermediary component called a tokenizer to convert text into a list of numbers before feeding it to the model.
graph TD
Text["📝 Raw Text: 'road'"] --> Tokenizer["⚙️ Tokenizer"]
Tokenizer --> Tokens["🔢 Token ID: [1204]"]
style Text fill:#1e293b,stroke:#475569,color:#cbd5e1
style Tokenizer fill:#1e293b,stroke:#475569,color:#cbd5e1
style Tokens fill:#1e293b,stroke:#475569,color:#cbd5e1
The simplest way is to map each word or character to a number. However:
- If we split character-by-character: The model will take too many computational steps to process a long sentence and struggle to grasp word-level meanings.
- If we split word-by-word: The model's vocabulary will bloat infinitely as new words, tenses, prefixes, and suffixes constantly emerge.
Layer 2 — The Byte Pair Encoding Mechanism
The most popular algorithm for tokenization today is Byte Pair Encoding (BPE). It works like this:
- Start by treating every individual character as a token.
- Scan the massive training dataset to find which pair of tokens appears next to each other most frequently.
- Merge that pair into a new token and add it to the vocabulary dictionary.
- Repeat the process until the dictionary reaches the target size (e.g., 100,000 tokens).
When the word "strawberry" goes through the tokenizer, it is typically split into two tokens:
- First token:
straw(ID: 4123) - Second token:
berry(ID: 8912)
For the model, this word is merely two inputs: [4123, 8912]. Because the model only recognizes these IDs and has no concept of the individual letters inside them, counting the letter "r" is like guessing how many red marbles are inside two sealed boxes without being allowed to open them.
Layer 3 — The Disadvantage for Accented/Non-English Languages
Most tokenizers built by tech giants are primarily trained on English data. Consequently, they are optimized for English and disadvantage other languages, particularly those with complex accent systems like Vietnamese.
Accented characters (such as ư, ờ, đ) are often not found as unified tokens in English-skewed vocabularies. As a result, the tokenizer has to break non-English words into much smaller, fragmented chunks.
English: "information" ───> [1 token]
Vietnamese: "thông tin" ───> [3 to 4 tokens]
graph TD
subgraph English
A[information] --> B([1 token])
end
subgraph Vietnamese
C[đường] --> D([đư])
C --> E([ờng])
end
This discrepancy leads to two real-world consequences:
- Higher Costs: API providers charge based on token count. Users writing in non-English languages often pay 2 to 3 times more than English users to transmit the same amount of information.
- Shorter Memory Span: Every model has a limit called the context window – the maximum number of tokens it can read and remember in a session. Consuming more tokens per word means the model will forget the conversation context much faster.
Understanding tokenizers helps you write more efficient prompts, save API costs, and structure your input data so that models can process it with maximum accuracy.
Full picture
graph TD
Input["Raw Text: 'strawberry'"] -->|Tokenization Process| Tokenizer[Tokenizer: BPE Algorithm]
Tokenizer -->|1. Slice text into sub-words| Subwords["Sub-words: ['straw', 'berry']"]
Subwords -->|2. Map to numerical values| Tokens["Token IDs: [4123, 8912]"]
Tokens -->|3. Feed to Neural Network| Model[LLM Neural Network]
style Input fill:#1e293b,stroke:#475569,color:#cbd5e1
style Tokenizer fill:#1e293b,stroke:#475569,color:#cbd5e1
style Subwords fill:#1e293b,stroke:#475569,color:#cbd5e1
style Tokens fill:#1e293b,stroke:#475569,color:#cbd5e1
style Model fill:#1e293b,stroke:#475569,color:#cbd5e1
Takeaway
A tokenizer acts as a translator between natural language and the numerical domain of LLMs. Because it uses sub-word tokenization algorithms (like Byte Pair Encoding) rather than reading character-by-character, the underlying neural network has no direct concept of individual letters, causing quirks like miscounting characters in strawberry. Additionally, because most tokenizers are optimized for English, non-English languages require more tokens per word, leading to higher API costs and faster context window exhaustion.
Related Posts
Zoom-in: Rate Limiter
You send too many API requests, and the system responds with '429 Too Many Requests'. How does the Rate Limiter gatekeeper protect system resources?
Zoom-in: WebSocket
Your chat app updates instantly without reloading the page. How WebSocket breaks free from the one-way limits of HTTP.
Zoom-in: Virtual Memory
Run multiple apps at once, and each one acts like it owns all of your RAM. How does the operating system isolate memory space so securely?