Zoom-in: LLM

Every day, millions of people chat with Large Language Models (LLMs) and marvel at their ability to answer questions, write code, or solve complex logic puzzles. However, few stop to ask: what is actually happening inside that black box?

Zoom in on the core mechanism of the model.

Layer 1 — The Essence: The Next-Token Game

Contrary to popular imagination of a conscious entity, all modern LLMs operate on a simple statistical principle: Next-Token Prediction.

Think of the auto-suggest feature on your phone's keyboard, but scaled up to a massive proportion.

graph LR
    Input["✍️ Input: 'Eating fruit, remember the...'"] --> Model["🖥️ Language Model"]
    Model --> Output["Output: 'grower' (Highest probability)"]
    style Input fill:#1e293b,stroke:#475569,color:#cbd5e1
    style Model fill:#1e293b,stroke:#475569,color:#cbd5e1
    style Output fill:#1e293b,stroke:#475569,color:#cbd5e1

When you enter a prompt, the model does not "think" to find the answer. Instead, it calculates the probability of all possible tokens in its vocabulary to appear next, based on what you just wrote. It picks the most optimal token, appends it to the existing sequence, and repeats the process until the entire response is generated.

→ The essence of intelligence here is the ability to predict probability extremely accurately, learned from millions of texts.

Layer 2 — Where is the "Large"?

Why does a simple word-prediction mechanism produce outstanding artificial intelligence? The word "Large" is the key to unlocking this breakthrough reasoning ability, represented by three main pillars:

1. Number of Parameters

Parameters are the mathematical weights inside the neural network. You can think of them as "knobs" adjusting how signals flow through the model.

Small language models have about a few billion parameters (e.g., Llama-3-8B with 8 billion knobs).
Top tier models like GPT-4 can have trillions of parameters. As the number of knobs increases, the model can memorize more complex and subtle relationships between tokens.

2. Training Data Size

The model is trained by reading almost all public human texts on the internet: from books, academic papers, software source code to discussion forums. The total size of this data amounts to tens of trillions of tokens.

3. Compute Power

To adjust billions of knobs by reading that huge amount of data, you need thousands of Graphics Processing Units (GPUs) or Tensor Processing Units (TPUs) running continuously for months.

Layer 3 — Two Stages of Model Growth

A model does not start smart. It must go through two rigorous training phases:

sequenceDiagram
    participant Web as Internet (Raw Data)
    participant Base as Base Model
    participant Chat as Chat Model

    Note over Web, Base: Stage 1: Pre-training
    Web->>Base: Reads trillions of words to learn language rules
    Note over Base: Only knows how to autocomplete raw text

    Note over Base, Chat: Stage 2: Alignment
    Base->>Chat: Fine-tuned on conversation templates & human feedback
    Note over Chat: Becomes a helpful and safe assistant

Pre-training: This is the phase where the model "reads the bookstore." The sole goal is to learn language structures and raw knowledge. After this, a base model is born. If you ask: "What is the capital of Vietnam?", the base model might autocomplete it to: "What is the capital of Thailand? What is the capital of France?" because it assumes you are writing a list of questions.
Alignment: In this step, humans teach the model how to interact. By fine-tuning it on sample dialogue templates and scoring its behaviors, the model learns to be a helpful, polite assistant that answers on-topic and refuses dangerous requests.

→ Learning language rules is Pre-training; learning how to converse is Alignment.

From a pure statistical probability machine, pushing parameter and data scale to the extreme has unlocked reasoning abilities – a fascinating emergent behavior that computer scientists are still actively studying.

Full picture

graph TD
    Data[Internet Data] -->|Pre-training: learning language rules| Base[Base Model: Autocomplete engine]
    Base -->|Alignment: learning human values| Chat[Chat/Instruct Model: Conversational Assistant]
    
    subgraph Scale ["Scale (The Large Part)"]
        Params[Billions of Parameters]
        Compute[Thousands of GPUs]
        Tokens[Trillions of Tokens]
    end
    
    Params & Compute & Tokens --> Data

Takeaway

At its core, a Large Language Model (LLM) is a highly sophisticated next-token prediction machine. Reasoning abilities emerge as scale increases across three dimensions: model parameters, training tokens, and compute power. Finally, the two-phase training process—Pre-training (learning language rules) and Alignment (learning behavior)—transforms a raw text completer into a reliable conversational assistant.

Layer 1 — The Essence: The Next-Token Game

Layer 2 — Where is the "Large"?

1. Number of Parameters

2. Training Data Size

3. Compute Power

Layer 3 — Two Stages of Model Growth

Full picture

Takeaway

Related Posts

Zoom-in: Decoding Parameters

The Things That Never Go Obsolete

Claude Opus 4.8: Anthropic Ships Honesty Improvements, Cuts Fast Mode Pricing 3x