Skip to content

Zoom-in: Alignment

Karify98·
Cover Image for Zoom-in: Alignment

When a large language model finishes its pre-training phase on trillions of raw tokens, it only knows how to do one thing: predict the next word. It cannot answer questions properly or act as a helpful chat partner yet.

To transform this raw model into a helpful, safe, and honest conversational partner, we must perform a process called alignment. The two most prominent technologies driving this are RLHF and DPO.

Let's zoom in on the process of educating AI behavior.


Layer 1 — The Problem: The Naivety of Base Models

A base model, after self-learning, will attempt to autocomplete text in the most natural way it read from the internet.

  • If you ask: "How do I fix a memory leak in NodeJS?"
  • The base model might autocomplete to: "How do I fix a memory leak in Python? How do I fix it in Java?" because it assumes you are writing a list of programming questions.

To solve the initial step, we perform Supervised Fine-Tuning (SFT) by training the model on high-quality Q&A conversation templates written by humans.

However, SFT alone is not enough. The model can still generate offensive, toxic, or hallucinated answers because it is merely copying text patterns mechanically. We need an evaluation mechanism to teach it right from wrong.


Layer 2 — RLHF: The Traditional Reinforcement Learning Method

Reinforcement Learning from Human Feedback (RLHF) is the core technology that powered OpenAI's ChatGPT breakthrough. The RLHF process consists of three steps:

flowchart TD
    A[Supervised Fine-Tuned SFT Model] --> B[Humans score and rank sample answers]
    B --> C[Train an independent Reward Model]
    C --> D[Use Reinforcement Learning PPO to adjust the Policy based on Reward Model scores]
    D --> E[Successfully Aligned Model]
  1. Feedback Data Collection: Human evaluators compare multiple answers generated by the model for the same prompt and rank them based on helpfulness and safety.
  2. Train a Reward Model: A separate neural network is trained to output a score representing how much humans would prefer a given answer.
  3. Optimize via Reinforcement Learning: Using reinforcement learning, specifically Proximal Policy Optimization (PPO), the main model's weights are adjusted to maximize the score provided by the reward model.

Weaknesses of RLHF: This process is extremely complex and notoriously unstable. PPO is highly sensitive to hyperparameters and prone to training divergence. Additionally, running the main model, reward model, and reference model simultaneously requires massive GPU resources.


Layer 3 — DPO: The Simplicity Leap

In 2023, Stanford researchers introduced Direct Preference Optimization (DPO), shifting the paradigm of model alignment.

The mathematical intuition behind DPO is bold: Why waste resources training an intermediate reward model and running complex reinforcement learning, when we can directly optimize the policy using human preference data?

The authors proved that the optimization objective of RLHF can be mathematically transformed into a simple binary classification loss function:

$$\mathcal{L}{DPO}(\pi\theta; \pi_{ref}) = -\mathbb{E}{(x, y_w, y_l)} \left[ \log \sigma \left( \beta \log \frac{\pi\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)} \right) \right]$$

(Where: $y_w$ is the preferred answer, $y_l$ is the dispreferred answer, $\pi_\theta$ is the model being trained, $\pi_{ref}$ is the reference base model)

This formula compares the ratio of probabilities of the preferred and dispreferred answers between the policy being trained and the reference policy. DPO increases the probability of preferred responses and decreases the probability of dispreferred ones, using KL-divergence to keep the model from drifting too far from its original stable state.

DPO converts a complex reinforcement learning problem into a simple binary classification task, making training much more stable and easy.

Due to its simplicity, resource savings, and superior performance, DPO has rapidly replaced RLHF to become the standard alignment method for most state-of-the-art open-weight models today.


Full picture

graph TD
    Base[Pre-trained Base Model] -->|1. SFT: Supervised Fine-Tuning| SFT[SFT Model]
    
    subgraph Traditional Alignment: RLHF
        SFT -->|2. Human evaluation| HumanFeedback[Human Preference Dataset]
        HumanFeedback -->|3. Train secondary network| RewardModel[Reward Model]
        RewardModel -->|4. Complex PPO Reinforcement Learning| Policy[Aligned Policy Model]
    end
    
    subgraph Modern Minimalist Alignment: DPO
        SFT -->|Direct optimization via binary classification loss| DPO[Aligned DPO Model]
    end

Takeaway

Alignment is a critical developmental step that transforms an LLM from a raw text autocomplete engine into a safe, helpful, and honest conversational assistant. While traditional RLHF (Reinforcement Learning from Human Feedback) relies on a complex 3-step pipeline involving human preference modeling and PPO reinforcement learning, DPO (Direct Preference Optimization) streamlines this process. By mathematically reformulating the optimization objective into a simple binary classification loss, DPO completely bypasses the need for an active reward model and reinforcement learning step, making LLM alignment substantially more stable and resource-efficient.

Related Posts