How GPT Actually Works
If you understood the neural network that classifies images into digits from MNIST, you can see that GPT works on the same principle. This article takes a detailed look at how GPT actually generates sentences.
The Core: Next Token Prediction
GPT looks impressive but what it does is simple. Guessing the next word (token).
Input: "The weather today is"
Output: "nice" (1 next token)
Long sentences come from repeating this.
"The weather today is" → "nice"
"The weather today is nice" → "let's"
"The weather today is nice let's" → "go"
"The weather today is nice let's go" → "walking"
Converting Sentences to Numbers
Neural networks only process numbers. A process to convert sentences to numbers is needed.
Tokenization
Split sentences into pieces (tokens). Not word units, but subword units.
"unhappiness" → ["un", "happi", "ness"]
"programming" → ["program", "ming"]
Why split into pieces instead of whole words? Because you can't fit every word in the world into a dictionary. Splitting into pieces lets you represent almost any text with 50,000-100,000 tokens.
"Anthropic" → not in dictionary → can't process ❌
"Anthropic" → ["Anthrop", "ic"] → can process ✅
| Model | Token Count |
|---|---|
| GPT-2 | ~50,000 |
| GPT-3/4 | ~100,000 |
Embedding
Convert each token to a number array (vector).
"today" → 1523 → [0.2, -0.5, 0.8, 0.1, ...] # length in thousands
"weather" → 892 → [0.1, 0.3, -0.2, 0.7, ...]
The "meaning" of words is captured in these vectors.
Comparing to MNIST
| MNIST | GPT | |
|---|---|---|
| Input | 784 pixels | Token vectors |
| Layers | Dense | Transformer blocks |
| Output | Choose from 10 | Choose from 50,000 |
| Result | "5" | "nice" (next token) |
Just like MNIST chooses one from 0-9, GPT chooses one from all tokens (50,000).
Output: [0.001, 0.002, ..., 0.15, ..., 0.003]
"a" "the" "nice" "house"
↑
highest → selected
The core principle is the same.
Long Sentence Generation = Repeated Model Execution
The number of Transformer layers is fixed. GPT-3 has 96 layers. Long sentences don't mean more layers - it means running the entire model multiple times.
1 token = 1 model run
100 tokens = 100 model runs
Run 1: ["today"] → through 96 layers → "weather"
Run 2: ["today", "weather"] → through 96 layers → "is"
Run 3: ["today", "weather", "is"] → through 96 layers → "nice"
...
This is why ChatGPT shows responses character by character. It's actually generating tokens one at a time.
When Does It Stop?
End Token (EOS)
Stops when the model outputs a special "end here" token.
["Hello", "!", "<EOS>"] → stop
Maximum Length Limit
Force stops when exceeding the set token count.
max_tokens=100 # cuts off after 100
This is why long responses sometimes get cut off mid-way.
Everything Goes Back In Each Time
Every time the next token is generated, all previous tokens go back in as input.
Run 1: ["today"] → "weather"
Run 2: ["today", "weather"] → "is"
Run 3: ["today", "weather", "is"] → "nice"
Because Self-Attention needs to see relationships between all tokens. This looks inefficient, but in practice KV Cache stores previously calculated results and only computes new tokens.
No cache: 100 tokens → 5,050 calculations
KV Cache: 100 tokens → 100 calculations
Base GPT vs ChatGPT
A GPT with only pre-training doesn't answer questions. It just does "next word prediction."
# Base GPT
Input: "What is the capital of Korea?"
Output: "asked Cheolsu curiously." ← continues the story
Additional training is needed to work like ChatGPT.
| Stage | Content | Result |
|---|---|---|
| Pre-training | Next word prediction | Only continues sentences |
| Instruction tuning | Train on Q&A data | Starts answering questions |
| RLHF | Train on human feedback | Helpful, friendly answers |
Same model structure, but behaves completely differently depending on what data it's trained on.
Transformer Block Structure
GPT really is just stacked Transformer blocks.
[Embedding] → [Transformer block] × 96 → [Output]
Inside each block:
Input
↓
[Self-Attention] ← Understand token relationships
↓
[Feed-Forward Network] ← Two Dense layers
↓
Output
Essentially Self-Attention + Dense layers. Same Dense concept as seen in MNIST neural network.
How Many Parameters?
Calculating for GPT-3 (175B):
| Setting | Value |
|---|---|
| Number of layers | 96 |
| Embedding dimension | 12,288 |
| Attention heads | 96 |
| Feed-Forward size | 49,152 |
Parameters in One Block
Self-Attention:
- Q, K, V matrices: 12,288 × 12,288 × 3 ≈ 450 million
- Output matrix: 12,288 × 12,288 ≈ 150 million
Feed-Forward:
- First Dense: 12,288 × 49,152 ≈ 600 million
- Second Dense: 49,152 × 12,288 ≈ 600 million
One block total: ~1.8 billion
Total
1.8 billion × 96 blocks ≈ 170 billion
+ Embedding layer ≈ 5 billion
─────────────────────────
Total ~175 billion (175B)
Comparison by Model
| Model | Layers | Parameters |
|---|---|---|
| GPT-2 Small | 12 | 124 million |
| GPT-2 Large | 36 | 774 million |
| GPT-3 | 96 | 175 billion |
| GPT-4 | Not disclosed | Estimated 1 trillion+ |
This Is Really All There Is
# The core of GPT (conceptually)
for i in range(96):
x = self_attention(x) # Understand relationships
x = feed_forward(x) # Transform
next_token = softmax(x) # Next token probability
The structure is surprisingly simple. What's complex is not the structure but the scale.
- 175 billion parameters
- Hundreds of billions of tokens in training data
- Millions of dollars in training costs
Summary
| Question | Answer |
|---|---|
| What does GPT do? | Predict next token, repeat |
| How does it process sentences? | Tokenize → Embed → Number arrays |
| Structure? | Stack 96 Transformer blocks |
| Inside each block? | Self-Attention + Dense |
| Parameter count? | 1.8B per block, 175B total (GPT-3) |
| To become ChatGPT? | Add instruction tuning + RLHF |
Same principle as "Input → Pass through layers → Classify" understood from MNIST. Just scaled up enormously.
See AI Fundamentals: Machine Learning to GPT for the complete context, and MNIST Neural Network Explained for basic concepts.