How GPT Actually Works

If you understood the neural network that classifies images into digits from MNIST, you can see that GPT works on the same principle. This article takes a detailed look at how GPT actually generates sentences.

The Core: Next Token Prediction

GPT looks impressive but what it does is simple. Guessing the next word (token).

Input: "The weather today is"
Output: "nice" (1 next token)

Long sentences come from repeating this.

"The weather today is" → "nice"
"The weather today is nice" → "let's"
"The weather today is nice let's" → "go"
"The weather today is nice let's go" → "walking"

Converting Sentences to Numbers

Neural networks only process numbers. A process to convert sentences to numbers is needed.

Tokenization

Split sentences into pieces (tokens). Not word units, but subword units.

"unhappiness" → ["un", "happi", "ness"]
"programming" → ["program", "ming"]

Why split into pieces instead of whole words? Because you can't fit every word in the world into a dictionary. Splitting into pieces lets you represent almost any text with 50,000-100,000 tokens.

"Anthropic" → not in dictionary → can't process ❌
"Anthropic" → ["Anthrop", "ic"] → can process ✅
ModelToken Count
GPT-2~50,000
GPT-3/4~100,000

Embedding

Convert each token to a number array (vector).

"today" → 1523 → [0.2, -0.5, 0.8, 0.1, ...]  # length in thousands
"weather" → 892 → [0.1, 0.3, -0.2, 0.7, ...]

The "meaning" of words is captured in these vectors.

Comparing to MNIST

MNISTGPT
Input784 pixelsToken vectors
LayersDenseTransformer blocks
OutputChoose from 10Choose from 50,000
Result"5""nice" (next token)

Just like MNIST chooses one from 0-9, GPT chooses one from all tokens (50,000).

Output: [0.001, 0.002, ..., 0.15, ..., 0.003]
        "a"     "the"       "nice"     "house"
                             ↑
                       highest → selected

The core principle is the same.

Long Sentence Generation = Repeated Model Execution

The number of Transformer layers is fixed. GPT-3 has 96 layers. Long sentences don't mean more layers - it means running the entire model multiple times.

1 token = 1 model run
100 tokens = 100 model runs
Run 1: ["today"] → through 96 layers → "weather"
Run 2: ["today", "weather"] → through 96 layers → "is"
Run 3: ["today", "weather", "is"] → through 96 layers → "nice"
...

This is why ChatGPT shows responses character by character. It's actually generating tokens one at a time.

When Does It Stop?

End Token (EOS)

Stops when the model outputs a special "end here" token.

["Hello", "!", "<EOS>"] → stop

Maximum Length Limit

Force stops when exceeding the set token count.

max_tokens=100  # cuts off after 100

This is why long responses sometimes get cut off mid-way.

Everything Goes Back In Each Time

Every time the next token is generated, all previous tokens go back in as input.

Run 1: ["today"] → "weather"
Run 2: ["today", "weather"] → "is"
Run 3: ["today", "weather", "is"] → "nice"

Because Self-Attention needs to see relationships between all tokens. This looks inefficient, but in practice KV Cache stores previously calculated results and only computes new tokens.

No cache: 100 tokens → 5,050 calculations
KV Cache: 100 tokens → 100 calculations

Base GPT vs ChatGPT

A GPT with only pre-training doesn't answer questions. It just does "next word prediction."

# Base GPT
Input: "What is the capital of Korea?"
Output: "asked Cheolsu curiously."  ← continues the story

Additional training is needed to work like ChatGPT.

StageContentResult
Pre-trainingNext word predictionOnly continues sentences
Instruction tuningTrain on Q&A dataStarts answering questions
RLHFTrain on human feedbackHelpful, friendly answers

Same model structure, but behaves completely differently depending on what data it's trained on.

Transformer Block Structure

GPT really is just stacked Transformer blocks.

[Embedding] → [Transformer block] × 96 → [Output]

Inside each block:

Input
  ↓
[Self-Attention] ← Understand token relationships
  ↓
[Feed-Forward Network] ← Two Dense layers
  ↓
Output

Essentially Self-Attention + Dense layers. Same Dense concept as seen in MNIST neural network.

How Many Parameters?

Calculating for GPT-3 (175B):

SettingValue
Number of layers96
Embedding dimension12,288
Attention heads96
Feed-Forward size49,152

Parameters in One Block

Self-Attention:
- Q, K, V matrices: 12,288 × 12,288 × 3 ≈ 450 million
- Output matrix: 12,288 × 12,288 ≈ 150 million

Feed-Forward:
- First Dense: 12,288 × 49,152 ≈ 600 million
- Second Dense: 49,152 × 12,288 ≈ 600 million

One block total: ~1.8 billion

Total

1.8 billion × 96 blocks ≈ 170 billion
+ Embedding layer ≈ 5 billion
─────────────────────────
Total ~175 billion (175B)

Comparison by Model

ModelLayersParameters
GPT-2 Small12124 million
GPT-2 Large36774 million
GPT-396175 billion
GPT-4Not disclosedEstimated 1 trillion+

This Is Really All There Is

# The core of GPT (conceptually)
for i in range(96):
    x = self_attention(x)  # Understand relationships
    x = feed_forward(x)    # Transform

next_token = softmax(x)  # Next token probability

The structure is surprisingly simple. What's complex is not the structure but the scale.

  • 175 billion parameters
  • Hundreds of billions of tokens in training data
  • Millions of dollars in training costs

Summary

QuestionAnswer
What does GPT do?Predict next token, repeat
How does it process sentences?Tokenize → Embed → Number arrays
Structure?Stack 96 Transformer blocks
Inside each block?Self-Attention + Dense
Parameter count?1.8B per block, 175B total (GPT-3)
To become ChatGPT?Add instruction tuning + RLHF

Same principle as "Input → Pass through layers → Classify" understood from MNIST. Just scaled up enormously.

See AI Fundamentals: Machine Learning to GPT for the complete context, and MNIST Neural Network Explained for basic concepts.