How GPT Actually Works

If you understood the neural network that classifies images into digits from MNIST, you can see that GPT works on the same principle. This article takes a detailed look at how GPT actually generates sentences.

The Core: Next Token Prediction

GPT looks impressive but what it does is simple. Guessing the next word (token).

Input: "The weather today is"
Output: "nice" (1 next token)

Long sentences come from repeating this.

"The weather today is" → "nice"
"The weather today is nice" → "let's"
"The weather today is nice let's" → "go"
"The weather today is nice let's go" → "walking"

Converting Sentences to Numbers

Neural networks only process numbers. A process to convert sentences to numbers is needed.

Tokenization

Split sentences into pieces (tokens). Not word units, but subword units.

"unhappiness" → ["un", "happi", "ness"]
"programming" → ["program", "ming"]

Why split into pieces instead of whole words? Because you can't fit every word in the world into a dictionary. Splitting into pieces lets you represent almost any text with 50,000-100,000 tokens.

"Anthropic" → not in dictionary → can't process ❌
"Anthropic" → ["Anthrop", "ic"] → can process ✅

Model	Token Count
GPT-2	~50,000
GPT-3/4	~100,000

Embedding

Convert each token to a number array (vector).

"today" → 1523 → [0.2, -0.5, 0.8, 0.1, ...]  # length in thousands
"weather" → 892 → [0.1, 0.3, -0.2, 0.7, ...]

The "meaning" of words is captured in these vectors.

Comparing to MNIST

	MNIST	GPT
Input	784 pixels	Token vectors
Layers	Dense	Transformer blocks
Output	Choose from 10	Choose from 50,000
Result	"5"	"nice" (next token)

Just like MNIST chooses one from 0-9, GPT chooses one from all tokens (50,000).

Output: [0.001, 0.002, ..., 0.15, ..., 0.003]
        "a"     "the"       "nice"     "house"
                             ↑
                       highest → selected

The core principle is the same.

Long Sentence Generation = Repeated Model Execution

The number of Transformer layers is fixed. GPT-3 has 96 layers. Long sentences don't mean more layers - it means running the entire model multiple times.

1 token = 1 model run
100 tokens = 100 model runs

Run 1: ["today"] → through 96 layers → "weather"
Run 2: ["today", "weather"] → through 96 layers → "is"
Run 3: ["today", "weather", "is"] → through 96 layers → "nice"
...

This is why ChatGPT shows responses character by character. It's actually generating tokens one at a time.

When Does It Stop?

End Token (EOS)

Stops when the model outputs a special "end here" token.

["Hello", "!", "<EOS>"] → stop

Maximum Length Limit

Force stops when exceeding the set token count.

max_tokens=100  # cuts off after 100

This is why long responses sometimes get cut off mid-way.

Everything Goes Back In Each Time

Every time the next token is generated, all previous tokens go back in as input.

Run 1: ["today"] → "weather"
Run 2: ["today", "weather"] → "is"
Run 3: ["today", "weather", "is"] → "nice"

Because Self-Attention needs to see relationships between all tokens. This looks inefficient, but in practice KV Cache stores previously calculated results and only computes new tokens.

No cache: 100 tokens → 5,050 calculations
KV Cache: 100 tokens → 100 calculations

Base GPT vs ChatGPT

A GPT with only pre-training doesn't answer questions. It just does "next word prediction."

# Base GPT
Input: "What is the capital of Korea?"
Output: "asked Cheolsu curiously."  ← continues the story

Additional training is needed to work like ChatGPT.

Stage	Content	Result
Pre-training	Next word prediction	Only continues sentences
Instruction tuning	Train on Q&A data	Starts answering questions
RLHF	Train on human feedback	Helpful, friendly answers

Same model structure, but behaves completely differently depending on what data it's trained on.

Transformer Block Structure

GPT really is just stacked Transformer blocks.

[Embedding] → [Transformer block] × 96 → [Output]

Inside each block:

Input
  ↓
[Self-Attention] ← Understand token relationships
  ↓
[Feed-Forward Network] ← Two Dense layers
  ↓
Output

Essentially Self-Attention + Dense layers. Same Dense concept as seen in MNIST neural network.

How Many Parameters?

Calculating for GPT-3 (175B):

Setting	Value
Number of layers	96
Embedding dimension	12,288
Attention heads	96
Feed-Forward size	49,152

Parameters in One Block

Self-Attention:
- Q, K, V matrices: 12,288 × 12,288 × 3 ≈ 450 million
- Output matrix: 12,288 × 12,288 ≈ 150 million

Feed-Forward:
- First Dense: 12,288 × 49,152 ≈ 600 million
- Second Dense: 49,152 × 12,288 ≈ 600 million

One block total: ~1.8 billion

Total

1.8 billion × 96 blocks ≈ 170 billion
+ Embedding layer ≈ 5 billion
─────────────────────────
Total ~175 billion (175B)

Comparison by Model

Model	Layers	Parameters
GPT-2 Small	12	124 million
GPT-2 Large	36	774 million
GPT-3	96	175 billion
GPT-4	Not disclosed	Estimated 1 trillion+

This Is Really All There Is

# The core of GPT (conceptually)
for i in range(96):
    x = self_attention(x)  # Understand relationships
    x = feed_forward(x)    # Transform

next_token = softmax(x)  # Next token probability

The structure is surprisingly simple. What's complex is not the structure but the scale.

175 billion parameters
Hundreds of billions of tokens in training data
Millions of dollars in training costs

Summary

Question	Answer
What does GPT do?	Predict next token, repeat
How does it process sentences?	Tokenize → Embed → Number arrays
Structure?	Stack 96 Transformer blocks
Inside each block?	Self-Attention + Dense
Parameter count?	1.8B per block, 175B total (GPT-3)
To become ChatGPT?	Add instruction tuning + RLHF

Same principle as "Input → Pass through layers → Classify" understood from MNIST. Just scaled up enormously.

See AI Fundamentals: Machine Learning to GPT for the complete context, and MNIST Neural Network Explained for basic concepts.