Understanding Neural Networks Through MNIST Handwritten Digit Recognition
When first learning machine learning, you almost always encounter the MNIST handwritten digit dataset. It's like "Hello World." This article breaks down MNIST classification code line by line to see how neural networks actually work.
The Complete Code
First, let's see the complete code. You can run it directly in Google Colab.
import tensorflow as tf
from tensorflow.keras import layers, models
# Load data
(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.mnist.load_data()
train_images = train_images / 255.0
test_images = test_images / 255.0
# Build model
model = models.Sequential([
layers.Flatten(input_shape=(28, 28)),
layers.Dense(128, activation='relu'),
layers.Dense(10, activation='softmax')
])
# Configure training
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
# Train
model.fit(train_images, train_labels, epochs=5)
# Evaluate
model.evaluate(test_images, test_labels)
Running this gives about 97-98% accuracy. Now let's break it down.
Data: 28×28 Pixel Images
MNIST contains 70,000 handwritten digit images from 0 to 9. Each image is 28×28 pixels, with each pixel having a value from 0 (white) to 255 (black).
train_images = train_images / 255.0
This line converts 0-255 to between 0-1. Smaller numbers make training more stable.
28×28 = 784 pixels, so each image is represented by 784 numbers.
Sequential: Stacking Layers in Order
model = models.Sequential([...])
Sequential means "sequential." Like stacking Lego blocks, it's the simplest way to stack layers from bottom to top.
Data flows through in order from top to bottom.
Image input
↓
[Flatten]
↓
[Dense + ReLU]
↓
[Dense + Softmax]
↓
Result output
Flatten: Converting Image to 1D
layers.Flatten(input_shape=(28, 28))
Flattens the 28×28 2D image into a single line of 784 numbers. Dense layers can only receive 1D input, so this step is necessary.
Dense: The Neural Network Layer
layers.Dense(128, activation='relu')
Dense means "densely connected." It's called this because every input connects to every neuron. Also known as "Fully Connected Layer."
One Dense is one layer of the neural network.
Input layer Hidden layer Output layer
(784) (128) (10)
● ● ●
● ● ●
● → ● → ●
● ● ●
... ... ...
Dense(128) Dense(10)
| Layer | Type | Neurons | Trainable |
|---|---|---|---|
| Flatten | Transform | - | ❌ |
| Dense(128) | Hidden Layer | 128 | ✅ |
| Dense(10) | Output Layer | 10 | ✅ |
What each neuron does is simple.
(input1 × weight1) + (input2 × weight2) + ... + (input784 × weight784) + bias
Multiply each of 784 inputs by its weight, add them all up, add the bias. That's it.
Weights represent how important each input is. Bias adjusts the overall threshold.
ReLU: Zeroing Out Negatives
activation='relu'
ReLU (Rectified Linear Unit) is an activation function. The rule is very simple.
Negative → 0
Positive → stays the same
Why is it needed? Without activation functions, no matter how many layers you stack, it's still just one straight line. Activation functions like ReLU enable learning complex curved patterns.
The result after passing through 128 neurons is 128 numbers. Each number represents how "activated" that neuron is. ReLU makes more than half become 0, meaning "this feature isn't present."
Second Dense: 128 to 10
layers.Dense(10, activation='softmax')
Same approach. This time 10 neurons each receive 128 inputs.
Neuron 0: calculate 128 → 1.2
Neuron 1: calculate 128 → 0.5
Neuron 2: calculate 128 → 8.7
...
Neuron 9: calculate 128 → 0.3
Result: [1.2, 0.5, 8.7, 0.1, 0.2, 0.4, 0.1, 0.3, 0.2, 0.3]
Softmax: Converting to Probabilities
activation='softmax'
The 10 numbers above aren't probabilities yet. Softmax transforms them so they sum to 1.
[1.2, 0.5, 8.7, ...] → [0.05, 0.02, 0.85, ...]
"5% probability of 0, 2% probability of 1, 85% probability of 2..." Selecting the highest gives the prediction.
Complete Flow
[784] → [128] → [10]
pixels features probabilities
[0.1, 0.9, 0.2, ...] → [0, 2.5, 8.1, ...] → [0.01, 0.02, 0.85, ...]
784 128 10
"85% probability of 2"
How Many Weights?
Dense(128): 784 × 128 = 100,352 weights + 128 biases
Dense(10): 128 × 10 = 1,280 weights + 10 biases
────────────────────────────────────────────────────────
Total: 101,770
About 100,000 numbers are this model's training targets.
ReLU and Softmax have no weights. They're just formulas. What actually trains are the Dense layer weights and biases.
Stacking More Layers? Deep Learning
Adding more Dense layers makes a deeper neural network.
model = models.Sequential([
layers.Flatten(input_shape=(28, 28)),
layers.Dense(256, activation='relu'), # Hidden layer 1
layers.Dense(128, activation='relu'), # Hidden layer 2
layers.Dense(64, activation='relu'), # Hidden layer 3
layers.Dense(10, activation='softmax') # Output layer
])
This makes a 4-layer neural network. Stacking layers deep like this is deep learning.
Deeper layers can learn more complex patterns. But for simple problems like MNIST, 2 layers are sufficient.
Training: Finding Optimal Weights
Initially all weights are random. So at first it's at guessing level (10%).
Here's the training process.
- Predict: Input an image and see the result
- Compare with answer: Calculate how wrong it is (loss)
- Adjust weights: Change slightly toward being less wrong
- Repeat: Repeat 5 times through all 60,000 images (epochs=5)
model.compile(
optimizer='adam', # How to adjust weights
loss='sparse_categorical_crossentropy', # How to calculate wrongness
)
model.fit(train_images, train_labels, epochs=5)
Running shows this.
Epoch 1/5 - loss: 0.26 - accuracy: 0.92
Epoch 2/5 - loss: 0.11 - accuracy: 0.96
Epoch 3/5 - loss: 0.08 - accuracy: 0.97
Epoch 4/5 - loss: 0.05 - accuracy: 0.98
Epoch 5/5 - loss: 0.04 - accuracy: 0.98
With repetition, loss decreases and accuracy increases. The 100,000 weights gradually approach optimal values.
Summary
Neural network training comes down to this.
- Model structure: 784 inputs → 128 neurons → 10 outputs
- Training targets: Dense layer weights and biases (about 100,000)
- Training method: Predict → Compare with answer → Adjust weights → Repeat
Terms like Sequential, Dense, ReLU, Softmax look difficult at first, but breaking them down reveals they're combinations of simple operations - multiplying, adding, and transforming numbers.
Understanding this structure lets you read more complex neural networks with the same principles. See AI Fundamentals: Machine Learning to GPT for the complete context.