Understanding Cross-Entropy Loss Function Easily
When looking at machine learning code, you always see lines like this.
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
We get that optimizer is "training method" and metrics is "measurement criteria," but what on earth is sparse_categorical_crossentropy in loss?
What is a Loss Function?
A loss function calculates "how different the prediction is from the answer" as a number.
Prediction: "This is 2" (85% confident)
Answer: "2"
Loss: 0.16 (low = good prediction)
Prediction: "This is 3" (80% confident)
Answer: "2"
Loss: 3.0 (high = wrong)
During training, weights are adjusted in the direction of lowering this loss. That's why loss functions are important.
Breaking Down the Name
Breaking down sparse_categorical_crossentropy:
| Part | Meaning |
|---|---|
| sparse | Answer is a single number (e.g., 5) |
| categorical | Choosing one from multiple options (0-9 classification) |
| cross-entropy | Method of calculating probability difference |
Cross-Entropy Calculation
The core is simple. Only look at the predicted probability for the correct answer's position.
Say we input handwritten "2" and the model predicts this.
Prediction: [0.01, 0.02, 0.85, 0.02, 0.02, 0.02, 0.02, 0.02, 0.01, 0.01]
# 0 1 2 3 4 5 6 7 8 9
# ↑
# Answer position = 85%
Cross-Entropy takes only the probability at the answer position (2nd) and calculates.
Loss = -log(0.85) = 0.16
If the model was wrong:
Prediction: [0.01, 0.02, 0.05, 0.80, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02]
# ↑
# Answer position = 5%
Loss = -log(0.05) = 3.0
Why Use -log?
Because of the properties of the log function.
| Answer Probability | -log(probability) | Meaning |
|---|---|---|
| 100% (1.0) | 0 | Perfect |
| 85% (0.85) | 0.16 | Good |
| 50% (0.5) | 0.69 | 50-50 |
| 10% (0.1) | 2.3 | Very wrong |
| 1% (0.01) | 4.6 | Almost completely wrong |
Higher probability means lower loss, lower probability means loss increases sharply. It's a structure that penalizes more strongly for being wrong.
sparse vs regular categorical
There are two types depending on answer format.
sparse_categorical_crossentropy
Used when the answer is a single number.
train_labels = [5, 0, 4, 1, 9, 2, ...] # Just numbers
Most datasets like MNIST use this format.
categorical_crossentropy
Used when the answer is one-hot encoded.
# 5 → [0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
# 0 → [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
train_labels = [[0,0,0,0,0,1,0,0,0,0], [1,0,0,0,0,0,0,0,0,0], ...]
The calculation result is the same. Only the answer data format differs.
# If answer is a number
loss='sparse_categorical_crossentropy'
# If answer is one-hot encoded
loss='categorical_crossentropy'
Why "Cross"-Entropy?
Originally Entropy comes from information theory, measuring "uncertainty." Cross-Entropy measures the difference between two probability distributions.
Answer distribution: [0, 0, 1, 0, 0, 0, 0, 0, 0, 0] ← 2nd is 100%
Prediction distribution: [0.01, 0.02, 0.85, ...] ← 2nd is 85%
It's called Cross-Entropy because it calculates by "crossing" how different these two distributions are. You don't need to know the detailed math - just understand "distribution difference = loss."
Other Loss Functions
Classification problems use Cross-Entropy, but there are other loss functions depending on problem type.
| Problem Type | Loss Function | Example |
|---|---|---|
| Multi-class | categorical_crossentropy | 0-9 digit classification |
| Binary | binary_crossentropy | Dog/cat classification |
| Regression | mse (mean squared error) | House price prediction |
Summary
What this one line does:
loss='sparse_categorical_crossentropy'
- Take the probability at the answer position from model predictions
- Calculate loss with
-log(probability) - Lower probability (more wrong) means higher loss
- Adjust weights in the direction of reducing this loss
It's ultimately a mathematical expression of the goal "increase the probability of getting the right answer."
See MNIST Neural Network Explained together for a clearer picture of the complete training process.