Understanding Cross-Entropy Loss Function Easily

When looking at machine learning code, you always see lines like this.

model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

We get that optimizer is "training method" and metrics is "measurement criteria," but what on earth is sparse_categorical_crossentropy in loss?

What is a Loss Function?

A loss function calculates "how different the prediction is from the answer" as a number.

Prediction: "This is 2" (85% confident)
Answer: "2"
Loss: 0.16 (low = good prediction)

Prediction: "This is 3" (80% confident)
Answer: "2"
Loss: 3.0 (high = wrong)

During training, weights are adjusted in the direction of lowering this loss. That's why loss functions are important.

Breaking Down the Name

Breaking down sparse_categorical_crossentropy:

PartMeaning
sparseAnswer is a single number (e.g., 5)
categoricalChoosing one from multiple options (0-9 classification)
cross-entropyMethod of calculating probability difference

Cross-Entropy Calculation

The core is simple. Only look at the predicted probability for the correct answer's position.

Say we input handwritten "2" and the model predicts this.

Prediction: [0.01, 0.02, 0.85, 0.02, 0.02, 0.02, 0.02, 0.02, 0.01, 0.01]
#            0     1     2     3     4     5     6     7     8     9
#                        ↑
#                 Answer position = 85%

Cross-Entropy takes only the probability at the answer position (2nd) and calculates.

Loss = -log(0.85) = 0.16

If the model was wrong:

Prediction: [0.01, 0.02, 0.05, 0.80, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02]
#                        ↑
#                 Answer position = 5%

Loss = -log(0.05) = 3.0

Why Use -log?

Because of the properties of the log function.

Answer Probability-log(probability)Meaning
100% (1.0)0Perfect
85% (0.85)0.16Good
50% (0.5)0.6950-50
10% (0.1)2.3Very wrong
1% (0.01)4.6Almost completely wrong

Higher probability means lower loss, lower probability means loss increases sharply. It's a structure that penalizes more strongly for being wrong.

sparse vs regular categorical

There are two types depending on answer format.

sparse_categorical_crossentropy

Used when the answer is a single number.

train_labels = [5, 0, 4, 1, 9, 2, ...]  # Just numbers

Most datasets like MNIST use this format.

categorical_crossentropy

Used when the answer is one-hot encoded.

# 5 → [0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
# 0 → [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
train_labels = [[0,0,0,0,0,1,0,0,0,0], [1,0,0,0,0,0,0,0,0,0], ...]

The calculation result is the same. Only the answer data format differs.

# If answer is a number
loss='sparse_categorical_crossentropy'

# If answer is one-hot encoded
loss='categorical_crossentropy'

Why "Cross"-Entropy?

Originally Entropy comes from information theory, measuring "uncertainty." Cross-Entropy measures the difference between two probability distributions.

Answer distribution: [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]  ← 2nd is 100%
Prediction distribution: [0.01, 0.02, 0.85, ...]    ← 2nd is 85%

It's called Cross-Entropy because it calculates by "crossing" how different these two distributions are. You don't need to know the detailed math - just understand "distribution difference = loss."

Other Loss Functions

Classification problems use Cross-Entropy, but there are other loss functions depending on problem type.

Problem TypeLoss FunctionExample
Multi-classcategorical_crossentropy0-9 digit classification
Binarybinary_crossentropyDog/cat classification
Regressionmse (mean squared error)House price prediction

Summary

What this one line does:

loss='sparse_categorical_crossentropy'
  1. Take the probability at the answer position from model predictions
  2. Calculate loss with -log(probability)
  3. Lower probability (more wrong) means higher loss
  4. Adjust weights in the direction of reducing this loss

It's ultimately a mathematical expression of the goal "increase the probability of getting the right answer."

See MNIST Neural Network Explained together for a clearer picture of the complete training process.