AI Fundamentals: Understanding Machine Learning to GPT

When using AI services like ChatGPT or Claude, you might wonder "How does this actually work?" But searching online leads to complex formulas and jargon. This article breaks down the complete picture of AI in simple terms.

The Big Picture First

Understanding the overall structure makes everything else easier.

Artificial Intelligence (AI)
   └── Machine Learning (ML)
          └── Deep Learning (DL)
                 └── Transformer
                        └── GPT

Deep learning is a type of machine learning, Transformer is a type of deep learning, and GPT is a model that uses Transformer. Let's look at each one.

Machine Learning: Machines Learn on Their Own

Traditional programming requires humans to specify rules directly. "If it's red and round, it's an apple" - you code every case. But this approach has limits. You can't code every possible scenario in the world.

Machine learning takes a different approach. Instead of providing rules, you show examples and let the machine find the rules itself.

Traditional: Rules + Data → Answer
Machine Learning: Data + Answers → Rules

Show 1000 apple photos and 1000 banana photos, labeling "this is an apple" and "this is a banana," and the machine figures out the differences on its own. This is the core of machine learning.

Machine learning broadly divides into three types.

Supervised learning trains with labeled answers. The apple/banana example above falls into this category.

Unsupervised learning finds patterns without labels. Tell it to "group similar things together" and it classifies on its own.

Reinforcement learning learns through trial and error. Like playing a game - remember what increases the score, avoid what decreases it.

Perceptron: A Single Artificial Neuron

Now let's dig into how machines actually "learn." The most fundamental concept is the perceptron.

Our brains work through connected neurons (nerve cells). Scientists thought about mimicking this mathematically, and that's how the perceptron emerged. Think of it as the simplest form of a single artificial neuron.

Consider deciding whether to watch a movie.

Looks interesting → importance 3
Have time → importance 2
Have money → importance 2
Friend is going → importance 1

Each condition is 1 if met, 0 if not. Multiply by importance (weight) and add everything up.

(1×3) + (1×2) + (0×2) + (1×1) = 6 points

If my threshold is 5 points? 6 exceeds 5, so "let's watch the movie!"

This is exactly how a perceptron works. Take inputs, multiply each by weights, sum them up, output 1 if it exceeds the threshold, 0 otherwise.

How does it learn? When the perceptron gets wrong answers, it adjusts the weights slightly. If the movie wasn't fun, lower the weight for "looks interesting." Repeat this thousands of times and it gets more accurate.

But a single perceptron can't solve complex problems. So people started connecting multiple perceptrons.

Deep Learning: Stacking Neural Networks Deep

Connecting multiple perceptrons creates a neural network. Stack these networks in layers? That's deep learning. "Deep" means the layers are deep.

Shallow network: Input → [1 layer] → Output
Deep learning:   Input → [layer] → [layer] → [layer] → ... → Output

Why is stacking deeper better? Looking at how cat photos are recognized makes this clear.

Layer 1 learns very simple things like lines, dots, edges. Layer 2 learns shapes like circles and triangles. Layer 3 learns eyes, nose, ears. Layer 4 learns face shapes. Layer 5 decides "this is a cat!"

It builds up from simple to complex, combining progressively. More layers mean understanding more complex concepts.

The idea of deep learning has been around for a while. But it only exploded recently for these reasons.

First, data has grown enormously. Thanks to the internet, there's endless data for training.

Second, computing power has improved. Especially GPU development enabled parallel computation.

Third, algorithms have improved. Methods were discovered to train deep networks stably.

Transformer: The Game Changer

In 2017, Google published "Attention Is All You Need." This introduced the Transformer architecture, which became the foundation for modern AI like GPT, BERT, and Claude.

Previously, RNN structures were common. RNNs read text word by word sequentially. Processing "I ate rice" means reading "I", then "ate", then "rice" in order.

The problem is it keeps forgetting what came before. As sentences get longer, earlier content fades.

Transformer takes a different approach. It looks at the entire sentence at once while figuring out how each word relates to every other word.

Imagine a classroom with 30 students. The old way is the teacher asking student 1, then student 2, then 3... up to 30 in sequence. The Transformer way is the teacher saying "everyone raise your hands!" and seeing all 30 at once.

The core mechanism enabling this is Self-Attention.

Self-Attention in one phrase: finding "who's related to me?"

"The dog played in the park. It looked happy."

What does "It" refer to? Humans naturally know it's "dog." Self-Attention solves this mechanically.

The word "It" looks at every word in the sentence. It scores how related each word is.

dog: ⭐⭐⭐⭐⭐ (highly related)
park: ⭐
played: ⭐⭐
happy: ⭐⭐

"dog" is most related! So it understands "It = dog."

Every word in the sentence does this simultaneously. Everyone looks at everyone else to understand relationships. Since the sentence finds relationships within itself, it's "Self"-Attention.

This means even in very long sentences, relationships between beginning and end aren't lost. And processing all words simultaneously enables parallel computation for speed.

GPT: Why Did It Get So Smart?

GPT (Generative Pre-trained Transformer) is a language model built on Transformer. The GPT in ChatGPT is this.

Several reasons make GPT far superior to previous AI.

First, thanks to Transformer architecture, it understands long contexts well. The Self-Attention explained earlier is key.

Second, it does large-scale pre-training. It massively learns "next word prediction" from vast internet text. Through this process, it naturally acquires grammar, common sense, and reasoning abilities.

Third, transfer learning is possible. Once well-trained, it can be applied to various tasks like translation, summarization, Q&A. No need to train from scratch each time.

Fourth, there's the scaling effect. As model size, data, and compute increase, performance predictably improves. Amazing capabilities emerged going from GPT-3 to GPT-4.

Fifth, few-shot learning works. Previous models needed massive labeled data for each new task. GPT can perform new tasks with just a few examples in the prompt.

Summary

It was long, but here's the core.

Machine Learning is machines finding rules from data on their own.

Perceptron is the simplest artificial neuron that activates when weighted inputs exceed a threshold.

Deep Learning is stacking these neurons in many layers to learn complex patterns.

Transformer is a structure that sees the entire sentence at once, understanding relationships through Self-Attention.

GPT is a language model made from Transformer + massive pre-training + scaling.

Understanding this flow makes reading AI-related articles much easier going forward. For deeper exploration, check out Neural Network Basics and Perceptron and ADALINE.