Convolutional Neural Networks
A complete, structured learning path covering every concept from pixels to production โ with visuals, formulas, and hands-on projects that make deep learning click.
Course Roadmap
18 carefully structured sections. Begin at any point, but we recommend following the sequence for best understanding.
Course Introduction
Before we write a single line of code, understand why CNNs matter and how they reshaped artificial intelligence.
How CNNs Changed AI
5 LecturesBefore 2012, teaching a computer to recognise a cat in a photo required hand-crafted rules โ thousands of lines of code describing whiskers, fur, ears. Engineers spent years building these brittle systems.
Then, in 2012, a Convolutional Neural Network called AlexNet slashed the error rate on the ImageNet competition by nearly half โ from 26% to 15.3% โ in a single year. No hand-crafted rules. The network learned what features mattered, directly from raw pixels.
Think of a CNN as a toddler learning to recognise animals. At first they see blurry blobs (low-level features), then shapes (mid-level), then full animals (high-level). CNNs learn in exactly the same layered way.
Real-World Applications of CNNs
Python & Math Foundations
Deep learning is applied mathematics. This section gives you just enough โ no more, no less โ to fully understand what's happening inside every CNN.
Python Essentials
4 LecturesWe use Python because it has the richest deep learning ecosystem on the planet. The two libraries you'll live in are NumPy (fast maths) and Matplotlib (visualisation).
# Creating a 3ร3 matrix (a grayscale image patch)
import numpy as np
patch = np.array([
[10, 20, 30],
[40, 50, 60],
[70, 80, 90]
])
print(patch.shape) # โ (3, 3)
print(patch.mean()) # โ 50.0 (average pixel brightness)
- Python Refresher
- NumPy Basics
- Matrices & Arrays
- Data Visualisation
Mathematics for CNNs
6 LecturesThe Three Pillars
A CNN has dozens of layers. The chain rule lets us calculate gradients through every layer by multiplying local gradients together โ this is what makes backpropagation possible.
- Vectors & Matrices
- Matrix Multiplication
- Derivatives
- Partial Derivatives
- Chain Rule
- Gradient Descent Basics
Introduction to Neural Networks
CNNs are a specialised type of neural network. Before studying the specialist, understand the generalist โ how data flows, how errors are measured, and how the network learns.
From Biological to Artificial Neurons
6 LecturesA biological neuron receives signals through dendrites, processes them in the cell body, and fires an output signal down the axon if the input is strong enough. Artificial neurons work identically โ mathematically.
Each artificial neuron: (1) receives inputs, (2) multiplies each by a weight, (3) sums everything up, (4) passes the sum through an activation function, (5) outputs the result.
Activation Functions โ Introducing Non-Linearity
| Function | Formula | When to Use | Limitation |
|---|---|---|---|
| ReLU | max(0, x) | Hidden layers of CNNs (default choice) | Dead neurons when x < 0 |
| Sigmoid | 1 / (1 + eโปหฃ) | Binary classification output | Vanishing gradient for deep networks |
| Tanh | (eหฃ โ eโปหฃ) / (eหฃ + eโปหฃ) | Recurrent networks, some hidden layers | Still suffers vanishing gradient |
| Softmax | eหฃโฑ / ฮฃeหฃสฒ | Multi-class classification output | Computationally expensive for many classes |
Without an activation function, stacking 100 layers is mathematically identical to one layer โ just a big matrix multiply. Non-linear activations give networks the power to learn complex, curved decision boundaries.
Introduction to Computer Vision
Computers don't see images the way humans do. They see grids of numbers. Mastering this perspective is the single most important conceptual shift in this entire course.
Images as Matrices of Numbers
6 LecturesGrayscale Images
A grayscale image is a 2D matrix where each value (pixel) is between 0 (black) and 255 (white). A 28ร28 grayscale image has 784 numbers โ that's all a computer ever sees.
RGB Colour Images
A colour image has three channels โ Red, Green, Blue โ stacked on top of each other. A 224ร224 colour image is actually a 3D tensor of shape (224, 224, 3) containing 150,528 values.
Red (255, 0, 0) + Green (0, 255, 0) = Yellow (255, 255, 0). Every colour on your screen is a combination of RGB intensity values.
Image Preprocessing โ Why It Matters
Convolution Operation Deep Dive
The convolution operation is the heart of every CNN. Understand this completely, and the rest of the course follows naturally.
What Is a Convolution? โ Kernels & Feature Maps
7 LecturesA kernel (also called a filter) is a tiny matrix โ typically 3ร3 or 5ร5 โ of learnable numbers. During convolution, this kernel slides across the input image, and at each position, we compute an element-wise product and sum the result.
The result of sliding a kernel across the entire image is called a feature map. It shows where in the image the kernel's pattern was found.
Imagine holding a magnifying glass (kernel) over a document and sliding it across. At each spot, you check if a particular pattern (e.g. the letter "e") matches. The feature map records the match strength at every location.
Stride and Padding Explained
| Parameter | What It Does | Effect on Output Size | Typical Values |
|---|---|---|---|
| Stride = 1 | Kernel moves 1 pixel at a time (standard) | Output โ input size (with padding) | Default for most layers |
| Stride = 2 | Kernel jumps 2 pixels โ skips positions | Output โ half input size | Used to downsample instead of pooling |
| Padding = 'valid' | No padding โ kernel stays inside image | Output shrinks by (kernel_size โ 1) | When you want smaller feature maps |
| Padding = 'same' | Zero-pad edges to preserve input size | Output = input size | Most common in modern CNNs |
Visual Labs โ What Different Kernels Detect
# Applying a convolution in Python (from scratch)
import numpy as np
def convolve2d(image, kernel):
kH, kW = kernel.shape
iH, iW = image.shape
# Output dimensions (no padding)
out = np.zeros((iH - kH + 1, iW - kW + 1))
for y in range(out.shape[0]):
for x in range(out.shape[1]):
region = image[y:y+kH, x:x+kW]
out[y, x] = np.sum(region * kernel)
return out
# Vertical edge detector kernel
kernel = np.array([[-1, 0, 1],
[-2, 0, 2],
[-1, 0, 1]])
result = convolve2d(my_image, kernel)
Pooling Layers
After convolution, feature maps are large. Pooling reduces their size while preserving the most important information โ making the network faster and more robust.
Max Pooling vs Average Pooling
5 LecturesMax Pooling takes the maximum value in each pooling window. It answers: "Was this feature present anywhere in this region?" Widely used because it retains the strongest activations.
Average Pooling takes the mean of the window. Smoother output. Used in some architectures (GoogLeNet) and often in the final global pooling layer.
Global Average Pooling collapses an entire feature map to a single number โ used at the end of modern architectures instead of fully-connected layers to reduce parameters.
Input: [[1, 3, 2, 4], [5, 6, 7, 8]] โ After 2ร2 max pool with stride 2 โ [6, 8]. The feature map halves in size, keeping only the strongest signals.
Building Your First CNN
Everything comes together here. You'll build a complete end-to-end Convolutional Neural Network and achieve 98%+ accuracy on the MNIST handwritten digit dataset.
CNN Architecture โ The Full Pipeline
5 LecturesA typical CNN processes an image through a repeating pattern of Convolution โ Activation โ Pooling blocks, followed by fully-connected layers for final classification.
from tensorflow import keras
model = keras.Sequential([
# Block 1: Convolution + ReLU + Pooling
keras.layers.Conv2D(32, (3,3), activation='relu', input_shape=(28,28,1)),
keras.layers.MaxPooling2D((2,2)),
# Block 2: Deeper features
keras.layers.Conv2D(64, (3,3), activation='relu'),
keras.layers.MaxPooling2D((2,2)),
# Classifier head
keras.layers.Flatten(),
keras.layers.Dense(128, activation='relu'),
keras.layers.Dense(10, activation='softmax') # 10 digit classes
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5, validation_split=0.1)
After all convolution and pooling layers, we have a 3D tensor (e.g. 7ร7ร64). We "flatten" this into a 1D vector of 3,136 numbers so the fully-connected layers can process it. Think of it as unrolling a cube into a long string.
CNN Mathematics & Backpropagation
This section opens the black box. You'll see exactly how a CNN calculates error, distributes blame backwards through layers, and updates its weights.
How CNNs Learn โ The Learning Loop
5 LecturesLoss Functions & Optimisation
Choosing the right loss function and optimiser is as important as the architecture itself. This section covers every major option and when to use each.
Loss Functions โ Measuring Error Precisely
7 LecturesOptimisers โ Smarter Ways to Descend
| Optimiser | Key Idea | Best For |
|---|---|---|
| SGD | Pure gradient descent โ simple and reliable | When you want full control and tuning |
| Momentum | Accumulates velocity in gradient direction โ like a ball rolling downhill | Training on noisy gradients |
| RMSProp | Adapts learning rate per-parameter based on recent gradient magnitude | Recurrent networks |
| Adam | Combines Momentum + RMSProp โ adaptive and fast | Default choice for CNNs |
Too high โ the network overshoots and never converges. Too low โ training takes forever. Start with lr=0.001 with Adam. Use learning rate scheduling (e.g. cosine annealing) to reduce it as training progresses.
Preventing Overfitting
A model that memorises training data but fails on new data is useless. These techniques force your network to generalise rather than memorise.
Regularisation Techniques
5 LecturesUnderfitting: Both training and validation loss are high. Model is too simple or not trained enough. Fix: more capacity, longer training.
Overfitting: Training loss is low but validation loss is high. Model memorised training data. Fix: regularisation, more data, simpler model.
Deep Learning Frameworks
TensorFlow and PyTorch are the two frameworks every professional uses. This section makes you proficient in both.
TensorFlow / Keras vs PyTorch
6 Lectures| Feature | TensorFlow / Keras | PyTorch |
|---|---|---|
| Design Philosophy | Production-first; static/dynamic graphs via tf.function | Research-first; dynamic computational graph (eager by default) |
| Ease of Use | Keras API is very beginner-friendly | More Pythonic; feels like NumPy with gradients |
| Industry Use | Dominant in production / mobile (TF Lite) | Dominant in research papers (60%+ of papers) |
| Debugging | Harder to debug graph execution | Easy โ standard Python debugger works |
| Our Recommendation | Start here for fast prototyping | Move here for custom research |
# PyTorch equivalent of our MNIST CNN
import torch
import torch.nn as nn
class MyCNN(nn.Module):
def __init__(self):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(1, 32, 3), # 1 channel in, 32 filters, 3ร3 kernel
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, 3),
nn.ReLU(),
nn.MaxPool2d(2),
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(64*5*5, 128),
nn.ReLU(),
nn.Linear(128, 10)
)
def forward(self, x):
return self.classifier(self.features(x))
Famous CNN Architectures
Every milestone architecture solved a specific problem. Understanding these designs gives you an intuition for architectural choices that transfer to your own projects.
The Evolution of CNN Architectures
6 Lectures + AnalysisWhy ResNet Solved Vanishing Gradients
In very deep networks, gradients shrink exponentially as they travel backwards through layers โ they "vanish" before reaching early layers, which stop learning.
ResNet's solution: skip connections (residual connections). Instead of learning a direct mapping, each block learns a residual (the difference from its input). The gradient can now flow back through the shortcut path unchanged.
Transfer Learning
Why train from scratch when ResNet has already learned from 14 million images? Transfer learning lets you build state-of-the-art models in hours, not weeks.
Feature Extraction & Fine-Tuning
4 Lecturesfrom tensorflow.keras.applications import ResNet50
from tensorflow.keras import layers, Model
# Load ResNet with ImageNet weights, remove top classifier
base = ResNet50(weights='imagenet', include_top=False, input_shape=(224,224,3))
base.trainable = False # Freeze all layers
# Add our custom head for binary classification
x = layers.GlobalAveragePooling2D()(base.output)
x = layers.Dense(256, activation='relu')(x)
output = layers.Dense(1, activation='sigmoid')(x) # Cat vs Dog
model = Model(inputs=base.input, outputs=output)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
Object Detection
Classification says "there's a cat." Detection says "there's a cat โ right there, at those pixel coordinates." This is what powers self-driving cars and security cameras.
From Classification to Detection
5 LecturesA bounding box is a rectangle defined by four values: (x_min, y_min, x_max, y_max). Object detection models must simultaneously predict: the class of the object AND the bounding box coordinates.
The Major Approaches
| Method | Approach | Speed | Accuracy |
|---|---|---|---|
| R-CNN | Region proposals โ classify each | Slow (47s/image) | High |
| Fast R-CNN | Shared CNN features across proposals | 2s/image | High |
| Faster R-CNN | Region Proposal Network (RPN) | 0.2s/image | Very High |
| YOLO | Single pass โ predict all boxes at once | Real-time (30fps+) | Good |
| SSD | Multi-scale feature maps, single pass | Real-time | Good |
YOLO divides the image into an SรS grid. Each cell predicts B bounding boxes and their confidence scores, plus C class probabilities. Everything happens in a single forward pass โ that's why it's fast enough for real-time video.
Image Segmentation
Where detection draws boxes, segmentation colours every single pixel. Essential for medical imaging, autonomous driving, and satellite analysis.
Semantic vs Instance Segmentation & U-Net
4 LecturesU-Net has an encoder (contracting path) that captures context and a decoder (expanding path) that enables precise localisation. Skip connections between matching encoder/decoder layers preserve fine-grained spatial information that would otherwise be lost during downsampling.
CNN Deployment
A model nobody can use is just a research experiment. This section takes your trained model from a Python notebook to a live web API or mobile app.
From Model File to Production API
5 Lecturesmodel.save('my_cnn.h5') (Keras) or torch.save(model.state_dict(), 'model.pth') (PyTorch). Always save both architecture and weights.# Simple Flask API for your CNN
from flask import Flask, request, jsonify
import numpy as np
from tensorflow import keras
from PIL import Image
import io
app = Flask(__name__)
model = keras.models.load_model('my_cnn.h5')
class_names = ['airplane', 'cat', 'dog', 'car', 'bird']
@app.route('/predict', methods=['POST'])
def predict():
img = Image.open(io.BytesIO(request.data)).resize((224, 224))
arr = np.array(img) / 255.0
pred = model.predict(arr[np.newaxis, ...])
return jsonify({'class': class_names[pred.argmax()],
'confidence': float(pred.max())})
if __name__ == '__main__':
app.run(port=5000)
Advanced Topics
Where CNNs meet the cutting edge of modern AI โ Vision Transformers, Generative Models, and beyond.
Modern Vision Systems
5 LecturesCNNs have inductive bias โ they assume local patterns matter (locality) and that the same pattern anywhere is equally important (translation equivariance). ViTs have no such assumptions โ they learn everything from data. ViTs win at very large scale; CNNs are often better with limited data.
Ethics & Responsible AI
With great power comes great responsibility. Every CNN practitioner must understand the societal consequences of the systems they build.
Societal Impact of Computer Vision
4 LecturesGradient-weighted Class Activation Mapping (Grad-CAM) produces a heatmap showing which pixels the CNN focused on when making a decision. If a model classifies a dog image correctly because it looked at the dog โ great. If it looked at the leash โ suspicious. Explainability tools are becoming standard in regulated industries.
