Convolutional Neural Networks — Complete Course

🎓 Complete Beginner to Expert

Convolutional Neural Networks

A complete, structured learning path covering every concept from pixels to production — with visuals, formulas, and hands-on projects that make deep learning click.

Sections

80+

Lectures

Projects

Start Learning ↓

Neural Network Visualization — layers detecting features from pixels to predictions

📋

Navigation

Course Roadmap

18 carefully structured sections. Begin at any point, but we recommend following the sequence for best understanding.

Course Introduction

Goals, roadmap, real-world AI

Python & Math Foundations

NumPy, matrices, gradient descent

Neural Networks

Neurons, backpropagation, activations

Computer Vision Basics

Images as matrices, RGB, pixels

Convolution Deep Dive

Kernels, stride, padding, feature maps

Pooling Layers

Max pooling, average pooling

Building Your First CNN

End-to-end architecture, MNIST

CNN Math & Backprop

Gradients, weight updates

Loss & Optimization

Adam, SGD, learning rate

Preventing Overfitting

Dropout, batch norm, augmentation

Deep Learning Frameworks

TensorFlow, Keras, PyTorch

Famous CNN Architectures

LeNet → ResNet → EfficientNet

Transfer Learning

Pretrained models, fine-tuning

Object Detection

YOLO, R-CNN, bounding boxes

Image Segmentation

U-Net, semantic, instance

CNN Deployment

Flask, ONNX, mobile

Advanced Topics

Vision Transformers, GANs

Ethics & Responsible AI

Bias, privacy, adversarial attacks

🚀

Section 01

Course Introduction

Before we write a single line of code, understand why CNNs matter and how they reshaped artificial intelligence.

🌍

How CNNs Changed AI

5 Lectures

Before 2012, teaching a computer to recognise a cat in a photo required hand-crafted rules — thousands of lines of code describing whiskers, fur, ears. Engineers spent years building these brittle systems.

Then, in 2012, a Convolutional Neural Network called AlexNet slashed the error rate on the ImageNet competition by nearly half — from 26% to 15.3% — in a single year. No hand-crafted rules. The network learned what features mattered, directly from raw pixels.

💡

Beginner Analogy

Think of a CNN as a toddler learning to recognise animals. At first they see blurry blobs (low-level features), then shapes (mid-level), then full animals (high-level). CNNs learn in exactly the same layered way.

Computer Vision AI Robot Eye — Computer vision enables machines to interpret and understand the visual world

Real-World Applications of CNNs

🏥

Medical Imaging

Detecting tumours in MRI scans and diabetic retinopathy in eye photographs — often matching radiologist accuracy.

🚗

Self-Driving Cars

Identifying pedestrians, road signs, lane markings, and obstacles in real time at highway speeds.

📱

Face Unlock

Your phone's face recognition runs a CNN on your selfie camera 30 times per second.

🌾

Agriculture

Drones scan farmland and CNNs identify diseased crops before the damage spreads — saving entire harvests.

🔍

Quality Control

Factory cameras inspect thousands of products per minute, flagging defects invisible to the human eye.

🎨

Generative Art

Stable Diffusion, DALL·E, and Midjourney all rely on CNN-derived components to understand and generate images.

🧮

Section 02

Python & Math Foundations

Deep learning is applied mathematics. This section gives you just enough — no more, no less — to fully understand what's happening inside every CNN.

🐍

Python Essentials

4 Lectures

We use Python because it has the richest deep learning ecosystem on the planet. The two libraries you'll live in are NumPy (fast maths) and Matplotlib (visualisation).

# Creating a 3×3 matrix (a grayscale image patch)
import numpy as np

patch = np.array([
    [10, 20, 30],
    [40, 50, 60],
    [70, 80, 90]
])

print(patch.shape)   # → (3, 3)
print(patch.mean())  # → 50.0  (average pixel brightness)

Python Refresher
NumPy Basics
Matrices & Arrays
Data Visualisation

∑

Mathematics for CNNs

6 Lectures

The Three Pillars

Matrix Multiplication

Every layer in a neural network is a matrix multiplication. Understanding this unlocks the entire architecture.

Derivatives & Partial Derivatives

The derivative of the loss tells the network which direction to adjust its weights. Partial derivatives extend this to millions of parameters simultaneously.

Gradient Descent

The learning algorithm. Like rolling a ball down a hill to find the lowest point (minimum loss).

Mathematics equations on blackboard — Matrix operations form the mathematical backbone of every neural network

⚙ Gradient Descent Update Rule

w_new = w_old − η · (∂L / ∂w)

w = weight | η (eta) = learning rate | ∂L/∂w = gradient of loss with respect to weight

📝

Why the chain rule matters

A CNN has dozens of layers. The chain rule lets us calculate gradients through every layer by multiplying local gradients together — this is what makes backpropagation possible.

Vectors & Matrices
Matrix Multiplication
Derivatives
Partial Derivatives
Chain Rule
Gradient Descent Basics

🧠

Section 03

Introduction to Neural Networks

CNNs are a specialised type of neural network. Before studying the specialist, understand the generalist — how data flows, how errors are measured, and how the network learns.

⚡

From Biological to Artificial Neurons

6 Lectures

A biological neuron receives signals through dendrites, processes them in the cell body, and fires an output signal down the axon if the input is strong enough. Artificial neurons work identically — mathematically.

Each artificial neuron: (1) receives inputs, (2) multiplies each by a weight, (3) sums everything up, (4) passes the sum through an activation function, (5) outputs the result.

⚙ Neuron Output

output = activation( Σ(xᵢ · wᵢ) + b )

x = inputs | w = weights | b = bias

Neural network structure visualization — Layered artificial neural network — input layer, hidden layers, output layer

Activation Functions — Introducing Non-Linearity

Function	Formula	When to Use	Limitation
ReLU	`max(0, x)`	Hidden layers of CNNs (default choice)	Dead neurons when x < 0
Sigmoid	`1 / (1 + e⁻ˣ)`	Binary classification output	Vanishing gradient for deep networks
Tanh	`(eˣ − e⁻ˣ) / (eˣ + e⁻ˣ)`	Recurrent networks, some hidden layers	Still suffers vanishing gradient
Softmax	`eˣⁱ / Σeˣʲ`	Multi-class classification output	Computationally expensive for many classes

💡

Why Activation Functions Exist

Without an activation function, stacking 100 layers is mathematically identical to one layer — just a big matrix multiply. Non-linear activations give networks the power to learn complex, curved decision boundaries.

🔨 Project | Build a Simple Neural Network from Scratch using NumPy

👁️

Section 04

Introduction to Computer Vision

Computers don't see images the way humans do. They see grids of numbers. Mastering this perspective is the single most important conceptual shift in this entire course.

🖼️

Images as Matrices of Numbers

6 Lectures

Data grid matrix visualization — Every image is a grid (matrix) of pixel values — each cell holds a number 0–255

Grayscale Images

A grayscale image is a 2D matrix where each value (pixel) is between 0 (black) and 255 (white). A 28×28 grayscale image has 784 numbers — that's all a computer ever sees.

RGB Colour Images

A colour image has three channels — Red, Green, Blue — stacked on top of each other. A 224×224 colour image is actually a 3D tensor of shape (224, 224, 3) containing 150,528 values.

📝

Colour Mixing

Red (255, 0, 0) + Green (0, 255, 0) = Yellow (255, 255, 0). Every colour on your screen is a combination of RGB intensity values.

Image Preprocessing — Why It Matters

📏

Resizing

CNNs need fixed-size inputs. All images are resized to the same dimensions (e.g. 224×224).

📊

Normalisation

Pixel values (0–255) are scaled to (0–1) or (−1 to 1). This speeds up training dramatically.

🔄

Augmentation

Flipping, rotating, and cropping images creates artificial training variety to improve generalisation.

🎯

Mean Subtraction

Subtracting the dataset mean centres the data around zero, improving gradient flow.

🔨 Project | Load & Visualise Images Using Python (OpenCV + Matplotlib)

⚙️

Section 05 — Core

Convolution Operation Deep Dive

The convolution operation is the heart of every CNN. Understand this completely, and the rest of the course follows naturally.

🔍

What Is a Convolution? — Kernels & Feature Maps

7 Lectures

A kernel (also called a filter) is a tiny matrix — typically 3×3 or 5×5 — of learnable numbers. During convolution, this kernel slides across the input image, and at each position, we compute an element-wise product and sum the result.

The result of sliding a kernel across the entire image is called a feature map. It shows where in the image the kernel's pattern was found.

💡

Real Analogy

Imagine holding a magnifying glass (kernel) over a document and sliding it across. At each spot, you check if a particular pattern (e.g. the letter "e") matches. The feature map records the match strength at every location.

Circuit board pattern resembling convolution filter — A 3×3 kernel slides across the image, computing dot products to produce a feature map

🔬 Convolution Formula

(I ★ K)(x, y) = Σₘ Σₙ I(x−m, y−n) · K(m, n)

I = input image | K = kernel (filter) | (x,y) = output position

Stride and Padding Explained

Parameter	What It Does	Effect on Output Size	Typical Values
Stride = 1	Kernel moves 1 pixel at a time (standard)	Output ≈ input size (with padding)	Default for most layers
Stride = 2	Kernel jumps 2 pixels — skips positions	Output ≈ half input size	Used to downsample instead of pooling
Padding = 'valid'	No padding — kernel stays inside image	Output shrinks by (kernel_size − 1)	When you want smaller feature maps
Padding = 'same'	Zero-pad edges to preserve input size	Output = input size	Most common in modern CNNs

Visual Labs — What Different Kernels Detect

↔️

Edge Detection (Sobel)

A 3×3 kernel like [−1,0,1 / −2,0,2 / −1,0,1] detects vertical edges. Rotating 90° finds horizontal edges.

🌫️

Gaussian Blur

A kernel filled with values that sum to 1 and peak in the centre creates a smooth blur — removing noise.

✨

Sharpen Filter

A kernel with a large positive centre and negative neighbours amplifies edges — making images crisper.

# Applying a convolution in Python (from scratch)
import numpy as np

def convolve2d(image, kernel):
    kH, kW = kernel.shape
    iH, iW = image.shape
    # Output dimensions (no padding)
    out = np.zeros((iH - kH + 1, iW - kW + 1))
    for y in range(out.shape[0]):
        for x in range(out.shape[1]):
            region = image[y:y+kH, x:x+kW]
            out[y, x] = np.sum(region * kernel)
    return out

# Vertical edge detector kernel
kernel = np.array([[-1, 0, 1],
                   [-2, 0, 2],
                   [-1, 0, 1]])

result = convolve2d(my_image, kernel)

🔨 Project | Build the Convolution Operation from Scratch with NumPy

📉

Section 06

Pooling Layers

After convolution, feature maps are large. Pooling reduces their size while preserving the most important information — making the network faster and more robust.

📊

Max Pooling vs Average Pooling

5 Lectures

Data reduction visualization — Pooling reduces spatial dimensions while retaining dominant features

Max Pooling takes the maximum value in each pooling window. It answers: "Was this feature present anywhere in this region?" Widely used because it retains the strongest activations.

Average Pooling takes the mean of the window. Smoother output. Used in some architectures (GoogLeNet) and often in the final global pooling layer.

Global Average Pooling collapses an entire feature map to a single number — used at the end of modern architectures instead of fully-connected layers to reduce parameters.

📝

2×2 Max Pooling Example

Input: [[1, 3, 2, 4], [5, 6, 7, 8]] → After 2×2 max pool with stride 2 → [6, 8]. The feature map halves in size, keeping only the strongest signals.

🏗️

Section 07 — Milestone

Building Your First CNN

Everything comes together here. You'll build a complete end-to-end Convolutional Neural Network and achieve 98%+ accuracy on the MNIST handwritten digit dataset.

🏛️

CNN Architecture — The Full Pipeline

5 Lectures

A typical CNN processes an image through a repeating pattern of Convolution → Activation → Pooling blocks, followed by fully-connected layers for final classification.

🖼️

Input

28×28×1

→

⚙️

Conv + ReLU

32 filters

→

📉

MaxPool

2×2

→

⚙️

Conv + ReLU

64 filters

→

📉

MaxPool

2×2

→

📋

Flatten

1D vector

→

🔗

Dense

128 units

→

🎯

Softmax

10 classes

from tensorflow import keras

model = keras.Sequential([
    # Block 1: Convolution + ReLU + Pooling
    keras.layers.Conv2D(32, (3,3), activation='relu', input_shape=(28,28,1)),
    keras.layers.MaxPooling2D((2,2)),

    # Block 2: Deeper features
    keras.layers.Conv2D(64, (3,3), activation='relu'),
    keras.layers.MaxPooling2D((2,2)),

    # Classifier head
    keras.layers.Flatten(),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dense(10, activation='softmax')  # 10 digit classes
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5, validation_split=0.1)

💡

Flattening Explained

After all convolution and pooling layers, we have a 3D tensor (e.g. 7×7×64). We "flatten" this into a 1D vector of 3,136 numbers so the fully-connected layers can process it. Think of it as unrolling a cube into a long string.

🔨 Project | Handwritten Digit Recognition — MNIST Dataset (Target: 99% accuracy)

📐

Section 08

CNN Mathematics & Backpropagation

This section opens the black box. You'll see exactly how a CNN calculates error, distributes blame backwards through layers, and updates its weights.

🔄

How CNNs Learn — The Learning Loop

5 Lectures

Forward Pass

Input flows through every layer from left to right. Each layer transforms the data. At the end, the network produces a prediction (e.g. "70% cat, 30% dog").

Compute Loss

Compare the prediction to the true label using a loss function. If the network predicted "30% cat" but the answer was "cat", the loss is high.

Backward Pass (Backpropagation)

Using the chain rule, calculate how much each weight contributed to the error. This gives us the gradient — the direction to adjust each weight.

Weight Update

Subtract a small fraction (learning rate η) of the gradient from each weight. Repeat millions of times until the network is accurate.

⚙ Weight Update Formula

w_new = w_old − η · (∂L / ∂w)

Repeated for every weight in the network — potentially billions of parameters in modern models.

🎯

Section 09

Loss Functions & Optimisation

Choosing the right loss function and optimiser is as important as the architecture itself. This section covers every major option and when to use each.

📉

Loss Functions — Measuring Error Precisely

7 Lectures

📊 Cross-Entropy Loss (Classification)

L = −Σᵢ yᵢ · log(ŷᵢ)

y = true label (one-hot) | ŷ = predicted probability | Penalises overconfident wrong predictions heavily

Optimisers — Smarter Ways to Descend

Optimiser	Key Idea	Best For
SGD	Pure gradient descent — simple and reliable	When you want full control and tuning
Momentum	Accumulates velocity in gradient direction — like a ball rolling downhill	Training on noisy gradients
RMSProp	Adapts learning rate per-parameter based on recent gradient magnitude	Recurrent networks
Adam	Combines Momentum + RMSProp — adaptive and fast	Default choice for CNNs

💡

Learning Rate — The Most Critical Hyperparameter

Too high → the network overshoots and never converges. Too low → training takes forever. Start with lr=0.001 with Adam. Use learning rate scheduling (e.g. cosine annealing) to reduce it as training progresses.

🛡️

Section 10

Preventing Overfitting

A model that memorises training data but fails on new data is useless. These techniques force your network to generalise rather than memorise.

⚖️

Regularisation Techniques

5 Lectures

🎲

Dropout

During training, randomly "drop" (zero out) neurons with probability p (typically 0.5). Forces the network to learn redundant representations. Disable during inference.

📊

Batch Normalisation

Normalises activations within each mini-batch. Dramatically speeds up training, allows higher learning rates, and has a mild regularising effect.

🔄

Data Augmentation

Horizontally flip, rotate, crop, and colour-jitter training images. The network sees more variety and can't memorise exact examples.

⏱️

Early Stopping

Monitor validation loss. Stop training when it stops improving — before the model starts overfitting the training set.

Training validation curve — Overfitting: training loss drops while validation loss rises — the model is memorising, not learning

📝

Underfitting vs Overfitting

Underfitting: Both training and validation loss are high. Model is too simple or not trained enough. Fix: more capacity, longer training.

Overfitting: Training loss is low but validation loss is high. Model memorised training data. Fix: regularisation, more data, simpler model.

🔨 Project | Improve CNN Accuracy on CIFAR-10 from 70% → 90%+

🔧

Section 11

Deep Learning Frameworks

TensorFlow and PyTorch are the two frameworks every professional uses. This section makes you proficient in both.

🔷

TensorFlow / Keras vs PyTorch

6 Lectures

Feature	TensorFlow / Keras	PyTorch
Design Philosophy	Production-first; static/dynamic graphs via tf.function	Research-first; dynamic computational graph (eager by default)
Ease of Use	Keras API is very beginner-friendly	More Pythonic; feels like NumPy with gradients
Industry Use	Dominant in production / mobile (TF Lite)	Dominant in research papers (60%+ of papers)
Debugging	Harder to debug graph execution	Easy — standard Python debugger works
Our Recommendation	Start here for fast prototyping	Move here for custom research

# PyTorch equivalent of our MNIST CNN
import torch
import torch.nn as nn

class MyCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 32, 3),   # 1 channel in, 32 filters, 3×3 kernel
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3),
            nn.ReLU(),
            nn.MaxPool2d(2),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64*5*5, 128),
            nn.ReLU(),
            nn.Linear(128, 10)
        )
    def forward(self, x):
        return self.classifier(self.features(x))

🏛️

Section 12

Famous CNN Architectures

Every milestone architecture solved a specific problem. Understanding these designs gives you an intuition for architectural choices that transfer to your own projects.

📜

The Evolution of CNN Architectures

6 Lectures + Analysis

1989 / 1998

LeNet

The original CNN by Yann LeCun. Proved CNNs work for handwritten digit recognition.

2012

AlexNet

The watershed moment. Won ImageNet by a massive margin. Introduced ReLU, Dropout, GPU training.

2014

VGGNet

16–19 layers of uniform 3×3 convolutions. Proved depth matters. Still widely used as a feature extractor.

2014

GoogLeNet

Introduced the Inception module — multiple kernel sizes in parallel. 22 layers but fewer params than AlexNet.

2015
ResNet
Skip connections solved the vanishing gradient problem. 152 layers. Changed deep learning forever.

2019

EfficientNet

Compound scaling of depth, width, and resolution. State-of-the-art accuracy with fewer parameters.

Why ResNet Solved Vanishing Gradients

In very deep networks, gradients shrink exponentially as they travel backwards through layers — they "vanish" before reaching early layers, which stop learning.

ResNet's solution: skip connections (residual connections). Instead of learning a direct mapping, each block learns a residual (the difference from its input). The gradient can now flow back through the shortcut path unchanged.

🔗 Residual Block

output = F(x) + x

F(x) = transformation through conv layers | x = input (identity shortcut)

Deep layered structure neural network — ResNet's skip connections allow gradients to flow directly through deep networks

♻️

Section 13 — Power User

Transfer Learning

Why train from scratch when ResNet has already learned from 14 million images? Transfer learning lets you build state-of-the-art models in hours, not weeks.

🔁

Feature Extraction & Fine-Tuning

4 Lectures

Cat and dog classification — Transfer learning lets ResNet's ImageNet knowledge be repurposed for Cat vs Dog classification

Load a Pretrained Model

Download ResNet50 with weights trained on ImageNet (1.4M images, 1,000 classes). The model already knows edges, textures, objects.

Freeze the Feature Extractor

Lock the convolutional layers so their weights don't change. We only train the final classification head on our specific task.

Fine-Tune (Optional)

After initial training, unfreeze the last few convolutional layers and train with a very low learning rate to adapt features to your domain.

from tensorflow.keras.applications import ResNet50
from tensorflow.keras import layers, Model

# Load ResNet with ImageNet weights, remove top classifier
base = ResNet50(weights='imagenet', include_top=False, input_shape=(224,224,3))
base.trainable = False  # Freeze all layers

# Add our custom head for binary classification
x = layers.GlobalAveragePooling2D()(base.output)
x = layers.Dense(256, activation='relu')(x)
output = layers.Dense(1, activation='sigmoid')(x)  # Cat vs Dog

model = Model(inputs=base.input, outputs=output)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

🔨 Project | Cat vs Dog Classifier with ResNet50 — Target: 98%+ accuracy

🎯

Section 14

Object Detection

Classification says "there's a cat." Detection says "there's a cat — right there, at those pixel coordinates." This is what powers self-driving cars and security cameras.

📦

From Classification to Detection

5 Lectures

A bounding box is a rectangle defined by four values: (x_min, y_min, x_max, y_max). Object detection models must simultaneously predict: the class of the object AND the bounding box coordinates.

The Major Approaches

Method	Approach	Speed	Accuracy
R-CNN	Region proposals → classify each	Slow (47s/image)	High
Fast R-CNN	Shared CNN features across proposals	2s/image	High
Faster R-CNN	Region Proposal Network (RPN)	0.2s/image	Very High
YOLO	Single pass — predict all boxes at once	Real-time (30fps+)	Good
SSD	Multi-scale feature maps, single pass	Real-time	Good

Street scene object detection cars pedestrians — YOLO detects cars, pedestrians, and traffic signs simultaneously in real time

💡

YOLO — "You Only Look Once"

YOLO divides the image into an S×S grid. Each cell predicts B bounding boxes and their confidence scores, plus C class probabilities. Everything happens in a single forward pass — that's why it's fast enough for real-time video.

🔨 Project | Real-Time Object Detection with YOLOv8 (webcam stream)

🗺️

Section 15

Image Segmentation

Where detection draws boxes, segmentation colours every single pixel. Essential for medical imaging, autonomous driving, and satellite analysis.

🎨

Semantic vs Instance Segmentation & U-Net

4 Lectures

🟩

Semantic Segmentation

Assign a class label to every pixel. All cars are green, all people are red — but individual instances aren't distinguished.

🟦

Instance Segmentation

Like semantic, but distinguishes individual instances. Car #1 is dark blue, Car #2 is light blue. Used in Mask R-CNN.

🏥

U-Net Architecture

Encoder-decoder with skip connections. Designed for medical images where training data is scarce. Achieves pixel-perfect accuracy on tumour boundaries.

🛣️

Road Segmentation

Driveable surface detection for autonomous vehicles. Every pixel classified as road / non-road in real time.

Medical imaging scan MRI — U-Net was originally designed to segment tumours in medical scans — with just 30 training images

📝

U-Net Architecture

U-Net has an encoder (contracting path) that captures context and a decoder (expanding path) that enables precise localisation. Skip connections between matching encoder/decoder layers preserve fine-grained spatial information that would otherwise be lost during downsampling.

🔨 Project | Road Segmentation or Tumour Detection with U-Net

🚀

Section 16

CNN Deployment

A model nobody can use is just a research experiment. This section takes your trained model from a Python notebook to a live web API or mobile app.

🌐

From Model File to Production API

5 Lectures

Save the Model

model.save('my_cnn.h5') (Keras) or torch.save(model.state_dict(), 'model.pth') (PyTorch). Always save both architecture and weights.

Export to ONNX

ONNX (Open Neural Network Exchange) is a universal format. Convert once, run on any hardware — CPU, GPU, mobile chip.

Build a Flask API

Wrap your model in a REST endpoint. Clients send an image via POST request, receive a JSON prediction response in milliseconds.

Mobile Deployment

TensorFlow Lite or Core ML compresses models for on-device inference — no internet required, full privacy.

# Simple Flask API for your CNN
from flask import Flask, request, jsonify
import numpy as np
from tensorflow import keras
from PIL import Image
import io

app = Flask(__name__)
model = keras.models.load_model('my_cnn.h5')
class_names = ['airplane', 'cat', 'dog', 'car', 'bird']

@app.route('/predict', methods=['POST'])
def predict():
    img = Image.open(io.BytesIO(request.data)).resize((224, 224))
    arr = np.array(img) / 255.0
    pred = model.predict(arr[np.newaxis, ...])
    return jsonify({'class': class_names[pred.argmax()],
                    'confidence': float(pred.max())})

if __name__ == '__main__':
    app.run(port=5000)

🔬

Section 17 — Frontier

Advanced Topics

Where CNNs meet the cutting edge of modern AI — Vision Transformers, Generative Models, and beyond.

🌟

Modern Vision Systems

5 Lectures

🔭

Attention Mechanisms

Attention lets networks focus on the most relevant parts of an image for a given task — like how humans look at different areas when answering different questions about a scene.

🤖

Vision Transformers (ViT)

Divide an image into 16×16 patches and treat them like words in a sentence. Transformers (from NLP) then model relationships between patches — matching or beating CNNs on benchmarks.

🎨

GANs Overview

Two networks (Generator + Discriminator) compete. The generator creates fake images; the discriminator tries to catch them. The result: photorealistic synthetic images.

🔗

Self-Supervised Learning

Learn from unlabelled data by creating proxy tasks (predict the rotation of an image, fill in masked patches). DINO and MAE achieve remarkable results without any labels.

💬

Multimodal AI

Models like CLIP and GPT-4V understand both images and text. Ask "What's in this photo?" and get an intelligent answer — the frontier of modern AI.

💡

CNNs vs Vision Transformers

CNNs have inductive bias — they assume local patterns matter (locality) and that the same pattern anywhere is equally important (translation equivariance). ViTs have no such assumptions — they learn everything from data. ViTs win at very large scale; CNNs are often better with limited data.

Generative AI futuristic visualization — Modern AI systems fuse vision and language understanding — the frontier of multimodal AI

⚖️

Section 18 — Critical Thinking

Ethics & Responsible AI

With great power comes great responsibility. Every CNN practitioner must understand the societal consequences of the systems they build.

🧭

Societal Impact of Computer Vision

4 Lectures

⚠️

Bias in AI

A face recognition system trained on mostly light-skinned faces will perform significantly worse on darker-skinned faces. Biased training data produces biased models — with real-world consequences in hiring, policing, and lending.

🔒

Privacy Concerns

Mass facial recognition enables surveillance at population scale. Who owns that data? Who has access? The technology outpaces the legal frameworks meant to govern it.

🙈

Adversarial Attacks

Adding invisible noise to a stop sign can cause a CNN to classify it as a yield sign with 99% confidence. Self-driving cars, medical diagnostics, and security systems are all vulnerable.

🌐

Responsible Development

Principles: Diverse training data, regular bias audits, explainability tools (Grad-CAM), clear consent for facial data, and red-teaming models before deployment.

📝

Grad-CAM — Making CNNs Explainable

Gradient-weighted Class Activation Mapping (Grad-CAM) produces a heatmap showing which pixels the CNN focused on when making a decision. If a model classifies a dog image correctly because it looked at the dog — great. If it looked at the leash — suspicious. Explainability tools are becoming standard in regulated industries.

Convolutional Neural Networks

Course Roadmap

Course Introduction

How CNNs Changed AI

Real-World Applications of CNNs

Python & Math Foundations

Python Essentials

Mathematics for CNNs

The Three Pillars

Introduction to Neural Networks

From Biological to Artificial Neurons

Activation Functions — Introducing Non-Linearity

Introduction to Computer Vision

Images as Matrices of Numbers

Grayscale Images

RGB Colour Images

Image Preprocessing — Why It Matters

Convolution Operation Deep Dive

What Is a Convolution? — Kernels & Feature Maps

Stride and Padding Explained

Visual Labs — What Different Kernels Detect

Pooling Layers

Max Pooling vs Average Pooling

Building Your First CNN

CNN Architecture — The Full Pipeline

CNN Mathematics & Backpropagation

How CNNs Learn — The Learning Loop

Loss Functions & Optimisation

Loss Functions — Measuring Error Precisely

Optimisers — Smarter Ways to Descend

Preventing Overfitting

Regularisation Techniques

Deep Learning Frameworks

TensorFlow / Keras vs PyTorch

Famous CNN Architectures

The Evolution of CNN Architectures

Why ResNet Solved Vanishing Gradients

Transfer Learning

Feature Extraction & Fine-Tuning

Object Detection

From Classification to Detection

The Major Approaches

Image Segmentation

Semantic vs Instance Segmentation & U-Net

CNN Deployment

From Model File to Production API

Advanced Topics

Modern Vision Systems

Ethics & Responsible AI

Societal Impact of Computer Vision

Popular Courses

Contact Info