CNNs & Computer Vision
From pixels to perception
Teaching Computers to See
When you look at a photo of a cat, you instantly recognize it. But to a computer, an image is just a grid of numbers—pixel values. Convolutional Neural Networks (CNNs) bridge this gap, learning to extract meaningful patterns from raw pixels.
Why Not Just Use Regular Neural Networks?
Consider a small 256×256 color image. That's 256 × 256 × 3 = 196,608 input values. In a traditional fully-connected network, the first layer alone could need billions of connections. CNNs solve this with three key ideas:
- Local connectivity: Each neuron only looks at a small patch, not the whole image
- Weight sharing: The same pattern detector is used across the entire image
- Hierarchy: Simple features combine into complex ones
Convolutions: Pattern Detectors
A convolution slides a small filter (like a magnifying glass) across an image, checking for a specific pattern at each position:
- Edge detector: Highlights boundaries between light and dark
- Corner detector: Finds intersection points
- Texture detector: Recognizes repeating patterns
The filter is just a small grid of numbers (e.g., 3×3). The network learns what numbers to put in each filter during training.
Feature Hierarchies
CNNs build up understanding in layers:
- Layer 1: Edges, simple textures
- Layer 2: Corners, curves, basic shapes
- Layer 3: Parts of objects (eyes, wheels, windows)
- Layer 4: Whole objects (faces, cars, buildings)
- Final layers: Categories and decisions
This mirrors how our visual cortex processes information!
Pooling: Simplifying and Generalizing
Pooling shrinks the representation by summarizing regions:
- Max pooling: Keep the strongest signal in each region
- Average pooling: Average the values
This makes the network robust to small shifts and reduces computation.
Receptive Fields
A neuron's receptive field is the region of the original image it can "see." As you go deeper:
- Early neurons see tiny patches (3×3 pixels)
- Middle neurons see larger areas (dozens of pixels)
- Deep neurons see most or all of the image
This is why deep networks understand context better.
What CNNs Can Do
Image Classification: "This is a cat" vs "This is a dog"
Object Detection: Find and label multiple objects with bounding boxes
Semantic Segmentation: Label every pixel (road, car, pedestrian, sky)
Face Recognition: Match faces across photos
Medical Imaging: Detect tumors, analyze X-rays
Landmark Architectures
LeNet (1998): The original CNN for handwritten digits
AlexNet (2012): Sparked the deep learning revolution by winning ImageNet
VGGNet (2014): Showed that deeper is better (16-19 layers)
ResNet (2015): Enabled 100+ layer networks with skip connections
Vision Transformer (2020): Applied transformer architecture to images
References
Citation Note: All referenced papers are open access. We encourage readers to explore the original research for deeper understanding. If you notice any citation errors, please let us know.