Training: Backprop & Optimization

The engine that powers learning

How Neural Networks Learn

Training a neural network is like teaching someone to throw darts. At first, throws land randomly, but with feedback about where each dart landed and how far off it was, accuracy improves. Neural networks follow the same principle: they make predictions, measure errors, and adjust to do better.

The Learning Loop

Every training iteration follows four steps:

Forward Pass: Input flows through the network, producing a prediction
Loss Calculation: Compare prediction to the correct answer
Backward Pass: Figure out which weights caused the error
Weight Update: Adjust weights to reduce future errors

This loop repeats millions of times until the network performs well.

Loss Functions: Measuring Mistakes

A loss function quantifies how wrong a prediction is:

Mean Squared Error: For predicting numbers (regression). Large errors are penalized heavily.
Cross-Entropy Loss: For classification. Measures how surprised the model is by the correct answer.
Binary Cross-Entropy: For yes/no decisions.

Lower loss = better predictions. Training aims to minimize loss.

Gradients: The Direction of Improvement

A gradient tells us how to change each weight to reduce loss:

If increasing a weight would increase loss → decrease that weight
If increasing a weight would decrease loss → increase that weight
If changing a weight doesn't affect loss much → leave it alone

The gradient points toward the steepest increase in loss. We move in the opposite direction to improve.

Backpropagation: Tracing Blame

Backpropagation is the algorithm that calculates gradients efficiently. It works backward from the output:

Start with the final error
Ask: "Which neurons contributed to this error?"
For each contributing neuron, ask the same question
Continue until reaching the input

This "chain of blame" lets us compute gradients for millions of weights in seconds.

Optimizers: How to Step

Once we know the gradient, optimizers decide how to update weights:

Stochastic Gradient Descent (SGD): Take a small step opposite to the gradient. Simple but can be slow.

Momentum: Like a ball rolling downhill—remember previous direction to avoid getting stuck in small dips.

Adam: Adapts step size for each weight individually. Often works well out of the box.

Learning Rate: Step Size Matters

The learning rate controls how big each update step is:

Too large → overshooting, unstable training, loss explodes
Too small → very slow progress, stuck in poor solutions
Just right → steady improvement toward good performance

Common strategies:

Start with a moderate rate (e.g., 0.001)
Reduce it as training progresses
Use learning rate schedules or warmup

Batching: Learn from Groups

Processing all data at once is impractical. Instead, we use mini-batches:

Take a random subset of data (32-512 examples)
Compute average gradient over this batch
Update weights once per batch

Batches provide a noisy but useful estimate of the true gradient.

Epochs: Full Passes Through Data

One epoch = one complete pass through all training data. Models often train for 10-100+ epochs, seeing each example many times.

Overfitting and Regularization

Without checks, networks memorize training data instead of learning general patterns:

Regularization techniques:

Dropout: Randomly disable neurons during training, forcing redundancy
Weight decay: Penalize large weights to keep them small
Early stopping: Stop training when validation performance plateaus

References

Citation Note: All referenced papers are open access. We encourage readers to explore the original research for deeper understanding. If you notice any citation errors, please let us know.

← Back to Learn