The Vanishing Gradient Problem

by Sven Kriegel

This project aims to explore why vanishing gradients occur in deep neural networks, how they impact training, and strategies to mitigate them using calculus.

deliverables

1. Gradients Drive Learning

Explain gradient descent using calculus (derivatives, partial derivatives) and relate it to neural network weight updates.

Requirements

Define gradients as direction of steepest ascent/descent using partial derivatives.
Describe how gradients update weights to minimize loss.
Include a simple numerical example (e.g., optimizing a quadratic function).

2. Why Gradients Vanish in Deep Networks

Explain the vanishing gradient problem using calculus (chain rule, activation functions) and AI architecture.

Requirements

Show mathematically how repeated chain rule multiplication shrinks gradients (e.g. using sigmoid/\(\mathbf{tanh}\)) derivatives).
Visualize how activation functions (sigmoid vs. \(\mathbf{ReLU}\)) affect gradient flow.
Link vanishing gradients to slow training/ineffective deep networks.

3. Fixing the Vanishing Gradient

Analyze solutions (\(\mathbf{ResNet}\), LSTM, activation functions) and their real-world significance.

Requirements

Describe what causes the vanishing gradient problem in deep neural networks.
Explain how \(\mathbf{ResNet}\)’s skip connections help reduce this problem.¹
Compare \(\mathbf{ReLU}\) and sigmoid activation functions and why \(\mathbf{ReLU}\) works better for deep networks.
Give one example of a real-world application where \(\mathbf{ResNet}\) or \(\mathbf{LSTM}\) is used, and explain why the solution helps.

Resources

Neural Networks by 3Blue1Brown. Videos 1–4.
Vanishing Gradient Problem. YouTube. Hint: summarize this!
Vanishing Gradient Problem in Deep Learning: Explained. Shaoni Mukherjee. Good math content
Gradient Decay in Neural Networks. Lucas on Stats StackExchange.

Bonus. Use simple examples or diagrams. ↩