Activation Functions in Deep Learning (in progress)
An intuitive journey into how they shape learning, bend gradients, and sometimes break them.
Activation functions are what make a neural network more than just a stack of linear equations. Without them, no matter how many layers you add, the network collapses into a single linear mapping.
🤔 Why do we need non-linearity in the first place?
If we keep stacking linear layers , as shown in the animation below , the network still acts like just one big linear layer. You can multiply all those weight matrices into one, and the output will still be a straight line.
The problem is this. Sometimes the decision boundary that separates different classes is not a straight line. It might be curved, twisted, or broken into multiple pieces.
A pure linear model cannot bend the input space to match that. No matter how deep you make it, it can only rotate, scale, or shift things around. It cannot fold the space to create meaningful separation.
That’s where non-linearity comes in. Activation functions are what let the network bend, warp, and reshape the space between layers.
They’re the reason neural networks can do more than just draw lines.
Activation functions break that linearity. They bend the space. They allow a network to approximate weird, complex functions. But bending space is only half the story. What really matters is how that bending affects training. Because at the end of the day, all learning happens through gradients. Errors flowing backward. Tweaking the weights.
Info
- So the real question is not just “Is this function non-linear?”
- It is “How does this function treat gradients?”
- If gradients die out or explode or behave unpredictably, training becomes hell.
Sigmoid / Logistic Activation Function
The sigmoid function looks smooth, continuous, and well-behaved, which is exactly why it was so widely used in the early days. But that clean shape hides some nasty training issues.
Definition:
The sigmoid function is defined as:
This function takes any real-valued input and maps it to a value between 0 and 1.
- For large positive inputs, the output gets very close to 1.
- For large negative inputs, the output approaches 0.
This property makes it useful in models where we want to predict probabilities. After all, probabilities naturally lie between 0 and 1.
Derivative
The sigmoid function is differentiable, which is a crucial requirement for backpropagation. Its derivative provides a smooth, non-jumping gradient.
Sigmoid derivative derivation
Now, let's explore its two main problems.
Flaw #1: The Vanishing Gradient Problem
Let's Look at the Graph of Derivative of Sigmoid:
If you see the graph . It forms a bell-shaped curve that is always positive and peaks at 0.25. And the gradient is only significant when the input lies approximately between -3 and +3. Outside this range, the gradient becomes extremely small close to zero. Which means if a neuron's input is strongly positive or strongly negative, the learning slows down drastically. This is called the vanishing gradient problem.
We can explain all this theoretically, but let’s be honest that’s not always intuitive. So let’s take a better approach. We want to understand how the gradient becomes smaller and eventually vanishes when using sigmoid in a deep network.
To do that, we’ll walk through a simple example: a small feedforward network with a few layers, all using sigmoid activations. Then we’ll do backpropagation step by step to see exactly how the gradient starts shrinking layer by layer. Once we do that, it will be clear why sigmoid struggles in deep networks.
Example
Network Architecture:
Objective: We want to find
Info
We begin at the end of the network and apply the chain rule backward.
Step 1: From loss to the last layer
- The loss depends directly on \[ a_3 = \hat{y} \]\[ a_3 = \sigma(z_3) \]
-
so the derivative is:
\[ \frac{da_3}{dz_3} = \sigma'(z_3) \] -
Therefore:
Step 2: Backprop to layer 2
Step 3: Backprop to layer 1
Assume for simplicity:
- All weights are around 1
- Maximum derivative of sigmoid is 0.25
Then we get:
After just 3 layers, the gradient at the first layer has shrunk to only about 1.5% of what it was at the output.
Key Takeaways:
- Every time we move one layer backward, we multiply the gradient by a number less than or equal to 0.25
- Multiply many such numbers, and the gradient becomes almost zero
- This is the vanishing gradient problem
Flaw #2: The Output of Sigmoid is not symmetric around 0
Before we understand this issue, I think we should ask ourselves:
A Quick Refresher on Weight Updates
Let’s first recall how weights are updated during training:
So you see why the gradient needs to be able to take both signs. But if you look at the graph of the sigmoid’s derivative it's always positive for any value of x. Now remember this because that’s where the problem starts.
During backpropagation, the gradient with respect to a weight is computed as:
Since x > 0 because the derivative of sigmoid is also always positive, the sign of the gradient
Even though in reality, the weights should be pushed in different directions sometimes up, sometimes down (in 2d) the structure of the sigmoid prevents that flexibility. The result is biased updates and slower convergence.
Another way to put it: the direction of the update vector is always tilted. It's not adaptive. Instead of pointing straight toward the loss minimum, it keeps pulling diagonally. This slanted gradient makes optimization inefficient, especially in high-dimensional spaces.
Even in high-dimensional weight spaces, this problem doesn’t go away.
A model might be trying to reduce the loss by adjusting many weights simultaneously some increasing, some decreasing depending on the slope of the loss surface in that region. But when using sigmoid activations, all inputs to a neuron are positive, and the gradient vector
This breaks the optimizer’s ability to descend diagonally or adaptively. The update vector cannot point straight to the loss minimum and it’s always skewed. This causes inefficient learning, instability in the optimization path, and slower convergence, especially in deeper networks.
