Blog


Missing calculation of gradient descent derivatives on Coursera Deep Learning specialization

February 20, 2018


For some reasons Andrew Ng didn't show how to obtain derivatives in back-prop. He showed final results only. I didn't feel good about that. It wasn't a nice feeling to do not understand something from the very beginning. That's why I performed necessary calculation on my own.

Would you like to see how it's done?

1. Calculation of derivatives for single or last neuron with sigmoid activation function

Linear part looks as follows:

$$ z^{(i)} = w^T x^{(i)} + b $$

As an activation finction we use sigmoid function \(\sigma\):

$$ \hat{y}^{(i)} = \sigma(z^{(i)}) = \frac{1}{1 + e^{-z^{(i)}}} $$

We'll define loss function \(\mathcal{L}\) and cost function \(\mathcal{J}\) as Andrew did that:

$$ \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) = -[y^{(i)} log \hat{y}^{(i)} + (1 - y^{(i)}) log(1 - \hat{y}^{(i)})] $$

$$ \mathcal{J}(w, b) = \frac{1}{m} \sum_{i = 1}^m \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) = -\frac{1}{m} \sum_{i = 1}^m [y^{(i)} log \hat{y}^{(i)} + (1 - y^{(i)}) log(1 - \hat{y}^{(i)})] $$

Now we calculate the derivative of function \(\mathcal{J}(\sigma(z))\):

$$ \frac{d \mathcal{J}}{d z} = \frac{d \mathcal{J}}{d \sigma} \frac{d \sigma}{d z} $$

The first differentiation:

$$ \frac{d \mathcal{J}}{d \sigma} = -\frac{1}{m} \left( \frac{y}{\sigma} - \frac{1 - y}{1 - \sigma} \right) = -\frac{1}{m} \frac{y(1 - \sigma) - \sigma (1 - y)}{\sigma (1 - \sigma)} = -\frac{1}{m} \frac{y - \sigma}{\sigma (1 - \sigma)} = \frac{1}{m} \frac{\sigma -y}{\sigma (1 - \sigma)} $$

And second one:

$$ \frac{d \sigma}{d z} = \frac{e^{-z}}{{(1 + e^{-z})}^2} $$

In the last equation, we must express \(e^{-z}\) as a function of the \(\sigma\) argument:

$$ \sigma(z) = \frac{1}{1 + e^{-z}} \Rightarrow \sigma^{-1} = 1 + e^{-z} \Rightarrow e^{-z} = \sigma^{-1} - 1 $$

So, we have:

$$ \frac{d \sigma}{d z} = \frac{\sigma^{-1} - 1}{\sigma^{-2}} = \sigma^2(\sigma^{-1} - 1) = \sigma(1 - \sigma) $$

Now we combine both derivatives together:

$$ \frac{d \mathcal{J}}{d z} = \frac{d \mathcal{J}}{d \sigma} \frac{d \sigma}{d z} = \frac{1}{m} \frac{\sigma - y}{\sigma (1 - \sigma)} \sigma(1 - \sigma) = \frac{1}{m} (\sigma -y) $$

All we need at the momement is to perform differentiation with respect to \(w\) and \(b\):

$$ \frac{\partial \mathcal{J}}{\partial w} = \frac{d \mathcal{J}}{d z} \frac{\partial z}{\partial w} = \frac{1}{m} (\sigma - y) x = x \frac{d \mathcal{J}}{d z} $$

$$ \frac{\partial \mathcal{J}}{\partial b} = \frac{d \mathcal{J}}{d z} \frac{\partial z}{\partial b} = \frac{1}{m} (\sigma - y) = \frac{d \mathcal{J}}{d z} $$

That's all.



2. Calculation of derivatives for hidden layer

For two layers we have forward propagation:

$$ z^{[1]} = w^{[1]} x + b^{[1]} \rightarrow a^{[1]} = g^{[1]}(z^{[1]}) \rightarrow z^{[2]} = w^{[2]} a^{[1]} + b^{[2]} \rightarrow a^{[2]} = \sigma^{[2]} (z^{[2]}) \rightarrow \mathcal{L}(a^{[2]}, y) $$

Derivatives for last layer has been done above and everything is clear:

$$ dz^{[2]} = a^{[2]} - y $$ $$ dw^{[2]} = dz^{[2]} a^{[1]T} $$ $$ db^{[2]} = dz^{[2]} $$

But what about first layer? Andrew provide us with following formulas:

$$ dz^{[1]} = w^{[2]T} dz^{[2]} \ast g^{[1]'}(z^{[1]}) $$ $$ dw^{[1]} = dz^{[1]} x^T $$ $$ db^{[1]} = dz^{[1]} $$

Let's do that!

Chain rule:

$$ \frac{d \mathcal{J}}{d z^{[2]}} = \frac{d \mathcal{J}}{d \sigma^{[2]}} \frac{d \sigma^{[2]}}{d z^{[2]}} $$

And again:

$$ \frac{d \mathcal{J}}{d z^{[1]}} = \frac{d \mathcal{J}}{d z^{[2]}} \frac{d z^{[2]}}{d z^{[1]}} $$

Linear part:

$$ z^{[2]} = w^{[2]} a^{[1]} + b^{[2]} = w^{[2]} g^{[1]}(z^{[1]}) + b^{[2]} $$

And we have:

$$ \frac{d z^{[2]}}{d z^{[1]}} = w^{[2]T} \frac{d g^{[1]}}{d z^{[1]}} $$

That's why:

$$ \frac{d \mathcal{J}}{d z^{[1]}} = \frac{d \mathcal{J}}{d z^{[2]}} \frac{d z^{[2]}}{d z^{[1]}} = w^{[2]T} \frac{d \mathcal{J}}{d z^{[2]}} \frac{d g^{[1]}}{d z^{[1]}} $$

And - as before - differentiation with respect to \(w\) and \(b\):

$$ \frac{\partial \mathcal{J}}{\partial w^{[1]}} = \frac{d \mathcal{J}}{d z^{[1]}} \frac{\partial z^{[1]}}{\partial w^{[1]}} = \frac{d \mathcal{J}}{d z^{[1]}} x^T $$

$$ \frac{\partial \mathcal{J}}{\partial b^{[1]}} = \frac{d \mathcal{J}}{d z^{[1]}} \frac{\partial z^{[1]}}{\partial b^{[1]}} = \frac{d \mathcal{J}}{d z^{[1]}} $$

Enjoy!