Key Notes
- The problem is almost a Linear Separable Problem
- We can design an (almost) error free Linear Classifier
Training
Minimize mean square error until convergence using gradient descent
How it is written in the book:
Or more familiar
And applying the bias trick
We write
Where is the ground truth (label).
We also need to squish the results between and → Sigmoid (which is an activation function)
No explicit solution of the minima of MSE exists. We therefore use a gradient based technique called Gradient Descent.
From page=79 we know that
Where
Note that means elementwise multiplication.
Combining these results with Gradient Descent, we get
Where
-
- is the ground truth
- is the iteration number
- is the step factor