Training Neural Networks using Steepest Descent
Task
Implement a steepest descent algorithm to train the neural network
Given this fully connected neural network
We want to train it, aka solve the following minimization problem
Note that this is an Unconstrained Optimization problem, but since this network is very small we can use standard gradients algorithms without “cheating.”
Important stuff to notice
is the neural network, and is the gradient of the neural network with respect to the parameters.
Solution
By doing the following changes in the main for loop
% Data
x = 0:0.2:2*pi;
rng('default');
y = .1*(x + cos(x) + 1) + .1*rand(size(x));
N = size(x,2); % Number of data points
n_theta = 9; % Number of parameters
% NN parameters
alpha = .1;
n_iter = 2000;
% Allocate storage
ES = zeros(1,n_iter+1);
theta = zeros(n_theta,n_iter+1);
theta(:,1) = rand(n_theta,1); % initial value of parameter
% Initial value of E
E = 0;
for i = 1:N,
E = E + .5* (y(i) - f_NN(x(i),theta) ).^2;
end
ES(1) = E; % ES stores the value of the objective function for each iteration, E for plotting
% Steepest descent algorithm
for k=1:n_iter,
% Initialize gradient accumulator
dE_dtheta = zeros(n_theta,1);
% Sum over all data points to calculate gradient
for i = 1:N,
% Compute gradient of the neural network output with respect to parameters
[dydtheta_i, y_i] = gradient_f_NN(x(i),theta(:,k));
% Compute the error derivative for this data point
e_i = y(i) - y_i;
% Accumulate the gradient of the error function
dE_dtheta = dE_dtheta + (-e_i * dydtheta_i);
end
% Update parameters using steepest descent
theta(:,k+1) = theta(:,k) - alpha * dE_dtheta;
% Recalculate value of E for the updated parameters
E = 0;
for i = 1:N,
E = E + .5* (y(i) - f_NN(x(i),theta(:,k+1)) )^2;
end
ES(k+1) = E;
end
theta_opt = theta(:,end);
figure(1)
plot(0:n_iter,ES);
xlabel('Iteration'); ylabel('E');
figure(2);
plot(x,y); hold on;
plot(x,f_NN(x,theta_opt),'m--'); hold off
xlabel('x'); ylabel('y');
function y = f_NN(x,theta)
% Do a forward pass of a 1-2-1 Neural Network.
%
% x input
% theta parameters theta = (w1, b1, w2, b2, w3, b3, w41, w42, b4)
%
% y output
% Activation function ("logistic")
sigma = @(r) 1./(1+exp(-r));
w1 = theta(1);
b1 = theta(2);
w2 = theta(3);
b2 = theta(4);
w3 = theta(5);
b3 = theta(6);
w41 = theta(7);
w42 = theta(8);
b4 = theta(9);
y1 = sigma(w1*x + b1);
y2 = sigma(w2*y1 + b2);
y3 = sigma(w3*y1 + b3);
y = sigma(w41*y2 + w42*y3 + b4);
end
function [dydtheta, y] = gradient_f_NN(x,theta)
% Calculate the gradient of the 1-2-1 NN wrt parameters.
% Note: This is an inefficient implementation, using "forward" (chain rule)
% differentiation. "Reverse" differentiation aka back propagation
% would be more efficient, but the difference is not
% significant for this small example. Machine learning frameworks use
% reverse mode of automatic differentiation to compute these derivatives.
%
% x input
% theta parameters theta = (w1, b1, w2, b2, w3, b3, w41, w42, b4)
%
% y output
% dydtheta gradient of y wrt theta
% Activation function ("logistic")
sigma = @(r) 1./(1+exp(-r));
% Derivative of activation function (as a function of the output of the sigma function)
dsigmadr = @(sigma) sigma*(1-sigma);
w1 = theta(1);
b1 = theta(2);
w2 = theta(3);
b2 = theta(4);
w3 = theta(5);
b3 = theta(6);
w41 = theta(7);
w42 = theta(8);
b4 = theta(9);
y1 = sigma(w1*x + b1);
y2 = sigma(w2*y1 + b2);
y3 = sigma(w3*y1 + b3);
y = sigma(w41*y2 + w42*y3 + b4);
dydtheta = [
dsigmadr(y)*w41*dsigmadr(y2)*w2*dsigmadr(y1)*x + dsigmadr(y)*w42*dsigmadr(y3)*w3*dsigmadr(y1)*x; % dydw1
dsigmadr(y)*w41*dsigmadr(y2)*w2*dsigmadr(y1) + dsigmadr(y)*w42*dsigmadr(y3)*w3*dsigmadr(y1); % dydb1
dsigmadr(y)*w41*dsigmadr(y2)*y1; % dydw2
dsigmadr(y)*w41*dsigmadr(y2); % dydb2
dsigmadr(y)*w42*dsigmadr(y3)*y1; % dydw3
dsigmadr(y)*w42*dsigmadr(y3); % dydb3
dsigmadr(y)*y2; % dydw41
dsigmadr(y)*y3; % dydw42
dsigmadr(y); % dydb4
];
end
NOTE:
dE_dtheta
- /
- Represents the gradient of the error function with respect to the parameters
dydtheta
- /
- Represents the gradient of output with respect to the parameters
dydtheta_i
- /
- Represents the gradient of the output in a specific input data point , with respect to the parameters .
The core formula in updating the parameters is found in line 43:
Core formula
theta(:,k+1) = theta(:,k) - alpha * dE_dtheta;
The last iteration is the optimal set of parameters.