# Neural Network

Today I have started looking into Long Short Term Memory. LSTM is a kind of Recurrent Neural Network (RNN). The first impression that I had was LSTM is a combination of Neural Network (NN) and State Space Models, specifically Hidden Markov Models (HMM). That would form RNN. LSTM is RNN with filter. To understand them fully, I decided to code it, starting off with Neural Network.

Following is my short note. There are no fancy visualizations for now and I try to include as much related terms as possible as they used to confuse me a lot.

### Sample Neural Network coded from scratch

from sklearn import datasets
import numpy as np
import math

def sigFn(z):
return 1.0/(1.0 + math.exp(-z))

def weiSum(x, w):
return np.matmul(x, w)

def apply_row_fn(X, row_fn, **kwargs):
return np.array([row_fn(X[i], i, **kwargs) for i in range(X.shape[0])])

def apply_ele_fn(x, i, ele_fn):
return np.array([ele_fn(j) for j in x])

return np.append(x, 1)

def remOne_row(x, i):
return x[:-1]

def get_dEdy_row(x, i, y):
return np.array([-0.5*((1 if y[i] == j else 0) - x[j]) for j in range(x.shape[0])])

def get_dydz_ele(y_hat):
return -(1 - y_hat) * y_hat

def get_SE_row(x, i, y):
return np.array([math.pow(((1 if y[i] == j else 0) - x[j]), 2) for j in range(n_class - 1)])

def get_pred_row(x, i, y):
return np.array([(1 if np.argmax(np.append(x, 1 - np.sum(x))) == y[i] else 0)])

y = iris.target

n_iter = 50
n_h1_units = 10
n_class = 3
W_h1_out = 2 * (np.random.rand(n_h1_units + 1, n_class - 1) - 0.5)
W_in_h1 = 2 * (np.random.rand(iris.data.shape[1] + 1, n_h1_units) - 0.5)
eta = 0.02

for i in range(n_iter):
y_h1 = apply_row_fn(apply_row_fn(weiSum(X, W_in_h1), apply_ele_fn, ele_fn=sigFn), addOne_row)
y_hat = apply_row_fn(weiSum(y_h1, W_h1_out), apply_ele_fn, ele_fn=sigFn)

SE = apply_row_fn(y_hat, get_SE_row, y=y)
ACC = apply_row_fn(y_hat, get_pred_row, y=y)
print("Iter %s: SE = %s | ACC = %s" % (str(i), str(np.sum(SE)), str(100 * float(np.sum(ACC)) / X.shape[0])))

dEdy = apply_row_fn(y_hat, get_dEdy_row, y=y)
dydz = apply_row_fn(y_hat, apply_ele_fn, ele_fn=get_dydz_ele)
dzdy_h1 = W_h1_out
dy_h1dz_h1 = apply_row_fn(y_h1, apply_ele_fn, ele_fn=get_dydz_ele)
dz_h1dw = X
dz_dw_h1 = y_h1

dEdz = dEdy * dydz

W_in_h1 = W_in_h1 + eta * apply_row_fn(np.matmul(np.transpose(dz_h1dw), np.matmul(dEdz, np.transpose(dzdy_h1)) * dy_h1dz_h1), remOne_row)
W_h1_out = W_h1_out + eta * np.matmul(np.transpose(dz_dw_h1), dEdz)

# Iter 156: SE = 64.5131992554 | ACC = 79.3333333333
# Iter 157: SE = 64.5568255983 | ACC = 80.6666666667
# Iter 158: SE = 64.5982885577 | ACC = 82.0
# Iter 159: SE = 64.6376604297 | ACC = 82.0
# Iter 160: SE = 64.6750148136 | ACC = 82.6666666667
# Iter 161: SE = 64.7104259167 | ACC = 81.3333333333

### Background (Skip this if one has some background on Neural Networks)

Let’s begin with the basics, Neural Network. Neural Network without hidden layer can be thought of Linear Regression, weighted sum. Adding the Sigmoid function it then becomes Logistic Regression.

Weighted sum,

\begin{aligned} & z_i = \sum_{j} W_{j}X_{ij} \\ \end{aligned}

where $i$ is the index of data point, $j$ is the index of feature and $z_i$ is the predicted output.

With Sigmoid function,

\begin{aligned} & y_i & & = \frac{1}{1 + e^{-z_i}} \\ &&& = \frac{1}{1 + e^{-\sum_{j} w_{j}x_{ij}}}\\ \end{aligned}

Now $y_i$ has become the predicted output instead of $z_i$. “How do we find the solution, $W?$” you asked. We can use Least Squares or Gradient Descent. As mentioned, Gradient Descent

Generalizing this,

\begin{aligned} & y_i & & = f_1(z_i) \\ &&& = f_1(f_2(x_i))\\ \end{aligned}

where $f_2$ is the weighted sum of features and $f_1$ is $1$ for weighted sum and Sigmoid function in the case of “Logistic Regression”. $f_2$ is called as a logic gate or activation function, $z_i$ is “input” for Neuron $i$ and $y_i$ is the “output”. This “Regression” is also called as Perceptron or Neuron. For simplicity, using it as a classifier, a threshold could be set such that the output above this threshold belongs to 1 of those 2 possible classes. 1 Neuron can classify 2 classes whereas 2 Neurons can classify 3 classes. This is called as 1 vs All logical structure.

Hidden layer(s) is introduced to allow the “system” to perform more complicated work, $y_{i, new} = f_{hidden}(y_{i})$. For an example, when we were young, firstly we learned about alphabets, then $W$ was taught and hence we learned how to make words from them. After that, a “hidden layer” is added so that we know how to make sentences and so on. One might ask, “Why don’t we just learn $W$ that could allow us to write essay directly?”. One could think of the combinations of alphabets vs the combinations of words in an essay, the number of combinations is less with words, hence it is faster and easier to learn. Learning alphabets is easier compared to learning words. Putting that in this context, number of Neurons per hidden layer may not be the same. The layer is hidden because the learned weights might not make any sense to us, it is a knowledge storing capacity.

When there are more layers, the learning capacity increases and hence the buzzword “Deep Learning”. Don’t get shook by it! One could think of it as aligning several batteries serially to increase the Voltage and guess what, Deep Learning is not limited to just Neural Network, it could be a serial combination of any other Machine Learning (ML) algorithms, even a different one at every level. However for the sake of simplicity, let’s limit it to just Neural Network.

What is left to know now is how do we train Neural Network.

For a classification problem, assuming that there are only 3 possible classes and 0 hidden layer, 2 neurons are needed

\begin{aligned} E_{total} = E_1 + E_2\\ E_1 = \frac{1}{2}(y_1 - \hat{y}_1)^2 \\ \hat{y}_1 = \frac{1}{1 + e^{-\sum_{j} w_{j}x_{ij}}} \end{aligned}

where $y_1 \in \{0, 1\}$ is the ground truth, i.e.: $y_1 = 1, y_2 = 0$ if a certain data point belongs to class 1.

So the learning of weight between feature 1, $x_1$ and input for Neuron 1, $z_1$ would be

\begin{aligned} w_{x_1z_1} \gets w_{x_1z_1} - \eta\frac{\partial E_{total}}{\partial w_{x_1z_1}} \\ \frac{\partial E_{total}}{\partial w_{x_1z_1}} = \frac{\partial E_{total}}{\partial y_1}\frac{\partial y_1}{\partial w_{x_1z_1}} \\ \frac{\partial y_1}{\partial w_{x_1z_1}} = \frac{\partial y_1}{\partial z_1}\frac{\partial z_1}{\partial w_{x_1z_1}} \end{aligned}

Neural Network with a single hidden layer is also known as Multi Layer Perceptron (MLP). Assuming that there are 3 classes, and 1 hidden layer with $n_{h_1}$ neurons, there are 2 layer of weights to learn, 1. between hidden layer and output layer, 2. between raw input and hidden layer.

1. Between hidden layer and output layer

\begin{aligned} w_{h_{11}z_1} \gets w_{h_{11}z_1} - \eta\frac{\partial E_{total}}{\partial w_{h_{11}z_1}} \\ \frac{\partial E_{total}}{\partial w_{h_{11}z_1}} = \frac{\partial E_{total}}{\partial y_1}\frac{\partial y_1}{\partial w_{h_{11}z_1}} \\ \frac{\partial y_1}{\partial w_{h_{11}z_1}} = \frac{\partial y_1}{\partial z_1}\frac{\partial z_1}{\partial w_{h_{11}z_1}} \end{aligned}

2. Between input layer and hidden layer

\begin{aligned} w_{x_1h_{11}} \gets w_{x_1h_{11}} - \eta\frac{\partial E_{total}}{\partial w_{x_1h_{11}}} \\ \end{aligned}
\begin{aligned} \frac{\partial E_{total}}{\partial w_{x_1h_{11}}} = \sum_{i}\bigg(\frac{\partial E_i}{\partial y_i}\frac{\partial y_i}{\partial z_i}\frac{\partial z_i}{\partial y_{h_{11}}}\bigg)\frac{\partial y_{h_{11}}}{\partial z_{h_{11}}}\frac{\partial z_{h_{11}}}{\partial w_{x_1h_{11}}} \\ \end{aligned}

where

\begin{aligned} & \frac{\partial E_i}{\partial y_i} & & = -(y_i - \hat{y}_i) \\ & \frac{\partial y_i}{\partial z_i} & & = -\hat{y}_i(1 - \hat{y}_i) \\ & \frac{\partial z_i}{\partial y_{h_{11}}} & & = w_{h_{11}z_i} \\ & \frac{\partial y_{h_{11}}}{\partial z_{h_{11}}} & & = 1 - \hat{y}_{h_{11}} \\ & \frac{\partial z_{h_{11}}}{\partial w_{x_1h_{11}}} & & = x_1 \\ \end{aligned}

to be continued… (RNN and LSTM)

Reference(s):

## 3 thoughts on “Neural Network”

1. […] is just NN of fixed parameters at each hidden […]

2. […] Learning bears the similar idea as Deep Neural Network (DNN). Using a more obvious example that is image processing, what DNN (specifically Convolution […]

3. […] already know what Deep Neural Network is but what about it’s breadth. A broad deep learning means the use of features from […]