Today I have started looking into Long Short Term Memory. LSTM is a kind of Recurrent Neural Network (RNN). The first impression that I had was LSTM is a combination of Neural Network (NN) and State Space Models, specifically Hidden Markov Models (HMM). That would form RNN. LSTM is RNN with filter. To understand them fully, I decided to code it, starting off with Neural Network.

Following is my short note. There are no fancy visualizations for now and I try to include as much related terms as possible as they used to confuse me a lot.

### Sample Neural Network coded from scratch

from sklearn import datasets
import numpy as np
import math
def sigFn(z):
return 1.0/(1.0 + math.exp(-z))
def weiSum(x, w):
return np.matmul(x, w)
def apply_row_fn(X, row_fn, **kwargs):
return np.array([row_fn(X[i], i, **kwargs) for i in range(X.shape[0])])
def apply_ele_fn(x, i, ele_fn):
return np.array([ele_fn(j) for j in x])
def addOne_row(x, i):
return np.append(x, 1)
def remOne_row(x, i):
return x[:-1]
def get_dEdy_row(x, i, y):
return np.array([-0.5*((1 if y[i] == j else 0) - x[j]) for j in range(x.shape[0])])
def get_dydz_ele(y_hat):
return -(1 - y_hat) * y_hat
def get_SE_row(x, i, y):
return np.array([math.pow(((1 if y[i] == j else 0) - x[j]), 2) for j in range(n_class - 1)])
def get_pred_row(x, i, y):
return np.array([(1 if np.argmax(np.append(x, 1 - np.sum(x))) == y[i] else 0)])
iris = datasets.load_iris()
y = iris.target
X = apply_row_fn(iris.data, addOne_row)
n_iter = 50
n_h1_units = 10
n_class = 3
W_h1_out = 2 * (np.random.rand(n_h1_units + 1, n_class - 1) - 0.5)
W_in_h1 = 2 * (np.random.rand(iris.data.shape[1] + 1, n_h1_units) - 0.5)
eta = 0.02
for i in range(n_iter):
y_h1 = apply_row_fn(apply_row_fn(weiSum(X, W_in_h1), apply_ele_fn, ele_fn=sigFn), addOne_row)
y_hat = apply_row_fn(weiSum(y_h1, W_h1_out), apply_ele_fn, ele_fn=sigFn)
SE = apply_row_fn(y_hat, get_SE_row, y=y)
ACC = apply_row_fn(y_hat, get_pred_row, y=y)
print("Iter %s: SE = %s | ACC = %s" % (str(i), str(np.sum(SE)), str(100 * float(np.sum(ACC)) / X.shape[0])))
dEdy = apply_row_fn(y_hat, get_dEdy_row, y=y)
dydz = apply_row_fn(y_hat, apply_ele_fn, ele_fn=get_dydz_ele)
dzdy_h1 = W_h1_out
dy_h1dz_h1 = apply_row_fn(y_h1, apply_ele_fn, ele_fn=get_dydz_ele)
dz_h1dw = X
dz_dw_h1 = y_h1
dEdz = dEdy * dydz
W_in_h1 = W_in_h1 + eta * apply_row_fn(np.matmul(np.transpose(dz_h1dw), np.matmul(dEdz, np.transpose(dzdy_h1)) * dy_h1dz_h1), remOne_row)
W_h1_out = W_h1_out + eta * np.matmul(np.transpose(dz_dw_h1), dEdz)
# Iter 156: SE = 64.5131992554 | ACC = 79.3333333333
# Iter 157: SE = 64.5568255983 | ACC = 80.6666666667
# Iter 158: SE = 64.5982885577 | ACC = 82.0
# Iter 159: SE = 64.6376604297 | ACC = 82.0
# Iter 160: SE = 64.6750148136 | ACC = 82.6666666667
# Iter 161: SE = 64.7104259167 | ACC = 81.3333333333

### Background (Skip this if one has some background on Neural Networks)

Let’s begin with the basics, Neural Network. Neural Network without hidden layer can be thought of Linear Regression, weighted sum. Adding the Sigmoid function it then becomes Logistic Regression.

Weighted sum,

where is the index of data point, is the index of feature and is the predicted output.

With Sigmoid function,

Now has become the predicted output instead of . “How do we find the solution, ” you asked. We can use Least Squares or Gradient Descent. As mentioned, Gradient Descent

Generalizing this,

where is the weighted sum of features and is for weighted sum and Sigmoid function in the case of “Logistic Regression”. is called as a logic gate or activation function, is “input” for Neuron and is the “output”. This “Regression” is also called as Perceptron or Neuron. For simplicity, using it as a classifier, a threshold could be set such that the output above this threshold belongs to 1 of those 2 possible classes. 1 Neuron can classify 2 classes whereas 2 Neurons can classify 3 classes. This is called as *1 vs All* logical structure.

Hidden layer(s) is introduced to allow the “system” to perform more complicated work, . For an example, when we were young, firstly we learned about alphabets, then was taught and hence we learned how to make words from them. After that, a “hidden layer” is added so that we know how to make sentences and so on. One might ask, “Why don’t we just learn that could allow us to write essay directly?”. One could think of the combinations of alphabets vs the combinations of words in an essay, the number of combinations is less with words, hence it is faster and easier to learn. Learning alphabets is easier compared to learning words. Putting that in this context, number of Neurons per hidden layer may not be the same. The layer is hidden because the learned weights might not make any sense to us, it is a knowledge storing capacity.

When there are more layers, the learning capacity increases and hence the buzzword “Deep Learning”. Don’t get shook by it! One could think of it as aligning several batteries serially to increase the Voltage and guess what, Deep Learning is not limited to just Neural Network, it could be a serial combination of any other Machine Learning (ML) algorithms, even a different one at every level. However for the sake of simplicity, let’s limit it to just Neural Network.

What is left to know now is how do we train Neural Network.

For a classification problem, assuming that there are only 3 possible classes and 0 hidden layer, 2 neurons are needed

where is the ground truth, i.e.: if a certain data point belongs to class 1.

So the learning of weight between feature 1, and input for Neuron 1, would be

Neural Network with a single hidden layer is also known as Multi Layer Perceptron (MLP). Assuming that there are 3 classes, and 1 hidden layer with neurons, there are 2 layer of weights to learn, 1. between hidden layer and output layer, 2. between raw input and hidden layer.

1. Between hidden layer and output layer

2. Between input layer and hidden layer

where

to be continued… (RNN and LSTM)

Reference(s):

- LSTM
- RNN
- NN