Artificial Neural Networks

Artificial Neural Networks#

# If the graphviz pictures cause an error: 
#(this was removed from requirements.txt file again, after it caused on github an error when running the jupyter book.)
#!pip install graphviz

_{_Source}

Adaptive basis function models#

\[ f(x) = w_{0} + \sum w_{m}\phi_{m}(x)\]

remember ensemble models?

When and Why?#

non-linearity
many dimensions
…

When and why?#

Non linear hypotheses

Assuming this data: to this dataset you could apply logistic regression with a lot of nonlinear features including lots of polynomials

For 100 features, including the second order polynomials you get about 5000 features ~ $O(n^2)$ features

\[ g(b_{0}\,+\,b_{1}x_{1}\,+\,b_{2}x_{2}\,+\,b_{3}x_{1}x_{2}\,+\,b_{4}x_{1}^{2}\,+\,b_{5}x_{2}^{2})\]

When and why?#

Non linear hypotheses

Assuming this data: to this dataset you could apply logistic regression with a lot of nonlinear features including lots of polynomials

For 100 features, including the second order polynomials you get about 5000 features ~ $O(n^2)$ features

\[ g(b_{0}\,+\,b_{1}x_{1}\,+\,b_{2}x_{2}\,+\,b_{3}x_{1}x_{2}\,+\,b_{4}x_{1}^{2}\,+\,b_{5}x_{2}^{2})\]

risk of overfitting
computationally expensive
picking a subset of features can lead to simpler decision boundaries

When and why?#

Many Dimensions

So.. let’s look at a picture of cats

1200 x 1200 pixels

50 x 50 pixels

n = 2500 pixels so 7500 in RGB

~3 million quadratic features in grey scale .. 9 million in RGB

The Neuron#

How it started#

“one learning algorithm” hypothesis - plug in any sensor and given enough time the brain will learn to deal with it

How it ended#

A logistic unit

show_graph_visualization("../images/neural_networks/nn_2layers.gv")

../_images/f6c8af3a555a6f3a5788fee24616289a5a7a078f2f60bf9eed0c7742721dcf4e.svg

How it ended#

A logistic unit with a bias term

\[\begin{split}x = \begin{bmatrix} x_1\\ x_2 \\ x_3 \end{bmatrix}\end{split}\]

\[\begin{split}w = \begin{bmatrix} w_1\\ w_2 \\ w_3 \end{bmatrix}\end{split}\]

\[bias = b\]

ACTIVATION FUNCTION -
for example sigmoid

\[ \ h_w(x) = \frac{1}{1+e^{-w^T x+ b}}\]

show_graph_visualization("../images/neural_networks/nn_2layers_bias.gv")

../_images/d824063d3acd5f656a9a1bf5a084174d88520841f6198d53d6c1f42f4f046a41.svg

A different neuron: the Perceptron#

Input and output are numbers
Each input is associated with a weight

Example step function:
$ s t e p(z)=h e a v i s i d e(z)=\left\{\!\!\begin{array}{c c}{{0}}&{{z<0}}\\ {{1}}&{{z\ge0}}\end{array}\right.$

Rosenblatt, F. (1958), “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain”

Activation Function#

In the first Perceptron a step-function was used.

\[\begin{split} \mathrm{heaviside}\ (z)=\begin{cases} 0 & \text {if} & z<0\\ 1 & \text {if} & z \ge 0 \end{cases}\end{split}\]

\[\begin{split}{sgn}\ (z)=\begin{cases} -1 & \text {if} & z<0\\ 0 & \text {if} & z=0\\ +1 & \text {if} & z > 0 \end{cases}\end{split}\]

The XOR problem - the end of the Perceptron#

Exclusive OR: ($x_1$ OR $x_2$ are 1 but never both)
Perceptron fails at this simple problem
One of the reasons for the first AI Winter (1973)
https://towardsdatascience.com/perceptrons-logical-functions-and-the-xor-problem-37ca5025790a

The XOR problem - the end of the Perceptron#

Exclusive OR: ($x_1$ OR $x_2$ are 1 but never both)
Perceptron fails at this simple problem
One of the reasons for the first AI Winter (1973)

x1	x2	Output
0	0	0
0	1	1
1	0	1
1	1	1

Logical computations and neurons#

More than one possible outcome - more Perceptrons

Here: three possible outcomes
Described as a {fully connected layer} or {dense layer}

Logical computations and neurons#

The XOR problem - the end of the Perceptron

Two Lines are needed to separate the classes
- Two Perceptrons could solve this
But two binary outputs
- Two outputs with three combinations: [00, 10, 11]

Minsky, M. and Papert, S. (1969): “Perceptrons: An Introduction to Computational Geometry”

Logical computations and neurons#

The XOR problem - a multilayer problem

Adding a second layer solves the XOR problem
- Multilayer Perceptron (MLP)
Breaks problem into sub-problems
- Solves sub-problems
- Combines results

The Neural Network#

Neural network#

show_graph_visualization("../images/neural_networks/nn_3layers.gv")

../_images/2b746c639f815cb38fcd3ff29943d2aa430e6bea92bafbc2370cceec555a2717.svg

Neural network#

layer 0 = input layer

layer 1 = hidden layer (can be more of course)

layer 2 = output layer

show_graph_visualization("../images/neural_networks/nn_3layers_bias.gv")

../_images/c3dc1017c5a0c4310c80138a6136bc10837eb466fe20661d2752c99c0b437fb3.svg

Neural network#

\[\begin{split} \begin{align} \text{The input values: }& x = \begin{bmatrix} x_1\\ x_2 \\ x_3 \end{bmatrix} \\[10px] \text{The tensor of weights: } & W^{(j)}\hspace{1 cm} \\[10px] \text{The matrix of biases: } & b^{(j)}\hspace{1 cm} \\[10px] \text{The output values of each neuron: }& a_i^{(j)} \\[10px] \text{The layer number: }& j \end{align} \end{split}\]

Note: How many dimensions does W for j =1 have?

show_graph_visualization("../images/neural_networks/nn_3layers_bias.gv")

Neural network#

\[\begin{split} x = \begin{bmatrix} b^{(1)}\\ x_1\\ x_2 \\ x_3 \end{bmatrix} \hspace{1cm} \ W^{(j)} , a_i^{(j)} \end{split}\]

j is the layer number

How many dimensions does W for j =1 have?

\[\begin{split}\begin{align} a_{1}^{(1)}&=g\Big(W_{10}^{(1)}b^{(1)}\,+\,W_{11}^{(1)}x_{1}\,+\,W_{12}^{(1)}x_{2}\,+\,W_{13}^{(1)}x_{3}\Big)\\ a_{2}^{(1)}&=\,g\Big(W_{20}^{(1)}b^{(1)}\,+\,W_{21}^{(1)}x_{1}\,+\,W_{22}^{(1)}x_{2}\,+\,W_{23}^{(1)}x_{3}\Big)\\ a_{3}^{(1)}&=\,g\Big(W_{30}^{(1)}b^{(1)}\,+\,W_{31}^{(1)}x_{1}\,+\,W_{32}^{(1)}x_{2}\,+\,W_{33}^{(1)}x_{3}\Big)\\ h_W = a_{1}^{(2)}&=\,g\Big(W_{10}^{(2)}b^{(2)}\,+\,W_{11}^{(2)}a_{1}^{(1)}\,+\,W_{12}^{(2)}a_{2}^{(1)}\,+\,W_{13}^{(2)}a_{3}^{(1)}\Big) \end{align}\end{split}\]

show_graph_visualization("../images/neural_networks/nn_3layers_bias.gv")

In ‘numpy notation’:

$\begin{align} W^{(j)}\text{.shape} &= \left(\text{size}_{j+1} , (\text{size}_j + 1)\right)\\ &=(3,4)\\ \end{align}$

Deep Neural Networks and compact representations#

What Network architecture can solve the Parity function:

\[\begin{split} f_{p a r}(x_{1},...,x_{D})=\left\{\begin{array}{l l}{{0}}&{{\mathrm{if~\sum_{j}x_{j}~\mathrm{odd}}}}\\ {{1}}&{{\mathrm{if~{\mathrm{it~{is}}~e v e n}}}}\end{array}\right.\end{split}\]

Here for D=4:

Deep Neural Network with a single hidden layer: 55 neurons required to solve the problem

Deep Neural Network with D=4 hidden layers: 16 neurons required to solve the problem (much more compact as size is linear)

Note: Neural networks with more layers can represent functions in more compact form

Deep learning and parameter search#

Deep Neural Networks

Deep neural networks
- At least one hidden layer
Can theoretically approximate any function
- Universal Approximation Theorem (Hornik (1991))
- With arbitrary activation functions
- Even with only a single hidden layer
Why than deeper?

Cybenko, G. (1989), “Approximation by Superpositions of a Sigmoidal Function”
Hornik, K. (1991), “Approximation Capabilities of Multilayer Feedforward Networks”
Csáji, B.C., (2001), “Approximation with Artificial Neural Networks”

Training the network#

Deep learning and parameter search#

How to find the parameters?

First intuition: Gradient descent
- But how to apply the algorithm with many coupled and nested functions?
Backpropagation was a breakthrough in research
- Makes training of very deep networks possible
- Error is backpropagated through a network (End to Begin)
- Gradient is calculated step-wise
Problem: Gradient of the step function is almost always zero
- What now?

Werbos, P.J. (1974), “Beyong Regression: New Tools For Prediction And Analysis In The Behavioral Sciences”
Linnainmaa, S. (1970), “The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors”
Rumelhart, D., Hinton, G. and Williams, R. (1986), “Learning Internal Representations by Error Propagation”

Non-linear activation functions#

Make Backpropagation possible
- Gradient at almost any point in space
Most common activation functions are
- Sigmoid $\hspace{0.5cm} \sigma(z) = \frac{1}{1+\exp{(-z)}}$
- Tanh $\hspace{0.5cm} \tanh(z) = \frac{\exp{(z)} - \exp{(-z)}}{\exp{(z)}+ \exp{(-z)}}$
- ReLU (Rectified Linear Unit) $\hspace{0.5cm} ReLU(z) = \max(0,z)$

Non-linear activation functions#

Non-linear activation functions are needed to model more interesting output function
- Linear activation functions give linear models
Linear activation functions can be used in Regression problems
- Hidden layers should be non-linear nevertheless
Hidden layers with linear activation functions and a sigmoid output:
- Logistic regression

Playground

Feed-forward step#

In the feed-forward step the data is propagated through the network from input to output

show_graph_visualization("../images/neural_networks/feed-forward-step.gv")

Back-propagation#

How can we discover in which direction and with which magnitude to tweak the parameters?
- We do the same as usual: Start at the loss function

Binary Loss (like in Logistic regression):

\[ \mathcal{L}(y,\hat{y})=y\log\left(\hat{y}\right)+\left(1-y\right)\log\left(1-\hat{y}\right) \hspace{2cm}\text{ where } \hat{y} = a[5]\]

Note: The Label is either 1 or 0 and a[5] the Output of the last layer’s activation function (here: sigmoid)

show_graph_visualization("../images/neural_networks/back_prop.gv")

../_images/32cff392db2b2e593c34e490f63ee8c4a248eacb21ff6c21eba8a5f373e562af.svg

Feed forward and back propagation#

Back prop - intuition#

Computational graphs#

A directed graph where the nodes correspond to mathematical operations

$f(x, y, z) = q * z = (x + y)*z$