Skip to main content

4. Understand how a Neural Network works

4. Understand how a Neural Network works


4.1. Graphical Understanding


In order to start in this world, it is important first to have a visual perception of what we are going to deal with, and understand the basics of how Neural Networks work, in the raw form.





A Neural Network (NN) is nothing else than a net of perceptrons that are linked so that input fires another network that produces an output. Of course, a NN has assigned values that allow to make further calculations and learning.


 


So far the two images above represent a simple NN with 1 input, 1 output and 1 hidden layer. The hidden layer is said to be dense (each neuron in a layer x is connected to all neurons in the layer x-1 and all the neurons in the layer x+1).

Depending on how the neurons are organized and how the connections are made, we can find many different types of NN:

 


Types of NN (image source and explanation)




4.2. Simple Neural Network (1 hidden layer)


Before getting to the “real” application of NN with Keras, I think it is important to conceptualize the idea of a NN  with a raw code. We will start with a simple NN with 1 input, 1 output and 1 hidden layer.

Here we can find the code we have used and its corresponding explanation:

import numpy as np






# sigmoid function
def nonlin(x,deriv=False):
   if(deriv==True):
       return x*(1-x)
   return 1/(1+np.exp(-x))




# input dataset
X = np.array([  [0,0,1],
               [0,1,1],
               [1,0,1],
               [1,1,1] ])



# output dataset   
y = np.array([[0,0,1,1]]).T



# seed random numbers to make calculation
# deterministic (just a good practice)
np.random.seed(1)




# initialize weights randomly with mean 0 : Synapse 0
syn0 = 2*np.random.random((3,1)) - 1



# QUESTION




for iter in range(10000):
   # forward propagation
   l0 = X
   l1 = nonlin(np.dot(l0,syn0))




   # how much did we miss?
   l1_error = y - l1



   # multiply how much we missed by the
   # slope of the sigmoid at the values in l1
   l1_delta = l1_error * nonlin(l1,True)



   # update weights
   syn0 += np.dot(l0.T,l1_delta)


print ("Output After Training:")

print (l1)




Imports numpy (linear algebra library), our only dependency.


Activation of a neuron (sigmoid function, nonlinearity)
Generates the derivative of a sigmoid (when deriv=True). The sigmoid output can be used to create its derivative. If it’s a variable “out”, then the derivative is out(1-out)


X: Input dataset matrix, each row is a training example
Training data (4 examples, 3 input nodes)


y: Output dataset matrix, each row is a training example
Label of every data (binary classification). Each row (4) is a training example, and each column (1) an output node.

Numbers still randomly distributed but exactly the same way each time we train.

Initialization of the weights (this neural network has 1 layer of weights: 3 input, 1 output). Every link value is stored in syn0.
syn0: 1st layer of weights, connects l0 to l1 —> weight matrix (dimensions 3 by 1 since we have 3 inputs and 1 output: l0 is of size 3 and l1 of size 1)
! We don’t save layers, but “syn” matrices (they store the learning)

Network training code begins here (iterated for optimization):
l0: First Layer of the Network, specified by the input data
l1 (second layer, hidden) is the prediction step. We let the network “try” to predict the output from the input, and we analyze results to perform better each iteration. Two steps: multiplication of l0 by syn0 & passing output through sigmoid function (defined before as nonlin)

l1 had a “guess”, so we compare how it did by subtracting y (true answer) from l1 (guess), and we obtain a vector of positive and negative numbers

nonlin(l1, True) generates the slopes of the lines of the dots represented by l1
Error weighted derivative: multiplies the points of l1 passed through the sigmoid function (nonlin…), by l1_error

Updates weights by computing the update for each weight for each training example and summing them.

Prints result


Output After Training:
[[0.00966449]
[0.00786506]
[0.99358898]
[0.99211957]]
The code:
Trains the neural network
Passes every data
Evaluates the error
Updates the weights using back propagation



Question: Why does it multiply by 2 and subtract 1?
Answer: The reason why when defining syn0 = 2*np.random.random((3,1)) - 1 we multiply it by  2 and then we subtract 1 is because np.random.random generates a distribution function between 0 and 1 of values with the characteristics given by (3, 1). Since we want it to be mean zero distribution; we multiply it by two so that it’ll be from 0 to 2 (mean 1) and when subtracting the 1 we finally obtain the desired mean 0.

This could be much easily obtained by calculating np.random.randn since this function produces a Gaussian Distribution, thus automatically centered in zero.


4.3. Simple Neural Network (1 hidden layer, ReLU activation function)


Now we will do the same but applying a ReLU Activation Function:

import numpy as np

print("Enter the two values for input layers")

print('a = ')
a = int(input())
# 2
print('b = ')
b = int(input())

weights = {
   'node_0': np.array([2, 4]),
   'node_1': np.array([[4, -5]]),
   'output_node': np.array([2, 7])
}

input_data = np.array([a, b])


def relu(input):
   # Rectified Linear Activation
   output = max(input, 0)
   return(output)


node_0_input = (input_data * weights['node_0']).sum()
node_0_output = relu(node_0_input)

node_1_input = (input_data * weights['node_1']).sum()
node_1_output = relu(node_1_input)

hidden_layer_outputs = np.array([node_0_output, node_1_output])

model_output = (hidden_layer_outputs * weights['output_node']).sum()

print(model_output)
Enter the two values for input layers
a  =
2
b =
3
32

Question: Why do we apply an Activation Function?
Answer: An activation function by definition is a “node attached to the output end of any NN or in between two NN; in order to determine the output of such NN (yes/no). It maps the resulting values in between 0 to 1 or -1 to 1 etc. (depending upon the function).”  url
It is important to include it, otherwise we would not receive the output from the neurons, and the process would be pointless.

Question: Why are we choosing ReLU?
Answer: As far as choosing the “correct” Activation Function (there is not correct one, since it would depend on the characteristics of our problem and the kind of output we want to come up with), you can find more information about each type of AF  in the Glossary of Terms.
In regards to ReLU, it is commonly used for linear regression models (we should not used it for a problem in which we want probabilities in the output, nor when dealing with negative input).

Resultado de imagen para relu activation function
Source: Sarkar, K. "ReLU: Not a differentiable function"


The range of the function is 0 to infinity, and it follows a constant slope for values over zero, but maps to zero any negative value.


4.4. Simple Neural Network (2 hidden layers)


Now that we understand how a simple NN of 1 hidden layer works (or at least we think so), understanding how a simple NN of 2 hidden layers should be easy. Here we can find the code we have used and its corresponding explanation:

import numpy as np




# sigmoid function
def nonlin(x, deriv=False):
   if (deriv == True):
       return (x * (1 - x))
   return 1 / (1 + np.exp(-x))




# input dataset
X = np.array([  [1,1,1],
               [3,3,3],
               [2,2,2],
               [2,2,2] ])




# output dataset
y = np.array([[1],
             [1],
             [0],
             [1]])





# seed random numbers to make calculation
# deterministic (just a good practice)
np.random.seed(1)




# randomly initialize our weights 
#with mean 0

syn0 = 2*np.random.random((3,4)) - 1

syn1 = 2*np.random.random((4,1)) - 1





for j in range(60000):


# Feed forward through layers 0, 1, and 2
   l0 = X
   l1 = nonlin(np.dot(l0,syn0))
   l2 = nonlin(np.dot(l1,syn1))




   # how much did we miss the target value?
   l2_error = y - l2

  


   if (j% 10000) == 0:

       print ('Error:' + str(np.mean(np.abs(l2_error))))


      
   # in what direction is the target value?
   # were we really sure? if so,
# don't change too much.
   l2_delta = l2_error*nonlin(l2,deriv=True)



   # how much did each l1 value contribute to the l2 error (according to the weights)?
   l1_error = l2_delta.dot(syn1.T)
   

   # in what direction is the target l1?

   # were we really sure? if so, 
#don't change too much.

   l1_delta = l1_error * nonlin(l1,deriv=True)



   syn1 += l1.T.dot(l2_delta)
   syn0 += l0.T.dot(l1_delta)
   
print ('Output after training')
print (l2)
Imports numpy (linear algebra library), our only dependency.

Activation of a neuron (sigmoid function, nonlinearity)
Generates the derivative of a sigmoid (when derive=True). The sigmoid output can be used to create its derivative. If it’s a variable “out”, then the derivative is out(1-out)

X: Input dataset matrix, each row is a training example
Training data (4 examples, 3 input nodes)


y: Output dataset matrix, each row is a training example
Label of every data (binary classification). Each row (4) is a training example, and each column (1) an output node.
Same as writing: y = np.array([[1,1,0,1]]).T

Numbers still randomly distributed but exactly the same way each time we train.

Initialization of the weights (this neural network has 2 layers of weights: 3 input neurons, 4 hidden, 1 output). The value of every link is stored in syn0 and syn1.
syn0: 1st layer of weights, Synapse 0, connects l0 to l1 —> weight matrix (dimensions 3 by 4 since we have 3 inputs and 4 outputs: l0 is of size 3 and l1 of size 4)
syn1: 2nd layer of weights, Synapse 1, connects l1 to l2.
! We don’t save layers, but “syn” matrices (they store the learning)
Network training code begins here (iterated for optimization):
l0: First Layer of the Network, specified by the input data
l1 (2nd layer, hidden) and l2 (3rs layer, hidden) are the prediction step. We let the network “try” to predict the output from the input, and we analyze results to perform better each iteration.
Two steps: multiplication of l0 by syn0 & passing output through sigmoid function.

Compare how guess (l2) against real output value.

The % is calculating the remainder between j and 1000, so that if it is zero, an error will be printed. This is done since we are iterating for j in range(6000) so we can get 6 recordings of the error computed (just for our information). Notice it must decrease, otherwise we are doing something wrong!!!

Error weighted derivative

nonlin(l2, True) generates the slopes of the lines of the dots represented by l2
Uses the “confidence weighted error” from l2 to establish an error for l1 (by sending the error across the weights from l2 to l1). We learn how much each node value in l1 “contributed” to the error in l2 (backpropagating step)

Then syn0 update.
Error:0.49641003190272537
Error:0.008584525653247153
Error:0.005789459862507809
Error:0.004629176776769984
Error:0.003958765280273649
Error:0.003510122567861676
Output after training
[[0.00260572]
[0.99672209]
[0.99701711]
[0.00386759]]
The code:
Trains the neural network
Passes every data
Evaluates the error
Updates the weights using back propagation


4.5. Predicting Numbers from zero to three (simple NN with 2 hidden layers)


Awesome, now it is time to apply this concept to an easy example.

import numpy as np

zero = [ 0, 1, 1, 0,  1, 0, 0, 1, 1, 0, 0, 1,  1, 0, 0, 1, 0, 1, 1, 0 ]
one = [  0, 0, 1, 0,  0, 0, 1, 0, 0, 0, 1, 0,   0, 0, 1, 0, 0, 0, 1, 0 ]
two = [  0, 1, 1, 0,   1, 0, 0, 1, 0, 0, 1, 0,  0, 1, 0, 0, 1, 1, 1, 1 ]
three = [  1, 1, 1, 1,  0, 0, 0, 1, 0, 1, 1, 1,  0, 0, 0, 1, 1, 1, 1, 1]

predict = [ 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1 ]

X = np.array((zero, one, two, three), dtype=float)
y = np.array(([0], [1], [2], [3]), dtype=float)
xPredicted = np.array((predict), dtype=float)

# scale units
y = y/3 # max test score is 100

class Neural_Network(object):
# QUESTION: Difference between class and def?
 def __init__(self):
# QUESTION: Why is “self” used here?
   #parameters
   self.inputSize = 20
   self.outputSize = 1
   self.hiddenSize = 20

   #weights
   self.W1 = np.random.randn(self.inputSize, self.hiddenSize)
   self.W2 = np.random.randn(self.hiddenSize, self.outputSize)

 def forward(self, X):
   #forward propagation through our network
   self.z = np.dot(X, self.W1) # dot product of X (input) and first set of 3x2 weights
   self.z2 = self.sigmoid(self.z) # activation function
   self.z3 = np.dot(self.z2, self.W2) # dot product of hidden layer (z2) and second set of 3x1 weights
   o = self.sigmoid(self.z3) # final activation function
   return o

 def sigmoid(self, s):
   # activation function
   return 1/(1+np.exp(-s))

 def sigmoidPrime(self, s):
   #derivative of sigmoid
   return s * (1 - s)

 def backward(self, X, y, o):
   # backward propgate through the network
   self.o_error = y - o # error in output
   self.o_delta = self.o_error*self.sigmoidPrime(o) # applying derivative of sigmoid to error

   self.z2_error = self.o_delta.dot(self.W2.T) # z2 error: how much our hidden layer weights contributed to output error
   self.z2_delta = self.z2_error*self.sigmoidPrime(self.z2) # applying derivative of sigmoid to z2 error

   self.W1 += X.T.dot(self.z2_delta) # adjusting first set (input --> hidden) weights
   self.W2 += self.z2.T.dot(self.o_delta) # adjusting second set (hidden --> output) weights

 def train(self, X, y):
   o = self.forward(X)
   self.backward(X, y, o)

 def predict(self):
   print ('Predicted data based on trained weights: ')
   print ('Input (scaled): \n' + str(xPredicted))
   print ('Actual Output: \n' + str((self.forward(xPredicted))*3))
   print ('Rounded Output: \n' + str(round((self.forward(xPredicted))*3)))

NN = Neural_Network()
for i in range(100): # trains the NN 10,000 times
 print ('#' + str(i) + '\n')
 #print "Input: \n" + str(X)
 #print "Actual Output: \n" + str(y*3)
 print ('Predicted Output: \n' + str(NN.forward(X)*3))
 print ('Loss: \n' + str(np.mean(np.square(y - NN.forward(X))))) # mean sum squared loss
 NN.train(X, y)


Question: What is the difference between “class” and “ref”?
Answer: We should get familiarized with these concepts (definitions for each concept are easily found by looking up  “<CONCEPT> keras”, or in Python documentation)

class is used to define a set or category of things having some property or attribute in common and differentiated from others by kind, type, or quality. It is a template from which we can instantiate objects. We can technically say that a class is a  blue print for individual objects with exact behaviour.

def is used to define a function or a method, one of the instances of the class. A method is like a function that belongs to a class.

# function
def double(x):
   return x * 2

# class
class MyClass(object):
   # method
   def myMethod(self):
       print "Hello, World"

myObject = MyClass()
myObject.myMethod()  # will print "Hello, World”



Question: What is “self” and how do we use it?
Answer: The self represents the instance of the class (defined priorly). By using the “self” keyword, we can access the attributes and methods of such class in Python.
At this point, we should also learn what __init__ is. It represents a reserved method in Python classes, used as a constructor in object oriented concepts. When an object is created from the class, we can call the __init__ method in order to allow the class to initialize the attributes.

Let’s understand the way __init__ works with an example.

Imagine we are creating a game in which we need a motorbike. The attributes might be “brand”, “color”, “power”, “speed_limit”, “year”... and the methods “start”, “stop”, “accelerate”, “break”, “change_gear”... etc. The code would look like this


class Motorbike(object):
def __init__(self, brand, color, year):
self.brand = brand
self.color = color
self.year = year

def start(self):
print("started")

def stop(self):
print("stopped")

def accelarate(self):
print("accelarating...")
"accelarator functionality here"

def change_gear(self, gear_type):
print("gear changed")
" gear related functionality here"

And we can create two different kind of motorbikes objects, but with same class:


gsx = Motorbike("Suzuki", "Grey", 2006)
monster = Motorbike("Ducatti", "Red", 2015)

The arguments  "Suzuki", "Grey", "Red", 2015…  will then pass to "__init__" method to initialize the object.
The keyword "self"  therefore represents the instance of the class, and it binds the attributes with the given arguments so that we can access the methods and attributes.

If the concept is not clear yet or you are curious to learn more about the self variable, these are some useful links I found:

4.6. Deep Neural Networks

 

We have already seen these. They are any “simple” NN with more than one hidden layer.

 



Image source

 


4.7. Convolutional Neural Networks (CNN)

 

It is probably the most used type of NN. It is a class of deep, feed-forward ANN. The arrangement of the neurons is such that the preprocessing is minimized, so the network can learn the filters that in traditional algorithms were hand-engineered. The structure resembles the organization of the animal visual cortex (biological inspiration).

 

Some common applications include image and video recognition, recommender systems and natural language processing. Ref



We can obtain a visual comparison between Simple NN, Deep NN and Convolutional NN in the following image:






 

Image Source

 

And this is what an example of CNN parameters would look like:

Image Source

 

The previous link will direct you to a Tutorial on “Implementing Layer-Wise Relevance Propagation”. Be patient! We will start our tutorials later, once we know the required basis.

 

A possible application would be to determine from an image whether it is a car, a truck, a van, a bicycle…   

Image Source

 

And the learned features correspond to the CNN included in the problem, that would provide the necessary calculations and values for the consequent classification.

Image Source



Although this process has been performed since long time ago by Machine Learning algorithms, the advantage of using Deep Learning now is that we can get rid of the feature extraction and ML Algorithm part.

These steps are substituted by a DL Algorithm, which might seem harder to create in the beginning, but once obtained, the potential it has and the speed of computation are really far away from the traditional ML procedure.

 

Image Source

 

In a category classification problem (in this case, numbers), the application of the CNN will lead to a final fully connected layer with a vector encoding our desired output. In the case of the following image, the digit zero.

Img source

 

 

4.8. The magic of Deep Learning


So far this might seem boring or easy, or maybe pointless. In GitHub Deep CNN we can appreciate some of the “magic” that can be done with Deep Learning tools. Even though we are still somewhat far from creating things like that one, we can get to that point!

Comments