The binary output units of neural network
It is possible to introduce neural networks without appealing to brain analogies. Notice that the non-linearity is critical computationally - if we left it out, the two matrices could be collapsed to a single matrix, and therefore the predicted class scores would again be a linear function of the input. The non-linearity is where we get the wiggle. The area of Neural Networks has originally been primarily inspired by the goal of modeling biological neural systems, but has since diverged and become a matter of engineering and achieving good results in Machine Learning tasks.
Nonetheless, we begin our discussion with a very brief and high-level description of the biological system that a large portion of this area has been inspired by. The basic computational unit of the brain is a neuron. The diagram below shows a cartoon drawing of a biological neuron left and a the binary output units of neural network mathematical model right.
Each neuron receives input signals from its dendrites and produces output signals along its single axon. The axon eventually branches out and connects via synapses to dendrites of other neurons. In the computational model of a neuron, the signals that travel along the axons e.
In the basic model, the dendrites the binary output units of neural network the signal to the cell body where they all get summed. If the final sum is above a certain threshold, the neuron can firesending a spike along its axon.
In the computational model, we assume that the precise timings of the spikes do not matter, and that only the frequency of the firing the binary output units of neural network information.
We will see details of these activation functions later in this section. We will go into more details about different activation functions at the end of this section. For example, there are many different types of neurons, each with different properties. The dendrites in biological neurons perform complex nonlinear computations. The exact timing of the output spikes in many systems is known to be important, suggesting that the rate code approximation may not hold.
Due to all the binary output units of neural network and many other simplifications, be prepared to hear groaning sounds from anyone with some neuroscience background if you draw analogies between Neural Networks and real brains. See this review pdfor more recently this review if you are interested. With this interpretation, we can formulate the cross-entropy loss as we have seen in the Linear Classification section, and optimizing it would lead to a binary Softmax classifier also known as logistic regression.
Since the sigmoid function is restricted to be betweenthe predictions of this classifier are based on whether the output of the neuron is greater than 0. Alternatively, we could attach a max-margin hinge loss to the output of the neuron and train it to become a binary Support Vector Machine. A single neuron can be used to implement a binary classifier e.
Every activation function or the binary output units of neural network takes a single number and performs the binary output units of neural network certain fixed mathematical operation on it.
There are several activation functions you may encounter in practice:. In particular, large negative numbers become 0 and large positive numbers become 1. The sigmoid function has seen frequent use historically since it has a nice interpretation as the firing rate of a neuron: In practice, the sigmoid non-linearity has recently fallen out of favor and it is rarely ever used. It has two major drawbacks:. The tanh non-linearity is shown on the image above on the right.
It squashes a real-valued number to the range [-1, 1]. Like the sigmoid neuron, its activations saturate, but unlike the sigmoid neuron its output is zero-centered. Therefore, in practice the tanh non-linearity is always preferred to the sigmoid nonlinearity.
Also note that the tanh neuron is simply a scaled sigmoid neuron, in particular the following holds: The Rectified Linear Unit has become very popular in the last few years. In other words, the activation is simply thresholded at zero see image above on the left. There are several pros and cons to using the binary output units of neural network ReLUs:.
Some people report success with this form of activation function, but the results are not always consistent. The slope in the negative region can also be made into a parameter of each neuron, as seen in PReLU neurons, introduced in Delving Deep into Rectifiersby Kaiming He et al. However, the consistency of the benefit across tasks is presently unclear.
One relatively popular choice is the Maxout neuron introduced recently by Goodfellow et al. The Maxout neuron therefore enjoys all the benefits of a ReLU unit linear regime of operation, no saturation and does not have its drawbacks dying The binary output units of neural network. However, unlike the ReLU neurons it doubles the number of parameters for every single neuron, leading to a high total number of parameters.
This concludes our discussion of the most common types of neurons and their activation functions. As a last comment, it is very rare to mix and match different types of neurons in the same network, even though there is no fundamental problem with doing so.
Neural Networks as neurons in graphs. Neural Networks are modeled as collections of neurons that are connected in an acyclic graph. In other words, the outputs of some neurons can become inputs to other neurons.
Cycles are not allowed since that would imply an infinite loop in the forward pass of a network. Instead of an amorphous blobs of connected neurons, Neural Network models are often organized into distinct layers of neurons. For regular neural networks, the most common layer type is the fully-connected layer in which neurons between two adjacent layers are fully pairwise connected, but neurons within a single layer share no connections.
Below are two example Neural Network topologies that use a stack of fully-connected layers:. Notice that when we say N-layer neural network, we do not count the input layer. Therefore, a single-layer neural network describes a network with no hidden layers input directly mapped to output.
In that sense, you can sometimes hear people say that logistic regression or SVMs are simply a special case of single-layer Neural Networks. Many people do not like the analogies between Neural Networks and real brains and prefer to refer to neurons as units. Unlike all layers in a Neural Network, the output layer neurons most commonly do not have an activation function or you can think of them as having a linear identity activation function.
This is because the last output layer is usually taken to represent the class scores e. The two metrics that people commonly use to measure the size of neural networks are the number of neurons, or more commonly the number of parameters. Working with the two example networks in the above picture:. To give you some context, modern Convolutional Networks contain on orders of million parameters and are usually made up of approximately layers hence deep learning.
However, as we will see the number of effective connections is significantly greater due to parameter sharing. More on this in the Convolutional Neural Networks module. Repeated matrix multiplications interwoven with activation function. One of the primary reasons that Neural Networks are organized into layers is that this structure makes it very simple and efficient to evaluate Neural Networks using matrix vector operations. Working with the example three-layer neural network in the diagram above, the input would be a [3x1] vector.
All connection strengths for a layer can be stored in a single matrix. Here, every single neuron has its weights in a row of W1so the matrix vector multiplication np. Similarly, W2 would be a [4x4] matrix that stores the connections of the second hidden layer, and W3 a [1x4] matrix for the last output layer. The full forward pass of this 3-layer neural network is then simply three matrix multiplications, interwoven with the application of the activation function:.
In the above code, W1,W2,W3,b1,b2,b3 are the learnable parameters of the network. Notice also that instead of having a single input column vector, the variable x could hold an entire batch the binary output units of neural network training data where each input example would be a column of x and the binary output units of neural network all examples would be efficiently evaluated in parallel.
The forward pass of a fully-connected layer corresponds to one matrix multiplication followed by a bias offset and an activation function. One way to look at Neural Networks with fully-connected layers is that they define a family of functions that are parameterized by the weights of the network. A natural question that arises is: What is the representational power of the binary output units of neural network family of functions?
In particular, are there functions that the binary output units of neural network be modeled with a Neural Network? It turns out that Neural Networks with at least one hidden layer are universal approximators.
That is, it can be shown e. In other words, the neural network can approximate any continuous function. If one hidden layer suffices to approximate any function, why use more layers and go deeper? The answer is that the fact that a two-layer Neural Network is a universal approximator is, while mathematically cute, a relatively weak and useless statement in practice. Neural Networks work well in practice because they compactly express nice, smooth functions that fit well with the statistical properties of data we encounter in practice, and are also easy to learn using our optimization algorithms e.
Similarly, the fact that deeper networks with multiple hidden layers can work better than a single-hidden-layer networks is an empirical observation, despite the fact that their representational power is equal.
As an aside, in practice it is often the case that 3-layer neural networks will outperform 2-layer nets, but going even deeper 4,5,6-layer rarely helps much more. This is in stark contrast to Convolutional Networks, where depth has been found to be an extremely important component for a good recognition system e.
One argument for this observation is that images contain hierarchical structure e. The full story is, of course, much more involved and a topic of much recent research. If you are interested in these topics we recommend for further reading:. How do we decide on what architecture to use when faced with a practical problem? Should we use no hidden layers? How large should each layer be? First, note that as we increase the size and number of layers in a Neural Network, the capacity of the network increases.
That is, the space of the binary output units of neural network functions grows since the neurons can collaborate to express many different functions. For example, suppose we had a binary classification problem in two dimensions. We could train three separate neural networks, each with one hidden layer of some size and obtain the following classifiers:.
In the diagram above, we can see that Neural Networks with more neurons can express more complicated functions. However, this is both a blessing since we can learn to classify more complicated data and a curse since it is the binary output units of neural network to overfit the training data.