Backpropagation
Encyclopedia
Backpropagation is a common method of teaching artificial neural network
Artificial neural network
An artificial neural network , usually called neural network , is a mathematical model or computational model that is inspired by the structure and/or functional aspects of biological neural networks. A neural network consists of an interconnected group of artificial neurons, and it processes...

s how to perform a given task. Arthur E. Bryson
Arthur E. Bryson
Arthur Earl Bryson, Jr. is the Pigott Professor of Engineering Emeritus at Stanford University and the "father of modern optimal control theory"....

 and Yu-Chi Ho
Yu-Chi Ho
This article is about the Chinese-American mathematician.Yu-Chi "Larry" Ho is a renowned Chinese-American mathematician, control theorist, and a professor at the School of Engineering and Applied Sciences, Harvard University.He is the co-author of Applied Optimal Control, and an influential...

 described it as a multi-stage dynamic system optimization method in 1969 . It wasn't until 1974 and later, when applied in the context of neural networks and through the work of Paul Werbos
Paul Werbos
Paul J. Werbos is a scientist best known for his 1974 Harvard University Ph.D. thesis, which first described the process of training artificial neural networks through backpropagation of errors. The thesis, and some supplementary information, can be found in his book, The Roots of Backpropagation...

, David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams
Ronald J. Williams
Ronald J. Williams is professor of computer science at Northeastern University, and one of the pioneers of neural networks. He co-authored a paper on the backpropagation algorithm which triggered a boom in neural network research. He also made fundamental contributions to the fields of recurrent...

, that it gained recognition, and it led to a “renaissance” in the field of artificial neural network research.

It is a supervised learning
Supervised learning
Supervised learning is the machine learning task of inferring a function from supervised training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value...

 method, and is a generalization of the delta rule
Delta rule
The delta rule is a gradient descent learning rule for updating the weights of the artificial neurons in a single-layer perceptron. It is a special case of the more general backpropagation algorithm...

. It requires a teacher that knows, or can calculate, the desired output for any input in the training set. It is most useful for feed-forward networks (networks that have no feedback, or simply, that have no connections that loop). The term is an abbreviation for "backward propagation of errors". Backpropagation requires that the activation function
Activation function
In computational networks, the activation function of a node defines the output of that node given an input or set of inputs. A standard computer chip circuit can be seen as a digital network of activation functions that can be "ON" or "OFF" , depending on input. This is similar to the behavior of...

 used by the artificial neuron
Artificial neuron
An artificial neuron is a mathematical function conceived as a crude model, or abstraction of biological neurons. Artificial neurons are the constitutive units in an artificial neural network...

s (or "nodes") be differentiable.

Summary

For better understanding, the backpropagation learning algorithm can be divided into two phases: propagation and weight update.

Phase 1: Propagation

Each propagation involves the following steps:
  1. Forward propagation of a training pattern's input through the neural network in order to generate the propagation's output activations.
  2. Backward propagation of the propagation's output activations through the neural network using the training pattern's target in order to generate the deltas of all output and hidden neurons.

Phase 2: Weight update

For each weight-synapse:
  1. Multiply its output delta and input activation to get the gradient of the weight.
  2. Bring the weight in the opposite direction of the gradient by subtracting a ratio of it from the weight.

This ratio influences the speed and quality of learning; it is called the learning rate. The sign of the gradient of a weight indicates where the error is increasing, this is why the weight must be updated in the opposite direction.

Repeat the phase 1 and 2 until the performance of the network is good enough.

Modes of learning

There are two modes of learning to choose from: One is on-line learning and the other is batch learning. In on-line learning, each propagation is followed immediately by a weight update. In batch learning, many propagations occur before weight updating occurs. Batch learning requires more memory capacity, but on-line learning requires more updates.

Algorithm

Actual algorithm for a 3-layer network (only one hidden layer):

Initialize the weights in the network (often randomly)
Do
For each example e in the training set
O = neural-net-output(network, e) ; forward pass
T = teacher output for e
Calculate error (T - O) at the output units
Compute delta_wh for all weights from hidden layer to output layer ; backward pass
Compute delta_wi for all weights from input layer to hidden layer ; backward pass continued
Update the weights in the network
Until all examples classified correctly or stopping criterion satisfied
Return the network

As the algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

's name implies, the errors propagate backwards from the output nodes to the inner nodes. Technically speaking, backpropagation calculates the gradient of the error of the network regarding the network's modifiable weights. This gradient is almost always used in a simple stochastic gradient descent
Stochastic gradient descent
Stochastic gradient descent is an optimization method for minimizing an objective function that is written as a sum of differentiable functions.- Background :...

 algorithm to find weights that minimize the error. Often the term "backpropagation" is used in a more general sense, to refer to the entire procedure encompassing both the calculation of the gradient and its use in stochastic gradient descent. Backpropagation usually allows quick convergence on satisfactory local minima
Maxima and minima
In mathematics, the maximum and minimum of a function, known collectively as extrema , are the largest and smallest value that the function takes at a point either within a given neighborhood or on the function domain in its entirety .More generally, the...

 for error in the kind of networks to which it is suited.

Backpropagation networks are necessarily multilayer perceptron
Multilayer perceptron
A multilayer perceptron is a feedforward artificial neural network model that maps sets of input data onto a set of appropriate output. An MLP consists of multiple layers of nodes in a directed graph, with each layer fully connected to the next one. Except for the input nodes, each node is a...

s (usually with one input, one hidden, and one output layer). In order for the hidden layer to serve any useful function, multilayer networks must have non-linear activation functions for the multiple layers: a multilayer network using only linear activation functions is equivalent to some single layer, linear network. Non-linear activation functions that are commonly used include the logistic function
Logistic function
A logistic function or logistic curve is a common sigmoid curve, given its name in 1844 or 1845 by Pierre François Verhulst who studied it in relation to population growth. It can model the "S-shaped" curve of growth of some population P...

, the softmax function
Softmax activation function
The softmax activation function is a neural transfer function. In neural networks, transfer functions calculate a layer's output from its net input. It is a biologically plausible approximation to the maximum operation...

, and the gaussian function.

The backpropagation algorithm for calculating a gradient has been rediscovered a number of times, and is a special case of a more general technique called automatic differentiation
Automatic differentiation
In mathematics and computer algebra, automatic differentiation , sometimes alternatively called algorithmic differentiation, is a set of techniques to numerically evaluate the derivative of a function specified by a computer program...

 in the reverse accumulation mode.

It is also closely related to the Gauss–Newton algorithm, and is also part of continuing research in neural backpropagation
Neural backpropagation
Neural backpropagation is the phenomenon in which the action potential of a neuron creates a voltage spike both at the end of the axon and back through to the dendritic arbor or dendrites, from which much of the original input current originated...

.

Multithreaded backpropagation

Backpropagation is an iterative process that can often take a great deal of time to complete. When multicore
Multicore
Multicore may refer to:* Multi-core processor ** Multicore Association, founded in 2005, a non-profit, industry consortium focused on multicore technology* multicore cable, a generic term for an electrical cable that has multiple cores...

 computers are used multithreaded techniques can greatly decrease the amount of time that backpropagation takes to converge. If batching is being used, it is relatively simple to adapt the backpropagation algorithm to operate in a multithreaded manner.

The training data is broken up into equally large batches for each of the threads. Each thread executes the forward and backward propagations. The weight and threshold deltas are summed for each of the threads. At the end of each iteration all threads must pause briefly for the weight and threshold deltas to be summed and applied to the neural network. This process continues for each iteration. This multithreaded approach to backpropagation is used by the Encog Neural Network Framework
Encog
Encog is a neural network and artificial intelligence framework available for Java, .Net, and Silverlight. Encog contains classes to create a wide variety of networks, as well as support classes to normalize and process data for these neural networks. Encog trains using many different techniques. ...

.

Limitations

  • The convergence obtained from backpropagation learning is very slow.
  • The convergence in backpropagation learning is not guaranteed.
  • The result may generally converge to any local minimum on the error surface, since stochastic gradient descent exists on a surface which is not flat.
  • Backpropagation learning requires input scaling or normalization. Inputs are usually scaled into the range of +0.1f to +0.9f for best performance.


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK