BFGS method
Encyclopedia
In numerical
optimization
, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method is a method
for solving nonlinear optimization problems (which lack constraints).
The BFGS method approximates
Newton's method
, a class of hill-climbing optimization
techniques that seeks a stationary point
of a (preferably twice continuously differentiable) function: For such problems, a necessary condition for optimality is that the gradient
be zero.
Newton's method and the BFGS methods need not converge unless the function has a quadratic Taylor expansion near an optimum. These methods use the first and second derivatives. However, BFGS has proven good performance even for non-smooth optimizations.
In quasi-Newton methods, the Hessian matrix
of second derivative
s need not be evaluated directly. Instead, the Hessian matrix is approximated using rank-one updates specified by gradient evaluations (or approximate gradient evaluations). Quasi-Newton methods are a generalization of the secant method
to find the root of the first derivative
for multidimensional problems. In multi-dimensions the secant equation does not specify a unique solution, and quasi-Newton methods differ in how they constrain the solution. The BFGS method is one of the most popular members of this class. Also in common use is L-BFGS
, which is a limited-memory version of BFGS that is particularly suited to problems with very large numbers of variables (like >1000). The BFGS-B variant handles simple box constraints.
where is an approximation to the Hessian matrix
which is updated iteratively at each stage, and is the gradient of the function evaluated at xk. A line search in the direction pk is then used to find the next point xk+1. Instead of requiring the full Hessian matrix at the point xk+1 to be computed as Bk+1, the approximate Hessian at stage k is updated by the addition of two matrices.
Both Uk and Vk are symmetric rank-one matrices but have different bases. The symmetric rank one assumption here means that we may write
So equivalently, Uk and Vk construct a rank-two update matrix which is robust against the scale problem often suffered in the gradient descent
searching (e.g., in Broyden's method
).
The quasi-Newton condition imposed on this update is
denotes the objective function to be minimized. Convergence can be checked by observing the norm of the gradient, . Practically, can be initialized with , so that the first step will be equivalent to a gradient descent
, but further steps are more and more refined by , the approximation to the Hessian.
The first step of the algorithm is carried out using the inverse of the matrix , which is usually obtained efficiently by applying the Sherman–Morrison formula
to the fifth line of the algorithm, giving
In statistical estimation problems (such as maximum likelihood or Bayesian inference), credible interval
s or confidence interval
s for the solution can be obtained from the inverse of the final Hessian matrix.
Optimization Toolbox, the fminunc function uses BFGS with cubic line search when the problem size is set to "medium scale."
The GSL
implements BFGS as gsl_multimin_fdfminimizer_vector_bfgs2.
In SciPy
, the scipy.optimize.fmin_bfgs function implements BFGS.
It is also possible to run BFGS using any of the L-BFGS
algorithms by setting the parameter L to a very large number.
Numerical analysis
Numerical analysis is the study of algorithms that use numerical approximation for the problems of mathematical analysis ....
optimization
Optimization (mathematics)
In mathematics, computational science, or management science, mathematical optimization refers to the selection of a best element from some set of available alternatives....
, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method is a method
Iterative method
In computational mathematics, an iterative method is a mathematical procedure that generates a sequence of improving approximate solutions for a class of problems. A specific implementation of an iterative method, including the termination criteria, is an algorithm of the iterative method...
for solving nonlinear optimization problems (which lack constraints).
The BFGS method approximates
Approximation theory
In mathematics, approximation theory is concerned with how functions can best be approximated with simpler functions, and with quantitatively characterizing the errors introduced thereby...
Newton's method
Newton's method in optimization
In mathematics, Newton's method is an iterative method for finding roots of equations. More generally, Newton's method is used to find critical points of differentiable functions, which are the zeros of the derivative function.-Method:...
, a class of hill-climbing optimization
Hill climbing
In computer science, hill climbing is a mathematical optimization technique which belongs to the family of local search. It is an iterative algorithm that starts with an arbitrary solution to a problem, then attempts to find a better solution by incrementally changing a single element of the solution...
techniques that seeks a stationary point
Stationary point
In mathematics, particularly in calculus, a stationary point is an input to a function where the derivative is zero : where the function "stops" increasing or decreasing ....
of a (preferably twice continuously differentiable) function: For such problems, a necessary condition for optimality is that the gradient
Gradient
In vector calculus, the gradient of a scalar field is a vector field that points in the direction of the greatest rate of increase of the scalar field, and whose magnitude is the greatest rate of change....
be zero.
Newton's method and the BFGS methods need not converge unless the function has a quadratic Taylor expansion near an optimum. These methods use the first and second derivatives. However, BFGS has proven good performance even for non-smooth optimizations.
In quasi-Newton methods, the Hessian matrix
Hessian matrix
In mathematics, the Hessian matrix is the square matrix of second-order partial derivatives of a function; that is, it describes the local curvature of a function of many variables. The Hessian matrix was developed in the 19th century by the German mathematician Ludwig Otto Hesse and later named...
of second derivative
Derivative
In calculus, a branch of mathematics, the derivative is a measure of how a function changes as its input changes. Loosely speaking, a derivative can be thought of as how much one quantity is changing in response to changes in some other quantity; for example, the derivative of the position of a...
s need not be evaluated directly. Instead, the Hessian matrix is approximated using rank-one updates specified by gradient evaluations (or approximate gradient evaluations). Quasi-Newton methods are a generalization of the secant method
Secant method
In numerical analysis, the secant method is a root-finding algorithm that uses a succession of roots of secant lines to better approximate a root of a function f. The secant method can be thought of as a finite difference approximation of Newton's method. However, the method was developed...
to find the root of the first derivative
for multidimensional problems. In multi-dimensions the secant equation does not specify a unique solution, and quasi-Newton methods differ in how they constrain the solution. The BFGS method is one of the most popular members of this class. Also in common use is L-BFGS
L-BFGS
The limited-memory BFGS algorithm is a member of the broad family of quasi-Newton optimization methods that uses a limited memory variation of the Broyden–Fletcher–Goldfarb–Shanno update to approximate the inverse Hessian matrix...
, which is a limited-memory version of BFGS that is particularly suited to problems with very large numbers of variables (like >1000). The BFGS-B variant handles simple box constraints.
Rationale
The search direction pk at stage k is given by the solution of the analogue of the Newton equationwhere is an approximation to the Hessian matrix
Hessian matrix
In mathematics, the Hessian matrix is the square matrix of second-order partial derivatives of a function; that is, it describes the local curvature of a function of many variables. The Hessian matrix was developed in the 19th century by the German mathematician Ludwig Otto Hesse and later named...
which is updated iteratively at each stage, and is the gradient of the function evaluated at xk. A line search in the direction pk is then used to find the next point xk+1. Instead of requiring the full Hessian matrix at the point xk+1 to be computed as Bk+1, the approximate Hessian at stage k is updated by the addition of two matrices.
Both Uk and Vk are symmetric rank-one matrices but have different bases. The symmetric rank one assumption here means that we may write
So equivalently, Uk and Vk construct a rank-two update matrix which is robust against the scale problem often suffered in the gradient descent
Gradient descent
Gradient descent is a first-order optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point...
searching (e.g., in Broyden's method
Broyden's method
In numerical analysis, Broyden's method is a quasi-Newton method for the numerical solution of nonlinear equations in k variables. It was originally described by C. G. Broyden in 1965....
).
The quasi-Newton condition imposed on this update is
Algorithm
From an initial guess and an approximate Hessian matrix the following steps are repeated until converges to the solution.- Obtain a direction by solving:
- Perform a line search to find an acceptable stepsize in the direction found in the first step, then update
- Set
denotes the objective function to be minimized. Convergence can be checked by observing the norm of the gradient, . Practically, can be initialized with , so that the first step will be equivalent to a gradient descent
Gradient descent
Gradient descent is a first-order optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point...
, but further steps are more and more refined by , the approximation to the Hessian.
The first step of the algorithm is carried out using the inverse of the matrix , which is usually obtained efficiently by applying the Sherman–Morrison formula
Sherman–Morrison formula
In mathematics, in particular linear algebra, the Sherman–Morrison formula, named after Jack Sherman and Winifred J. Morrison, computes the inverse of the sum of an invertiblematrix Aand the dyadic product, u v^T,of a column vector u and a row vector v^T...
to the fifth line of the algorithm, giving
In statistical estimation problems (such as maximum likelihood or Bayesian inference), credible interval
Credible interval
In Bayesian statistics, a credible interval is an interval in the domain of a posterior probability distribution used for interval estimation. The generalisation to multivariate problems is the credible region...
s or confidence interval
Confidence interval
In statistics, a confidence interval is a particular kind of interval estimate of a population parameter and is used to indicate the reliability of an estimate. It is an observed interval , in principle different from sample to sample, that frequently includes the parameter of interest, if the...
s for the solution can be obtained from the inverse of the final Hessian matrix.
Implementations
In the MatlabMATLAB
MATLAB is a numerical computing environment and fourth-generation programming language. Developed by MathWorks, MATLAB allows matrix manipulations, plotting of functions and data, implementation of algorithms, creation of user interfaces, and interfacing with programs written in other languages,...
Optimization Toolbox, the fminunc function uses BFGS with cubic line search when the problem size is set to "medium scale."
The GSL
GNU Scientific Library
In computing, the GNU Scientific Library is a software library written in the C programming language for numerical calculations in applied mathematics and science...
implements BFGS as gsl_multimin_fdfminimizer_vector_bfgs2.
In SciPy
SciPy
SciPy is an open source library of algorithms and mathematical tools for the Python programming language.SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks common in science and...
, the scipy.optimize.fmin_bfgs function implements BFGS.
It is also possible to run BFGS using any of the L-BFGS
L-BFGS
The limited-memory BFGS algorithm is a member of the broad family of quasi-Newton optimization methods that uses a limited memory variation of the Broyden–Fletcher–Goldfarb–Shanno update to approximate the inverse Hessian matrix...
algorithms by setting the parameter L to a very large number.