L-BFGS
Encyclopedia
The limited-memory BFGS algorithm
is a member of the broad family of quasi-Newton
optimization
methods that uses a limited memory variation of the Broyden–Fletcher–Goldfarb–Shanno (BFGS)
update to approximate the inverse Hessian matrix
(denoted by ). Unlike the original BFGS
method which stores a dense approximation, L-BFGS stores only a few vectors that represent the approximation implicitly. Due to its moderate memory requirement, L-BFGS method is particularly well suited for optimization problems with a large number of variables. L-BFGS never explicitly forms or stores . Instead, it maintains a history of the past updates of the position and gradient , where generally the history can be short, often less than 10. These updates are used to implicitly do operations requiring the -vector product. While strictly, a straightforward BFGS implementation at the -th iteration would represent the inverse Hessian approximation as informed by all updates on , L-BFGS does quite well using updates from only the most recent iterations .
We'll take as given , the position at the -th iteration, and where is the function being minimized, and all vectors are column vectors. Then we keep the updates and . We'll define . will be the 'initial' approximate of the inverse Hessian that our estimate at iteration begins with.
Then we can compute the (uphill) direction as follows:
This formulation is valid whether we are minimizing or maximizing. Note that if we are minimizing, the search direction would be the negative of z (since z is "uphill"), and if we are maximizing, should be negative definite rather than positive definite. We would typically do a backtracking line search in the search direction (any line search would be valid, but L-BFGS does not require exact line searches in order to converge).
Commonly, the inverse Hessian is represented as a diagonal matrix, so that initially setting requires only an element-by-element multiplication.
This two loop update only works for the inverse Hessian. Approaches to implementing L-BFGS using the direct approximate Hessian have also been developed, as have other means of approximating the inverse Hessian.
and conditional random field
s with -regularization
by Andrew and Gao (who developed a variant called OWL-QN
for -regularized models).
as a shar
archive http://netlib.org/opt/lbfgs_um.shar. Multiple other open source implementations have been produced by as translations of this Fortran code (e.g. java, and python via SciPy
). Other implementations exist (e.g. Matlab (optimization toolbox) Matlab (BSD)), frequently as part of generic optimization libraries (e.g. Mathematica). A variant which can handle box-constraints on the variables, L-BFGS-B, also exists as ACM TOMS algorithm 778, and can be called from R's
implementation.
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...
is a member of the broad family of quasi-Newton
Quasi-Newton method
In optimization, quasi-Newton methods are algorithms for finding local maxima and minima of functions. Quasi-Newton methods are based on...
optimization
Optimization (mathematics)
In mathematics, computational science, or management science, mathematical optimization refers to the selection of a best element from some set of available alternatives....
methods that uses a limited memory variation of the Broyden–Fletcher–Goldfarb–Shanno (BFGS)
BFGS method
In numerical optimization, the Broyden–Fletcher–Goldfarb–Shanno method is a method for solving nonlinear optimization problems ....
update to approximate the inverse Hessian matrix
Hessian matrix
In mathematics, the Hessian matrix is the square matrix of second-order partial derivatives of a function; that is, it describes the local curvature of a function of many variables. The Hessian matrix was developed in the 19th century by the German mathematician Ludwig Otto Hesse and later named...
(denoted by ). Unlike the original BFGS
BFGS method
In numerical optimization, the Broyden–Fletcher–Goldfarb–Shanno method is a method for solving nonlinear optimization problems ....
method which stores a dense approximation, L-BFGS stores only a few vectors that represent the approximation implicitly. Due to its moderate memory requirement, L-BFGS method is particularly well suited for optimization problems with a large number of variables. L-BFGS never explicitly forms or stores . Instead, it maintains a history of the past updates of the position and gradient , where generally the history can be short, often less than 10. These updates are used to implicitly do operations requiring the -vector product. While strictly, a straightforward BFGS implementation at the -th iteration would represent the inverse Hessian approximation as informed by all updates on , L-BFGS does quite well using updates from only the most recent iterations .
Representation
An L-BFGS shares many features with other quasi-Newton algorithms, but is very different in how the matrix-vector multiplication for finding the search direction is carried out . There are multiple published approaches to using a history of updates to form this direction vector. Here, we give a common approach, the so-called "two loop recursion."We'll take as given , the position at the -th iteration, and where is the function being minimized, and all vectors are column vectors. Then we keep the updates and . We'll define . will be the 'initial' approximate of the inverse Hessian that our estimate at iteration begins with.
Then we can compute the (uphill) direction as follows:
- For
- For
- Stop with
This formulation is valid whether we are minimizing or maximizing. Note that if we are minimizing, the search direction would be the negative of z (since z is "uphill"), and if we are maximizing, should be negative definite rather than positive definite. We would typically do a backtracking line search in the search direction (any line search would be valid, but L-BFGS does not require exact line searches in order to converge).
Commonly, the inverse Hessian is represented as a diagonal matrix, so that initially setting requires only an element-by-element multiplication.
This two loop update only works for the inverse Hessian. Approaches to implementing L-BFGS using the direct approximate Hessian have also been developed, as have other means of approximating the inverse Hessian.
Applications
L-BFGS is called "the algorithm of choice" for fitting log-linear (MaxEnt) modelsMultinomial logit
In statistics, economics, and genetics, a multinomial logit model, also known as multinomial logistic regression, is a regression model which generalizes logistic regression by allowing more than two discrete outcomes...
and conditional random field
Conditional random field
A conditional random field is a statistical modelling method often applied in pattern recognition.More specifically it is a type of discriminative undirected probabilistic graphical model. It is used to encode known relationships between observations and construct consistent interpretations...
s with -regularization
Regularization (mathematics)
In mathematics and statistics, particularly in the fields of machine learning and inverse problems, regularization involves introducing additional information in order to solve an ill-posed problem or to prevent overfitting...
by Andrew and Gao (who developed a variant called OWL-QN
Orthant-wise limited-memory quasi-Newton
Orthant-wise limited-memory quasi-Newton is a numerical optimization algorithm that belongs to the class of quasi-Newton methods, and is specifically designed to serve in the training/fitting algorithm of log-linear models with \ell_1-regularization...
for -regularized models).
Implementations
An early, open source implementation of L-BFGS in Fortran exists in NetlibNetlib
Netlib is a repository of software for scientific computing maintained by AT&T, Bell Laboratories, the University of Tennessee and Oak Ridge National Laboratory. Netlib comprises a large number of separate programs and libraries...
as a shar
Shar
In the Unix operating system, shar is an archive format. A shar file is a shell script, and executing it will recreate the files. This is a type of self-extracting archive file. It can be created with the Unix shar utility...
archive http://netlib.org/opt/lbfgs_um.shar. Multiple other open source implementations have been produced by as translations of this Fortran code (e.g. java, and python via SciPy
SciPy
SciPy is an open source library of algorithms and mathematical tools for the Python programming language.SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks common in science and...
). Other implementations exist (e.g. Matlab (optimization toolbox) Matlab (BSD)), frequently as part of generic optimization libraries (e.g. Mathematica). A variant which can handle box-constraints on the variables, L-BFGS-B, also exists as ACM TOMS algorithm 778, and can be called from R's
R (programming language)
R is a programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians for developing statistical software, and R is widely used for statistical software development and data analysis....
optim
general-purpose optimizer routing by using method="L-BFGS-B"
. The libLBFGS is a CC (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
implementation.