Empirical risk minimization

Empirical risk minimization (ERM) is a principle in statistical learning theory

Statistical learning theory

Statistical learning theory is an ambiguous term.#It may refer to computational learning theory, which is a sub-field of theoretical computer science that studies how algorithms can learn from data....

which defines a family of learning algorithms and is used to give theoretical bounds on the performance of learning algorithms

Machine learning

Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

Background

Consider the following situation, which is a general setting of many supervised learning

Supervised learning

Supervised learning is the machine learning task of inferring a function from supervised training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value...

problems. We have two spaces of objects

and

and would like to learn a function

(often called hypothesis) which outputs an object

, given

. To do so, we have at our disposal a training set of a few examples

where

is an input and

is the corresponding response that we wish to get from

.

To put it more formally, we assume that there is a joint probability distribution

over

and

, and that the training set consists of

instances

drawn i.i.d. from

. Note that the assumption of a joint probability distribution allows us to model uncertainty in predictions (e.g. from noise in data) because

is not a deterministic function of

, but rather a random variable

Random variable

In probability and statistics, a random variable or stochastic variable is, roughly speaking, a variable whose value results from a measurement on some type of random process. Formally, it is a function from a probability space, typically to the real numbers, which is measurable functionmeasurable...

with conditional distribution

Conditional distribution

Given two jointly distributed random variables X and Y, the conditional probability distribution of Y given X is the probability distribution of Y when X is known to be a particular value...

for a fixed

.

We also assume that we are given a non-negative real-valued loss function

Loss function

In statistics and decision theory a loss function is a function that maps an event onto a real number intuitively representing some "cost" associated with the event. Typically it is used for parameter estimation, and the event in question is some function of the difference between estimated and...

which measures how different the prediction

of a hypothesis is from the true outcome

. The risk associated with hypothesis

is then defined as the expectation

Expected value

In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...

of the loss function:

A loss function commonly used in theory is the 0-1 loss function

0-1 loss function

In statistics and decision theory, a frequently used loss function is the 0-1 loss functionwhere I is the indicator notation....

, where

is the indicator notation.

The ultimate goal of a learning algorithm is to find a hypothesis

among a fixed class of functions

for which the risk

is minimal:

In general, the risk

cannot be computed because the distribution

is unknown to the learning algorithm (this situation is referred to as agnostic learning). However, we can compute an approximation, called empirical risk, by averaging the loss function on the training set:

Empirical risk minimization principle states that the learning algorithm should choose a hypothesis

which minimizes the empirical risk:

Thus the learning algorithm defined by ERM principle consists in solving the above optimization problem.

Computational complexity

Empirical risk minimization for a classification problem with 0-1 loss function

0-1 loss function

In statistics and decision theory, a frequently used loss function is the 0-1 loss functionwhere I is the indicator notation....

is known to be an NP-hard

NP-hard

NP-hard , in computational complexity theory, is a class of problems that are, informally, "at least as hard as the hardest problems in NP". A problem H is NP-hard if and only if there is an NP-complete problem L that is polynomial time Turing-reducible to H...

problem even for such relatively simple class of functions as linear classifier

Linear classifier

In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class it belongs to. A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics...

s. Though, it can be solved efficiently when minimal empirical risk is zero, i.e. data is linearly separable

Linearly separable

In geometry, two sets of points in a two-dimensional space are linearly separable if they can be completely separated by a single line. In general, two point sets are linearly separable in n-dimensional space if they can be separated by a hyperplane....

.

In practice, machine learning algorithms cope with that either by employing a convex approximation to 0-1 loss function (like hinge loss

Hinge loss

Hinge loss is a loss function in machine learning. The hinge loss function is the following:l=maxHinge loss works well for its purposes in SVM as a classifier, since the more you violate the margin, the higher the penalty is. However, hinge loss is not well-suited for regression-based problems as a...

for SVM

Support vector machine

A support vector machine is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis...

), which is easier to optimize, or by posing assumptions on the distribution

(and thus stop being agnostic learning algorithms to which the above result applies,)

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.