Loss function
Encyclopedia
In statistics
and decision theory
a loss function is a function that maps an event
onto a real number
intuitively representing some "cost" associated with the event. Typically it is used for parameter estimation, and the event in question is some function of the difference between estimated and true values for an instance of data. In the context of economics
, for example, this is usually economic cost
or regret
. In Machine Learning
, it is the penalty for an incorrect classification of an example.
X, that is indexed by some θ.
More intuitively, we can think of X as our "data", perhaps , where i.i.d. The X is the set of things the decision rule
will be making decisions on. There exists some number of possible ways to model our data X, which our decision function can use to make decisions. For a finite number of models, we can thus think of θ as the index to this family of probability models. For an infinite family of models, it is a set of parameters to the family of distributions.
On a more practical note, it is important to understand that, while it is tempting to think of loss functions as necessarily parametric (since they seem to take θ as a "parameter"), the fact that θ is non-finite-dimensional is completely incompatible with this notion; for example, if the family of probability functions is uncountably infinite, θ indexes an uncountably infinite space.
From here, given a set A of possible actions, a decision rule
is a function δ : → A.
A loss function is a real lower-bounded function L on Θ × A for some θ ∈ Θ. The value L(θ, δ(X)) is the cost of action δ(X) under parameter θ.
statistical theory involve making a decision based on the expected value
of the loss function: however this quantity is defined differently under the two paradigms.
.
One then should choose the action a* which minimises the expected loss. Although this will result in choosing the same action as would be chosen using the Bayes risk, the emphasis of the Bayesian approach is that one is only interested in choosing the optimal action under the actual observed data, whereas choosing the actual Bayes optimal decision rule, which is a function of all possible observations, is a much more difficult problem.
into problems of scientific decision-making.
A common example involves estimating "location
." Under typical statistical assumptions, the mean
or average is the statistic for estimating location that minimizes the expected loss experienced under the Taguchi
or squared-error
loss function, while the median
is the estimator that minimizes expected loss experienced under the absolute-difference loss function. Still different estimators would be optimal under other, less common circumstances.
In economics, when an agent is risk neutral
, the loss function is simply expressed in monetary terms, such as profit, income, or end-of-period wealth.
But for risk averse (or risk-loving
) agents, loss is measured as the negative of a utility function
, which represents satisfaction and is usually interpreted in ordinal
terms rather than in cardinal
(absolute) terms.
Other measures of cost are possible, for example mortality
or morbidity in the field of public health
or safety engineering
.
For most optimization algorithms, it is desirable to have a loss function that is globally continuous and differentiable.
Two very commonly-used loss functions are the squared loss
, , and the absolute loss
, . However the absolute loss has the disadvantage that it is not differentiable around . The squared loss has the disadvantage that it has the tendency to be dominated by outliers---when summing over a set of 's (as in ), the final sum tends to be the result of a few particularly-large a-values, rather than an expression of the average a-value.
is that in addition to experimental data, the loss function does not in itself wholly determine a decision. What is important is the relationship between the loss function and the prior probability
. So it is possible to have two different loss functions which lead to the same decision when the prior probability distribution
s associated with each compensate for the details of each loss function.
Combining the three elements of the prior probability, the data, and the loss function then allows decisions to be based on maximizing the subjective expected utility
, a concept introduced by Leonard J. Savage.
, the loss function should be based on the idea of regret
, i.e., the loss associated with a decision should be the difference between the consequences of the best decision that could have been taken had the underlying circumstances been known and the decision that was in fact taken before they were known.
techniques or Taguchi methods
. It is often more mathematically tractable than other loss functions because of the properties of variance
s, as well as being symmetric: an error above the target causes the same loss as the same magnitude of error below the target. If the target is t, then a quadratic loss function is
for some constant C; the value of the constant makes no difference to a decision, and can be ignored by setting it equal to 1.
Many common statistics, including t-tests, regression
models, design of experiments
, and much else, use least squares
Linear models theory, which is based on the Taguchi loss function.
The quadratic loss function is also used in linear-quadratic optimal control problems
.
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
and decision theory
Decision theory
Decision theory in economics, psychology, philosophy, mathematics, and statistics is concerned with identifying the values, uncertainties and other issues relevant in a given decision, its rationality, and the resulting optimal decision...
a loss function is a function that maps an event
Event (probability theory)
In probability theory, an event is a set of outcomes to which a probability is assigned. Typically, when the sample space is finite, any subset of the sample space is an event...
onto a real number
Real number
In mathematics, a real number is a value that represents a quantity along a continuum, such as -5 , 4/3 , 8.6 , √2 and π...
intuitively representing some "cost" associated with the event. Typically it is used for parameter estimation, and the event in question is some function of the difference between estimated and true values for an instance of data. In the context of economics
Economics
Economics is the social science that analyzes the production, distribution, and consumption of goods and services. The term economics comes from the Ancient Greek from + , hence "rules of the house"...
, for example, this is usually economic cost
Economic cost
The economic cost of a decision depends on both the cost of the alternative chosen and the benefit that the best alternative would have provided if chosen. Economic cost differs from accounting cost because it includes opportunity cost....
or regret
Regret (decision theory)
Regret is defined as the difference between the actual payoff and the payoff that would have been obtained if a different course of action had been chosen. This is also called difference regret...
. In Machine Learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...
, it is the penalty for an incorrect classification of an example.
Definition
Formally, we begin by considering some family of distributions for a random variableRandom variable
In probability and statistics, a random variable or stochastic variable is, roughly speaking, a variable whose value results from a measurement on some type of random process. Formally, it is a function from a probability space, typically to the real numbers, which is measurable functionmeasurable...
X, that is indexed by some θ.
More intuitively, we can think of X as our "data", perhaps , where i.i.d. The X is the set of things the decision rule
Decision rule
In decision theory, a decision rule is a function which maps an observation to an appropriate action. Decision rules play an important role in the theory of statistics and economics, and are closely related to the concept of a strategy in game theory....
will be making decisions on. There exists some number of possible ways to model our data X, which our decision function can use to make decisions. For a finite number of models, we can thus think of θ as the index to this family of probability models. For an infinite family of models, it is a set of parameters to the family of distributions.
On a more practical note, it is important to understand that, while it is tempting to think of loss functions as necessarily parametric (since they seem to take θ as a "parameter"), the fact that θ is non-finite-dimensional is completely incompatible with this notion; for example, if the family of probability functions is uncountably infinite, θ indexes an uncountably infinite space.
From here, given a set A of possible actions, a decision rule
Decision rule
In decision theory, a decision rule is a function which maps an observation to an appropriate action. Decision rules play an important role in the theory of statistics and economics, and are closely related to the concept of a strategy in game theory....
is a function δ : → A.
A loss function is a real lower-bounded function L on Θ × A for some θ ∈ Θ. The value L(θ, δ(X)) is the cost of action δ(X) under parameter θ.
Decision rules
A decision rule makes a choice using an optimality criterion. Some commonly used criteria are:- MinimaxMinimaxMinimax is a decision rule used in decision theory, game theory, statistics and philosophy for minimizing the possible loss for a worst case scenario. Alternatively, it can be thought of as maximizing the minimum gain...
: Choose the decision rule with the lowest worst loss — that is, minimize the worst-case (maximum possible) loss:
-
-
- InvarianceInvariant estimatorIn statistics, the concept of being an invariant estimator is a criterion that can be used to compare the properties of different estimators for the same quantity. It is a way of formalising the idea that an estimator should have certain intuitively appealing qualities...
: Choose the optimal decision rule which satisfies an invariance requirement. - Minimize the expected valueExpected valueIn probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...
of the loss function.
- Invariance
-
Expected loss
The value of the loss function itself is a random quantity because it depends on the outcome of a random variable X. Both frequentist and BayesianBayesian probability
Bayesian probability is one of the different interpretations of the concept of probability and belongs to the category of evidential probabilities. The Bayesian interpretation of probability can be seen as an extension of logic that enables reasoning with propositions, whose truth or falsity is...
statistical theory involve making a decision based on the expected value
Expected value
In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...
of the loss function: however this quantity is defined differently under the two paradigms.
Frequentist risk
The expected loss in the frequentist context is obtained by taking the expected value with respect to the probability distribution, Pθ, of the observed data, X. This is also referred to as the risk function of the decision rule δ and the parameter θ. Here the decision rule depends on the outcome of X. The risk function is given byBayesian expected loss
In a Bayesian approach, the expectation is calculated using the posterior distribution π* of the parameter θ:.
One then should choose the action a* which minimises the expected loss. Although this will result in choosing the same action as would be chosen using the Bayes risk, the emphasis of the Bayesian approach is that one is only interested in choosing the optimal action under the actual observed data, whereas choosing the actual Bayes optimal decision rule, which is a function of all possible observations, is a much more difficult problem.
Selecting a loss function
Sound statistical practice requires selecting an estimator consistent with the actual loss experienced in the context of a particular applied problem. Thus, in the applied use of loss functions, selecting which statistical method to use to model an applied problem depends on knowing the losses that will be experienced from being wrong under the problem's particular circumstances, which results in the introduction of an element of teleologyTeleology
A teleology is any philosophical account which holds that final causes exist in nature, meaning that design and purpose analogous to that found in human actions are inherent also in the rest of nature. The word comes from the Greek τέλος, telos; root: τελε-, "end, purpose...
into problems of scientific decision-making.
A common example involves estimating "location
Location parameter
In statistics, a location family is a class of probability distributions that is parametrized by a scalar- or vector-valued parameter μ, which determines the "location" or shift of the distribution...
." Under typical statistical assumptions, the mean
Mean
In statistics, mean has two related meanings:* the arithmetic mean .* the expected value of a random variable, which is also called the population mean....
or average is the statistic for estimating location that minimizes the expected loss experienced under the Taguchi
Taguchi methods
Taguchi methods are statistical methods developed by Genichi Taguchi to improve the quality of manufactured goods, and more recently also applied to, engineering, biotechnology, marketing and advertising...
or squared-error
Least squares
The method of least squares is a standard approach to the approximate solution of overdetermined systems, i.e., sets of equations in which there are more equations than unknowns. "Least squares" means that the overall solution minimizes the sum of the squares of the errors made in solving every...
loss function, while the median
Median
In probability theory and statistics, a median is described as the numerical value separating the higher half of a sample, a population, or a probability distribution, from the lower half. The median of a finite list of numbers can be found by arranging all the observations from lowest value to...
is the estimator that minimizes expected loss experienced under the absolute-difference loss function. Still different estimators would be optimal under other, less common circumstances.
In economics, when an agent is risk neutral
Risk neutral
In economics and finance, risk neutral behavior is between risk aversion and risk seeking. If offered either €50 or a 50% chance of each of €100 and nothing, a risk neutral person would have no preference between the two options...
, the loss function is simply expressed in monetary terms, such as profit, income, or end-of-period wealth.
But for risk averse (or risk-loving
Risk-loving
In economics and finance, a risk lover is a person who has a preference for risk. While most investors are considered risk averse, one could view casino goers as risk loving...
) agents, loss is measured as the negative of a utility function
Utility
In economics, utility is a measure of customer satisfaction, referring to the total satisfaction received by a consumer from consuming a good or service....
, which represents satisfaction and is usually interpreted in ordinal
Ordinal utility
Ordinal utility theory states that while the utility of a particular good or service cannot be measured using a numerical scale bearing economic meaning in and of itself, pairs of alternative bundles of goods can be ordered such that one is considered by an individual to be worse than, equal to,...
terms rather than in cardinal
Cardinal utility
In economics, cardinal utility refers to a property of mathematical indices that preserve preference orderings uniquely up to positive linear transformations...
(absolute) terms.
Other measures of cost are possible, for example mortality
Death
Death is the permanent termination of the biological functions that sustain a living organism. Phenomena which commonly bring about death include old age, predation, malnutrition, disease, and accidents or trauma resulting in terminal injury....
or morbidity in the field of public health
Public health
Public health is "the science and art of preventing disease, prolonging life and promoting health through the organized efforts and informed choices of society, organizations, public and private, communities and individuals" . It is concerned with threats to health based on population health...
or safety engineering
Safety engineering
Safety engineering is an applied science strongly related to systems engineering / industrial engineering and the subset System Safety Engineering...
.
For most optimization algorithms, it is desirable to have a loss function that is globally continuous and differentiable.
Two very commonly-used loss functions are the squared loss
Mean squared error
In statistics, the mean squared error of an estimator is one of many ways to quantify the difference between values implied by a kernel density estimator and the true values of the quantity being estimated. MSE is a risk function, corresponding to the expected value of the squared error loss or...
, , and the absolute loss
Absolute deviation
In statistics, the absolute deviation of an element of a data set is the absolute difference between that element and a given point. Typically the point from which the deviation is measured is a measure of central tendency, most often the median or sometimes the mean of the data set.D_i = |x_i-m|...
, . However the absolute loss has the disadvantage that it is not differentiable around . The squared loss has the disadvantage that it has the tendency to be dominated by outliers---when summing over a set of 's (as in ), the final sum tends to be the result of a few particularly-large a-values, rather than an expression of the average a-value.
Loss functions in Bayesian statistics
One of the consequences of Bayesian inferenceBayesian inference
In statistics, Bayesian inference is a method of statistical inference. It is often used in science and engineering to determine model parameters, make predictions about unknown variables, and to perform model selection...
is that in addition to experimental data, the loss function does not in itself wholly determine a decision. What is important is the relationship between the loss function and the prior probability
Prior probability
In Bayesian statistical inference, a prior probability distribution, often called simply the prior, of an uncertain quantity p is the probability distribution that would express one's uncertainty about p before the "data"...
. So it is possible to have two different loss functions which lead to the same decision when the prior probability distribution
Probability distribution
In probability theory, a probability mass, probability density, or probability distribution is a function that describes the probability of a random variable taking certain values....
s associated with each compensate for the details of each loss function.
Combining the three elements of the prior probability, the data, and the loss function then allows decisions to be based on maximizing the subjective expected utility
Subjective expected utility
Subjective expected utility is a method in decision theory in the presence of risk, promoted by L. J. Savage in 1954 following previous work by Ramsey and von Neumann...
, a concept introduced by Leonard J. Savage.
Regret
Savage also argued that using non-Bayesian methods such as minimaxMinimax
Minimax is a decision rule used in decision theory, game theory, statistics and philosophy for minimizing the possible loss for a worst case scenario. Alternatively, it can be thought of as maximizing the minimum gain...
, the loss function should be based on the idea of regret
Regret (decision theory)
Regret is defined as the difference between the actual payoff and the payoff that would have been obtained if a different course of action had been chosen. This is also called difference regret...
, i.e., the loss associated with a decision should be the difference between the consequences of the best decision that could have been taken had the underlying circumstances been known and the decision that was in fact taken before they were known.
Quadratic loss function
The use of a quadratic loss function is common, for example when using least squaresLeast squares
The method of least squares is a standard approach to the approximate solution of overdetermined systems, i.e., sets of equations in which there are more equations than unknowns. "Least squares" means that the overall solution minimizes the sum of the squares of the errors made in solving every...
techniques or Taguchi methods
Taguchi methods
Taguchi methods are statistical methods developed by Genichi Taguchi to improve the quality of manufactured goods, and more recently also applied to, engineering, biotechnology, marketing and advertising...
. It is often more mathematically tractable than other loss functions because of the properties of variance
Variance
In probability theory and statistics, the variance is a measure of how far a set of numbers is spread out. It is one of several descriptors of a probability distribution, describing how far the numbers lie from the mean . In particular, the variance is one of the moments of a distribution...
s, as well as being symmetric: an error above the target causes the same loss as the same magnitude of error below the target. If the target is t, then a quadratic loss function is
for some constant C; the value of the constant makes no difference to a decision, and can be ignored by setting it equal to 1.
Many common statistics, including t-tests, regression
Regression analysis
In statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...
models, design of experiments
Design of experiments
In general usage, design of experiments or experimental design is the design of any information-gathering exercises where variation is present, whether under the full control of the experimenter or not. However, in statistics, these terms are usually used for controlled experiments...
, and much else, use least squares
Least squares
The method of least squares is a standard approach to the approximate solution of overdetermined systems, i.e., sets of equations in which there are more equations than unknowns. "Least squares" means that the overall solution minimizes the sum of the squares of the errors made in solving every...
Linear models theory, which is based on the Taguchi loss function.
The quadratic loss function is also used in linear-quadratic optimal control problems
Linear-quadratic regulator
The theory of optimal control is concerned with operating a dynamic system at minimum cost. The case where the system dynamics are described by a set of linear differential equations and the cost is described by a quadratic functional is called the LQ problem...
.