Oversampling and undersampling in data analysis
Encyclopedia
Oversampling and undersampling in data analysis are techniques used to adjust the class distribution of a data set
Data set
A data set is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question. Its values for each of the variables, such as height and weight of an object or values of random numbers. Each...

 (i.e. the ratio between the different classes/categories represented).

Oversampling and undersampling are opposite and roughly equivalent techniques. They both involve using a bias
Bias
Bias is an inclination to present or hold a partial perspective at the expense of alternatives. Bias can come in many forms.-In judgement and decision making:...

 to select more samples from one class than from another.

The usual reason for oversampling is to correct for a bias in the original dataset. One scenario
where it is useful is when training a classifier using labelled training data from a biased source, since
labelled training data is valuable but often comes from un-representative sources.

For example, suppose we have a sample of 1000 people of which 66% are male (perhaps the sample was collected
at a football match). We know the general population is 50% female, and we may wish to adjust our dataset to represent this. Simple oversampling will select each female example twice, and this copying will produce a balanced dataset of 1333 samples with 50% female. Simple undersampling will drop some of the male samples at random to give a balanced dataset of 667 samples, again with 50% female.

There are also more complex oversampling techniques, including the creation
of artificial data points.

See also

  • Oversampling
    Oversampling
    In signal processing, oversampling is the process of sampling a signal with a sampling frequency significantly higher than twice the bandwidth or highest frequency of the signal being sampled...

    in signal processing, which is no relation.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK