Applicability Domain
Encyclopedia
The Applicability Domain (AD) of a QSAR is the physico-chemical, structural or biological space, knowledge or information on which the training set of the model has been developed, and for which it is applicable to make predictions for new compounds.
The purpose of AD is to state whether the model's assumptions are met. In general, this is the case for interpolation
rather than for extrapolation
. Although up to now there is no single generally accepted algorithm for determining the AD, there exists a rather systematic approach for defining interpolation regions. The process involves the removal of outliers and a probability density distribution method using kernel-weighted sampling. A recent rigorous benchmarking study of several AD algorithms identified standard-deviation of models as the most reliable approach .
To investigate the AD of a training set of chemicals one can directly analyse properties of the multivariate descriptor space of the training compounds or more indirectly via distance
(or similarity) metrics. When using distance metrics care should be taken to use an orthogonal and significant vector space. This can be achieved by different means of feature selection and successive principal components analysis
.
The purpose of AD is to state whether the model's assumptions are met. In general, this is the case for interpolation
Interpolation
In the mathematical field of numerical analysis, interpolation is a method of constructing new data points within the range of a discrete set of known data points....
rather than for extrapolation
Extrapolation
In mathematics, extrapolation is the process of constructing new data points. It is similar to the process of interpolation, which constructs new points between known points, but the results of extrapolations are often less meaningful, and are subject to greater uncertainty. It may also mean...
. Although up to now there is no single generally accepted algorithm for determining the AD, there exists a rather systematic approach for defining interpolation regions. The process involves the removal of outliers and a probability density distribution method using kernel-weighted sampling. A recent rigorous benchmarking study of several AD algorithms identified standard-deviation of models as the most reliable approach .
To investigate the AD of a training set of chemicals one can directly analyse properties of the multivariate descriptor space of the training compounds or more indirectly via distance
Distance
Distance is a numerical description of how far apart objects are. In physics or everyday discussion, distance may refer to a physical length, or an estimation based on other criteria . In mathematics, a distance function or metric is a generalization of the concept of physical distance...
(or similarity) metrics. When using distance metrics care should be taken to use an orthogonal and significant vector space. This can be achieved by different means of feature selection and successive principal components analysis
Principal components analysis
Principal component analysis is a mathematical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of uncorrelated variables called principal components. The number of principal components is less than or equal to...
.