Multifactor dimensionality reduction
Encyclopedia
Multifactor dimensionality reduction (MDR) is a data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

 approach for detecting and characterizing combinations of attribute
Attribute (computing)
In computing, an attribute is a specification that defines a property of an object, element, or file. It may also refer to or set the specific value for a given instance of such....

s or independent variable
Independent variable
The terms "dependent variable" and "independent variable" are used in similar but subtly different ways in mathematics and statistics as part of the standard terminology in those subjects...

s that interact to influence a dependent or class variable. MDR was designed specifically to identify interaction
Interaction
Interaction is a kind of action that occurs as two or more objects have an effect upon one another. The idea of a two-way effect is essential in the concept of interaction, as opposed to a one-way causal effect...

s among discrete variables that influence a binary
Binary numeral system
The binary numeral system, or base-2 number system, represents numeric values using two symbols, 0 and 1. More specifically, the usual base-2 system is a positional notation with a radix of 2...

 outcome and is considered a nonparametric alternative to traditional statistical methods such as logistic regression
Logistic regression
In statistics, logistic regression is used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve. It is a generalized linear model used for binomial regression...

.

The basis of the MDR method is a constructive induction algorithm that converts two or more variables or attributes to a single attribute. This process of constructing a new attribute changes the representation space of the data. The end goal is to create or discover a representation that facilitates the detection of nonlinear or nonadditive interactions among the attributes such that prediction of the class variable is improved over that of the original representation of the data.

Illustrative example

Consider the following simple example using the exclusive OR (XOR) function. XOR is a logical operator that is commonly used in data mining and machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

 as an example of a function that is not linearly separable. The table below represents a simple dataset where the relationship between the attributes (X1 and X2) and the class variable (Y) is defined by the XOR function such that Y = X1 XOR X2.

Table 1
X1 X2 Y
0 0 0
0 1 1
1 0 1
1 1 0


A data mining algorithm would need to discover or approximate the XOR function in order to accurately predict Y using information about X1 and X2. An alternative strategy would be to first change the representation of the data using constructive induction to facilitate predictive modeling. The MDR algorithm would change the representation of the data (X1 and X2) in the following manner. MDR starts by selecting two attributes. In this simple example, X1 and X2 are selected. Each combination of values for X1 and X2 are examined and the number of times Y=1 and/or Y=0 is counted. In this simple example, Y=1 occurs zero times and Y=0 occurs once for the combination of X1=0 and X2=0. With MDR, the ratio of these counts is computed and compared to a fixed threshold. Here, the ratio of counts is 0/1 which is less than our fixed threshold of 1. Since 0/1 < 1 we encode a new attribute (Z) as a 0. When the ratio is greater than one we encode Z as a 1. This process is repeated for all unique combinations of values for X1 and X2. Table 2 illustrates our new transformation of the data.

Table 2
Z Y
0 0
1 1
1 1
0 0


The data mining algorithm now has much less work to do to find a good predictive function. In fact, in this very simple example, the function Y = Z has a classification accuracy of 1. A nice feature of constructive induction methods such as MDR is the ability to use any data mining or machine learning method to analyze the new representation of the data. Decision trees
Decision tree learning
Decision tree learning, used in statistics, data mining and machine learning, uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. More descriptive names for such tree models are classification trees or regression trees...

, neural networks
Neural Networks
Neural Networks is the official journal of the three oldest societies dedicated to research in neural networks: International Neural Network Society, European Neural Network Society and Japanese Neural Network Society, published by Elsevier...

, or a naive Bayes classifier
Naive Bayes classifier
A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong independence assumptions...

 could be used.

Data mining with MDR

As illustrated above, the basic constructive induction algorithm in MDR is very simple. However, its implementation for mining patterns from real data can be computationally complex. As with any data mining algorithm there is always concern about overfitting
Overfitting
In statistics, overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations...

. That is, data mining algorithms are good at finding patterns in completely random data. It is often difficult to determine whether a reported pattern is an important signal or just chance. One approach is to estimate the generalizability of a model to independent datasets using methods such as cross-validation. Typically, models that describe random data typically don't generalize. Another approach is to generate many random permutations of the data to see what the data mining algorithm finds when given the chance to overfit. Permutation testing makes it possible to generate an empirical p-value
P-value
In statistical significance testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. One often "rejects the null hypothesis" when the p-value is less than the significance level α ,...

 for the result. These approaches have all been shown to be useful for choosing and evaluating MDR models.

Applications

MDR has mostly been applied to detecting gene-gene interactions or epistasis
Epistasis
In genetics, epistasis is the phenomenon where the effects of one gene are modified by one or several other genes, which are sometimes called modifier genes. The gene whose phenotype is expressed is called epistatic, while the phenotype altered or suppressed is called hypostatic...

 in genetic studies of common human diseases such as atrial fibrillation
Atrial fibrillation
Atrial fibrillation is the most common cardiac arrhythmia . It is a common cause of irregular heart beat, identified clinically by taking a pulse. Chaotic electrical activity in the two upper chambers of the heart result in the muscle fibrillating , instead of achieving coordinated contraction...

, autism
Autism
Autism is a disorder of neural development characterized by impaired social interaction and communication, and by restricted and repetitive behavior. These signs all begin before a child is three years old. Autism affects information processing in the brain by altering how nerve cells and their...

, bladder cancer
Bladder cancer
Bladder cancer is any of several types of malignant growths of the urinary bladder. It is a disease in which abnormal cells multiply without control in the bladder. The bladder is a hollow, muscular organ that stores urine; it is located in the pelvis...

, breast cancer
Breast cancer
Breast cancer is cancer originating from breast tissue, most commonly from the inner lining of milk ducts or the lobules that supply the ducts with milk. Cancers originating from ducts are known as ductal carcinomas; those originating from lobules are known as lobular carcinomas...

, cardiovascular disease
Cardiovascular disease
Heart disease or cardiovascular disease are the class of diseases that involve the heart or blood vessels . While the term technically refers to any disease that affects the cardiovascular system , it is usually used to refer to those related to atherosclerosis...

, hypertension
Hypertension
Hypertension or high blood pressure is a cardiac chronic medical condition in which the systemic arterial blood pressure is elevated. What that means is that the heart is having to work harder than it should to pump the blood around the body. Blood pressure involves two measurements, systolic and...

, prostate cancer
Prostate cancer
Prostate cancer is a form of cancer that develops in the prostate, a gland in the male reproductive system. Most prostate cancers are slow growing; however, there are cases of aggressive prostate cancers. The cancer cells may metastasize from the prostate to other parts of the body, particularly...

, schizophrenia
Schizophrenia
Schizophrenia is a mental disorder characterized by a disintegration of thought processes and of emotional responsiveness. It most commonly manifests itself as auditory hallucinations, paranoid or bizarre delusions, or disorganized speech and thinking, and it is accompanied by significant social...

, and type II diabetes. However, it can be applied to other domains such as economics
Economics
Economics is the social science that analyzes the production, distribution, and consumption of goods and services. The term economics comes from the Ancient Greek from + , hence "rules of the house"...

, engineering
Engineering
Engineering is the discipline, art, skill and profession of acquiring and applying scientific, mathematical, economic, social, and practical knowledge, in order to design and build structures, machines, devices, systems, materials and processes that safely realize improvements to the lives of...

, meteorology
Meteorology
Meteorology is the interdisciplinary scientific study of the atmosphere. Studies in the field stretch back millennia, though significant progress in meteorology did not occur until the 18th century. The 19th century saw breakthroughs occur after observing networks developed across several countries...

, etc. where interactions among discrete attributes might be important for predicting a binary outcome.

Software

www.epistasis.org provides an open-source and freely-available MDR software package.

See also

  • Dimensionality reduction
    Dimensionality reduction
    In machine learning, dimension reduction is the process of reducing the number of random variables under consideration, and can be divided into feature selection and feature extraction.-Feature selection:...

  • Machine Learning
    Machine learning
    Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

  • Data Mining
    Data mining
    Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...


Further reading

  • R. S. Michalski, "Pattern Recognition as Knowledge-Guided Computer Induction," Department of Computer Science Reports, No. 927, University of Illinois, Urbana, June 1978.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK