Evolutionary data mining
Encyclopedia
Evolutionary data mining, or genetic data mining is an umbrella term
for any data mining
using evolutionary algorithm
s. While it can be used for mining data from DNA sequence
s, it is not limited to biological contexts and can be used in any classification-based prediction scenario, which helps "predict the value ... of a user-specified goal attribute based on the values of other attributes." For instance, a banking institution might want to predict whether a customer's credit
would be "good" or "bad" based on their age, income and current savings. Evolutionary algorithms for data mining work by creating a series of random rules to be checked against a training dataset. The rules which most closely fit the data are selected and are mutated. The process is iterated
many times and eventually, a rule will arise that approaches 100% similarity with the training data. This rule is then checked against a test dataset, which was previously invisible to the genetic algorithm.
s can be mined for data using evolutionary algorithms, it first has to be cleaned, which means incomplete, noisy or inconsistent data should be repaired. It is imperative that this be done before the mining takes place, as it will help the algorithms produce more accurate results.
If data comes from more than one database, they can be integrated, or combined, at this point. When dealing with large datasets, it might be beneficial to also reduce the amount of data being handled. One common method of data reduction works by getting a normalized
sample of data from the database, resulting in much faster, yet statistically equivalent results.
At this point, the data is split into two equal but mutually exclusive elements, a test and a training dataset. The training dataset will be used to let rules evolve which match it closely. The test dataset will then either confirm or deny these rules.
. First, a random series of "rules" are set on the training dataset, which try to generalize the data into formulas. The rules are checked, and the ones that fit the data best are kept, the rules that do not fit the data are discarded. The rules that were kept are then mutated, and multiplied to create new rules.
This process iterates as necessary in order to produce a rule that matches the dataset as closely as possible. When this rule is obtained, it is then checked against the test dataset. If the rule still matches the data, then the rule is valid and is kept. If it does not match the data, then it is discarded and the process begins by selecting random rules again.
Umbrella term
An umbrella term is a word that provides a superset or grouping of concepts that all fall under a single common category. Umbrella term is also called a hypernym. For example, cryptology is an umbrella term that encompasses cryptography and cryptanalysis, among other fields...
for any data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
using evolutionary algorithm
Evolutionary algorithm
In artificial intelligence, an evolutionary algorithm is a subset of evolutionary computation, a generic population-based metaheuristic optimization algorithm. An EA uses some mechanisms inspired by biological evolution: reproduction, mutation, recombination, and selection...
s. While it can be used for mining data from DNA sequence
DNA sequence
The sequence or primary structure of a nucleic acid is the composition of atoms that make up the nucleic acid and the chemical bonds that bond those atoms. Because nucleic acids, such as DNA and RNA, are unbranched polymers, this specification is equivalent to specifying the sequence of...
s, it is not limited to biological contexts and can be used in any classification-based prediction scenario, which helps "predict the value ... of a user-specified goal attribute based on the values of other attributes." For instance, a banking institution might want to predict whether a customer's credit
Credit (finance)
Credit is the trust which allows one party to provide resources to another party where that second party does not reimburse the first party immediately , but instead arranges either to repay or return those resources at a later date. The resources provided may be financial Credit is the trust...
would be "good" or "bad" based on their age, income and current savings. Evolutionary algorithms for data mining work by creating a series of random rules to be checked against a training dataset. The rules which most closely fit the data are selected and are mutated. The process is iterated
Iteration
Iteration means the act of repeating a process usually with the aim of approaching a desired goal or target or result. Each repetition of the process is also called an "iteration," and the results of one iteration are used as the starting point for the next iteration.-Mathematics:Iteration in...
many times and eventually, a rule will arise that approaches 100% similarity with the training data. This rule is then checked against a test dataset, which was previously invisible to the genetic algorithm.
Data preparation
Before databaseDatabase
A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...
s can be mined for data using evolutionary algorithms, it first has to be cleaned, which means incomplete, noisy or inconsistent data should be repaired. It is imperative that this be done before the mining takes place, as it will help the algorithms produce more accurate results.
If data comes from more than one database, they can be integrated, or combined, at this point. When dealing with large datasets, it might be beneficial to also reduce the amount of data being handled. One common method of data reduction works by getting a normalized
Normalization (statistics)
In one usage in statistics, normalization is the process of isolating statistical error in repeated measured data. A normalization is sometimes based on a property...
sample of data from the database, resulting in much faster, yet statistically equivalent results.
At this point, the data is split into two equal but mutually exclusive elements, a test and a training dataset. The training dataset will be used to let rules evolve which match it closely. The test dataset will then either confirm or deny these rules.
Data mining
Evolutionary algorithms work by trying to emulate natural evolutionEvolution
Evolution is any change across successive generations in the heritable characteristics of biological populations. Evolutionary processes give rise to diversity at every level of biological organisation, including species, individual organisms and molecules such as DNA and proteins.Life on Earth...
. First, a random series of "rules" are set on the training dataset, which try to generalize the data into formulas. The rules are checked, and the ones that fit the data best are kept, the rules that do not fit the data are discarded. The rules that were kept are then mutated, and multiplied to create new rules.
This process iterates as necessary in order to produce a rule that matches the dataset as closely as possible. When this rule is obtained, it is then checked against the test dataset. If the rule still matches the data, then the rule is valid and is kept. If it does not match the data, then it is discarded and the process begins by selecting random rules again.
See also
- Data miningData miningData mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
- Evolutionary algorithmEvolutionary algorithmIn artificial intelligence, an evolutionary algorithm is a subset of evolutionary computation, a generic population-based metaheuristic optimization algorithm. An EA uses some mechanisms inspired by biological evolution: reproduction, mutation, recombination, and selection...
- Knowledge discoveryKnowledge discoveryKnowledge discovery is a concept of the field of computer science that describes the process of automatically searching large volumes of data for patterns that can be considered knowledge about the data . It is often described as deriving knowledge from the input data...
- Pattern mining
- Data analysisData analysisAnalysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making...