Molecule mining
Encyclopedia
This page describes mining
for molecule
s. Since molecules may be represented by molecular graph
s this is strongly related to graph mining and structured data mining
. The main problem is how to represent molecules while discriminating the data instances. One way to do this is chemical similarity metrics
, which has a long tradition in the field of cheminformatics
.
Typical approaches to calculate chemical similarities use chemical fingerprints, but this loses the underlying information about the molecule topology
. Mining the molecular graphs directly
avoids this problem. So does the inverse QSAR problem which is preferable for vectorial mappings.
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
for molecule
Molecule
A molecule is an electrically neutral group of at least two atoms held together by covalent chemical bonds. Molecules are distinguished from ions by their electrical charge...
s. Since molecules may be represented by molecular graph
Molecular graph
In chemical graph theory and in mathematical chemistry, a molecular graph or chemical graph is a representation of the structural formula of a chemical compound in terms of graph theory. A chemical graph is a labeled graph whose vertices correspond to the atoms of the compound and edges correspond...
s this is strongly related to graph mining and structured data mining
Structured data mining
Structure mining or structured data mining is the process of finding and extracting useful information from semi structured data sets. Graph mining is a special case of structured data mining.-Description:...
. The main problem is how to represent molecules while discriminating the data instances. One way to do this is chemical similarity metrics
Metric (mathematics)
In mathematics, a metric or distance function is a function which defines a distance between elements of a set. A set with a metric is called a metric space. A metric induces a topology on a set but not all topologies can be generated by a metric...
, which has a long tradition in the field of cheminformatics
Cheminformatics
Cheminformatics is the use of computer and informational techniques, applied to a range of problems in the field of chemistry. These in silico techniques are used in pharmaceutical companies in the process of drug discovery...
.
Typical approaches to calculate chemical similarities use chemical fingerprints, but this loses the underlying information about the molecule topology
Topology (chemistry)
In chemistry, topology provides a convenient way of describing and predicting the molecular structure within the constraints of three-dimensional space. Given the determinants of chemical bonding and the chemical properties of the atoms, topology provides a model for explaining how the atoms...
. Mining the molecular graphs directly
avoids this problem. So does the inverse QSAR problem which is preferable for vectorial mappings.
Maximum Common Graph methods
- MCSMaximum common subgraph isomorphism problemIn complexity theory, maximum common subgraph-isomorphism is an optimization problem that is known to be NP-hard. The formal description of the problem is as follows:Maximum common subgraph-isomorphism...
-HSCS (Highest Scoring Common Substructure (HSCS) ranking strategy for single MCS) - Small MoleculeSmall moleculeIn the fields of pharmacology and biochemistry, a small molecule is a low molecular weight organic compound which is by definition not a polymer...
Subgraph Detector (SMSD)- is a Java based software library for calculating Maximum Common Subgraph (MCS) between small molecules. This will help us to find similarity/distance between two molecules. MCS is also used for screening drug like compounds by hitting molecules, which share common subgraph (substructure).
Molecular query methods
- Warmr
- AGM
- PolyFARM
- FSG
- MolFea
- MoFa/MoSS
- Gaston
- LAZAR
- ParMol (contains MoFa, FFSM, gSpan, and Gaston)
- optimized gSpan
- SMIREP
- DMax
- SAm/AIm/RHC
- AFGen
- gRed
Methods based on special architectures of neural networks
- BPZ
- ChemNet
- CCS
- MolNet
- Graph machines
Further reading
- Schölkopf, B., K. Tsuda and J. P. Vert: Kernel Methods in Computational Biology, MIT Press, Cambridge, MA, 2004.
- R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, John Wiley & Sons, 2001. ISBN 0-471-05669-3
- Gusfield, D., Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, 1997. ISBN 0-521-58519-8
- R. Todeschini, V. Consonni, Handbook of Molecular Descriptors, Wiley-VCH, 2000. ISBN 3527299130
External links
- Small Molecule Subgraph Detector (SMSD) - is a Java based software library for calculating Maximum Common Subgraph (MCS) between small molecules.
- 5th International Workshop on Mining and Learning with Graphs, 2007
- Overview for 2006
- Molecule mining (basic chemical expert systems)
- ParMol and master thesis documentation - Java - Open source - Distributed mining - Benchmark algorithm library
- TU München - Kramer group
- Molecule mining (advanced chemical expert systems)
- DMax Chemistry Assistant - commercial software
- AFGen - Software for generating fragment-based descriptors