Chemical database
Encyclopedia
A chemical database is a database
specifically designed to store chemical information
. This information is about chemical and crystal structures, spectra, reactions
and syntheses, and thermophysical data.
s are traditionally represented using lines indicating chemical bonds between atoms and drawn on paper (2D structural formula
e). While these are ideal visual representations for the chemist
, they are unsuitable for computational use and especially for search
and storage
. Small molecules (also called ligands in drug design applications), are usually represented using lists of atoms and their connections. Large molecules such as proteins are however more compactly represented using the sequences of their amino acid building blocks.
Large chemical databases for structures are expected to handle the storage and searching of information on millions of molecules taking terabytes of physical memory.
and Cambridge Structural Database
.
s correlate chemical structure with NMR data. These databases often include other characterization data such as FTIR and Mass Spec.
s but in databases for reactions also intermediates and temporarily created unstable molecules are stored. Reaction databases contain information about products, educts, and reaction mechanism
s.
These approaches have been refined to allow representation of stereochemical
differences and charges as well as special kinds of bonding such as those seen in organo-metallic compounds. The principal advantage of a computer representation is the possibility for increased storage and fast, flexible search.
) and is a widely studied application of Graph theory
. The algorithms for searching are computationally intensive, often of O
(n3) or O
(n4) time complexity (where n is the number of atoms involved). The intensive component of search is called atom-by-atom-searching (ABAS), in which a mapping of the search substructure atoms and bonds with the target molecule is sought. ABAS searching usually makes use of Ullman's algorithm or variations of it (i.e. SMSD ). Speedups are achieved by time amortization, that is, some of the time on search tasks are saved by using precomputed information. This pre-computation typically involves creation of bitstrings representing presence or absence of molecular fragments. By looking at the fragments present in a search structure it is possible to eliminate the need for ABAS comparison with target molecules that do not possess the fragments that are present in the search structure. This elimination is called screening (not to be confused with the screening procedures used in drug-discovery). The bit-strings used for these applications are also called structural-keys. The performance of such keys depends on the choice of the fragments used for constructing the keys and the probability of their presence in the database molecules. Another kind of key makes use of hash-codes based on fragments derived computationally. These are called 'fingerprints' although the term is sometimes used synonymously with structural-keys. The amount of memory needed to store these structural-keys and fingerprints can be reduced by 'folding', which is achieved by combining parts of the key using bitwise-operations and thereby reducing the overall length.
. Searches of this kind can be computationally very expensive. Many approximate methods have been proposed, for instance BCUTS, special function representations, moments of inertia, ray-tracing histograms, maximum distance histograms, shape multipoles to name a few.
s. The IUPAC name is usually a good choice for representing a molecule's structure in a both human-readable
and unique string
although it becomes unwieldy for larger molecules. Trivial name
s on the other hand abound with homonym
s and synonyms and are therefore a bad choice as a defining database key. While physico-chemical descriptors like molecular weight, (partial
) charge, solubility
, etc. can mostly be computed directly based on the molecule's structure, pharmacological descriptors can be derived only indirectly using involved multivariate statistics or experimental (screening
, bioassay
) results. All of those descriptors can for reasons of computational effort be stored along with the molecule's representation and usually are.
of a measure of distance
in descriptor space. Two molecules might be considered more similar for instance if their difference in molecular weights is lower than when compared with others. A variety of other measures could be combined to produce a multi-variate distance measure. Distance measures are often classified into Euclidean measure
s and non-Euclidean measures depending on whether the triangle inequality
holds. Maximum Common Subgraph (MCS
) based substructure search (similarity or distance measure) is also very common. MCS is also used for screening drug like compounds by hitting molecules, which share common subgraph (substructure).
Chemicals in the databases may be clustered
into groups of 'similar' molecules based on similarities. Both hierarchical and non-hierarchical clustering approaches can be applied to chemical entities with multiple attributes. These attributes or molecular properties may either be determined empirically or computationally derived descriptors
. One of the most popular clustering approaches is the Jarvis-Patrick algorithm (k-nearest neighbours algorithm).
In pharmacologically oriented chemical repositories, similarity is usually defined in terms of the biological effects of compounds (ADME
/tox) that can in turn be semiautomatically inferred from similar combinations of physico-chemical descriptors using QSAR methods.
s are termed as Registration systems. These are often used for chemical indexing, patent
systems and industrial databases.
Registration systems usually enforce uniqueness of the chemical represented in the database through the use of unique representations. By applying rules of precedence for the generation of stringified notations, one can obtain unique/'canonical
' string representations such as 'canonical SMILES'. Some registration systems such as the CAS system make use of algorithms to generate unique hash codes to achieve the same objective.
A key difference between a registration system and a simple chemical database is the ability to accurately represent that which is known, unknown, and partially known. For example, a chemical database might store a molecule with stereochemistry
unspecified, whereas a chemical registry system requires the registrar to specify whether the stereo configuration is unknown, a specific (known) mixture, or racemic
. Each of these would be considered a different record in a chemical registry system.
Registration systems also preprocess molecules to avoid considering trivial differences such as differences in halogen
ions in chemicals.
An example is the Chemical Abstracts Service
(CAS) registration system http://www.cas.org/EO/regsys.html. See also CAS registry number
.
There are also numerous algorithms for the interconversion of various formats of representation. An open-source utility for conversion is OpenBabel
. These search and conversion algorithms are implemented either within the database system itself or as is now the trend is implemented as external components that fit into standard relational database systems. Both Oracle and PostgreSQL
based systems make use of cartridge technology that allows user defined datatypes. These allow the user to make SQL queries with chemical search conditions (For example a query to search for records having a phenyl ring in their structure represented as a SMILES string in a SMILESCOL column could be
Algorithms for the conversion of IUPAC names to structure representations and vice versa are also used for extracting structural information from text
. However there are difficulties due to the existence of multiple dialects of IUPAC. Work is on to establish a unique IUPAC standard (See InChI
).
Database
A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...
specifically designed to store chemical information
Cheminformatics
Cheminformatics is the use of computer and informational techniques, applied to a range of problems in the field of chemistry. These in silico techniques are used in pharmaceutical companies in the process of drug discovery...
. This information is about chemical and crystal structures, spectra, reactions
Chemical reaction
A chemical reaction is a process that leads to the transformation of one set of chemical substances to another. Chemical reactions can be either spontaneous, requiring no input of energy, or non-spontaneous, typically following the input of some type of energy, such as heat, light or electricity...
and syntheses, and thermophysical data.
Chemical structures
Chemical structureChemical structure
A chemical structure includes molecular geometry, electronic structure and crystal structure of molecules. Molecular geometry refers to the spatial arrangement of atoms in a molecule and the chemical bonds that hold the atoms together. Molecular geometry can range from the very simple, such as...
s are traditionally represented using lines indicating chemical bonds between atoms and drawn on paper (2D structural formula
Structural formula
The structural formula of a chemical compound is a graphical representation of the molecular structure, showing how the atoms are arranged. The chemical bonding within the molecule is also shown, either explicitly or implicitly...
e). While these are ideal visual representations for the chemist
Chemist
A chemist is a scientist trained in the study of chemistry. Chemists study the composition of matter and its properties such as density and acidity. Chemists carefully describe the properties they study in terms of quantities, with detail on the level of molecules and their component atoms...
, they are unsuitable for computational use and especially for search
Search algorithm
In computer science, a search algorithm is an algorithm for finding an item with specified properties among a collection of items. The items may be stored individually as records in a database; or may be elements of a search space defined by a mathematical formula or procedure, such as the roots...
and storage
Computer storage
Computer data storage, often called storage or memory, refers to computer components and recording media that retain digital data. Data storage is one of the core functions and fundamental components of computers....
. Small molecules (also called ligands in drug design applications), are usually represented using lists of atoms and their connections. Large molecules such as proteins are however more compactly represented using the sequences of their amino acid building blocks.
Large chemical databases for structures are expected to handle the storage and searching of information on millions of molecules taking terabytes of physical memory.
Literature database
Chemical literature databases correlate structures or other chemical information to relevant references such as academic papers or patents. This type of database includes STN and Scifinder. Links to literature are also included in many databases that focus on chemical characterization.Crystallographic database
Crystallographic databases store x-ray crystal structure data. Common examples include Protein Data BankProtein Data Bank
The Protein Data Bank is a repository for the 3-D structural data of large biological molecules, such as proteins and nucleic acids....
and Cambridge Structural Database
Cambridge Structural Database
The Cambridge Structural Database , is a repository for small molecule crystal structures. Scientists use single-crystal x-ray crystallography to determine the crystal structure of a compound. Once the structure is solved, information about the structure is saved in a file and deposited in the CSD...
.
NMR spectra database
NMR spectra databaseNMR spectra database
Nuclear Magnetic Resonance spectra database is an electronic repository of information concerning NMR spectra. The repository can be stored as a complete self contained data set or as an online repository that can be accessed and searched remotely. The form in which the data is stored ranges...
s correlate chemical structure with NMR data. These databases often include other characterization data such as FTIR and Mass Spec.
Reactions database
Most chemical databases store information on stable moleculeMolecule
A molecule is an electrically neutral group of at least two atoms held together by covalent chemical bonds. Molecules are distinguished from ions by their electrical charge...
s but in databases for reactions also intermediates and temporarily created unstable molecules are stored. Reaction databases contain information about products, educts, and reaction mechanism
Reaction mechanism
In chemistry, a reaction mechanism is the step by step sequence of elementary reactions by which overall chemical change occurs.Although only the net chemical change is directly observable for most chemical reactions, experiments can often be designed that suggest the possible sequence of steps in...
s.
Thermophysical database
Thermophysical data are information about- phase equilibria including vapor-liquid equilibriumVapor-liquid equilibriumVapor–liquid equilibrium is a condition where a liquid and its vapor are in equilibrium with each other, a condition or state where the rate of evaporation equals the rate of condensation on a molecular level such that there is no net vapor-liquid interconversion...
, solubilitySolubilitySolubility is the property of a solid, liquid, or gaseous chemical substance called solute to dissolve in a solid, liquid, or gaseous solvent to form a homogeneous solution of the solute in the solvent. The solubility of a substance fundamentally depends on the used solvent as well as on...
of gases in liquids, liquids in solids (SLE), heats of mixing, vaporization, and fusionEnthalpy of fusionThe enthalpy of fusion is the change in enthalpy resulting from heating one mole of a substance to change its state from a solid to a liquid. The temperature at which this occurs is the melting point....
. - caloric data like heat capacityHeat capacityHeat capacity , or thermal capacity, is the measurable physical quantity that characterizes the amount of heat required to change a substance's temperature by a given amount...
, heat of formation and combustionHeat of combustionThe heat of combustion is the energy released as heat when a compound undergoes complete combustion with oxygen under standard conditions. The chemical reaction is typically a hydrocarbon reacting with oxygen to form carbon dioxide, water and heat...
, - transport properties like viscosityViscosityViscosity is a measure of the resistance of a fluid which is being deformed by either shear or tensile stress. In everyday terms , viscosity is "thickness" or "internal friction". Thus, water is "thin", having a lower viscosity, while honey is "thick", having a higher viscosity...
and thermal conductivityThermal conductivityIn physics, thermal conductivity, k, is the property of a material's ability to conduct heat. It appears primarily in Fourier's Law for heat conduction....
Chemical structure representation
There are two principal techniques for representing chemical structures in digital databases- As connection tables / adjacency matricesAdjacency matrixIn mathematics and computer science, an adjacency matrix is a means of representing which vertices of a graph are adjacent to which other vertices...
/ lists with additional information on bondChemical bondA chemical bond is an attraction between atoms that allows the formation of chemical substances that contain two or more atoms. The bond is caused by the electromagnetic force attraction between opposite charges, either between electrons and nuclei, or as the result of a dipole attraction...
(edges) and atom attributes (nodes), such as:- MDL Molfile, PDBProtein Data BankThe Protein Data Bank is a repository for the 3-D structural data of large biological molecules, such as proteins and nucleic acids....
, CMLChemical Markup LanguageCML is an approach to managing molecular information using tools such as XML and Java. It was the first domain specific implementation based strictly on XML, first based on a DTD and later on XML Schema, the most robust and widely used system for precise information management in many areas...
- MDL Molfile, PDB
- As a linear string notation based on depth firstDepth-first searchDepth-first search is an algorithm for traversing or searching a tree, tree structure, or graph. One starts at the root and explores as far as possible along each branch before backtracking....
or breadth first traversal, such as:- SMILESSimplified molecular input line entry specificationThe simplified molecular-input line-entry specification or SMILES is a specification in form of a line notation for describing the structure of chemical molecules using short ASCII strings...
/SMARTS, SLNSYBYL Line NotationThe SYBYL line notation or SLN is a specification for unambiguously describing the structure of chemical molecules using short ASCII strings....
, WLNWiswesser Line NotationWiswesser Line Notation, also referred to as WLN, invented by William J. Wiswesser in 1949, was the first line notation capable of precisely describing complex molecules. It was the basis of ICI Ltd's CROSSBOW database system developed in the late 1960's...
, InChIInternational Chemical IdentifierThe IUPAC International Chemical Identifier is a textual identifier for chemical substances, designed to provide a standard and human-readable way to encode molecular information and to facilitate the search for such information in databases and on the web...
- SMILES
These approaches have been refined to allow representation of stereochemical
Stereochemistry
Stereochemistry, a subdiscipline of chemistry, involves the study of the relative spatial arrangement of atoms within molecules. An important branch of stereochemistry is the study of chiral molecules....
differences and charges as well as special kinds of bonding such as those seen in organo-metallic compounds. The principal advantage of a computer representation is the possibility for increased storage and fast, flexible search.
Substructure
Chemists can search databases using parts of structures, parts of their IUPAC names as well as based on constraints on properties. Chemical databases are particularly different from other general purpose databases in their support for sub-structure search. This kind of search is achieved by looking for subgraph isomorphism (sometimes also called a monomorphismMonomorphism
In the context of abstract algebra or universal algebra, a monomorphism is an injective homomorphism. A monomorphism from X to Y is often denoted with the notation X \hookrightarrow Y....
) and is a widely studied application of Graph theory
Graph theory
In mathematics and computer science, graph theory is the study of graphs, mathematical structures used to model pairwise relations between objects from a certain collection. A "graph" in this context refers to a collection of vertices or 'nodes' and a collection of edges that connect pairs of...
. The algorithms for searching are computationally intensive, often of O
Big O notation
In mathematics, big O notation is used to describe the limiting behavior of a function when the argument tends towards a particular value or infinity, usually in terms of simpler functions. It is a member of a larger family of notations that is called Landau notation, Bachmann-Landau notation, or...
(n3) or O
Big O notation
In mathematics, big O notation is used to describe the limiting behavior of a function when the argument tends towards a particular value or infinity, usually in terms of simpler functions. It is a member of a larger family of notations that is called Landau notation, Bachmann-Landau notation, or...
(n4) time complexity (where n is the number of atoms involved). The intensive component of search is called atom-by-atom-searching (ABAS), in which a mapping of the search substructure atoms and bonds with the target molecule is sought. ABAS searching usually makes use of Ullman's algorithm or variations of it (i.e. SMSD ). Speedups are achieved by time amortization, that is, some of the time on search tasks are saved by using precomputed information. This pre-computation typically involves creation of bitstrings representing presence or absence of molecular fragments. By looking at the fragments present in a search structure it is possible to eliminate the need for ABAS comparison with target molecules that do not possess the fragments that are present in the search structure. This elimination is called screening (not to be confused with the screening procedures used in drug-discovery). The bit-strings used for these applications are also called structural-keys. The performance of such keys depends on the choice of the fragments used for constructing the keys and the probability of their presence in the database molecules. Another kind of key makes use of hash-codes based on fragments derived computationally. These are called 'fingerprints' although the term is sometimes used synonymously with structural-keys. The amount of memory needed to store these structural-keys and fingerprints can be reduced by 'folding', which is achieved by combining parts of the key using bitwise-operations and thereby reducing the overall length.
Conformation
Search by matching 3D conformation of molecules or by specifying spatial constraints is another feature that is particularly of use in drug designDrug design
Drug design, also sometimes referred to as rational drug design or structure-based drug design, is the inventive process of finding new medications based on the knowledge of the biological target...
. Searches of this kind can be computationally very expensive. Many approximate methods have been proposed, for instance BCUTS, special function representations, moments of inertia, ray-tracing histograms, maximum distance histograms, shape multipoles to name a few.
Descriptors
All properties of molecules beyond their structure can be split up into either physico-chemical or pharmacological attributes also called descriptors. On top of that, there exist various artificial and more or less standardized naming systems for molecules that supply more or less ambiguous names and synonymSynonym
Synonyms are different words with almost identical or similar meanings. Words that are synonyms are said to be synonymous, and the state of being a synonym is called synonymy. The word comes from Ancient Greek syn and onoma . The words car and automobile are synonyms...
s. The IUPAC name is usually a good choice for representing a molecule's structure in a both human-readable
Human-readable
A human-readable medium or human-readable format is a representation of data or information that can be naturally read by humans.In computing, human-readable data is often encoded as ASCII or Unicode text, rather than presented in a binary representation...
and unique string
String (computer science)
In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....
although it becomes unwieldy for larger molecules. Trivial name
Trivial name
In chemistry, a trivial name is a common name or vernacular name; it is a non-systematic name or non-scientific name. That is, the name is not recognised according to the rules of any formal system of nomenclature...
s on the other hand abound with homonym
Homonym
In linguistics, a homonym is, in the strict sense, one of a group of words that often but not necessarily share the same spelling and the same pronunciation but have different meanings...
s and synonyms and are therefore a bad choice as a defining database key. While physico-chemical descriptors like molecular weight, (partial
Partial charge
A partial charge is a charge with an absolute value of less than one elementary charge unit .-Partial atomic charges:...
) charge, solubility
Solubility
Solubility is the property of a solid, liquid, or gaseous chemical substance called solute to dissolve in a solid, liquid, or gaseous solvent to form a homogeneous solution of the solute in the solvent. The solubility of a substance fundamentally depends on the used solvent as well as on...
, etc. can mostly be computed directly based on the molecule's structure, pharmacological descriptors can be derived only indirectly using involved multivariate statistics or experimental (screening
Sampling (statistics)
In statistics and survey methodology, sampling is concerned with the selection of a subset of individuals from within a population to estimate characteristics of the whole population....
, bioassay
Bioassay
Bioassay , or biological standardization is a type of scientific experiment. Bioassays are typically conducted to measure the effects of a substance on a living organism and are essential in the development of new drugs and in monitoring environmental pollutants...
) results. All of those descriptors can for reasons of computational effort be stored along with the molecule's representation and usually are.
Similarity
There is no single definition of molecular similarity, however the concept may be defined according to the application and is often described as an inverseInverse element
In abstract algebra, the idea of an inverse element generalises the concept of a negation, in relation to addition, and a reciprocal, in relation to multiplication. The intuition is of an element that can 'undo' the effect of combination with another given element...
of a measure of distance
Distance
Distance is a numerical description of how far apart objects are. In physics or everyday discussion, distance may refer to a physical length, or an estimation based on other criteria . In mathematics, a distance function or metric is a generalization of the concept of physical distance...
in descriptor space. Two molecules might be considered more similar for instance if their difference in molecular weights is lower than when compared with others. A variety of other measures could be combined to produce a multi-variate distance measure. Distance measures are often classified into Euclidean measure
Euclidean distance
In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula. By using this formula as distance, Euclidean space becomes a metric space...
s and non-Euclidean measures depending on whether the triangle inequality
Triangle inequality
In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side ....
holds. Maximum Common Subgraph (MCS
Maximum common subgraph isomorphism problem
In complexity theory, maximum common subgraph-isomorphism is an optimization problem that is known to be NP-hard. The formal description of the problem is as follows:Maximum common subgraph-isomorphism...
) based substructure search (similarity or distance measure) is also very common. MCS is also used for screening drug like compounds by hitting molecules, which share common subgraph (substructure).
Chemicals in the databases may be clustered
Clustering
Clustering can refer to the following:In demographics:* Clustering , the gathering of various populations based on factors such as ethnicity, economics or religion.In graph theory:...
into groups of 'similar' molecules based on similarities. Both hierarchical and non-hierarchical clustering approaches can be applied to chemical entities with multiple attributes. These attributes or molecular properties may either be determined empirically or computationally derived descriptors
Molecular descriptor
Molecular descriptors play a fundamental role in chemistry, pharmaceutical sciences, environmental protection policy, and health researches, as well as in quality control, being the way molecules, thought of as real bodies, are transformed into numbers, allowing some mathematical treatment of the...
. One of the most popular clustering approaches is the Jarvis-Patrick algorithm (k-nearest neighbours algorithm).
In pharmacologically oriented chemical repositories, similarity is usually defined in terms of the biological effects of compounds (ADME
ADME
ADME is an acronym in pharmacokinetics and pharmacology for absorption, distribution, metabolism, and excretion, and describes the disposition of a pharmaceutical compound within an organism...
/tox) that can in turn be semiautomatically inferred from similar combinations of physico-chemical descriptors using QSAR methods.
Registration systems
Databases systems for maintaining unique records on chemical compoundChemical compound
A chemical compound is a pure chemical substance consisting of two or more different chemical elements that can be separated into simpler substances by chemical reactions. Chemical compounds have a unique and defined chemical structure; they consist of a fixed ratio of atoms that are held together...
s are termed as Registration systems. These are often used for chemical indexing, patent
Patent
A patent is a form of intellectual property. It consists of a set of exclusive rights granted by a sovereign state to an inventor or their assignee for a limited period of time in exchange for the public disclosure of an invention....
systems and industrial databases.
Registration systems usually enforce uniqueness of the chemical represented in the database through the use of unique representations. By applying rules of precedence for the generation of stringified notations, one can obtain unique/'canonical
Canonical
Canonical is an adjective derived from canon. Canon comes from the greek word κανών kanon, "rule" or "measuring stick" , and is used in various meanings....
' string representations such as 'canonical SMILES'. Some registration systems such as the CAS system make use of algorithms to generate unique hash codes to achieve the same objective.
A key difference between a registration system and a simple chemical database is the ability to accurately represent that which is known, unknown, and partially known. For example, a chemical database might store a molecule with stereochemistry
Stereochemistry
Stereochemistry, a subdiscipline of chemistry, involves the study of the relative spatial arrangement of atoms within molecules. An important branch of stereochemistry is the study of chiral molecules....
unspecified, whereas a chemical registry system requires the registrar to specify whether the stereo configuration is unknown, a specific (known) mixture, or racemic
Racemic
In chemistry, a racemic mixture, or racemate , is one that has equal amounts of left- and right-handed enantiomers of a chiral molecule. The first known racemic mixture was "racemic acid", which Louis Pasteur found to be a mixture of the two enantiomeric isomers of tartaric acid.- Nomenclature :A...
. Each of these would be considered a different record in a chemical registry system.
Registration systems also preprocess molecules to avoid considering trivial differences such as differences in halogen
Halogen
The halogens or halogen elements are a series of nonmetal elements from Group 17 IUPAC Style of the periodic table, comprising fluorine , chlorine , bromine , iodine , and astatine...
ions in chemicals.
An example is the Chemical Abstracts Service
Chemical Abstracts Service
Chemical Abstracts is a periodical index that provides summaries and indexes of disclosures in recently published scientific documents. Approximately 8,000 journals, technical reports, dissertations, conference proceedings, and new books, in any of 50 languages, are monitored yearly, as are patent...
(CAS) registration system http://www.cas.org/EO/regsys.html. See also CAS registry number
CAS registry number
CAS Registry Numbersare unique numerical identifiers assigned by the "Chemical Abstracts Service" toevery chemical described in the...
.
Tools
The computational representations are usually made transparent to chemists by graphical display of the data. Data entry is also simplified through the use of chemical structure editors. These editors internally convert the graphical data into computational representations.There are also numerous algorithms for the interconversion of various formats of representation. An open-source utility for conversion is OpenBabel
OpenBabel
OpenBabel is free software, a chemical expert system mainly used for converting chemical file formats. Due to the strong relationship to informatics this program belongs more to the category cheminformatics than to molecular modelling. It is available for Windows, Unix, and Mac OS...
. These search and conversion algorithms are implemented either within the database system itself or as is now the trend is implemented as external components that fit into standard relational database systems. Both Oracle and PostgreSQL
PostgreSQL
PostgreSQL, often simply Postgres, is an object-relational database management system available for many platforms including Linux, FreeBSD, Solaris, MS Windows and Mac OS X. It is released under the PostgreSQL License, which is an MIT-style license, and is thus free and open source software...
based systems make use of cartridge technology that allows user defined datatypes. These allow the user to make SQL queries with chemical search conditions (For example a query to search for records having a phenyl ring in their structure represented as a SMILES string in a SMILESCOL column could be
Algorithms for the conversion of IUPAC names to structure representations and vice versa are also used for extracting structural information from text
Text mining
Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as...
. However there are difficulties due to the existence of multiple dialects of IUPAC. Work is on to establish a unique IUPAC standard (See InChI
International Chemical Identifier
The IUPAC International Chemical Identifier is a textual identifier for chemical substances, designed to provide a standard and human-readable way to encode molecular information and to facilitate the search for such information in databases and on the web...
).
See also
- Beilstein databaseBeilstein databaseThe Beilstein database is the largest database in the field of organic chemistry, in which compounds are uniquely identified by their Beilstein Registry Number. The database covers the scientific literature from 1771 to the present and contains experimentally validated information on millions of...
and Dortmund Data BankDortmund Data BankThe Dortmund Data Bank is a factual data bank for thermodynamic and thermophysical data. Its main usage is the data supply for process simulation where experimental data are the basis for the design, analysis, synthesis, and optimization of chemical processes... - BindingDBBindingDBBindingDB is a public, web-accessible database of measured binding affinities, focusing chiefly on the interactions of proteins considered to be candidate drug-targets with ligands that are small, drug-like molecules. As of March, 2011, BindingDB contains about 650,000 binding data, for 5,700...
- ChEBIChEBIChemical Entities of Biological Interest, also known as ChEBI, is a database and ontology of molecular entities focused on 'small' chemical compounds, that is part of the Open Biomedical Ontologies effort...
- ChEMBLChEMBLChEMBL or ChEMBLdb is a manually curated chemical database of bioactive molecules with drug-like properties.It is maintained by the European Bioinformatics Institute , based on the Wellcome Trust Genome Campus, Hinxton, UK. The database, originally known as StARlite, was developed by a...
- ChemSpiderChemSpiderChemsSpider is a free chemical database, owned by the Royal Society of Chemistry.-Database:The database contains more than 26 million unique molecules from over 400 data sources including those listed below.* A-L: EPA DSSTox, U.S...
- Comparative Toxicogenomics DatabaseComparative Toxicogenomics DatabaseThe Comparative Toxicogenomics Database is a public website and research tool that curates scientific data describing relationships between chemicals, genes, and human diseases....
- Computational Chemistry ListComputational Chemistry ListThe Computational Chemistry List was established on January 11, 1991, as an independent electronic forum for chemistry researchers and educators from around the world. According to the forum's web site, it is estimated that more than 3000 members in more than 50 countries are reading CCL messages...
- DrugBankDrugBankThe DrugBank database available at the University of Alberta is a bioinformatics and cheminformatics resource that combines detailed drug data with comprehensive drug target information...
- List of software for molecular mechanics modeling
- LOLI databaseLOLI DatabaseThe LOLI Database is an international chemical regulatory database developed and maintained by ChemADVISOR, Inc.The LOLI database is one of the primary sources of information for the creation of material safety data sheets and other hazard communication documents.The database was first created in...
- NMR spectra databaseNMR spectra databaseNuclear Magnetic Resonance spectra database is an electronic repository of information concerning NMR spectra. The repository can be stored as a complete self contained data set or as an online repository that can be accessed and searched remotely. The form in which the data is stored ranges...
- PubChemPubChemPubChem is a database of chemical molecules and their activities against biological assays. The system is maintained by the National Center for Biotechnology Information , a component of the National Library of Medicine, which is part of the United States National Institutes of Health . PubChem can...
- SPRESI databaseSPRESI databaseThe SPRESI data collection is one of the largest databases for organic chemistry worldwide. The database covers the scientific literature from 1974 to the present, focusing on organic synthesis...
- eChemportal
Database and registration software
- CDK a Java open source library for chemical data handling
- JChem Base and JChem Cartridge Java and .NET database management and search toolkits from ChemAxonChemaxonChemAxon is a software company specializing in application programming interfaces and end user applications for cheminformatics and life science research...
- Instant JChem Java desktop database management and search application from ChemAxonChemaxonChemAxon is a software company specializing in application programming interfaces and end user applications for cheminformatics and life science research...
. Personal Edition free. - SMSD (Small Molecule Subgraph Detector) is a Java based software library for calculating Maximum Common Subgraph (MCS) between small molecules
- JOELib, a Java chemical data handling software library
- 'Chemical Structure Lookup Service' and 'NCI Enhanced Database Browser', web services of the CADD group at the National Cancer Institute (NCI)
- Pinpoint from Dotmatics is a CC (programming language)C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
based cartridge for OracleOracle DatabaseThe Oracle Database is an object-relational database management system produced and marketed by Oracle Corporation....
that supports the free Oracle XE. - Bingo from GGA Software Services is a free and open-source cartridge for OracleOracle DatabaseThe Oracle Database is an object-relational database management system produced and marketed by Oracle Corporation....
, Microsoft SQL ServerMicrosoft SQL ServerMicrosoft SQL Server is a relational database server, developed by Microsoft: It is a software product whose primary function is to store and retrieve data as requested by other software applications, be it those on the same computer or those running on another computer across a network...
and PostgreSQLPostgreSQLPostgreSQL, often simply Postgres, is an object-relational database management system available for many platforms including Linux, FreeBSD, Solaris, MS Windows and Mac OS X. It is released under the PostgreSQL License, which is an MIT-style license, and is thus free and open source software...
. - MolSql from Scilligence is a chemistry cartridge built on Microsoft .NET for Microsoft SQL ServerMicrosoft SQL ServerMicrosoft SQL Server is a relational database server, developed by Microsoft: It is a software product whose primary function is to store and retrieve data as requested by other software applications, be it those on the same computer or those running on another computer across a network...
, supporting the free SQL Server Express edition.
Databases of chemical structures
- Synthesis references database
- Aurora Fine Chemicals
- eChemPortal, a global portal to information on Chemical Substances
- NLM ChemIDplus, biomedical chemicals searchable by name and structure.
- Chembase, a chemical compounds database with data and properties.
- Organic synthesis database
- ZINC, a free database for virtual screening
- ChemSpider, Free access to > 20 Million Chemical Structures, Physical Property Data and Systematic Identifiers
- MMsINC, a free web-oriented database of commercially available compounds for virtual screening and chemoinformatic applications
- ChemIndustry a free database derived from PubChemPubChemPubChem is a database of chemical molecules and their activities against biological assays. The system is maintained by the National Center for Biotechnology Information , a component of the National Library of Medicine, which is part of the United States National Institutes of Health . PubChem can...
data - https://kdd.di.unito.it/casmedchem, OpenCDLig: a free Web application for host/guest complexes
- NCI/CADD Chemical Structure Lookup Service, lookup in which databases a structure occurs (currently > 70 million indexed chemical structures)
- Chempedia, the open, peer reviewed chemical substance registry
- ChEBI, the free chemical substance registry for biologically relevant molecules
- Chemonaut Chemonaut is the world's most comprehensive source of physically available commercial compounds.
Databases of chemical names
- Chemical Substances Database, a free database of chemical names, mainly useful for translation of names between Japanese and English. More than 37,000 entries.
- ChemSub Online, the Free Web Portal and Information System on Chemical Substances, substance names available in 8 languages.
- EuroChem Online Database, the free Chemical Database.