Unified Medical Language System
Encyclopedia
The Unified Medical Language System (UMLS) is a compendium
of many controlled vocabularies
in the biomedical sciences (created 1986). It provides a mapping structure among these vocabularies and thus allows one to translate among the various terminology systems; it may also be viewed as a comprehensive thesaurus
and ontology
of biomedical concepts. UMLS further provides facilities for natural language processing
. It is intended to be used mainly by developers of systems in medical informatics.
UMLS consists of Knowledge Sources (databases) and a set of software tools.
The UMLS was designed and is maintained by the US
National Library of Medicine, is updated quarterly and may be used for free. The project was initiated in 1986 by Donald A. B. Lindberg, M.D.
, then and current Director of the Library of Medicine.
The UMLS can be used to design information retrieval or patient record systems, to facilitate the communication between different systems, or to develop systems that parse the biomedical literature. For many of these applications, the UMLS will have to be customized locally according to one's particular needs. The Library of Medicine itself uses it for its PubMed
and ClinicalTrials.gov
systems.
, MeSH
, SNOMED CT
, DSM-IV, LOINC
, WHO Adverse Drug Reaction Terminology, UK Clinical Terms, RxNorm, Gene Ontology
, and OMIM (see full list).
The Metathesaurus is organized by concept, and each concept has specific attributes defining its meaning and is linked to the corresponding concept names in the various source vocabularies. Numerous relationships between the concepts are represented, for instance hierarchical ones such as "isa" for subclasses and "is part of" for subunits, and associative ones such as "is caused by" or "in the literature often occurs close to" (the latter being derived from Medline
).
The scope of the Metathesaurus is determined by the scope of the source vocabularies. If different vocabularies use different names for the same concept, or if they use the same name for different concepts, then this will be faithfully represented in the Metathesaurus. All hierarchical information from the source vocabularies is retained in the Metathesaurus. Metathesaurus concepts can also link to resources outside of the database, for instance gene sequence databases.
The Metathesaurus itself is produced by the automated processing of machine-readable versions of the source vocabularies, followed by human intervention of editing and review. It is distributed as a set of relational files and can also be accessed through a Java
API.
The semantic network
is a catalog of these semantic types and relationships. This is a rather broad classification; there are 135 semantic types and 54 relationships in total.
The major semantic types are organisms, anatomical structures, biologic function, chemicals, events, physical objects, and concepts or ideas.
The links among semantic types define the structure of the network and show important relationships between the groupings and concepts. The primary link between semantic types is the "isa" link, establishing a hierarchy
of types.
The network also has 5 major categories of non-hierarchical (or associative) relationships, which constitute the remaining 53 relationship types. These are "physically related to", "spatially related to", "temporally related to", "functionally related to" and "conceptually related to".
The information about a semantic type includes an identifier, definition, examples, hierarchical information about the encompassing semantic type(s), and associative
relationships. Associative relationships within the Semantic Network are very weak. They capture at most some-some relationships, i.e. they capture the fact that some instance of the first type may be connected by the salient relationship to some instance of the second type. Phrased differently, they capture the fact that a corresponding relational assertion is meaningful (though it need not be true in all cases).
An example of an associative relationship is "may-cause", applied to the terms (smoking, lung cancer) would yield: smoking "may-cause" lung cancer.
and terms found in the UMLS Metathesaurus. Each entry contains syntactic (how words are put together to create meaning), morphological
(form and structure) and orthographic
(spelling) information. A set of Java
programs use the lexicon to work through the variations in biomedical texts by relating words by their parts of speech, which can be helpful in web searches or searches through an electronic medical record
.
Entries may be one-word or multiple-word terms. Records contain four parts: base form (i.e. "run" for "running"); parts of speech (of which Specialist recognizes eleven); a unique identifier; and any available spelling variants.
For example, a query
for "anesthetic" would return the following:
{ base=anaesthetic
spelling_variant=anesthetic
entry=E0008769
cat=noun
variants=reg
}
{ base=anaesthetic
spelling_variant=anesthetic
entry=E0008770
cat=adj
variants=inv
position=attrib(3)
}
(Browne et al., 2000)
The SPECIALIST lexicon is available in two formats. The "unit record" format can be seen above, and comprises slots and fillers. A slot is the element (i.e. "base=" or "spelling variant=") and the fillers are the values attributable to that slot for that entry. The "relational table
" format is not yet normalized
and contain a great deal of redundant data in the files.
Errors include ambiguity and redundancy, hierarchical relationship cycles (a concept is both an ancestor and descendant to another), missing ancestors (semantic types of parent and child concepts are unrelated), and semantic inversion (the child/parent relationship with the semantic types is not consistent with the concepts).
These errors are discovered and resolved by auditing the UMLS. Manual audits can be very time-consuming and costly. Researchers have attempted to address the issue through a number of ways. Automated tools can be used to search for these errors.
For structural inconsistencies (such as loops), a trivial solution that removes based on order would work. However, the same wouldn't apply when the inconsistency is at the term or concept level (context-specific meaning of a term). This requires an informed search strategy be used (knowledge representation).
Compendium
A compendium is a concise, yet comprehensive compilation of a body of knowledge. A compendium may summarize a larger work. In most cases the body of knowledge will concern some delimited field of human interest or endeavour , while a "universal" encyclopedia can be referred to as a compendium of...
of many controlled vocabularies
Controlled vocabulary
Controlled vocabularies provide a way to organize knowledge for subsequent retrieval. They are used in subject indexing schemes, subject headings, thesauri, taxonomies and other form of knowledge organization systems...
in the biomedical sciences (created 1986). It provides a mapping structure among these vocabularies and thus allows one to translate among the various terminology systems; it may also be viewed as a comprehensive thesaurus
Thesaurus
A thesaurus is a reference work that lists words grouped together according to similarity of meaning , in contrast to a dictionary, which contains definitions and pronunciations...
and ontology
Ontology (computer science)
In computer science and information science, an ontology formally represents knowledge as a set of concepts within a domain, and the relationships between those concepts. It can be used to reason about the entities within that domain and may be used to describe the domain.In theory, an ontology is...
of biomedical concepts. UMLS further provides facilities for natural language processing
Natural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....
. It is intended to be used mainly by developers of systems in medical informatics.
UMLS consists of Knowledge Sources (databases) and a set of software tools.
The UMLS was designed and is maintained by the US
United States
The United States of America is a federal constitutional republic comprising fifty states and a federal district...
National Library of Medicine, is updated quarterly and may be used for free. The project was initiated in 1986 by Donald A. B. Lindberg, M.D.
Doctor of Medicine
Doctor of Medicine is a doctoral degree for physicians. The degree is granted by medical schools...
, then and current Director of the Library of Medicine.
Purpose and applications
The number of biomedical resources available to researchers is enormous. Often this is a problem due to the large volume of documents retrieved when the medical literature is searched. The purpose of the UMLS is to enhance access to this literature by facilitating the development of computer systems that understand biomedical language. This is achieved by overcoming two significant barriers: "the variety of ways the same concepts are expressed in different machine-readable sources & by different people" and "the distribution of useful information among many disparate databases & systems".The UMLS can be used to design information retrieval or patient record systems, to facilitate the communication between different systems, or to develop systems that parse the biomedical literature. For many of these applications, the UMLS will have to be customized locally according to one's particular needs. The Library of Medicine itself uses it for its PubMed
PubMed
PubMed is a free database accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine at the National Institutes of Health maintains the database as part of the Entrez information retrieval system...
and ClinicalTrials.gov
ClinicalTrials.gov
ClinicalTrials.gov is a registry of clinical trials. It is run by the United States National Library of Medicine at the National Institutes of Health, and is the largest clinical trials database, currently holding registrations from over 93,000 trials from more than 170 countries in the...
systems.
Licensing
Users of the system are required to sign a "UMLS agreement" and file brief annual usage reports. Academic users may use the UMLS free of charge for research purposes. Commercial or production use requires copyright licenses for some of the incorporated source vocabularies.Metathesaurus
The Metathesaurus forms the base of the UMLS and comprises over 1 million biomedical concepts and 5 million concept names, all of which stem from the over 100 incorporated controlled vocabularies and classification systems. Some examples of the incorporated controlled vocabularies are ICD-10ICD-10
The International Statistical Classification of Diseases and Related Health Problems, 10th Revision is a medical classification list for the coding of diseases, signs and symptoms, abnormal findings, complaints, social circumstances, and external causes of injury or diseases, as maintained by the...
, MeSH
Medical Subject Headings
Medical Subject Headings is a comprehensive controlled vocabulary for the purpose of indexing journal articles and books in the life sciences; it can also serve as a thesaurus that facilitates searching...
, SNOMED CT
SNOMED CT
SNOMED CT , is a systematically organised computer processable collection of medical terminology covering most areas of clinical information such as diseases, findings, procedures, microorganisms, substances, etc...
, DSM-IV, LOINC
LOINC
Logical Observation Identifiers Names and Codes is a database and universal standard for identifying medical laboratory observations. It was developed and is maintained by the Regenstrief Institute, a US non-profit medical research organization, in 1994...
, WHO Adverse Drug Reaction Terminology, UK Clinical Terms, RxNorm, Gene Ontology
Gene Ontology
The Gene Ontology, or GO, is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species...
, and OMIM (see full list).
The Metathesaurus is organized by concept, and each concept has specific attributes defining its meaning and is linked to the corresponding concept names in the various source vocabularies. Numerous relationships between the concepts are represented, for instance hierarchical ones such as "isa" for subclasses and "is part of" for subunits, and associative ones such as "is caused by" or "in the literature often occurs close to" (the latter being derived from Medline
MEDLINE
MEDLINE is a bibliographic database of life sciences and biomedical information. It includes bibliographic information for articles from academic journals covering medicine, nursing, pharmacy, dentistry, veterinary medicine, and health care...
).
The scope of the Metathesaurus is determined by the scope of the source vocabularies. If different vocabularies use different names for the same concept, or if they use the same name for different concepts, then this will be faithfully represented in the Metathesaurus. All hierarchical information from the source vocabularies is retained in the Metathesaurus. Metathesaurus concepts can also link to resources outside of the database, for instance gene sequence databases.
The Metathesaurus itself is produced by the automated processing of machine-readable versions of the source vocabularies, followed by human intervention of editing and review. It is distributed as a set of relational files and can also be accessed through a Java
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...
API.
Semantic Network
Each concept in the Metathesaurus is assigned one or more semantic types (categories), which are linked with one another through semantic relationships.The semantic network
Semantic network
A semantic network is a network which represents semantic relations among concepts. This is often used as a form of knowledge representation. It is a directed or undirected graph consisting of vertices, which represent concepts, and edges.- History :...
is a catalog of these semantic types and relationships. This is a rather broad classification; there are 135 semantic types and 54 relationships in total.
The major semantic types are organisms, anatomical structures, biologic function, chemicals, events, physical objects, and concepts or ideas.
The links among semantic types define the structure of the network and show important relationships between the groupings and concepts. The primary link between semantic types is the "isa" link, establishing a hierarchy
Hierarchy
A hierarchy is an arrangement of items in which the items are represented as being "above," "below," or "at the same level as" one another...
of types.
The network also has 5 major categories of non-hierarchical (or associative) relationships, which constitute the remaining 53 relationship types. These are "physically related to", "spatially related to", "temporally related to", "functionally related to" and "conceptually related to".
The information about a semantic type includes an identifier, definition, examples, hierarchical information about the encompassing semantic type(s), and associative
Associative Entities
An associative entity is an element of the Entity-relationship model. The database relational model does not offer direct support to many-to-many relationships, even though such relationships happen frequently in normal usage. The solution to this problem is the creation of another table to hold...
relationships. Associative relationships within the Semantic Network are very weak. They capture at most some-some relationships, i.e. they capture the fact that some instance of the first type may be connected by the salient relationship to some instance of the second type. Phrased differently, they capture the fact that a corresponding relational assertion is meaningful (though it need not be true in all cases).
An example of an associative relationship is "may-cause", applied to the terms (smoking, lung cancer) would yield: smoking "may-cause" lung cancer.
SPECIALIST Lexicon
The SPECIALIST Lexicon contains information about common English vocabulary, biomedical terms, terms found in MEDLINEMEDLINE
MEDLINE is a bibliographic database of life sciences and biomedical information. It includes bibliographic information for articles from academic journals covering medicine, nursing, pharmacy, dentistry, veterinary medicine, and health care...
and terms found in the UMLS Metathesaurus. Each entry contains syntactic (how words are put together to create meaning), morphological
Morphology (linguistics)
In linguistics, morphology is the identification, analysis and description, in a language, of the structure of morphemes and other linguistic units, such as words, affixes, parts of speech, intonation/stress, or implied context...
(form and structure) and orthographic
Orthography
The orthography of a language specifies a standardized way of using a specific writing system to write the language. Where more than one writing system is used for a language, for example Kurdish, Uyghur, Serbian or Inuktitut, there can be more than one orthography...
(spelling) information. A set of Java
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...
programs use the lexicon to work through the variations in biomedical texts by relating words by their parts of speech, which can be helpful in web searches or searches through an electronic medical record
Electronic medical record
An electronic medical record is a computerized medical record created in an organization that delivers care, such as a hospital or physician's office...
.
Entries may be one-word or multiple-word terms. Records contain four parts: base form (i.e. "run" for "running"); parts of speech (of which Specialist recognizes eleven); a unique identifier; and any available spelling variants.
For example, a query
Information retrieval
Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...
for "anesthetic" would return the following:
{ base=anaesthetic
spelling_variant=anesthetic
entry=E0008769
cat=noun
variants=reg
}
{ base=anaesthetic
spelling_variant=anesthetic
entry=E0008770
cat=adj
variants=inv
position=attrib(3)
}
(Browne et al., 2000)
The SPECIALIST lexicon is available in two formats. The "unit record" format can be seen above, and comprises slots and fillers. A slot is the element (i.e. "base=" or "spelling variant=") and the fillers are the values attributable to that slot for that entry. The "relational table
Relational database
A relational database is a database that conforms to relational model theory. The software used in a relational database is called a relational database management system . Colloquial use of the term "relational database" may refer to the RDBMS software, or the relational database itself...
" format is not yet normalized
Database normalization
In the design of a relational database management system , the process of organizing data to minimize redundancy is called normalization. The goal of database normalization is to decompose relations with anomalies in order to produce smaller, well-structured relations...
and contain a great deal of redundant data in the files.
Inconsistencies and other errors
Given the size and complexity of the UMLS and its permissive policy on integrating terms, errors are inevitable.Errors include ambiguity and redundancy, hierarchical relationship cycles (a concept is both an ancestor and descendant to another), missing ancestors (semantic types of parent and child concepts are unrelated), and semantic inversion (the child/parent relationship with the semantic types is not consistent with the concepts).
These errors are discovered and resolved by auditing the UMLS. Manual audits can be very time-consuming and costly. Researchers have attempted to address the issue through a number of ways. Automated tools can be used to search for these errors.
For structural inconsistencies (such as loops), a trivial solution that removes based on order would work. However, the same wouldn't apply when the inconsistency is at the term or concept level (context-specific meaning of a term). This requires an informed search strategy be used (knowledge representation).
Supporting software tools
In addition to the knowledge sources, the National Library of Medicine also provides supporting tools.Third party software
- UMLS-Similarity, an open source software package that implements many measures of semantic similarity and relatedness.
- UMLS-Similarity web interface, a web interface to UMLS-Similarity
Further reading
- Bodenreider, Olivier. (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research, 32, D267-D270.
- Kumar, Anand and Smith, Barry (2003) The Unified Medical Language System and the Gene Ontology: Some Critical Reflections, in: KI 2003: Advances in Artificial Intelligence (Lecture Notes in Artificial Intelligence 2821), Berlin: Springer, 135–148.
- Smith, Barry Kumar, Anand and Schulze-Kremer, Steffen (2004) Revising the UMLS Semantic Network, in M. Fieschi, et al. (eds.), Medinfo 2004, Amsterdam: IOS Press, 1700.
External links
- Official UMLS site
- UMLS Summary description, with links to factsheets and documentation for Metathesaurus, Semantic Network, SPECIALIST Lexicon and MetamorphoSys
- UMLS Overview and Tutorial, by Rachel Kleinsorge, Jan Willis, Allen Browne, Alan Aronson
- A Perl module to query a UMLS mysql installation