Concept Mining
Encyclopedia
Concept mining is an activity that results in the extraction of concept
s from artifacts
. Solutions to the task typically involve aspects of artificial intelligence
and statistics
, such as data mining
and text mining
. Because artifacts are typically a loosely structured sequence of words and other symbols (rather than concepts), the problem is nontrivial
, but it can provide powerful insights into the meaning, provenance and similarity of documents.
, and for computational techniques the tendency is to do the same. The thesauri used are either specially created for the task, or a pre-existing language model, usually related to Princeton's WordNet
.
The mappings of words to concepts are often ambiguous. Typically each word in a given language will relate to several possible concepts. Humans use context to disambiguate the various meanings of a given piece of text, where available. Machine translation systems cannot easily infer context.
For the purposes of concept mining however, these ambiguities tend to be less important than they are with machine translation, for in large documents the ambiguities tend to even out, much as is the case with text mining.
There are many techniques for disambiguation
that may be used. Examples are linguistic analysis of the text and the use of word and concept association frequency information that may be inferred from large text corpora. Recently, techniques that base on semantic similarity
between the possible concepts and the context have appeared and gained interest in the scientific community.
. These structures can be used to produce simple tree membership statistics, that can be used to locate any document in a Euclidean concept space. If the size of a document is also considered as another dimension of this space then an extremely efficient indexing system can be created. This technique is currently in commercial use locating similar legal documents in a 2.5 million document corpus.
cousins, and tend to behave more intuitively, in that they map better to the similarity measures a human would generate.
Concept
The word concept is used in ordinary language as well as in almost all academic disciplines. Particularly in philosophy, psychology and cognitive sciences the term is much used and much discussed. WordNet defines concept: "conception, construct ". However, the meaning of the term concept is much...
s from artifacts
Document
The term document has multiple meanings in ordinary language and in scholarship. WordNet 3.1. lists four meanings :* document, written document, papers...
. Solutions to the task typically involve aspects of artificial intelligence
Artificial intelligence
Artificial intelligence is the intelligence of machines and the branch of computer science that aims to create it. AI textbooks define the field as "the study and design of intelligent agents" where an intelligent agent is a system that perceives its environment and takes actions that maximize its...
and statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
, such as data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
and text mining
Text mining
Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as...
. Because artifacts are typically a loosely structured sequence of words and other symbols (rather than concepts), the problem is nontrivial
Nontrivial
Nontrivial is the opposite of trivial. In contexts where trivial has a formal meaning, nontrivial is its antonym.It is a term common among communities of engineers and mathematicians, to indicate a statement or theorem that is not obvious or easy to prove.-Examples:*In mathematics, it is often...
, but it can provide powerful insights into the meaning, provenance and similarity of documents.
Methods
Traditionally, the conversion of words to concepts has been performed using a thesaurusThesaurus
A thesaurus is a reference work that lists words grouped together according to similarity of meaning , in contrast to a dictionary, which contains definitions and pronunciations...
, and for computational techniques the tendency is to do the same. The thesauri used are either specially created for the task, or a pre-existing language model, usually related to Princeton's WordNet
WordNet
WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets...
.
The mappings of words to concepts are often ambiguous. Typically each word in a given language will relate to several possible concepts. Humans use context to disambiguate the various meanings of a given piece of text, where available. Machine translation systems cannot easily infer context.
For the purposes of concept mining however, these ambiguities tend to be less important than they are with machine translation, for in large documents the ambiguities tend to even out, much as is the case with text mining.
There are many techniques for disambiguation
Word sense disambiguation
In computational linguistics, word-sense disambiguation is an open problem of natural language processing, which governs the process of identifying which sense of a word is used in a sentence, when the word has multiple meanings...
that may be used. Examples are linguistic analysis of the text and the use of word and concept association frequency information that may be inferred from large text corpora. Recently, techniques that base on semantic similarity
Semantic similarity
Semantic similarity or semantic relatedness is a concept whereby a set of documents or terms within term lists are assigned a metric based on the likeness of their meaning / semantic content....
between the possible concepts and the context have appeared and gained interest in the scientific community.
Detecting and indexing similar documents in large corpora
One of the spin-offs of calculating document statistics in the concept domain, rather than the word domain, is that concepts form natural tree structures based on hypernymy and meronymyMeronymy
Meronymy is a semantic relation used in linguistics. A meronym denotes a constituent part of, or a member of something. That is,...
. These structures can be used to produce simple tree membership statistics, that can be used to locate any document in a Euclidean concept space. If the size of a document is also considered as another dimension of this space then an extremely efficient indexing system can be created. This technique is currently in commercial use locating similar legal documents in a 2.5 million document corpus.
Clustering documents by topic
Standard numeric clustering techniques may be used in "concept space" as described above to locate and index documents by the inferred topic. These are numerically far more efficient than their text miningText mining
Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as...
cousins, and tend to behave more intuitively, in that they map better to the similarity measures a human would generate.
See also
- Formal concept analysisFormal concept analysisFormal concept analysis is a principled way of automatically deriving an ontology from a collection of objects and their properties. The term was introduced by Rudolf Wille in 1984, and builds on applied lattice and order theory that was developed by Birkhoff and others in the 1930s.-Intuitive...
- Information extractionInformation extractionInformation extraction is a type of information retrieval whose goal is to automatically extract structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language...
- Compound term processingCompound term processingCompound term processing is the name that is used for a category of techniques in Information retrieval applications that performs matching on the basis of compound terms...