Document-term matrix
Encyclopedia
A document-term matrix or term-document matrix is a mathematical matrix
that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing
.
s the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. For instance if one has the following two (short) documents:
then the document-term matrix would be:
which shows which documents contain which terms and how many times they appear.
Note that more sophisticated weights can be used; one typical example, among others, would be tf-idf.
, which is normally the one used to compute a document-term matrix, the goal is to represent the topic of a document by the frequency of semantically significant terms. The terms are semantic units of the documents. It is often assumed, for Indo-European languages
, that nouns, verbs and adjectives are the more significant categories
, and that words from those categories should be kept as terms.
Adding collocation
as terms improves the quality of the vectors, especially when computing similarities between documents.
(LSA, performing eigenvalue decomposition on the document-term matrix) can improve search results by disambiguating polysemous words
and searching for synonym
s of the query. However, searching in the high-dimensional continuous space is much slower than searching the standard trie
data structure of search engines.
of the document-term matrix can reveal topics/themes of the corpus. Specifically, latent semantic analysis
and data clustering
can be used, and more recently probabilistic latent semantic analysis
and non-negative matrix factorization have been found to perform well for this task.
Matrix (mathematics)
In mathematics, a matrix is a rectangular array of numbers, symbols, or expressions. The individual items in a matrix are called its elements or entries. An example of a matrix with six elements isMatrices of the same size can be added or subtracted element by element...
that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing
Natural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....
.
General Concept
When creating a database of terms that appear in a set of documentDocument
The term document has multiple meanings in ordinary language and in scholarship. WordNet 3.1. lists four meanings :* document, written document, papers...
s the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. For instance if one has the following two (short) documents:
- D1 = "I like databases"
- D2 = "I hate hate databases",
then the document-term matrix would be:
I | like | hate | databases | |
D1 | 1 | 1 | 0 | 1 |
D2 | 1 | 0 | 2 | 1 |
which shows which documents contain which terms and how many times they appear.
Note that more sophisticated weights can be used; one typical example, among others, would be tf-idf.
Choice of Terms
A point of view on the matrix is that each row represents a document. In the vectorial semantic modelVector space model
Vector space model is an algebraic model for representing text documents as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings...
, which is normally the one used to compute a document-term matrix, the goal is to represent the topic of a document by the frequency of semantically significant terms. The terms are semantic units of the documents. It is often assumed, for Indo-European languages
Indo-European languages
The Indo-European languages are a family of several hundred related languages and dialects, including most major current languages of Europe, the Iranian plateau, and South Asia and also historically predominant in Anatolia...
, that nouns, verbs and adjectives are the more significant categories
Syntactic category
A syntactic category is either a phrasal category, such as noun phrase or verb phrase, which can be decomposed into smaller syntactic categories, or a lexical category, such as noun or verb, which cannot be further decomposed....
, and that words from those categories should be kept as terms.
Adding collocation
Collocation
In corpus linguistics, collocation defines a sequence of words or terms that co-occur more often than would be expected by chance. In phraseology, collocation is a sub-type of phraseme. An example of a phraseological collocation is the expression strong tea...
as terms improves the quality of the vectors, especially when computing similarities between documents.
Improving search results
Latent semantic analysisLatent semantic analysis
Latent semantic analysis is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close...
(LSA, performing eigenvalue decomposition on the document-term matrix) can improve search results by disambiguating polysemous words
Polysemy
Polysemy is the capacity for a sign or signs to have multiple meanings , i.e., a large semantic field.Charles Fillmore and Beryl Atkins’ definition stipulates three elements: the various senses of a polysemous word have a central origin, the links between these senses form a network, and ...
and searching for synonym
Synonym
Synonyms are different words with almost identical or similar meanings. Words that are synonyms are said to be synonymous, and the state of being a synonym is called synonymy. The word comes from Ancient Greek syn and onoma . The words car and automobile are synonyms...
s of the query. However, searching in the high-dimensional continuous space is much slower than searching the standard trie
Trie
In computer science, a trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree defines the...
data structure of search engines.
Finding topics
Multivariate analysisMultivariate analysis
Multivariate analysis is based on the statistical principle of multivariate statistics, which involves observation and analysis of more than one statistical variable at a time...
of the document-term matrix can reveal topics/themes of the corpus. Specifically, latent semantic analysis
Latent semantic analysis
Latent semantic analysis is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close...
and data clustering
Data clustering
Cluster analysis or clustering is the task of assigning a set of objects into groups so that the objects in the same cluster are more similar to each other than to those in other clusters....
can be used, and more recently probabilistic latent semantic analysis
Probabilistic latent semantic analysis
Probabilistic latent semantic analysis , also known as probabilistic latent semantic indexing is a statistical technique for the analysis of two-mode and co-occurrence data. PLSA evolved from latent semantic analysis, adding a sounder probabilistic model...
and non-negative matrix factorization have been found to perform well for this task.
Implementations
- Gensim: Open source Python framework for Vector Space modelling. Contains memory-efficient algorithms for constructing term-document matrices from text plus common transformations (tf-idf, LSALatent semantic analysisLatent semantic analysis is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close...
, LDALatent Dirichlet allocationIn statistics, latent Dirichlet allocation is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar...
).