Co-occurrence networks
Encyclopedia
Co-occurrence
networks are generally used to provide a graphic visualization of potential relationships between people, organizations, concepts or other entities represented within written material. The generation and visualization of co-occurrence networks has become practical with the advent of electronically stored text amenable to text mining
.
By way of definition, co-occurrence networks are the collective interconnection of terms based on their paired presence within a specified unit of text. Networks are generated by connecting pairs of terms using a set of criteria defining co-occurrence. For example, terms A and B may be said to “co-occur” if they both appear in a particular article. Another article may contain terms B and C. Linking A to B and B to C creates a co-occurrence network of these three terms. Rules to define co-occurrence within a text corpus
can be set according to desired criteria. For example, a more stringent criteria for co-occurrence may require a pair of terms to appear in the same sentence.
Co-occurrence networks can be created for any given list of terms (any dictionary) in relation to any collection of texts (any text corpus). Co-occurring pairs of terms can be called “neighbors” and these often group into “neighborhoods” based on their interconnections. Individual terms may have several neighbors. Neighborhoods may connect to one another through at least one individual term or may remain unconnected.
Individual terms are, within the context of text mining, symbolically represented as text strings. In the real world, the entity identified by a term normally has several symbolic representations. It is therefore useful to consider terms as being represented by one primary symbol and up to several synonymous alternative symbols. Occurrence of an individual term is established by searching for each known symbolic representations of the term. The process can be augmented through NLP (natural language processing
) algorithms that interrogate segments of text for possible alternatives such as word order, spacing and hyphenation. NLP can also be used to identify sentence structure and categorize text strings according to grammar (for example, categorizing a string of text as a noun
based on a preceding string of text known to be an article
).
Graphic representation of co-occurrence networks allow them to be visualized and inferences drawn regarding relationships between entities in the domain
represented by the dictionary of terms applied to the text corpus. Meaningful visualization normally requires simplifications of the network. For example, networks may be drawn such that the number of neighbors connecting to each term is limited. The criteria for limiting neighbors might be based on the absolute number of co-occurrences or more subtle criteria such as “probability” of co-occurrence or the presence of an intervening descriptive term.
Quantitative aspects of the underlying structure of a co-occurrence network might also be informative, such as the overall number of connections between entities, clustering of entities representing sub-domains, detecting synonyms, etc.
Some working applications of the co-occurrence approach are available to the public through the internet. PubGene
is an example of an application that addresses the interests of biomedical community by presenting networks based on the co-occurrence of genetics related terms as these appear in MEDLINE
records. The web site Name Base is an example of how human relationships can be inferred by examining networks constructed from the co-occurrence of personal names in newspapers and other texts (as in Ozgur et al.).
Networks of information are also used to facilitate efforts to organize and focus publicly available information for law enforcement and intelligence purposes (so called "open source intelligence
" or OSINT). Related techniques include co-citation networks as well as the analysis of hyperlink and content structure on the internet (such as in the analysis of web sites connected to terrorism).
See also Takada et al. and Liu
Co-occurrence
Co-occurrence or cooccurrence can either mean concurrence / coincidence or, in a more specific sense, the above-chance frequent occurrence of two terms from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense can be interpreted as an indicator of semantic...
networks are generally used to provide a graphic visualization of potential relationships between people, organizations, concepts or other entities represented within written material. The generation and visualization of co-occurrence networks has become practical with the advent of electronically stored text amenable to text mining
Text mining
Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as...
.
By way of definition, co-occurrence networks are the collective interconnection of terms based on their paired presence within a specified unit of text. Networks are generated by connecting pairs of terms using a set of criteria defining co-occurrence. For example, terms A and B may be said to “co-occur” if they both appear in a particular article. Another article may contain terms B and C. Linking A to B and B to C creates a co-occurrence network of these three terms. Rules to define co-occurrence within a text corpus
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...
can be set according to desired criteria. For example, a more stringent criteria for co-occurrence may require a pair of terms to appear in the same sentence.
Co-occurrence networks can be created for any given list of terms (any dictionary) in relation to any collection of texts (any text corpus). Co-occurring pairs of terms can be called “neighbors” and these often group into “neighborhoods” based on their interconnections. Individual terms may have several neighbors. Neighborhoods may connect to one another through at least one individual term or may remain unconnected.
Individual terms are, within the context of text mining, symbolically represented as text strings. In the real world, the entity identified by a term normally has several symbolic representations. It is therefore useful to consider terms as being represented by one primary symbol and up to several synonymous alternative symbols. Occurrence of an individual term is established by searching for each known symbolic representations of the term. The process can be augmented through NLP (natural language processing
Natural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....
) algorithms that interrogate segments of text for possible alternatives such as word order, spacing and hyphenation. NLP can also be used to identify sentence structure and categorize text strings according to grammar (for example, categorizing a string of text as a noun
Noun
In linguistics, a noun is a member of a large, open lexical category whose members can occur as the main word in the subject of a clause, the object of a verb, or the object of a preposition .Lexical categories are defined in terms of how their members combine with other kinds of...
based on a preceding string of text known to be an article
Article (grammar)
An article is a word that combines with a noun to indicate the type of reference being made by the noun. Articles specify the grammatical definiteness of the noun, in some languages extending to volume or numerical scope. The articles in the English language are the and a/an, and some...
).
Graphic representation of co-occurrence networks allow them to be visualized and inferences drawn regarding relationships between entities in the domain
Data domain
In data management and database analysis, a data domain refers to all the unique values which a data element may contain. The rule for determining the domain boundary may be as simple as a data type with an enumerated list of values....
represented by the dictionary of terms applied to the text corpus. Meaningful visualization normally requires simplifications of the network. For example, networks may be drawn such that the number of neighbors connecting to each term is limited. The criteria for limiting neighbors might be based on the absolute number of co-occurrences or more subtle criteria such as “probability” of co-occurrence or the presence of an intervening descriptive term.
Quantitative aspects of the underlying structure of a co-occurrence network might also be informative, such as the overall number of connections between entities, clustering of entities representing sub-domains, detecting synonyms, etc.
Some working applications of the co-occurrence approach are available to the public through the internet. PubGene
PubGene
PubGene AS is located in Oslo, Norway and is the daughter company of PubGene Inc.In 2001, PubGene founders demonstrated one of the firstapplications of text mining to research in biomedicine...
is an example of an application that addresses the interests of biomedical community by presenting networks based on the co-occurrence of genetics related terms as these appear in MEDLINE
MEDLINE
MEDLINE is a bibliographic database of life sciences and biomedical information. It includes bibliographic information for articles from academic journals covering medicine, nursing, pharmacy, dentistry, veterinary medicine, and health care...
records. The web site Name Base is an example of how human relationships can be inferred by examining networks constructed from the co-occurrence of personal names in newspapers and other texts (as in Ozgur et al.).
Networks of information are also used to facilitate efforts to organize and focus publicly available information for law enforcement and intelligence purposes (so called "open source intelligence
Open source intelligence
Open-source intelligence is a form of intelligence collection management that involves finding, selecting, and acquiring information from publicly available sources and analyzing it to produce actionable intelligence...
" or OSINT). Related techniques include co-citation networks as well as the analysis of hyperlink and content structure on the internet (such as in the analysis of web sites connected to terrorism).
See also Takada et al. and Liu