Overcategorization
Encyclopedia
Overcategorization, overcategorisation or category clutter is the process of assigning too many categories, classes or index terms to a given document
Document
The term document has multiple meanings in ordinary language and in scholarship. WordNet 3.1. lists four meanings :* document, written document, papers...

. Wikipedia has developed a set of principles concerning overcategorization (Wikipedia:overcategorization). Interestingly, the concept seems not to appear in the literature of Library and information science
Library and information science
Library and information science is a merging of the two fields library science and information science...

 (LIS), although it is clearly relevant for all kinds of document classification
Document classification
Document classification or document categorization is a problem in both library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" or algorithmically...

 and indexing. In LIS some related concepts have been developed, for example exhaustivity of indexing and information overload
Information overload
"Information overload" is a term popularized by Alvin Toffler in his bestselling 1970 book Future Shock. It refers to the difficulty a person can have understanding an issue and making decisions that can be caused by the presence of too much information...

, among others.

Basic principles

If too many categories as assigned to a given document, the implications for the users depends on how informative the links are. If the user is able to distinguish between useful and not useful links, the damage is limited: The user only waste time selecting links. In many cases, however, the user cannot judge whether or not a given link will turn out to be fruitful. In that case he has to follow the link and to read or skim another document. The worst case is, of course, that even after reading the new document the user is unable to decide whether or not it might be useful if its subject matter it thoroughly investigated.

Overcategorization also has another unpleasant implication: It makes the system (for example Wikipedia) difficult to maintain in a consistent way. If the system is inconsistent it means that when the user considers the links in a given category, he will not find all documents relevant in relation to that category.

Basically, the problem of overcategorization should be understand from the perspective of relevance
Relevance
-Introduction:The concept of relevance is studied in many different fields, including cognitive sciences, logic and library and information science. Most fundamentally, however, it is studied in epistemology...

 and the traditional measures of recall
Recall
Recall may refer to:* Recollection, recall from memory* Product recall* Recall election* Letter to recall sent to return an ambassador from a country, either as a diplomatic protest or because the diplomat is being reassigned elsewhere and is being replaced by another envoy* Recall to employment...

 and precision
Precision
Concepts* Accuracy and precision, measurement deviation from true value and its scatter* Precision , the number of digits from which a value is expressed* Precision , the percentage of documents returned that are relevant...

. If too few relevant categories is assigned to a document recall may decrease. If too many non-relevant categories is assigned precision becomes lower. The hard job is to say which categories are fruitful or relevant for future use of the document.

See also

  • Subject (documents)
    Subject (documents)
    In library and information science documents are classified and searched by subject - as well as by other attributes such as author, genre and document type. This makes "subject" a fundamental term in this field. Library and information specialists assign subject labels to documents to make them...

  • Subject indexing
    Subject indexing
    Subject indexing is the act of describing or classifying a document by index terms or other symbols in order to indicate what the document is about, to summarize its content or to increase its findability. In other words, it is about identifying and describing the subject of documents...

  • Information overload
    Information overload
    "Information overload" is a term popularized by Alvin Toffler in his bestselling 1970 book Future Shock. It refers to the difficulty a person can have understanding an issue and making decisions that can be caused by the presence of too much information...

  • Information pollution
    Information pollution
    Information pollution is the contamination of information supply with irrelevant, redundant, unsolicited and low-value information. The spread of useless and undesirable information can have a detrimental effect on human activities...

  • Relevance
    Relevance
    -Introduction:The concept of relevance is studied in many different fields, including cognitive sciences, logic and library and information science. Most fundamentally, however, it is studied in epistemology...

  • Subject indexing#Exhaustivity
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK