Document classification
Encyclopedia
Document classification or document categorization is a problem in both library science
Library science
Library science is an interdisciplinary or multidisciplinary field that applies the practices, perspectives, and tools of management, information technology, education, and other areas to libraries; the collection, organization, preservation, and dissemination of information resources; and the...

, information science
Information science
-Introduction:Information science is an interdisciplinary science primarily concerned with the analysis, collection, classification, manipulation, storage, retrieval and dissemination of information...

 and computer science
Computer science
Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...

. The task is to assign a document
Document
The term document has multiple meanings in ordinary language and in scholarship. WordNet 3.1. lists four meanings :* document, written document, papers...

 to one or more classes
Class (philosophy)
Philosophers sometimes distinguish classes from types and kinds. We can talk about the class of human beings, just as we can talk about the type , human being, or humanity...

 or categories
Categorization
Categorization is the process in which ideas and objects are recognized, differentiated and understood. Categorization implies that objects are grouped into categories, usually for some specific purpose. Ideally, a category illuminates a relationship between the subjects and objects of knowledge...

. This may be done "manually" (or "intellectually") or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is used mainly in information science and computer science. The problems are overlapping, however, and there is therefore also interdisciplinary research on document classification.

The documents to be classified may be texts, images, music, etc. Each kind of document possesses its special classification problems. When not otherwise specified, text classification is implied.

Documents may be classified according to their subjects
Subject (documents)
In library and information science documents are classified and searched by subject - as well as by other attributes such as author, genre and document type. This makes "subject" a fundamental term in this field. Library and information specialists assign subject labels to documents to make them...

 or according to other attributes (such as document type, author, printing year etc.). In the rest of this article only subject classification is considered. There are two main philosophies of subject classification of documents: The content based approach and the request based approach.

"Content based" versus "request based" classification

Content based classification is classification in which the weight given to particular subjects in a document determines the class to which the document is assigned. It is, for example, a rule in much library classification that at least 20% of the content of a book should be about the class to which the book is assigned. In automatic classification it could be the number of times given words appears in a document.

Request oriented classification (or -indexing) is classification in which the anticipated request from users is influencing how documents are being classified. The classifier ask himself: “Under which descriptors should this entity be found?” and “think of all the possible queries and decide for which ones the entity at hand is relevant” (Soergel, 1985, p. 230).

Request oriented classification may be classification that is targeted towards a particular audience or user group. For example, a library or a database for feminist studies may classify/index documents different compared to a historical library. It is probably better, however, to understand request oriented classification as policy based classification: The classification is done according to some ideals and reflects the purpose of the library or database doing the classification. In this way it is not necessarily a kind of classification or indexing based on user studies. Only if empirical data about use or users are applied should request oriented classification be regarded as a user-based approach.

Classification versus indexing

Sometimes a distinction is made between assigning documents to classes ("classification") versus assigning subjects
Subject (documents)
In library and information science documents are classified and searched by subject - as well as by other attributes such as author, genre and document type. This makes "subject" a fundamental term in this field. Library and information specialists assign subject labels to documents to make them...

 to documents ("subject indexing
Subject indexing
Subject indexing is the act of describing or classifying a document by index terms or other symbols in order to indicate what the document is about, to summarize its content or to increase its findability. In other words, it is about identifying and describing the subject of documents...

") but as Frederick Wilfrid Lancaster
Frederick Wilfrid Lancaster
Frederick Wilfrid Lancaster is a British-American information scientist. He immigrated to the USA in 1959; Worked as information specialist by the National Library of Medicine, Bethesda, Md., 1965–68; professor, University of Illinois, Urbana, 1972-92 Professor emeritus, U. Ill., Urbana, 1992-.F. W...

 has argued is this distinction not fruitful. "These terminological distinctions,” he writes, “are quite meaningless and only serve to cause confusion” (Lancaster, 2003, p. 21). The view that this distinction is purely superficial is also supported by the fact that a classification system may be transformed into a thesaurus
Thesaurus
A thesaurus is a reference work that lists words grouped together according to similarity of meaning , in contrast to a dictionary, which contains definitions and pronunciations...

 and vice versa (cf., Aitchison, 1986, 2004; Broughton, 2008; Riesthuis & Bliedung, 1991). Therefore is the act of labeling a document (say by assigning a term from a controlled vocabulary
Controlled vocabulary
Controlled vocabularies provide a way to organize knowledge for subsequent retrieval. They are used in subject indexing schemes, subject headings, thesauri, taxonomies and other form of knowledge organization systems...

 to a document) at the same time to assign that document to the class of documents indexed by that term (all documents indexed or classified as X belong to the same class of documents).

Automatic document classification

Automatic document classification tasks can be divided into two sorts: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, and unsupervised document classification (also known as document clustering
Document clustering
Document clustering is closely related to the concept of data clustering. Document clustering is a more specific technique for unsupervised document organization, automatic topic extraction and fast information retrieval or filtering.A web search engine often returns thousands of pages in...

), where the classification must be done entirely without reference to external information. There is also a semi-supervised document classification, where parts of the documents are labeled by the external mechanism.

Techniques

Automatic document classification techniques include:
  • Expectation maximization (EM)
  • Naive Bayes classifier
    Naive Bayes classifier
    A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong independence assumptions...

  • Tf-idf
  • Latent semantic indexing
    Latent semantic indexing
    Latent Semantic Indexing is an indexing and retrieval method that uses a mathematical technique called Singular value decomposition to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words...

  • Support vector machines (SVM)
  • Artificial neural network
    Artificial neural network
    An artificial neural network , usually called neural network , is a mathematical model or computational model that is inspired by the structure and/or functional aspects of biological neural networks. A neural network consists of an interconnected group of artificial neurons, and it processes...

  • K-nearest neighbour algorithms
    K-nearest neighbor algorithm
    In pattern recognition, the k-nearest neighbor algorithm is a method for classifying objects based on closest training examples in the feature space. k-NN is a type of instance-based learning, or lazy learning where the function is only approximated locally and all computation is deferred until...

  • Decision trees
    Decision tree learning
    Decision tree learning, used in statistics, data mining and machine learning, uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. More descriptive names for such tree models are classification trees or regression trees...

     such as ID3
    ID3 algorithm
    In decision tree learning, ID3 is an algorithm used to generate a decision tree invented by Ross Quinlan. ID3 is the precursor to the C4.5 algorithm.-Algorithm:The ID3 algorithm can be summarized as follows:...

     or C4.5
    C4.5 algorithm
    C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan. C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier.-Algorithm:C4.5...

  • Concept Mining
    Concept Mining
    Concept mining is an activity that results in the extraction of concepts from artifacts. Solutions to the task typically involve aspects of artificial intelligence and statistics, such as data mining and text mining...

  • Rough set based classifier
  • Soft set based classifier
  • Natural language processing
    Natural language processing
    Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....

     approaches

Applications

Classification techniques have been applied to
  • spam filtering, a process which tries to discern E-mail spam
    E-mail spam
    Email spam, also known as junk email or unsolicited bulk email , is a subset of spam that involves nearly identical messages sent to numerous recipients by email. Definitions of spam usually include the aspects that email is unsolicited and sent in bulk. One subset of UBE is UCE...

     messages from legitimate emails
  • topic spotting, automatically determining the topic of a text
    • email routing, sending an email sent to a general address to a specific address or mailbox depending on topic
  • language guessing
    Language guessing
    Language identification or language guessing is the process of automatically determining the language a document or piece of text is written in....

    , automatically determining the language of a text
  • genre classification, automatically determining the genre of a text

See also

  • Categorization
    Categorization
    Categorization is the process in which ideas and objects are recognized, differentiated and understood. Categorization implies that objects are grouped into categories, usually for some specific purpose. Ideally, a category illuminates a relationship between the subjects and objects of knowledge...

  • Class (philosophy)
    Class (philosophy)
    Philosophers sometimes distinguish classes from types and kinds. We can talk about the class of human beings, just as we can talk about the type , human being, or humanity...

  • Classification (disambiguation)
  • Compound term processing
    Compound term processing
    Compound term processing is the name that is used for a category of techniques in Information retrieval applications that performs matching on the basis of compound terms...

  • Content-based image retrieval
    Content-based image retrieval
    Content-based image retrieval , also known as query by image content and content-based visual information retrieval is the application of computer vision techniques to the image retrieval problem, that is, the problem of searching for digital images in large databases....

  • Document
    Document
    The term document has multiple meanings in ordinary language and in scholarship. WordNet 3.1. lists four meanings :* document, written document, papers...

  • Supervised learning
    Supervised learning
    Supervised learning is the machine learning task of inferring a function from supervised training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value...

    , unsupervised learning
    Unsupervised learning
    In machine learning, unsupervised learning refers to the problem of trying to find hidden structure in unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution...

  • Document retrieval
    Document retrieval
    Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual...

  • Document clustering
    Document clustering
    Document clustering is closely related to the concept of data clustering. Document clustering is a more specific technique for unsupervised document organization, automatic topic extraction and fast information retrieval or filtering.A web search engine often returns thousands of pages in...


  • Information retrieval
    Information retrieval
    Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...

  • Knowledge organization
    Knowledge organization
    The term knowledge organization designates a field of study related to Library and Information Science . In this meaning, KO is about activities such as document description, indexing and classification performed in libraries, databases, archives etc...

  • Knowledge Organization System
  • Library classification
    Library classification
    A library classification is a system of coding and organizing documents or library materials according to their subject and allocating a call number to that information resource...

  • Machine learning
    Machine learning
    Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

  • String metrics
  • Subject (documents)
    Subject (documents)
    In library and information science documents are classified and searched by subject - as well as by other attributes such as author, genre and document type. This makes "subject" a fundamental term in this field. Library and information specialists assign subject labels to documents to make them...

  • Subject indexing
    Subject indexing
    Subject indexing is the act of describing or classifying a document by index terms or other symbols in order to indicate what the document is about, to summarize its content or to increase its findability. In other words, it is about identifying and describing the subject of documents...

  • Text mining
    Text mining
    Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as...

    , web mining
    Web mining
    Web mining - is the application of data mining techniques to discover patterns from the Web.According to analysis targets, web mining can be divided into three different types, which are Web usage mining, Web content mining and Web structure mining.-Web usage mining:Web usage mining is the process...

    , concept mining
    Concept Mining
    Concept mining is an activity that results in the extraction of concepts from artifacts. Solutions to the task typically involve aspects of artificial intelligence and statistics, such as data mining and text mining...

  • RapidMiner - open source text mining
    Text mining
    Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as...

     software used for document classification, e-mail spam detection, e-mail routing, text sentiment analysis
    Sentiment analysis
    Sentiment analysis or opinion mining refers to the application of natural language processing, computational linguistics, and text analytics to identify and extract subjective information in source materials....

    , and other text classification tasks.


Further reading

Publications:

References:


Data sets:
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK