Stemming
Encyclopedia
In linguistic morphology and information retrieval
, stemming is the process for reducing inflected (or sometimes derived) words to their stem
, base or root
form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithm
s for stemming have been studied in computer science
since 1968. Many search engine
s treat words with the same stem as synonym
s as a kind of query broadening, a process called conflation.
Stemming programs are commonly referred to as stemming algorithms or stemmers.
, for example, should identify the string
"cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished", "fish", and "fisher" to the root word, "fish".
in 1968. This paper was remarkable for its early date and had great influence on later work in this area.
A later stemmer was written by Martin Porter
and was published in the July 1980 issue of the journal Program. This stemmer was very widely used and became the de-facto standard algorithm used for English stemming. Dr. Porter received the Tony Kent Strix award
in 2000 for his work on stemming and information retrieval.
Many implementations of the Porter stemming algorithm were written and freely distributed; however, many of these implementations contained subtle flaws. As a result, these stemmers did not match their potential. To eliminate this source of error, Martin Porter released an official free-software implementation of the algorithm around the year 2000. He extended this work over the next few years by building Snowball
, a framework for writing stemming algorithms, and implemented an improved English stemmer together with stemmers for several other languages.
stemmers employ a lookup table
which contains relations between root forms and inflected forms. To stem a word, the table is queried to find a matching inflection. If a matching inflection is found, the associated root form is returned.
Brute force approaches are criticized for their general lack of elegance in that no algorithm is applied that would more quickly converge on a solution. In other words, there are more operations performed during the search than should be necessary. Brute force searches consume immense amounts of storage to host the list of relations (relative to the task). The algorithm is only accurate to the extent that the inflected form already exists in the table. Given the number of words in a given language, like English, it is unrealistic to expect that all word forms can be captured and manually recorded by human action alone. Manual training of the algorithm is overly time-intensive and the ratio between the effort and the increase in accuracy is marginal at best.
Brute force algorithms do overcome some of the challenges faced by the other approaches. Not all inflected word forms in a given language "follow the rules" appropriately. While "running" might be easy to stem to "run" in a suffix stripping approach, the alternate inflection, "ran", is not. Suffix stripping algorithms are somewhat powerless to overcome this problem, short of increasing the number and complexity of the rules, but brute force algorithms only require storing a single extra relation between "run" and "ran". While that is true, this assumes someone bothered to store the relation in the first place, and one of the major problems of improving brute force algorithms is the coverage of the language.
Brute force algorithms are initially very difficult to design given the immense number of relations that must be initially stored to produce an acceptable level of accuracy (the number can span well into the millions). However, brute force algorithms are easy to improve in that decreasing the stemming error is only a matter of adding more relations to the table. Someone with only a minor experience in linguistics is capable of improving the algorithm, unlike the suffix stripping approaches which require a solid background in linguistics.
For technical accuracy, some programs may use suffix stripping to generate the lookup table given a text corpus, and then only consult the lookup table when stemming. This is not regarded as a brute force approach, although a lookup table is involved.
To improve a brute force algorithm it may be built upon preliminary part-of-speech tagging. In this case the look up table includes suffixes and endings associated with a specific part of speech, which significantly reduces the number of overstemming errors.
There are many types of heuristic, experimental techniques for identifying inflected forms of words. Some algorithms work phonetically, looking at the final syllables in a word. Some are rather brute force, using rules that seem a lot like normalization rules, by inspecting the last few characters. Others are similar to the process of lemmatisation, which takes advantage of the additional knowledge about the part of speech of the given word to limit what types of suffixes are considered when generating inflections for the word.
Distinguishing this technique from the other approaches is not very important. The other approaches share much of the same logic, but are simply working in the opposite direction. Some stemming approaches may employ both a production technique to generate inflections, and a suffix stripping approach to identify root forms together.
One drawback of the production technique is that there is no guarantee the inflection is real. For example, a technique might join together "run" and "ly" to create the word "runly". "runly" is not a real word. Granted, it is highly unlikely that "runly" will ever be used as input to a stemming algorithm since it is not a real word, so this error may not affect the result and therefore the accuracy of a stemming algorithm. However, such errors do populate the table of relations, and each additional entry can increase the time required to perform the lookup of an inflection to find its corresponding root form.
Suffix stripping approaches enjoy the benefit of being much simpler to maintain than brute force algorithms, assuming the maintainer is sufficiently knowledgeable in the challenges of linguistics and morphology and encoding suffix stripping rules. Suffix stripping algorithms are sometimes regarded as crude given the poor performance when dealing with exceptional relations (like 'ran' and 'run'). The solutions produced by suffix stripping algorithms are limited to those lexical categories
which have well known suffixes with few exceptions. This, however, is a problem, as not all parts of speech have such a well formulated set of rules. Lemmatisation attempts to improve upon this challenge.
One improvement upon basic suffix stripping is the use of suffix substitution. Similar to a stripping rule, a substitution rule replaces a suffix with an alternate suffix. For example, there could exist a rule that replaces ies with y. How this affects the algorithm varies on the algorithm's design. To illustrate, the algorithm may identify that both the ies suffix stripping rule as well as the suffix substitution rule apply. Since the stripping rule results in a non-existent term in the lexicon, but the substitution rule does not, the substitution rule is applied instead. In this example, friendlies becomes friendly instead of friendl. Diving further into the details, a common technique is to apply rules in a cyclical fashion (recursively, as computer scientists would say). After applying the suffix substitution rule in this example scenario, a second pass is made to identify matching rules on the term friendly, where the ly stripping rule is likely identified and accepted. In summary, friendlies becomes (via substitution) friendly which becomes (via stripping) friend.
This example also helps illustrate the difference between a rule-based approach and a brute force approach. In a brute force approach, the algorithm would search for friendlies in the set of hundreds of thousands of inflected word forms and ideally find the corresponding root form friend. In the rule-based approach, the three rules mentioned above would be applied in succession to converge on the same solution. Chances are that the rule-based approach was faster.
. This process involves first determining the part of speech of a word, and applying different normalization rules for each part of speech. The part of speech is first detected prior to attempting to find the root since for some languages, the stemming rules change depending on a word's part of speech.
This approach is highly conditional upon obtaining the correct lexical category (part of speech). While there is overlap between the normalization rules for certain categories, identifying the wrong category or being unable to produce the right category limits the added benefit of this approach over suffix stripping algorithms. The basic idea is that, if we are able to grasp more information about the word to be stemmed, then we are able to more accurately apply normalization rules (which are, more or less, suffix stripping rules).
algorithms involve using probability to identify the root form of a word. Stochastic algorithms are trained (they "learn") on a table of root form to inflected form relations to develop a probabilistic model. This model is typically expressed in the form of complex linguistic rules, similar in nature to those in suffix stripping or lemmatisation. Stemming is performed by inputting an inflected form to the trained model and having the model produce the root form according to its internal ruleset, which again is similar to suffix stripping and lemmatisation, except that the decisions involved in applying the most appropriate rule, or whether or not to stem the word and just return the same word, or whether to apply two different rules sequentially, are applied on the grounds that the output word will have the highest probability of being correct (which is to say, the smallest probability of being incorrect, which is how it is typically measured).
Some lemmatisation algorithms are stochastic in that, given a word which may belong to multiple parts of speech, a probability is assigned to each possible part. This may take into account the surrounding words, called the context, or not. Context-free grammars do not take into account any additional information. In either case, after assigning the probabilities to each possible part of speech, the most likely part of speech is chosen, and from there the appropriate normalization rules are applied to the input word to produce the normalized (root) form.
In language modeling, one could write a software program that stores a large table of all word bigrams found in a large body of text (called a corpus). The algorithm would scan word by word, like a sliding window, across the text from the beginning, storing two word sequences as it moves along until the last word of the last document is reached. On the output, one has what is called a bigrams table. Each row in the table may be called a tuple, storing the first word, and then the second word. Additional characteristics may be stored about each bigram, such as its frequency (the total count of the times the whole bigram was discovered) in whatever body of texts was used to generate the table. The table may also store the frequency of each individual word. For example, this sentence contains bigrams like "for example" and "this sentence".
From the table an algorithm can garner the probability of the second word following the first word by assessing each word's frequency and the frequency of the bigrams and perhaps other criteria like the total number of bigrams or the total number of bigrams where the first word is the same. For example, one would be able to state conclusions like: the presence of the word post indicates a probability of the following word being office of 50%, because the sequence occurred this way 50% of the time in the text (where post occurred with and without office following it).
In the bi-gram relationship, the first word can be understood as the context that describes the second word. It qualifies the second word, providing additional insight into its meaning. Humans use the technique of examining the context of words to determine the meaning of written words quite frequently. Words can have multiple meanings, called senses. The bigram-provided context provides a way to distinguish which sense of the word is used (well, most of the time, as literary puns and the like are the obvious exception). As mentioned elsewhere, the stemming algorithm can use this context as additional information to assist in determining the outcome of the algorithm (whether or not the current word is stemmed, which of one or more possible stems is appropriate, whether to use substitution or stripping, whether the word is a verb or noun or some other lexical category, etc). Because the algorithm used probabilities in determining the outcome, it is a stochastic algorithm.
Entering into the realm of stochastic algorithms may greatly complicate the stemming algorithm, making it much harder to develop, distribute and maintain. There is considerable debate over its benefits. If a simple suffix stripper can get to about 80% accuracy from a few days or weeks of programming, then how much more valuable is it that a context-driven n-gram stemmer can achieve 90% accuracy given several months of work? There are also many problems with stochastic stemmers. What if the body of text used to derive the probabilities of the bigrams is not a representative sample of all documents ever written in a given language (and it is likely not)? What if the algorithm's bigram table 'becomes corrupted', by learning from 'bad text'? Reconfiguring a brute force algorithm that employs a lookup table is as simple as going in and removing a row from the table. But how does one reconfigure a probabilistic network (this is analogous to some of the problems behind neural networks)? Such obstacles are researched in computer science.
, the term affix
refers to either a prefix
or a suffix
. In addition to dealing with suffixes, several approaches also attempt to remove common prefixes. For example, given the word indefinitely, identify that the leading "in" is a prefix that can be removed. Many of the same approaches mentioned earlier apply, but go by the name affix stripping. A study of affix stemming for several European languages can be found here.
Hebrew and Arabic are still considered difficult research languages for stemming. English stemmers are fairly trivial (with only occasional problems, such as "dries" being the third-person singular present form of the verb "dry", "axes" being the plural of "axe" as well as "axis"); but stemmers become harder to design as the morphology, orthography, and character encoding of the target language becomes more complex. For example, an Italian stemmer is more complex than an English one (because of more possible verb inflections), a Russian one is more complex (more possible noun declensions
), a Hebrew one is even more complex (due to non-catenative morphology and a writing system without vowels; and the requirement of prefix stripping. Hebrew stems can be two, three or four characters; but not more.), and so on, while a stemmer for Hungarian would be easier to implement due to the precise rules in the language for flexion..
such as Web
search engine
s, since a user who runs a query on "daffodils" would probably also be interested in documents that contain the word "daffodil" (without the s). The effectiveness of stemming for English query systems were soon found to be rather limited, however, and this has led early Information retrieval
researchers to deem stemming irrelevant in general. An alternative approach, based on searching for n-gram
s rather than stems, may be used instead. Also, recent research has shown greater benefits for retrieval in other languages.
The Snowball stemmers have been compared with commercial lexical stemmers with varying results.
Google search
adopted word stemming in 2003. Previously a search for "fish" would not have returned "fishing". Other software search algorithms vary in their use of word stemming. Programs that simply search for substrings obviously will find "fish" in "fishing" but when searching for "fishes" will not find occurrences of the word "fish".
Information retrieval
Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...
, stemming is the process for reducing inflected (or sometimes derived) words to their stem
Word stem
In linguistics, a stem is a part of a word. The term is used with slightly different meanings.In one usage, a stem is a form to which affixes can be attached. Thus, in this usage, the English word friendships contains the stem friend, to which the derivational suffix -ship is attached to form a new...
, base or root
Root (linguistics)
The root word is the primary lexical unit of a word, and of a word family , which carries the most significant aspects of semantic content and cannot be reduced into smaller constituents....
form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...
s for stemming have been studied in computer science
Computer science
Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...
since 1968. Many search engine
Search engine
A search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. Search engines help to minimize the time required to find information and the amount of information...
s treat words with the same stem as synonym
Synonym
Synonyms are different words with almost identical or similar meanings. Words that are synonyms are said to be synonymous, and the state of being a synonym is called synonymy. The word comes from Ancient Greek syn and onoma . The words car and automobile are synonyms...
s as a kind of query broadening, a process called conflation.
Stemming programs are commonly referred to as stemming algorithms or stemmers.
Examples
A stemmer for EnglishEnglish language
English is a West Germanic language that arose in the Anglo-Saxon kingdoms of England and spread into what was to become south-east Scotland under the influence of the Anglian medieval kingdom of Northumbria...
, for example, should identify the string
String literal
A string literal is the representation of a string value within the source code of a computer program. There are numerous alternate notations for specifying string literals, and the exact notation depends on the individual programming language in question...
"cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished", "fish", and "fisher" to the root word, "fish".
History
The first ever published stemmer was written by Julie Beth LovinsJulie Beth Lovins
Julie Beth Lovins is a computational linguist who wrote the first stemming algorithm for word matching. Lovins obtained her Ph.D. in 1973 from the University of Chicago.-References:...
in 1968. This paper was remarkable for its early date and had great influence on later work in this area.
A later stemmer was written by Martin Porter
Martin Porter
This article is about Dr. Martin F. Porter the inventor of the Porter Stemmer and the Snowball programming framework, for Martin Porter the musician, see Martin Porter Dr. Martin F...
and was published in the July 1980 issue of the journal Program. This stemmer was very widely used and became the de-facto standard algorithm used for English stemming. Dr. Porter received the Tony Kent Strix award
Tony Kent Strix award
The UKeiG Strix award is an annual award for outstanding contributions to the field of information retrieval and is presented in memory of Dr Tony Kent, a past Fellow of the Institute of Information Scientists , who died in 1997...
in 2000 for his work on stemming and information retrieval.
Many implementations of the Porter stemming algorithm were written and freely distributed; however, many of these implementations contained subtle flaws. As a result, these stemmers did not match their potential. To eliminate this source of error, Martin Porter released an official free-software implementation of the algorithm around the year 2000. He extended this work over the next few years by building Snowball
Snowball programming language
Snowball is a small string-handling programming language whose name was chosen as a tribute to the SNOBOL programming language, with which it shares the concept of string patterns delivering signals that are used to control the flow of the program....
, a framework for writing stemming algorithms, and implemented an improved English stemmer together with stemmers for several other languages.
Algorithms
There are several types of stemming algorithms which differ in respect to performance and accuracy and how certain stemming obstacles are overcome.Brute-force algorithms
Brute forceBrute-force search
In computer science, brute-force search or exhaustive search, also known as generate and test, is a trivial but very general problem-solving technique that consists of systematically enumerating all possible candidates for the solution and checking whether each candidate satisfies the problem's...
stemmers employ a lookup table
Lookup table
In computer science, a lookup table is a data structure, usually an array or associative array, often used to replace a runtime computation with a simpler array indexing operation. The savings in terms of processing time can be significant, since retrieving a value from memory is often faster than...
which contains relations between root forms and inflected forms. To stem a word, the table is queried to find a matching inflection. If a matching inflection is found, the associated root form is returned.
Brute force approaches are criticized for their general lack of elegance in that no algorithm is applied that would more quickly converge on a solution. In other words, there are more operations performed during the search than should be necessary. Brute force searches consume immense amounts of storage to host the list of relations (relative to the task). The algorithm is only accurate to the extent that the inflected form already exists in the table. Given the number of words in a given language, like English, it is unrealistic to expect that all word forms can be captured and manually recorded by human action alone. Manual training of the algorithm is overly time-intensive and the ratio between the effort and the increase in accuracy is marginal at best.
Brute force algorithms do overcome some of the challenges faced by the other approaches. Not all inflected word forms in a given language "follow the rules" appropriately. While "running" might be easy to stem to "run" in a suffix stripping approach, the alternate inflection, "ran", is not. Suffix stripping algorithms are somewhat powerless to overcome this problem, short of increasing the number and complexity of the rules, but brute force algorithms only require storing a single extra relation between "run" and "ran". While that is true, this assumes someone bothered to store the relation in the first place, and one of the major problems of improving brute force algorithms is the coverage of the language.
Brute force algorithms are initially very difficult to design given the immense number of relations that must be initially stored to produce an acceptable level of accuracy (the number can span well into the millions). However, brute force algorithms are easy to improve in that decreasing the stemming error is only a matter of adding more relations to the table. Someone with only a minor experience in linguistics is capable of improving the algorithm, unlike the suffix stripping approaches which require a solid background in linguistics.
For technical accuracy, some programs may use suffix stripping to generate the lookup table given a text corpus, and then only consult the lookup table when stemming. This is not regarded as a brute force approach, although a lookup table is involved.
To improve a brute force algorithm it may be built upon preliminary part-of-speech tagging. In this case the look up table includes suffixes and endings associated with a specific part of speech, which significantly reduces the number of overstemming errors.
The production technique
Some programs attempt to automatically generate the table of root and inflected forms. A production algorithm attempts to infer the probable inflections for a given word. For example, if the word is "run", then the algorithm might automatically generate the forms "running" and "runs". In the traditional sense of the concept of stemming, this algorithm is its reverse process. Rather than try and remove suffixes, the goal of a production algorithm is to generate suffixes. Later, a brute force algorithm can simply query the automatically generated table of word relations to find the root form of a word.There are many types of heuristic, experimental techniques for identifying inflected forms of words. Some algorithms work phonetically, looking at the final syllables in a word. Some are rather brute force, using rules that seem a lot like normalization rules, by inspecting the last few characters. Others are similar to the process of lemmatisation, which takes advantage of the additional knowledge about the part of speech of the given word to limit what types of suffixes are considered when generating inflections for the word.
Distinguishing this technique from the other approaches is not very important. The other approaches share much of the same logic, but are simply working in the opposite direction. Some stemming approaches may employ both a production technique to generate inflections, and a suffix stripping approach to identify root forms together.
One drawback of the production technique is that there is no guarantee the inflection is real. For example, a technique might join together "run" and "ly" to create the word "runly". "runly" is not a real word. Granted, it is highly unlikely that "runly" will ever be used as input to a stemming algorithm since it is not a real word, so this error may not affect the result and therefore the accuracy of a stemming algorithm. However, such errors do populate the table of relations, and each additional entry can increase the time required to perform the lookup of an inflection to find its corresponding root form.
Suffix-stripping algorithms
Suffix stripping algorithms do not rely on a lookup table that consists of inflected forms and root form relations. Instead, a typically smaller list of "rules" is stored which provides a path for the algorithm, given an input word form, to find its root form. Some examples of the rules include:- if the word ends in 'ed', remove the 'ed'
- if the word ends in 'ing', remove the 'ing'
- if the word ends in 'ly', remove the 'ly'
Suffix stripping approaches enjoy the benefit of being much simpler to maintain than brute force algorithms, assuming the maintainer is sufficiently knowledgeable in the challenges of linguistics and morphology and encoding suffix stripping rules. Suffix stripping algorithms are sometimes regarded as crude given the poor performance when dealing with exceptional relations (like 'ran' and 'run'). The solutions produced by suffix stripping algorithms are limited to those lexical categories
Lexical category
In grammar, a part of speech is a linguistic category of words , which is generally defined by the syntactic or morphological behaviour of the lexical item in question. Common linguistic categories include noun and verb, among others...
which have well known suffixes with few exceptions. This, however, is a problem, as not all parts of speech have such a well formulated set of rules. Lemmatisation attempts to improve upon this challenge.
Additional algorithm criteria
Suffix stripping algorithms may differ in results for a variety of reasons. One such reason is whether the algorithm constrains whether the output word must be a real word in the given language. Some approaches do not require the word to actually exist in the language lexicon (the set of all words in the language). Alternatively, some suffix stripping approaches maintain a database (a large list) of all known morphological word roots that exist as real words. These approaches check the list for the existence of the term prior to making a decision. Typically, if the term does not exist, alternate action is taken. This alternate action may involve several other criteria. The non-existence of an output term may serve to cause the algorithm to try alternate suffix stripping rules. It can be the case that two or more suffix stripping rules apply to the same input term, where there becomes an ambiguity in which rule to apply. The algorithm may assign (by human hand or stochastically) a priority to one rule or another. Or the algorithm may reject one rule application because it results in a non-existent term whereas the other overlapping rule does not. For example, given the English term friendlies, the algorithm may identify the ies suffix and apply the appropriate rule and achieve the result of friendl. friendl is likely not found in the lexicon, and therefore the rule is rejected.One improvement upon basic suffix stripping is the use of suffix substitution. Similar to a stripping rule, a substitution rule replaces a suffix with an alternate suffix. For example, there could exist a rule that replaces ies with y. How this affects the algorithm varies on the algorithm's design. To illustrate, the algorithm may identify that both the ies suffix stripping rule as well as the suffix substitution rule apply. Since the stripping rule results in a non-existent term in the lexicon, but the substitution rule does not, the substitution rule is applied instead. In this example, friendlies becomes friendly instead of friendl. Diving further into the details, a common technique is to apply rules in a cyclical fashion (recursively, as computer scientists would say). After applying the suffix substitution rule in this example scenario, a second pass is made to identify matching rules on the term friendly, where the ly stripping rule is likely identified and accepted. In summary, friendlies becomes (via substitution) friendly which becomes (via stripping) friend.
This example also helps illustrate the difference between a rule-based approach and a brute force approach. In a brute force approach, the algorithm would search for friendlies in the set of hundreds of thousands of inflected word forms and ideally find the corresponding root form friend. In the rule-based approach, the three rules mentioned above would be applied in succession to converge on the same solution. Chances are that the rule-based approach was faster.
Lemmatisation algorithms
A more complex approach to the problem of determining a stem of a word is lemmatisationLemmatisation
Lemmatisation in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item....
. This process involves first determining the part of speech of a word, and applying different normalization rules for each part of speech. The part of speech is first detected prior to attempting to find the root since for some languages, the stemming rules change depending on a word's part of speech.
This approach is highly conditional upon obtaining the correct lexical category (part of speech). While there is overlap between the normalization rules for certain categories, identifying the wrong category or being unable to produce the right category limits the added benefit of this approach over suffix stripping algorithms. The basic idea is that, if we are able to grasp more information about the word to be stemmed, then we are able to more accurately apply normalization rules (which are, more or less, suffix stripping rules).
Stochastic algorithms
StochasticStochastic
Stochastic refers to systems whose behaviour is intrinsically non-deterministic. A stochastic process is one whose behavior is non-deterministic, in that a system's subsequent state is determined both by the process's predictable actions and by a random element. However, according to M. Kac and E...
algorithms involve using probability to identify the root form of a word. Stochastic algorithms are trained (they "learn") on a table of root form to inflected form relations to develop a probabilistic model. This model is typically expressed in the form of complex linguistic rules, similar in nature to those in suffix stripping or lemmatisation. Stemming is performed by inputting an inflected form to the trained model and having the model produce the root form according to its internal ruleset, which again is similar to suffix stripping and lemmatisation, except that the decisions involved in applying the most appropriate rule, or whether or not to stem the word and just return the same word, or whether to apply two different rules sequentially, are applied on the grounds that the output word will have the highest probability of being correct (which is to say, the smallest probability of being incorrect, which is how it is typically measured).
Some lemmatisation algorithms are stochastic in that, given a word which may belong to multiple parts of speech, a probability is assigned to each possible part. This may take into account the surrounding words, called the context, or not. Context-free grammars do not take into account any additional information. In either case, after assigning the probabilities to each possible part of speech, the most likely part of speech is chosen, and from there the appropriate normalization rules are applied to the input word to produce the normalized (root) form.
n-gram analysis
To further explain, consider the well known technique of n-gram analysis. Here the term gram is a unit of measurement of text that may refer to a single character, a single syllable, or a single word. Which one applies is dependent upon which author or programmer is describing the algorithm (and their background, as a linguist is more likely to be referring to syllables, but a computer scientist characters and a software company words). The prefix n is, as usual in computer science jargon, representing 'any number of', or a 'variable number of'. There are frequently used subsets of n-grams, such as bigrams (a.k.a digrams, bi-grams, di-grams), representing a sequence of two grams (two characters or two words or two syllables in a row, consecutively in the text). Trigrams (three units) are also popular.In language modeling, one could write a software program that stores a large table of all word bigrams found in a large body of text (called a corpus). The algorithm would scan word by word, like a sliding window, across the text from the beginning, storing two word sequences as it moves along until the last word of the last document is reached. On the output, one has what is called a bigrams table. Each row in the table may be called a tuple, storing the first word, and then the second word. Additional characteristics may be stored about each bigram, such as its frequency (the total count of the times the whole bigram was discovered) in whatever body of texts was used to generate the table. The table may also store the frequency of each individual word. For example, this sentence contains bigrams like "for example" and "this sentence".
From the table an algorithm can garner the probability of the second word following the first word by assessing each word's frequency and the frequency of the bigrams and perhaps other criteria like the total number of bigrams or the total number of bigrams where the first word is the same. For example, one would be able to state conclusions like: the presence of the word post indicates a probability of the following word being office of 50%, because the sequence occurred this way 50% of the time in the text (where post occurred with and without office following it).
In the bi-gram relationship, the first word can be understood as the context that describes the second word. It qualifies the second word, providing additional insight into its meaning. Humans use the technique of examining the context of words to determine the meaning of written words quite frequently. Words can have multiple meanings, called senses. The bigram-provided context provides a way to distinguish which sense of the word is used (well, most of the time, as literary puns and the like are the obvious exception). As mentioned elsewhere, the stemming algorithm can use this context as additional information to assist in determining the outcome of the algorithm (whether or not the current word is stemmed, which of one or more possible stems is appropriate, whether to use substitution or stripping, whether the word is a verb or noun or some other lexical category, etc). Because the algorithm used probabilities in determining the outcome, it is a stochastic algorithm.
Entering into the realm of stochastic algorithms may greatly complicate the stemming algorithm, making it much harder to develop, distribute and maintain. There is considerable debate over its benefits. If a simple suffix stripper can get to about 80% accuracy from a few days or weeks of programming, then how much more valuable is it that a context-driven n-gram stemmer can achieve 90% accuracy given several months of work? There are also many problems with stochastic stemmers. What if the body of text used to derive the probabilities of the bigrams is not a representative sample of all documents ever written in a given language (and it is likely not)? What if the algorithm's bigram table 'becomes corrupted', by learning from 'bad text'? Reconfiguring a brute force algorithm that employs a lookup table is as simple as going in and removing a row from the table. But how does one reconfigure a probabilistic network (this is analogous to some of the problems behind neural networks)? Such obstacles are researched in computer science.
Hybrid approaches
Hybrid approaches use two or more of the approaches described above in unison. A simple example is a suffix tree algorithm which first consults a lookup table using brute force. However, instead of trying to store the entire set of relations between words in a given language, the lookup table is kept small and is only used to store a minute amount of "frequent exceptions" like "ran => run". If the word is not in the exception list, apply suffix stripping or lemmatisation and output the result.Affix stemmers
In linguisticsLinguistics
Linguistics is the scientific study of human language. Linguistics can be broadly broken into three categories or subfields of study: language form, language meaning, and language in context....
, the term affix
Affix
An affix is a morpheme that is attached to a word stem to form a new word. Affixes may be derivational, like English -ness and pre-, or inflectional, like English plural -s and past tense -ed. They are bound morphemes by definition; prefixes and suffixes may be separable affixes...
refers to either a prefix
Prefix
A prefix is an affix which is placed before the root of a word. Particularly in the study of languages,a prefix is also called a preformative, because it alters the form of the words to which it is affixed.Examples of prefixes:...
or a suffix
Suffix
In linguistics, a suffix is an affix which is placed after the stem of a word. Common examples are case endings, which indicate the grammatical case of nouns or adjectives, and verb endings, which form the conjugation of verbs...
. In addition to dealing with suffixes, several approaches also attempt to remove common prefixes. For example, given the word indefinitely, identify that the leading "in" is a prefix that can be removed. Many of the same approaches mentioned earlier apply, but go by the name affix stripping. A study of affix stemming for several European languages can be found here.
Matching algorithms
Such algorithms use a stem database (for example a set of documents that contain stem words). These stems, as mentioned above, are not necessarily valid words themselves (but rather common sub-strings, as the "brows" in "browse" and in "browsing"). In order to stem a word the algorithm tries to match it with stems from the database, applying various constraints, such as on the relative length of the candidate stem within the word (so that, for example, the short prefix "be", which is the stem of such words as "be", "been" and "being", would not be considered as the stem of the word "beside").Language challenges
While much of the early academic work in this area was focused on the English language (with significant use of the Porter Stemmer algorithm), many other languages have been investigated.Hebrew and Arabic are still considered difficult research languages for stemming. English stemmers are fairly trivial (with only occasional problems, such as "dries" being the third-person singular present form of the verb "dry", "axes" being the plural of "axe" as well as "axis"); but stemmers become harder to design as the morphology, orthography, and character encoding of the target language becomes more complex. For example, an Italian stemmer is more complex than an English one (because of more possible verb inflections), a Russian one is more complex (more possible noun declensions
Declension
In linguistics, declension is the inflection of nouns, pronouns, adjectives, and articles to indicate number , case , and gender...
), a Hebrew one is even more complex (due to non-catenative morphology and a writing system without vowels; and the requirement of prefix stripping. Hebrew stems can be two, three or four characters; but not more.), and so on, while a stemmer for Hungarian would be easier to implement due to the precise rules in the language for flexion..
Multilingual stemming
Multilingual stemming applies morphological rules of two or more languages simultaneously instead of rules for only a single language when interpreting a search query. Commercial systems using multilingual stemming exist.Error metrics
There are two error measurements in stemming algorithms, overstemming and understemming. Overstemming is an error where two separate inflected words are stemmed to the same root, but should not have been—a false positive. Understemming is an error where two separate inflected words should be stemmed to the same root, but are not—a false negative. Stemming algorithms attempt to minimize each type of error, although reducing one type can lead to increasing the other.Information retrieval
Stemmers are common elements in query systemsInformation retrieval
Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...
such as Web
World Wide Web
The World Wide Web is a system of interlinked hypertext documents accessed via the Internet...
search engine
Search engine
A search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. Search engines help to minimize the time required to find information and the amount of information...
s, since a user who runs a query on "daffodils" would probably also be interested in documents that contain the word "daffodil" (without the s). The effectiveness of stemming for English query systems were soon found to be rather limited, however, and this has led early Information retrieval
Information retrieval
Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...
researchers to deem stemming irrelevant in general. An alternative approach, based on searching for n-gram
N-gram
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items in question can be phonemes, syllables, letters, words or base pairs according to the application...
s rather than stems, may be used instead. Also, recent research has shown greater benefits for retrieval in other languages.
Use in commercial products
Many commercial companies have been using stemming since at least the 1980s and have produced algorithmic and lexical stemmers in many languages.The Snowball stemmers have been compared with commercial lexical stemmers with varying results.
Google search
Google search
Google or Google Web Search is a web search engine owned by Google Inc. Google Search is the most-used search engine on the World Wide Web, receiving several hundred million queries each day through its various services....
adopted word stemming in 2003. Previously a search for "fish" would not have returned "fishing". Other software search algorithms vary in their use of word stemming. Programs that simply search for substrings obviously will find "fish" in "fishing" but when searching for "fishes" will not find occurrences of the word "fish".
See also
- Root (linguistics)Root (linguistics)The root word is the primary lexical unit of a word, and of a word family , which carries the most significant aspects of semantic content and cannot be reduced into smaller constituents....
- linguistic definition of the term "root" - Stem (linguistics) - linguistic definition of the term "stem"
- Morphology (linguistics)Morphology (linguistics)In linguistics, morphology is the identification, analysis and description, in a language, of the structure of morphemes and other linguistic units, such as words, affixes, parts of speech, intonation/stress, or implied context...
- Lemma (morphology) - linguistic definition
- Lemmatization
- LexemeLexemeA lexeme is an abstract unit of morphological analysis in linguistics, that roughly corresponds to a set of forms taken by a single word. For example, in the English language, run, runs, ran and running are forms of the same lexeme, conventionally written as RUN...
- InflectionInflectionIn grammar, inflection or inflexion is the modification of a word to express different grammatical categories such as tense, grammatical mood, grammatical voice, aspect, person, number, gender and case...
- DerivationDerivation (linguistics)In linguistics, derivation is the process of forming a new word on the basis of an existing word, e.g. happi-ness and un-happy from happy, or determination from determine...
- stemming is a form of reverse derivation - Natural language processingNatural language processingNatural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....
- stemming is generally regarded as a form of NLP - Text miningText miningText mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as...
- stemming algorithms play a major role in commercial NLP software - Computational linguisticsComputational linguisticsComputational linguistics is an interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective....
External links
- Themis - open source IR framework, includes Porter stemmer implementation (PostgreSQL, Java API)
- Snowball - free stemming algorithms for many languages, includes source code, including stemmers for five romance languages
- Snowball on C# - port of Snowball stemmers for C# (14 languages)
- http://snowball.tartarus.org/wrappers/guide.html - Python bindings to Snowball API
- Ruby-Stemmer - Ruby extension to Snowball API
- PECL - PHP extension to the Snowball API
- Oleander Porter's algorithm - stemming library in C++ released under BSD
- Unofficial home page of the Lovins stemming algorithm - with source code in a couple of languages
- Official home page of the Porter stemming algorithm - including source code in several languages
- Official home page of the Lancaster stemming algorithm - Lancaster University, UK
- Modifications to the Lancaster Stemming Algorithm - Extensions to improve the handling of errors in the rules, allow interactive testing, provide more precise stems, and add some flexibility for implementing finite state automata.
- Official home page of the UEA-Lite Stemmer - University of East Anglia, UK
- Overview of stemming algorithms
- PTStemmer - A Java/Python/.Net stemming toolkit for the Portuguese language
- jsSnowball - open source JavaScript implementation of Snowball stemming algorithms for many languages
- hindi_stemmer - open source stemmer for Hindi
- czech_stemmer - open source stemmer for Czech