Stylometry
Encyclopedia
Stylometry is the application of the study of linguistic style
Stylistics (linguistics)
Stylistics is the study and interpretation of texts from a linguistic perspective. As a discipline it links literary criticism and linguistics, but has no autonomous domain of its own...

, usually to written language, but it has successfully been applied to music and to fine-art paintings as well.

Stylometry is often used to attribute authorship to anonymous
Anonymous work
Anonymous works are works, such as art or literature, that have an anonymous, undisclosed, or unknown creator or author. In the United States it is legally defined as "a work on the copies or phonorecords of which no natural person is identified as author."...

 or disputed documents. It has legal as well as academic and literary applications, ranging from the question of the authorship of Shakespeare's works
Shakespeare attribution studies
Shakespeare attribution studies is a term used to denote the scholarly attempt to determine the authorial boundaries of the William Shakespeare canon, the extent of his possible collaborative works, and the identity of his collaborators, which began in the late 17th century and continues to the...

 to forensic linguistics
Forensic linguistics
Forensic linguistics is the application of linguistic knowledge, methods and insights to the forensic context of law, language, crime investigation, trial, and judicial procedure. It is a branch of applied linguistics...

.

History

Stylometry grew out of earlier techniques of analyzing texts for evidence of authenticity, authorial identity, and other questions. An early example is Lorenzo Valla
Lorenzo Valla
Lorenzo Valla was an Italian humanist, rhetorician, and educator. His family was from Piacenza; his father, Luciave della Valla, was a lawyer....

's 1439 proof that the Donation of Constantine
Donation of Constantine
The Donation of Constantine is a forged Roman imperial decree by which the emperor Constantine I supposedly transferred authority over Rome and the western part of the Roman Empire to the pope. During the Middle Ages, the document was often cited in support of the Roman Church's claims to...

 was a forgery
Forgery
Forgery is the process of making, adapting, or imitating objects, statistics, or documents with the intent to deceive. Copies, studio replicas, and reproductions are not considered forgeries, though they may later become forgeries through knowing and willful misrepresentations. Forging money or...

, an argument based partly on a comparison of the Latin
Latin
Latin is an Italic language originally spoken in Latium and Ancient Rome. It, along with most European languages, is a descendant of the ancient Proto-Indo-European language. Although it is considered a dead language, a number of scholars and members of the Christian clergy speak it fluently, and...

 with that used in authentic 4th Century documents.

The modern practice of the discipline received major impetus from the study of authorship problems in English Renaissance drama. Researchers and readers observed that some playwrights of the era had distinctive patterns of language preferences, and attempted to use those patterns to identify authors in uncertain or collaborative works. Early efforts were not always successful: in 1901, one researcher attempted to use John Fletcher's
John Fletcher (playwright)
John Fletcher was a Jacobean playwright. Following William Shakespeare as house playwright for the King's Men, he was among the most prolific and influential dramatists of his day; both during his lifetime and in the early Restoration, his fame rivalled Shakespeare's...

 preference for "'em," the contractional form of "them," as a marker to distinguish between Fletcher and Philip Massinger
Philip Massinger
Philip Massinger was an English dramatist. His finely plotted plays, including A New Way to Pay Old Debts, The City Madam and The Roman Actor, are noted for their satire and realism, and their political and social themes.-Early life:The son of Arthur Massinger or Messenger, he was baptized at St....

 in their collaborations—but he mistakenly employed an edition of Massinger's works in which the editor had expanded all instances of "'em" to "them."

The basics of stylometry was described by Polish philosopher Wincenty Lutosławski in book "Principes de stylometrie" 1890. Lutosławski used this method to build a chronology of Plato's Dialogues.

The development of computers and their capacities for analyzing large quantities of data enhanced this type of effort by orders of magnitude. The great capacity of computers for data analysis, however, did not guarantee quality output. In the early 1960s, Rev. A. Q. Morton produced a computer analysis of the fourteen Epistles of the New Testament attributed to St. Paul, which showed that six different authors had written that body of work. A check of his method, applied to the works of James Joyce
James Joyce
James Augustine Aloysius Joyce was an Irish novelist and poet, considered to be one of the most influential writers in the modernist avant-garde of the early 20th century...

, gave the result that Ulysses
Ulysses (novel)
Ulysses is a novel by the Irish author James Joyce. It was first serialised in parts in the American journal The Little Review from March 1918 to December 1920, and then published in its entirety by Sylvia Beach on 2 February 1922, in Paris. One of the most important works of Modernist literature,...

was written by five separate individuals, none of whom had any part in A Portrait of the Artist as a Young Man
A Portrait of the Artist as a Young Man
A Portrait of the Artist as a Young Man is a semi-autobiographical novel by James Joyce, first serialised in the magazine The Egoist from 1914 to 1915, and published first in book format in 1916 by B. W. Huebsch, New York. The first English edition was published by the Egoist Press in February 1917...

.


In time, however, and with practice, researchers and scholars have refined their approaches and methods, to yield better results. One notable early success was the resolution of disputed authorship in twelve of the Federalist Papers
Federalist Papers
The Federalist Papers are a series of 85 articles or essays promoting the ratification of the United States Constitution. Seventy-seven of the essays were published serially in The Independent Journal and The New York Packet between October 1787 and August 1788...

 by Frederick Mosteller and David Wallace.
While questions of initial assumptions and methodology still arise (and, perhaps, always will), few now dispute the basic premise that linguistic analysis of written texts can produce valuable information and insight. (Indeed, this was apparent even before the advent of computers: the successful application of a textual/linguistic approach to the Fletcher canon by Cyrus Hoy
Cyrus Hoy
Cyrus Hoy was a literary scholar of the English Renaissance stage who taught at the University of Virginia and Vanderbilt University, and was the John B. Trevor Professor of English at the University of Rochester...

 and others yielded clear results in the late 1950s and early '60s.)
An example of a modern study is the analysis of Ronald Reagan
Ronald Reagan
Ronald Wilson Reagan was the 40th President of the United States , the 33rd Governor of California and, prior to that, a radio, film and television actor....

's radio commentaries of uncertain authorship.

Methods

Modern stylometry draws heavily on the aid of computers for statistical analysis, artificial intelligence
Artificial intelligence
Artificial intelligence is the intelligence of machines and the branch of computer science that aims to create it. AI textbooks define the field as "the study and design of intelligent agents" where an intelligent agent is a system that perceives its environment and takes actions that maximize its...

 and access to the growing corpus
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...

 of texts available via the Internet
Internet
The Internet is a global system of interconnected computer networks that use the standard Internet protocol suite to serve billions of users worldwide...

. Software systems such as Signature (freeware produced by Dr Peter Millican of Oxford University) and JGAAP (the Java Graphical Authorship Attribution Program—freeware produced by Dr Patrick Juola
Patrick Juola
Dr. Patrick Juola is a professor of computer science at Duquesne University and an expert in the field of computer linguistics and computer security. He is credited with co-creating the original biometric word list. Dr...

 of Duquesne University) make its use increasingly practicable, even for the non-expert.

Whereas in the past, stylometry emphasized the rarest or most striking elements of a text, contemporary techniques can isolate identifying patterns even in common parts of speech.

Writer invariant

The primary stylometric method is the writer invariant
Writer invariant
Writer invariant, also called authorial invariant or author's invariant, is a property of a text which is invariant of its author, that is, it will be similar in all texts of a given author and different in texts of different authors. It can be used to find plagiarism or discover who is real author...

: a property of a text which is invariant of its author. An example of a writer invariant is frequency of function word
Function word
Function words are words that have little lexical meaning or have ambiguous meaning, but instead serve to express grammatical relationships with other words within a sentence, or specify the attitude or mood of the speaker...

s used by the writer.

In one such method, the text is analyzed to find the 50 most common words. The text is then broken into 5,000 word chunks and each of the chunks is analyzed to find the frequency of those 50 words in that chunk. This generates a unique 50-number identifier for each chunk. These numbers place each chunk of text into a point in a 50-dimensional space. This 50-dimensional space is flattened into a plane using principal components analysis
Principal components analysis
Principal component analysis is a mathematical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of uncorrelated variables called principal components. The number of principal components is less than or equal to...

 (PCA). This results in a display of points that correspond to an author's style. If two literary works are placed on the same plane, the resulting pattern may show if both works were by the same author or different authors.

Neural networks

Neural network
Neural network
The term neural network was traditionally used to refer to a network or circuit of biological neurons. The modern usage of the term often refers to artificial neural networks, which are composed of artificial neurons or nodes...

s can be used to analyze authorship of texts.

Genetic Algorithms

The genetic algorithm
Genetic algorithm
A genetic algorithm is a search heuristic that mimics the process of natural evolution. This heuristic is routinely used to generate useful solutions to optimization and search problems...

 is another artificial intelligence technique used in stylometry. A method starts out with a set of rules. An example rule might be, "If but appears more than 1.7 times in every thousand words, then the text is author X". The program is presented with text and uses the rules to determine authorship. The rules are tested against a set of known texts and each rule is given a fitness score. The 50 rules with the lowest scores are thrown out. The remaining 50 rules are given small changes and 50 new rules are introduced. This is repeated until the evolved rules correctly attribute the texts.

Rare Pairs

One method for identifying style is called "rare pairs", and relies upon individual habits of collocation
Collocation
In corpus linguistics, collocation defines a sequence of words or terms that co-occur more often than would be expected by chance. In phraseology, collocation is a sub-type of phraseme. An example of a phraseological collocation is the expression strong tea...

. The use of certain words may, for a particular author, idiosyncratically entail the use of other, predictable words.

See also

  • Writeprint
    Writeprint
    Writeprint is a term proposed by some forensic linguistics researchers to denote a set of distinguishing stylometric characteristics of a written text such as "vocabulary richness, length of sentence, use of function words, layout of paragraphs, and key words" which allow one to identify its...

  • Linguistics and the Book of Mormon, Stylometry (Wordprint Studies)
  • Moshe Koppel
    Moshe Koppel
    Moshe Koppel is an Israeli-American computer scientist, Talmud scholar and political activist.Koppel was born and raised in New York City, where he received a traditional Orthodox Jewish education. He received a B.A. from Yeshiva University and a doctorate in mathematics from the Courant Institute...


Further reading

See also the academic journal Literary and Linguistic Computing (published by the University of Oxford
University of Oxford
The University of Oxford is a university located in Oxford, United Kingdom. It is the second-oldest surviving university in the world and the oldest in the English-speaking world. Although its exact date of foundation is unclear, there is evidence of teaching as far back as 1096...

) and the Language Resources and Evaluation journal.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK