Text Encoding Initiative
Encyclopedia
The Text Encoding Initiative (TEI) is a text-centric
Written language
A written language is the representation of a language by means of a writing system. Written language is an invention in that it must be taught to children, who will instinctively learn or create spoken or gestural languages....

 community of practice
Community of practice
A community of practice is, according to cognitive anthropologists Jean Lave and Etienne Wenger, a group of people who share an interest, a craft, and/or a profession. The group can evolve naturally because of the members' common interest in a particular domain or area, or it can be created...

 in the academic field of digital humanities. The community runs a mailing list, meetings and conference series, and maintains a technical standard, a wiki and a toolset.

The TEI started in the 1980s as a consortium
Consortium
A consortium is an association of two or more individuals, companies, organizations or governments with the objective of participating in a common activity or pooling their resources for achieving a common goal....

 of institutions and research projects, maintains and develops a standard for the representation of texts in digital form. Originally sponsored by three scholarly societies based on the manifesto issued after the Vassar
Vassar College
Vassar College is a private, coeducational liberal arts college in the town of Poughkeepsie, New York, in the United States. The Vassar campus comprises over and more than 100 buildings, including four National Historic Landmarks, ranging in style from Collegiate Gothic to International,...

  Conference, the TEI is now an independent membership consortium, hosted by academic institutions in the US and in Europe. Its major deliverable is a set of Guidelines, which specify encoding methods for machine-readable texts, chiefly in the humanities
Humanities
The humanities are academic disciplines that study the human condition, using methods that are primarily analytical, critical, or speculative, as distinguished from the mainly empirical approaches of the natural sciences....

, social sciences
Social sciences
Social science is the field of study concerned with society. "Social science" is commonly used as an umbrella term to refer to a plurality of fields outside of the natural sciences usually exclusive of the administrative or managerial sciences...

 and linguistics
Linguistics
Linguistics is the scientific study of human language. Linguistics can be broadly broken into three categories or subfields of study: language form, language meaning, and language in context....

. Since 1994, these guidelines are a widely-used standard for text materials for performing online research and teaching, and TEI is now the de facto standard for the encoding of electronic texts in the humanities academic community.

Sponsors and organisation

The scholarly societies originally sponsoring the TEI are the Association for Computers and the Humanities, the Association for Computational Linguistics
Association for Computational Linguistics
The Association for Computational Linguistics is the international scientific and professional society for people working on problems involving natural language and computation. An annual meeting is held each summer in locations where significant computational linguistics research is carried out...

, and the Association for Literary and Linguistic Computing
Association for Literary and Linguistic Computing
The Association for Literary and Linguistic Computing, or ALLC, is a digital humanities organization founded in London in 1973. Its purpose is to promote the advancement of education in the digital humanities through the development and use of computational methods in research and teaching in the...

. These three groups first organized the TEI in 1987 as a research effort funded by grants from several agencies. The Guidelines for Electronic Text Encoding and Interchange
were released in 1994, co-edited by Lou Burnard (at Oxford University) and Michael Sperberg-McQueen
Michael Sperberg-McQueen
C. M. "Michael" Sperberg-McQueen is an American markup specialist. He was co-editor of the Extensible Markup Language 1.0 spec , and chair of the XML Schema working group....

 (then at the University of Illinois at Chicago
University of Illinois at Chicago
The University of Illinois at Chicago, or UIC, is a state-funded public research university located in Chicago, Illinois, United States. Its campus is in the Near West Side community area, near the Chicago Loop...

, later at W3C and now an independent consultant).

Today, the TEI Consortium is a member-funded non-profit corporation hosted by:
  • The Research Technologies Service at the University of Oxford
    University of Oxford
    The University of Oxford is a university located in Oxford, United Kingdom. It is the second-oldest surviving university in the world and the oldest in the English-speaking world. Although its exact date of foundation is unclear, there is evidence of teaching as far back as 1096...

    ,
  • the Scholarly Technology Group at Brown University
    Brown University
    Brown University is a private, Ivy League university located in Providence, Rhode Island, United States. Founded in 1764 prior to American independence from the British Empire as the College in the English Colony of Rhode Island and Providence Plantations early in the reign of King George III ,...

    ,
  • a francophone group comprising ATILF, INIST, and LORIA, co-ordinated at Nancy
  • the Institute for Advanced Technology in the Humanities at the University of Virginia
    University of Virginia
    The University of Virginia is a public research university located in Charlottesville, Virginia, United States, founded by Thomas Jefferson...

    .

The guidelines

The Guidelines define some 500 different textual components and concepts
(word,
sentence,
character,
glyph,
person,
etc.), which can be expressed using a markup language
Markup language
A markup language is a modern system for annotating a text in a way that is syntactically distinguishable from that text. The idea and terminology evolved from the "marking up" of manuscripts, i.e. the revision instructions by editors, traditionally written with a blue pencil on authors' manuscripts...

 and defined by a DTD
Document Type Definition
Document Type Definition is a set of markup declarations that define a document type for SGML-family markup languages...

 or XML schema
XML schema
An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed by XML itself...

. Early versions of the Guidelines used SGML as a means of expression; more recently XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

 has been adopted. The basic concepts have been stable for over a decade, with TEI P3 (public release version 3) published in 1994, and updated in 1999. P4 (2002) is a slight update to accommodate XML; TEI P5 was released in November 2007. P5 includes integration with the xml:lang and xml:id attributes from the W3C (these had previously been attributes in the TEI namespace), regularisation of local pointing attributes to use the hash (as used in HTML) and unification of the ptr and xptr tags. Together these changes make P5 more regular and bring it closer to current xml practice as promoted by the W3C and as used by other XML variants.

Initially, supporting the character sets required by European and Asian languages was a major issue. This has now been resolved by the use of Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

, which XML parsers are required to support.

The current form of the guidelines is as ODD (One Document Does it all)
ODD (One Document Does it all)
ODD stands for "One Document Does it all". Part of the Text Encoding Initiative, it is an XML-based format for writing human-readable descriptions of XML files....

 files, from which documentation in PDF or HTML and schemas in Document Type Definition
Document Type Definition
Document Type Definition is a set of markup declarations that define a document type for SGML-family markup languages...

 and XML schema
XML schema
An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed by XML itself...

 format can be generated. The TEI scheme is a modular one, designed to be customized for particular research or production environments. Many different applications of it are possible; one very popular example customization is a subset known as TEI Lite
TEI Lite
TEI Lite is an XML-based file format for exchanging texts. It is a manageable selection from the extensive set of elements available in the full Text Encoding Initiative Guidelines.-External links:*...

.

There is ongoing work on TEI P5 which, although it breaks backward compatibility
Backward compatibility
In the context of telecommunications and computing, a device or technology is said to be backward or downward compatible if it can work with input generated by an older device...

 in a number of ways, has significantly updated the inner workings including a reorganization of the underlying structures of elements into classes which allow greater and easier customization. Maintenance and development continue under the sponsorship of the TEI Consortium. The TEI component for marking up feature structure
Feature structure
In phrase structure grammars, such as generalised phrase structure grammar, head-driven phrase structure grammar and lexical functional grammar, a feature structure is essentially a set of attribute-value pairs. For example the attribute named number might have the value singular. The value of an...

s (a model of data sometimes used in linguistics) has been adopted as the basis of the ongoing development of an ISO
International Organization for Standardization
The International Organization for Standardization , widely known as ISO, is an international standard-setting body composed of representatives from various national standards organizations. Founded on February 23, 1947, the organization promulgates worldwide proprietary, industrial and commercial...

 standard for feature structures.

TEI projects

The TEI is used by many projects worldwide. Practically all projects are associated with one or more universities. Some well-known projects that encode texts using TEI include:
Project URL Strengths
British National Corpus
British National Corpus
The British National Corpus is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. It was compiled as a general corpus in the field of corpus linguistics...

http://www.natcorp.ox.ac.uk 100 million word snapshot of current English
Oxford Text Archive
Oxford Text Archive
Oxford Text Archive is an archive of electronic texts and other literary and language resources which have been created, collected and distributed for the purpose of research into literary and linguistic topics...

http://ota.ahds.ac.uk/ Linguistic
Language
Language may refer either to the specifically human capacity for acquiring and using complex systems of communication, or to a specific instance of such a system of complex communication...

 data
Perseus Project
Perseus Project
The Perseus Project is a digital library project of Tufts University that assembles digital collections of humanities resources. It is hosted by the Department of Classics. It has suffered at times from computer hardware problems, and its resources are occasionally unavailable...

http://www.perseus.tufts.edu/ Greek
Greek language
Greek is an independent branch of the Indo-European family of languages. Native to the southern Balkans, it has the longest documented history of any Indo-European language, spanning 34 centuries of written records. Its writing system has been the Greek alphabet for the majority of its history;...

 and Latin
Latin
Latin is an Italic language originally spoken in Latium and Ancient Rome. It, along with most European languages, is a descendant of the ancient Proto-Indo-European language. Although it is considered a dead language, a number of scholars and members of the Christian clergy speak it fluently, and...

 texts
Women Writers Project
Women Writers Project
The Women Writers Project is an initiative based at Brown University, with the aim of making texts by pre-Victorian women writers more accessible. The eventual goal of the project is to make available all English language works written or co-authored by women up to 1850...

http://www.wwp.brown.edu/ Early modern women writers (Margaret Cavendish
Margaret Cavendish
Margaret Cavendish, Duchess of Newcastle-upon-Tyne was an English aristocrat, a prolific writer, and a scientist. Born Margaret Lucas, she was the youngest sister of prominent royalists Sir John Lucas and Sir Charles Lucas...

, Eliza Haywood
Eliza Haywood
Eliza Haywood , born Elizabeth Fowler, was an English writer, actress and publisher. Since the 1980s, Eliza Haywood’s literary works have been gaining in recognition and interest...

, etc.)
New Zealand Electronic Text Centre
New Zealand Electronic Text Centre
The New Zealand Electronic Text Centre is a unit of the library at the Victoria University of Wellington which provides a free online archive of New Zealand and Pacific Islands texts and heritage materials. The NZETC has an ongoing programme of digitisation and feature additions to the current...

http://www.nzetc.org/ New Zealand
New Zealand
New Zealand is an island country in the south-western Pacific Ocean comprising two main landmasses and numerous smaller islands. The country is situated some east of Australia across the Tasman Sea, and roughly south of the Pacific island nations of New Caledonia, Fiji, and Tonga...

 and Pacific Islands
Pacific Islands
The Pacific Islands comprise 20,000 to 30,000 islands in the Pacific Ocean. The islands are also sometimes collectively called Oceania, although Oceania is sometimes defined as also including Australasia and the Malay Archipelago....

 texts
The SWORD Project
The Sword Project
The SWORD Project is the CrossWire Bible Society's free Bible software project. Its purpose is to create cross-platform open source tools—covered by the GNU General Public License—that allow programmers and Bible societies to write new Bible software more quickly and easily.-Overview:The core of...

http://www.crosswire.org/sword/ Bible software
Bible software
Biblical software or Bible software is a group of computer applications designed to view and study biblical texts and concepts. Biblical software programs are similar to e-book readers in that they include digitally-formatted books, may be used to display a wide variety of inspirational books and...

, dictionaries, Christian literature
Christian literature
Christian Literature is writing that deals with Christian themes and incorporates the Christian world view. This constitutes a huge body of extremely varied writing.-Scripture:...

FreeDict
Freedict
Freedict is a collection of free bilingual dictionaries, licensed under the terms of the GNU General Public License .The primary format of the dictionaries is XML complying to the Text Encoding Initiative Guidelines. Most dictionaries have databases in the DICT format. Other occasionally available...

http://freedict.org Bilingual dictionaries
Text Creation Partnership
Text Creation Partnership
The Text Creation Partnership is a not-for-profit organization based in the library of the University of Michigan . Its purpose is to produce large-scale full-text electronic resources on behalf of both member institutions and scholarly publishers, under an arrangement calculated to serve the...

http://www.lib.umich.edu/tcp/ Early English and American books
Henrik Ibsen's Writings tei-c.org/Activities/Projects/he01.xml Complete works and writings by playwright Henrik Ibsen
Henrik Ibsen
Henrik Ibsen was a major 19th-century Norwegian playwright, theatre director, and poet. He is often referred to as "the father of prose drama" and is one of the founders of Modernism in the theatre...


TEI customizations

TEI customisations are specialisations of the TEI XML specification for use in particular fields of use or by specific communities.


As of 2011, there is an active proposal to add genetic editing
Genetic editing
Genetic editing is an approach to scholarly editing in which an exemplar is seen as derived from a dossier of other manuscripts and events. The derivation can be through physical cut and paste; writing or drawing in a variety of media; quotation, annotation or correction; acts of physical...

 support.

External links

  • TEI Consortium Web site (hosted at University of Virginia
    University of Virginia
    The University of Virginia is a public research university located in Charlottesville, Virginia, United States, founded by Thomas Jefferson...

    ) with a list of TEI projects, a form for adding your project and wiki
  • TEI @ Oxford (hosted at Oxford University) with development and backup versions of much of the core content.
  • TEI development (hosted at SourceForge.net
    SourceForge.net
    SourceForge is a web-based source code repository. It acts as a centralized location for software developers to control and manage open source software development. The website runs a version of SourceForge Enterprise Edition, forked from the last open-source version available...

    ) with bugtracker, version control, etc.
  • Larger list of TEI Projects
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK