ESpeak
Encyclopedia
eSpeak is a compact open source software speech synthesizer
Speech synthesis
Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware...

 for Linux, Windows, and other platforms. It uses a formant synthesis method, providing many languages in a small size. Much of the programming for eSpeak's languages was based on information found on Wikipedia
Wikipedia
Wikipedia is a free, web-based, collaborative, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation. Its 20 million articles have been written collaboratively by volunteers around the world. Almost all of its articles can be edited by anyone with access to the site,...

, with some subsequent feedback from native speakers. Projects using eSpeak include NVDA
NonVisual Desktop Access
NonVisual Desktop Access is a free, open source, portable screen reader for Microsoft Windows. The project was started by Michael Curran in 2006. The latest stable version is 2011.3NVDA is programmed in Python...

, Ubuntu and OLPC, and it has also been used by Google Translate
Google Translate
Google Translate is a free statistical machine translation service provided by Google Inc. to translate a section of text, document or webpage, into another language.The service was introduced in April 28, 2006 for the Arabic language...

.

History

eSpeak is derived from the "Speak" speech synthesizer for British English for Acorn RISC OS
RISC OS
RISC OS is a computer operating system originally developed by Acorn Computers Ltd in Cambridge, England for their range of desktop computers, based on their own ARM architecture. First released in 1987, under the name Arthur, the subsequent iteration was renamed as in 1988...

 computers which was originally written in 1995.

A rewritten version for Linux appeared in February 2006 and a Windows SAPI 5
Speech Application Programming Interface
The Speech Application Programming Interface or SAPI is an API developed by Microsoft to allow the use of speech recognition and speech synthesis within Windows applications. To date, a number of versions of the API have been released, which have shipped either as part of a Speech SDK, or as part...

 version in January 2007. Subsequent development has added and improved support for additional languages.

Because of its small size and many languages, it is included as the default speech synthesizer in the NVDA
NonVisual Desktop Access
NonVisual Desktop Access is a free, open source, portable screen reader for Microsoft Windows. The project was started by Michael Curran in 2006. The latest stable version is 2011.3NVDA is programmed in Python...

 open source screen reader
Screen reader
A screen reader is a software application that attempts to identify and interpret what is being displayed on the screen . This interpretation is then re-presented to the user with text-to-speech, sound icons, or a Braille output device...

 for Windows, and on the Ubuntu
Ubuntu (operating system)
Ubuntu is a computer operating system based on the Debian Linux distribution and distributed as free and open source software. It is named after the Southern African philosophy of Ubuntu...

 and other Linux installation discs.

The quality of the language voices varies greatly. Some have had more work or feedback from native speakers than others. Most of the people who have helped to improve the various languages are blind users of text-to-speech.

Synthesis method

eSpeak provides two methods of synthesis: the original eSpeak synthesizer and a Klatt synthesizer. In addition, eSpeak can be used as a front-end, providing text-to-phoneme translation and prosody
Prosody (linguistics)
In linguistics, prosody is the rhythm, stress, and intonation of speech. Prosody may reflect various features of the speaker or the utterance: the emotional state of the speaker; the form of the utterance ; the presence of irony or sarcasm; emphasis, contrast, and focus; or other elements of...

, to MBROLA diphone voices
MBROLA
MBROLA is an algorithm for speech synthesis, and software which is distributed at no financial cost but in binary form only, and a worldwide collaborative project...

.

The eSpeak and Klatt synthesizers use different types of formant synthesis.

The eSpeak synthesizer creates voiced speech sounds such as vowels and sonorant consonants by adding together sine waves to make the formant peaks. Unvoiced consonants such as /s/ are made by playing recorded sounds. Voiced consonants such as /z/ are made by mixing a synthesized voiced sound with a recorded unvoiced sound.

The Klatt synthesizer mostly uses the same formant data as the eSpeak synthesizer. It produces voiced sounds by starting with a waveform which is rich in harmonics (simulating the vibration of the vocal cords) and then applying digital filters in order to produce speech sounds.

Features

eSpeak can be used as a command-line program, or as a shared library.

It supports Speech Synthesis Markup Language
Speech Synthesis Markup Language
Speech Synthesis Markup Language is an XML-based markup language for speech synthesis applications. It is a recommendation of the W3C's voice browser working group. SSML is often embedded in VoiceXML scripts to drive interactive telephony systems. However, it also may be used alone, such as for...

 (SSML).

Language voices are identified by the language's ISO 639-1
ISO 639-1
ISO 639-1:2002, Codes for the representation of names of languages — Part 1: Alpha-2 code, is the first part of the ISO 639 series of international standards for language codes. Part 1 covers the registration of two-letter codes. There are 136 two-letter codes registered...

 code. They can be modified by "voice variants". These are text files which can change characteristics such as pitch range, add effects such as echo, whisper and croaky voice, or make systematic adjustments to formant frequencies to change the sound of the voice. For example, "af" is the Afrikaans voice. "af+f2" is the Afrikaans voice modified with the "f2" voice variant which changes the formants and the pitch range to give a female sound.

eSpeak uses an ASCII representation of phoneme names which is loosely based on the Kirshenbaum
Kirshenbaum
Kirshenbaum, sometimes called ASCII-IPA or erkIPA, is a system used to represent the International Phonetic Alphabet in ASCII. This way it allows typewriting IPA-symbols by regular keyboard. It was developed for Usenet, notably the newsgroups sci.lang and alt.usage.english...

 system.

Phonetic representations can be included within text input by including them within double square-brackets. For example: espeak -v en "Hello w3:ld" will say "Hello world" in English.

See also

  • Festival Speech Synthesis System
    Festival Speech Synthesis System
    Festival is a general multi-lingual speech synthesis system originally developed by Alan W. Black at at the University of Edinburgh. Substantial contributions have also been provided by Carnegie Mellon University and other sites...

  • Google translate
    Google Translate
    Google Translate is a free statistical machine translation service provided by Google Inc. to translate a section of text, document or webpage, into another language.The service was introduced in April 28, 2006 for the Arabic language...

  • Microsoft text-to-speech voices
    Microsoft text-to-speech voices
    The Microsoft text-to-speech voices are speech synthesizers provided for use with applications that use the Microsoft Speech API .Microsoft Sam is the default text-to-speech male voice in Microsoft Windows 2000 and Windows XP...

  • Speech
    Speech
    Speech is the human faculty of speaking.It may also refer to:* Public speaking, the process of speaking to a group of people* Manner of articulation, how the body parts involved in making speech are manipulated...


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK