Speech synthesis
Encyclopedia
Speech synthesis is the artificial production of human speech
Speech
Speech is the human faculty of speaking.It may also refer to:* Public speaking, the process of speaking to a group of people* Manner of articulation, how the body parts involved in making speech are manipulated...

. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware
Computer hardware
Personal computer hardware are component devices which are typically installed into or peripheral to a computer case to create a personal computer upon which system software is installed including a firmware interface such as a BIOS and an operating system which supports application software that...

. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representation
Symbolic linguistic representation
A symbolic linguistic representation is a representation of an utterance that uses symbols to represent linguistic information about the utterance, such as information about phonetics, phonology, morphology, syntax, or semantics...

s like phonetic transcription
Phonetic transcription
Phonetic transcription is the visual representation of speech sounds . The most common type of phonetic transcription uses a phonetic alphabet, e.g., the International Phonetic Alphabet....

s into speech.

Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a database
Database
A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...

. Systems differ in the size of the stored speech units; a system that stores phones or diphone
Diphone
In phonetics, a diphone is an adjacent pair of phones. It is usually used to refer to a recording of the transition between two phones.In the following diagram, a stream of phones are represented by P1, P2, etc., and the corresponding diphones are represented by D1-2, D2-3, etc:...

s provides the largest output range, but may lack clarity. For specific usage domains, the storage of entire words or sentences allows for high-quality output. Alternatively, a synthesizer can incorporate a model of the vocal tract
Vocal tract
The vocal tract is the cavity in human beings and in animals where sound that is produced at the sound source is filtered....

 and other human voice characteristics to create a completely "synthetic" voice output.

The quality of a speech synthesizer is judged by its similarity to the human voice and by its ability to be understood. An intelligible text-to-speech program allows people with visual impairment
Visual impairment
Visual impairment is vision loss to such a degree as to qualify as an additional support need through a significant limitation of visual capability resulting from either disease, trauma, or congenital or degenerative conditions that cannot be corrected by conventional means, such as refractive...

s or reading disabilities
Reading disability
A reading disability is a condition in which a sufferer displays difficulty reading resulting primarily from neurological factors. Developmental Dyslexia, Alexia , and Hyperlexia.-Definition:...

 to listen to written works on a home computer. Many computer operating systems have included speech synthesizers since the early 1980s.

Overview of text processing


A text-to-speech system (or "engine") is composed of two parts: a front-end and a back-end. The front-end has two major tasks. First, it converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words. This process is often called text normalization, pre-processing, or tokenization
Tokenization
Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining...

. The front-end then assigns phonetic transcription
Phonetic transcription
Phonetic transcription is the visual representation of speech sounds . The most common type of phonetic transcription uses a phonetic alphabet, e.g., the International Phonetic Alphabet....

s to each word, and divides and marks the text into prosodic units
Prosody (linguistics)
In linguistics, prosody is the rhythm, stress, and intonation of speech. Prosody may reflect various features of the speaker or the utterance: the emotional state of the speaker; the form of the utterance ; the presence of irony or sarcasm; emphasis, contrast, and focus; or other elements of...

, like phrase
Phrase
In everyday speech, a phrase may refer to any group of words. In linguistics, a phrase is a group of words which form a constituent and so function as a single unit in the syntax of a sentence. A phrase is lower on the grammatical hierarchy than a clause....

s, clause
Clause
In grammar, a clause is the smallest grammatical unit that can express a complete proposition. In some languages it may be a pair or group of words that consists of a subject and a predicate, although in other languages in certain clauses the subject may not appear explicitly as a noun phrase,...

s, and sentence
Sentence (linguistics)
In the field of linguistics, a sentence is an expression in natural language, and often defined to indicate a grammatical unit consisting of one or more words that generally bear minimal syntactic relation to the words that precede or follow it...

s. The process of assigning phonetic transcriptions to words is called text-to-phoneme or grapheme-to-phoneme conversion. Phonetic transcriptions and prosody information together make up the symbolic linguistic representation that is output by the front-end. The back-end—often referred to as the synthesizer—then converts the symbolic linguistic representation into sound. In certain systems, this part includes the computation of the target prosody (pitch contour, phoneme durations), which is then imposed on the output speech.

History

Long before electronic
Electronics
Electronics is the branch of science, engineering and technology that deals with electrical circuits involving active electrical components such as vacuum tubes, transistors, diodes and integrated circuits, and associated passive interconnection technologies...

 signal processing
Signal processing
Signal processing is an area of systems engineering, electrical engineering and applied mathematics that deals with operations on or analysis of signals, in either discrete or continuous time...

 was invented, there were those who tried to build machines to create human speech. Some early legends of the existence of "speaking heads"
Brazen Head
A Brazen Head was a prophetic device attributed to many medieval scholars who were believed to be wizards, or who were reputed to be able to answer any question. It was always in the form of a man's head, and it could correctly answer any question asked of it...

 involved Gerbert of Aurillac
Pope Silvester II
Pope Sylvester II , born Gerbert d'Aurillac, was a prolific scholar, teacher, and Pope. He endorsed and promoted study of Arab/Greco-Roman arithmetic, mathematics, and astronomy, reintroducing to Europe the abacus and armillary sphere, which had been lost to Europe since the end of the Greco-Roman...

 (d. 1003 AD), Albertus Magnus
Albertus Magnus
Albertus Magnus, O.P. , also known as Albert the Great and Albert of Cologne, is a Catholic saint. He was a German Dominican friar and a bishop, who achieved fame for his comprehensive knowledge of and advocacy for the peaceful coexistence of science and religion. Those such as James A. Weisheipl...

 (1198–1280), and Roger Bacon
Roger Bacon
Roger Bacon, O.F.M. , also known as Doctor Mirabilis , was an English philosopher and Franciscan friar who placed considerable emphasis on the study of nature through empirical methods...

 (1214–1294).

In 1779, the Danish
Denmark
Denmark is a Scandinavian country in Northern Europe. The countries of Denmark and Greenland, as well as the Faroe Islands, constitute the Kingdom of Denmark . It is the southernmost of the Nordic countries, southwest of Sweden and south of Norway, and bordered to the south by Germany. Denmark...

 scientist Christian Kratzenstein, working at the Russian Academy of Sciences
Russian Academy of Sciences
The Russian Academy of Sciences consists of the national academy of Russia and a network of scientific research institutes from across the Russian Federation as well as auxiliary scientific and social units like libraries, publishers and hospitals....

, built models of the human vocal tract
Vocal tract
The vocal tract is the cavity in human beings and in animals where sound that is produced at the sound source is filtered....

 that could produce the five long vowel
Vowel
In phonetics, a vowel is a sound in spoken language, such as English ah! or oh! , pronounced with an open vocal tract so that there is no build-up of air pressure at any point above the glottis. This contrasts with consonants, such as English sh! , where there is a constriction or closure at some...

 sounds (in International Phonetic Alphabet notation, they are [aː], [eː], [iː], [oː] and [uː]). This was followed by the bellows
Bellows
A bellows is a device for delivering pressurized air in a controlled quantity to a controlled location.Basically, a bellows is a deformable container which has an outlet nozzle. When the volume of the bellows is decreased, the air escapes through the outlet...

-operated "acoustic-mechanical speech machine" by Wolfgang von Kempelen
Wolfgang von Kempelen
Johann Wolfgang Ritter von Kempelen de Pázmánd was a Hungarian author and inventor with Irish ancestors.-Life:...

 of Vienna
Vienna
Vienna is the capital and largest city of the Republic of Austria and one of the nine states of Austria. Vienna is Austria's primary city, with a population of about 1.723 million , and is by far the largest city in Austria, as well as its cultural, economic, and political centre...

, Austria
Austria
Austria , officially the Republic of Austria , is a landlocked country of roughly 8.4 million people in Central Europe. It is bordered by the Czech Republic and Germany to the north, Slovakia and Hungary to the east, Slovenia and Italy to the south, and Switzerland and Liechtenstein to the...

, described in a 1791 paper. This machine added models of the tongue and lips, enabling it to produce consonant
Consonant
In articulatory phonetics, a consonant is a speech sound that is articulated with complete or partial closure of the vocal tract. Examples are , pronounced with the lips; , pronounced with the front of the tongue; , pronounced with the back of the tongue; , pronounced in the throat; and ,...

s as well as vowels. In 1837, Charles Wheatstone
Charles Wheatstone
Sir Charles Wheatstone FRS , was an English scientist and inventor of many scientific breakthroughs of the Victorian era, including the English concertina, the stereoscope , and the Playfair cipher...

 produced a "speaking machine" based on von Kempelen's design, and in 1857, M. Faber built the "Euphonia". Wheatstone's design was resurrected in 1923 by Paget.

In the 1930s, Bell Labs
Bell Labs
Bell Laboratories is the research and development subsidiary of the French-owned Alcatel-Lucent and previously of the American Telephone & Telegraph Company , half-owned through its Western Electric manufacturing subsidiary.Bell Laboratories operates its...

 developed the VOCODER
Vocoder
A vocoder is an analysis/synthesis system, mostly used for speech. In the encoder, the input is passed through a multiband filter, each band is passed through an envelope follower, and the control signals from the envelope followers are communicated to the decoder...

, a keyboard-operated electronic speech analyzer and synthesizer that was said to be clearly intelligible. Homer Dudley
Homer Dudley
Homer W. Dudley was a pioneering electronic and acoustic engineer who created the first electronic voice synthesizer for Bell Labs in the 1930s and led the development of a method of sending secure voice transmissions during World War Two....

 refined this device into the VODER, which he exhibited at the 1939 New York World's Fair
1939 New York World's Fair
The 1939–40 New York World's Fair, which covered the of Flushing Meadows-Corona Park , was the second largest American world's fair of all time, exceeded only by St. Louis's Louisiana Purchase Exposition of 1904. Many countries around the world participated in it, and over 44 million people...

.

The Pattern playback
Pattern playback
The Pattern playback is an early talking device that was built by Dr. Franklin S. Cooper and his colleagues, including John M. Borst and Caryl Haskins, at Haskins Laboratories in the late 1940s and completed in 1950. There were several different versions of this hardware device. Only one currently...

 was built by Dr. Franklin S. Cooper
Franklin S. Cooper
Franklin Seaney Cooper was an American physicist and inventor who was a pioneer in speech research.-Biography:...

 and his colleagues at Haskins Laboratories
Haskins Laboratories
Haskins Laboratories is an independent, international, multidisciplinary community of researchers conducting basic research on spoken and written language. Founded in 1935 and located in New Haven, Connecticut since 1970, Haskins Laboratories is a private, non-profit research institute with a...

 in the late 1940s and completed in 1950. There were several different versions of this hardware device but only one currently survives. The machine converts pictures of the acoustic patterns of speech in the form of a spectrogram back into sound. Using this device, Alvin Liberman
Alvin Liberman
Alvin Meyer Liberman was an American psychologist whose ideas set the agenda for fifty years of research in the psychology of speech perception and laid the groundwork for modern computer speech synthesis and the understanding of critical issues in cognitive science...

 and colleagues were able to discover acoustic cues for the perception of phonetic segments (consonants and vowels).

Dominant systems in the 1980s and 1990s were the MITalk system, based largely on the work of Dennis Klatt at MIT, and the Bell Labs system; the latter was one of the first multilingual language-independent systems, making extensive use of Natural Language Processing
Natural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....

 methods.

Early electronic speech synthesizers sounded robotic and were often barely intelligible. The quality of synthesized speech has steadily improved, but output from contemporary speech synthesis systems is still clearly distinguishable from actual human speech.

As the cost-performance ratio causes speech synthesizers to become cheaper and more accessible to the people, more people will benefit from the use of text-to-speech programs.

Electronic devices

The first computer-based speech synthesis systems were created in the late 1950s, and the first complete text-to-speech system was completed in 1968. In 1961, physicist John Larry Kelly, Jr
John Larry Kelly, Jr
John Larry Kelly, Jr. , was a scientist who worked at Bell Labs. He is best known for formulating the Kelly criterion, an algorithm for maximally investing money....

 and colleague Louis Gerstman used an IBM 704
IBM 704
The IBM 704, the first mass-produced computer with floating point arithmetic hardware, was introduced by IBM in 1954. The 704 was significantly improved over the IBM 701 in terms of architecture as well as implementations which were not compatible with its predecessor.Changes from the 701 included...

 computer to synthesize speech, an event among the most prominent in the history of Bell Labs
Bell Labs
Bell Laboratories is the research and development subsidiary of the French-owned Alcatel-Lucent and previously of the American Telephone & Telegraph Company , half-owned through its Western Electric manufacturing subsidiary.Bell Laboratories operates its...

. Kelly's voice recorder synthesizer (vocoder
Vocoder
A vocoder is an analysis/synthesis system, mostly used for speech. In the encoder, the input is passed through a multiband filter, each band is passed through an envelope follower, and the control signals from the envelope followers are communicated to the decoder...

) recreated the song "Daisy Bell
Daisy Bell
"Daisy Bell" is a popular song with the well-known chorus "Daisy, Daisy/Give me your answer do/I'm half crazy/all for the love of you" as well as the line "...a bicycle built for two".-History:"Daisy Bell" was composed by Harry Dacre in 1892...

", with musical accompaniment from Max Mathews
Max Mathews
Max Vernon Mathews was a pioneer in the world of computer music.-Biography:...

. Coincidentally, Arthur C. Clarke
Arthur C. Clarke
Sir Arthur Charles Clarke, CBE, FRAS was a British science fiction author, inventor, and futurist, famous for his short stories and novels, among them 2001: A Space Odyssey, and as a host and commentator in the British television series Mysterious World. For many years, Robert A. Heinlein,...

 was visiting his friend and colleague John Pierce at the Bell Labs Murray Hill facility. Clarke was so impressed by the demonstration that he used it in the climactic scene of his screenplay for his novel 2001: A Space Odyssey
2001: A Space Odyssey (novel)
2001: A Space Odyssey is a science fiction novel by Arthur C. Clarke. It was developed concurrently with Stanley Kubrick's film version and published after the release of the film...

, where the HAL 9000
HAL 9000
HAL 9000 is the antagonist in Arthur C. Clarke's science fiction Space Odyssey saga. HAL is an artificial intelligence that interacts with the astronaut crew of the Discovery One spacecraft, usually represented as a red television-camera eye found throughout the ship...

 computer sings the same song as it is being put to sleep by astronaut Dave Bowman. Despite the success of purely electronic speech synthesis, research is still being conducted into mechanical speech synthesizers.

Handheld electronics featuring speech synthesis began emerging in the 1970s. One of the first was the Telesensory Systems Inc.
Telesensory Systems
Telesensory Systems, Inc. was an American corporation that invented, designed, manufactured, and distributed technological aids for blind and low vision persons...

 (TSI) Speech+ portable calculator for the blind in 1976. Other devices were produced primarily for educational purposes, such as Speak & Spell
Speak & Spell (toy)
The Speak & Spell line is a series of electronic handheld educational toys created by Texas Instruments that consist of a speech synthesizer, a keyboard, and a receptor slot to receive one of a collection of ROM game library modules...

, produced by Texas Instruments
Texas Instruments
Texas Instruments Inc. , widely known as TI, is an American company based in Dallas, Texas, United States, which develops and commercializes semiconductor and computer technology...

 in 1978. Fidelity released a speaking version of its electronic chess computer in 1979. The first video game to feature speech synthesis was the 1980 shoot 'em up
Shoot 'em up
Shoot 'em up is a subgenre of shooter video games. In a shoot 'em up, the player controls a lone character, often in a spacecraft or aircraft, shooting large numbers of enemies while dodging their attacks. The genre in turn encompasses various types or subgenres and critics differ on exactly what...

 arcade game
Arcade game
An arcade game is a coin-operated entertainment machine, usually installed in public businesses such as restaurants, bars, and amusement arcades. Most arcade games are video games, pinball machines, electro-mechanical games, redemption games, and merchandisers...

, Stratovox
Stratovox
Stratovox AKA Speak & Rescue is an arcade shoot 'em up developed by Sun Electronics and published by Taito in 1980. It was the first video game to feature voice synthesis.-Gameplay:...

, from Sun Electronics. Another early example was the arcade version of Bezerk, released that same year. The first multi-player electronic game
Electronic game
An electronic game is a game that employs electronics to create an interactive system with which a player can play. The most common form of electronic game today is the video game, and for this reason the terms are often mistakenly used synonymously. Other common forms of electronic game include...

 using voice synthesis was Milton
Milton (game)
Milton is an electronic talking game. According to the patent, Milton was the first electronic talking game that allowed two people to play against each other...

from Milton Bradley Company
Milton Bradley Company
The Milton Bradley Company is an American game company established by Milton Bradley in Springfield, Massachusetts, in 1860. In 1920, it absorbed the game production of McLoughlin Brothers, formerly the largest game manufacturer in the United States, and in 1987, it purchased Selchow and Righter,...

, which produced the device in 1980.

Synthesizer technologies

The most important qualities of a speech synthesis system are naturalness and intelligibility. Naturalness describes how closely the output sounds like human speech, while intelligibility is the ease with which the output is understood. The ideal speech synthesizer is both natural and intelligible. Speech synthesis systems usually try to maximize both characteristics.

The two primary technologies for generating synthetic speech waveforms are concatenative synthesis and formant
Formant
Formants are defined by Gunnar Fant as 'the spectral peaks of the sound spectrum |P|' of the voice. In speech science and phonetics, formant is also used to mean an acoustic resonance of the human vocal tract...

 synthesis
. Each technology has strengths and weaknesses, and the intended uses of a synthesis system will typically determine which approach is used.

Concatenative synthesis

Concatenative synthesis is based on the concatenation
Concatenation
In computer programming, string concatenation is the operation of joining two character strings end-to-end. For example, the strings "snow" and "ball" may be concatenated to give "snowball"...

 (or stringing together) of segments of recorded speech. Generally, concatenative synthesis produces the most natural-sounding synthesized speech. However, differences between natural variations in speech and the nature of the automated techniques for segmenting the waveforms sometimes result in audible glitches in the output. There are three main sub-types of concatenative synthesis.

Unit selection synthesis

Unit selection synthesis uses large database
Database
A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...

s of recorded speech. During database creation, each recorded utterance is segmented into some or all of the following: individual phones, diphone
Diphone
In phonetics, a diphone is an adjacent pair of phones. It is usually used to refer to a recording of the transition between two phones.In the following diagram, a stream of phones are represented by P1, P2, etc., and the corresponding diphones are represented by D1-2, D2-3, etc:...

s, half-phones, syllable
Syllable
A syllable is a unit of organization for a sequence of speech sounds. For example, the word water is composed of two syllables: wa and ter. A syllable is typically made up of a syllable nucleus with optional initial and final margins .Syllables are often considered the phonological "building...

s, morpheme
Morpheme
In linguistics, a morpheme is the smallest semantically meaningful unit in a language. The field of study dedicated to morphemes is called morphology. A morpheme is not identical to a word, and the principal difference between the two is that a morpheme may or may not stand alone, whereas a word,...

s, word
Word
In language, a word is the smallest free form that may be uttered in isolation with semantic or pragmatic content . This contrasts with a morpheme, which is the smallest unit of meaning but will not necessarily stand on its own...

s, phrase
Phrase
In everyday speech, a phrase may refer to any group of words. In linguistics, a phrase is a group of words which form a constituent and so function as a single unit in the syntax of a sentence. A phrase is lower on the grammatical hierarchy than a clause....

s, and sentence
Sentence (linguistics)
In the field of linguistics, a sentence is an expression in natural language, and often defined to indicate a grammatical unit consisting of one or more words that generally bear minimal syntactic relation to the words that precede or follow it...

s. Typically, the division into segments is done using a specially modified speech recognizer
Speech recognition
Speech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to recognition systems that must be trained to a particular speaker—as is the case for most desktop recognition software...

 set to a "forced alignment" mode with some manual correction afterward, using visual representations such as the waveform
Waveform
Waveform means the shape and form of a signal such as a wave moving in a physical medium or an abstract representation.In many cases the medium in which the wave is being propagated does not permit a direct visual image of the form. In these cases, the term 'waveform' refers to the shape of a graph...

 and spectrogram
Spectrogram
A spectrogram is a time-varying spectral representation that shows how the spectral density of a signal varies with time. Also known as spectral waterfalls, sonograms, voiceprints, or voicegrams, spectrograms are used to identify phonetic sounds, to analyse the cries of animals; they were also...

. An index
Index (database)
A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space...

 of the units in the speech database is then created based on the segmentation and acoustic parameters like the fundamental frequency
Fundamental frequency
The fundamental frequency, often referred to simply as the fundamental and abbreviated f0, is defined as the lowest frequency of a periodic waveform. In terms of a superposition of sinusoids The fundamental frequency, often referred to simply as the fundamental and abbreviated f0, is defined as the...

 (pitch
Pitch (music)
Pitch is an auditory perceptual property that allows the ordering of sounds on a frequency-related scale.Pitches are compared as "higher" and "lower" in the sense associated with musical melodies,...

), duration, position in the syllable, and neighboring phones. At run time, the desired target utterance is created by determining the best chain of candidate units from the database (unit selection). This process is typically achieved using a specially weighted decision tree
Decision tree
A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm. Decision trees are commonly used in operations research, specifically...

.

Unit selection provides the greatest naturalness, because it applies only a small amount of digital signal processing
Digital signal processing
Digital signal processing is concerned with the representation of discrete time signals by a sequence of numbers or symbols and the processing of these signals. Digital signal processing and analog signal processing are subfields of signal processing...

 (DSP) to the recorded speech. DSP often makes recorded speech sound less natural, although some systems use a small amount of signal processing at the point of concatenation to smooth the waveform. The output from the best unit-selection systems is often indistinguishable from real human voices, especially in contexts for which the TTS system has been tuned. However, maximum naturalness typically require unit-selection speech databases to be very large, in some systems ranging into the gigabyte
Gigabyte
The gigabyte is a multiple of the unit byte for digital information storage. The prefix giga means 109 in the International System of Units , therefore 1 gigabyte is...

s of recorded data, representing dozens of hours of speech. Also, unit selection algorithms have been known to select segments from a place that results in less than ideal synthesis (e.g. minor words become unclear) even when a better choice exists in the database. Recently, researchers have proposed various automated methods to detect unnatural segments in unit-selection speech synthesis systems .

Diphone synthesis

Diphone synthesis uses a minimal speech database containing all the diphone
Diphone
In phonetics, a diphone is an adjacent pair of phones. It is usually used to refer to a recording of the transition between two phones.In the following diagram, a stream of phones are represented by P1, P2, etc., and the corresponding diphones are represented by D1-2, D2-3, etc:...

s (sound-to-sound transitions) occurring in a language. The number of diphones depends on the phonotactics
Phonotactics
Phonotactics is a branch of phonology that deals with restrictions in a language on the permissible combinations of phonemes...

 of the language: for example, Spanish has about 800 diphones, and German about 2500. In diphone synthesis, only one example of each diphone is contained in the speech database. At runtime, the target prosody
Prosody (linguistics)
In linguistics, prosody is the rhythm, stress, and intonation of speech. Prosody may reflect various features of the speaker or the utterance: the emotional state of the speaker; the form of the utterance ; the presence of irony or sarcasm; emphasis, contrast, and focus; or other elements of...

 of a sentence is superimposed on these minimal units by means of digital signal processing
Digital signal processing
Digital signal processing is concerned with the representation of discrete time signals by a sequence of numbers or symbols and the processing of these signals. Digital signal processing and analog signal processing are subfields of signal processing...

 techniques such as linear predictive coding
Linear predictive coding
Linear predictive coding is a tool used mostly in audio signal processing and speech processing for representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear predictive model...

, PSOLA
PSOLA
PSOLA is a digital signal processing technique used for speech processing and more specifically speech synthesis. It can be used to modify the pitch and duration of a speech signal....

 or MBROLA
MBROLA
MBROLA is an algorithm for speech synthesis, and software which is distributed at no financial cost but in binary form only, and a worldwide collaborative project...

.
The quality of the resulting speech is generally worse than that of unit-selection systems, but more natural-sounding than the output of formant synthesizers. Diphone synthesis suffers from the sonic glitches of concatenative synthesis and the robotic-sounding nature of formant synthesis, and has few of the advantages of either approach other than small size. As such, its use in commercial applications is declining, although it continues to be used in research because there are a number of freely available software implementations.

Domain-specific synthesis

Domain-specific synthesis concatenates prerecorded words and phrases to create complete utterances. It is used in applications where the variety of texts the system will output is limited to a particular domain, like transit schedule announcements or weather reports. The technology is very simple to implement, and has been in commercial use for a long time, in devices like talking clocks and calculators. The level of naturalness of these systems can be very high because the variety of sentence types is limited, and they closely match the prosody and intonation of the original recordings.

Because these systems are limited by the words and phrases in their databases, they are not general-purpose and can only synthesize the combinations of words and phrases with which they have been preprogrammed. The blending of words within naturally spoken language however can still cause problems unless the many variations are taken into account. For example, in non-rhotic
Rhotic and non-rhotic accents
English pronunciation can be divided into two main accent groups: a rhotic speaker pronounces a rhotic consonant in words like hard; a non-rhotic speaker does not...

 dialects of English the "r" in words like "clear" /ˈklɪə/ is usually only pronounced when the following word has a vowel as its first letter (e.g. "clear out" is realized as /ˌklɪəɾˈʌʊt/). Likewise in French
French language
French is a Romance language spoken as a first language in France, the Romandy region in Switzerland, Wallonia and Brussels in Belgium, Monaco, the regions of Quebec and Acadia in Canada, and by various communities elsewhere. Second-language speakers of French are distributed throughout many parts...

, many final consonants become no longer silent if followed by a word that begins with a vowel, an effect called liaison. This alternation
Alternation (linguistics)
In linguistics, an alternation is the phenomenon of a phoneme or morpheme exhibiting variation in its phonological realization. Each of the various realizations is called an alternant...

 cannot be reproduced by a simple word-concatenation system, which would require additional complexity to be context-sensitive
Context-sensitive grammar
A context-sensitive grammar is a formal grammar in which the left-hand sides and right-hand sides of any production rules may be surrounded by a context of terminal and nonterminal symbols...

.

Formant synthesis

Formant
Formant
Formants are defined by Gunnar Fant as 'the spectral peaks of the sound spectrum |P|' of the voice. In speech science and phonetics, formant is also used to mean an acoustic resonance of the human vocal tract...

 synthesis does not use human speech samples at runtime. Instead, the synthesized speech output is created using additive synthesis
Additive synthesis
Additive synthesis is a technique of sound synthesis that creates musical timbre by explicitly adding sinusoidal overtones together.The timbre of an instrument is composed of multiple harmonic or inharmonic partials , of different frequencies and amplitudes, that change over time...

 and an acoustic model (physical modelling synthesis
Physical modelling synthesis
In sound synthesis, physical modelling synthesis refers to methods in which the waveform of the sound to be generated is computed by using a mathematical model, being a set of equations and algorithms to simulate a physical source of sound, usually a musical instrument...

). Parameters such as fundamental frequency
Fundamental frequency
The fundamental frequency, often referred to simply as the fundamental and abbreviated f0, is defined as the lowest frequency of a periodic waveform. In terms of a superposition of sinusoids The fundamental frequency, often referred to simply as the fundamental and abbreviated f0, is defined as the...

, voicing
Phonation
Phonation has slightly different meanings depending on the subfield of phonetics. Among some phoneticians, phonation is the process by which the vocal folds produce certain sounds through quasi-periodic vibration. This is the definition used among those who study laryngeal anatomy and physiology...

, and noise
Noise
In common use, the word noise means any unwanted sound. In both analog and digital electronics, noise is random unwanted perturbation to a wanted signal; it is called noise as a generalisation of the acoustic noise heard when listening to a weak radio transmission with significant electrical noise...

 levels are varied over time to create a waveform
Waveform
Waveform means the shape and form of a signal such as a wave moving in a physical medium or an abstract representation.In many cases the medium in which the wave is being propagated does not permit a direct visual image of the form. In these cases, the term 'waveform' refers to the shape of a graph...

 of artificial speech. This method is sometimes called rules-based synthesis; however, many concatenative systems also have rules-based components.
Many systems based on formant synthesis technology generate artificial, robotic-sounding speech that would never be mistaken for human speech. However, maximum naturalness is not always the goal of a speech synthesis system, and formant synthesis systems have advantages over concatenative systems. Formant-synthesized speech can be reliably intelligible, even at very high speeds, avoiding the acoustic glitches that commonly plague concatenative systems. High-speed synthesized speech is used by the visually impaired to quickly navigate computers using a screen reader
Screen reader
A screen reader is a software application that attempts to identify and interpret what is being displayed on the screen . This interpretation is then re-presented to the user with text-to-speech, sound icons, or a Braille output device...

. Formant synthesizers are usually smaller programs than concatenative systems because they do not have a database of speech samples. They can therefore be used in embedded system
Embedded system
An embedded system is a computer system designed for specific control functions within a larger system. often with real-time computing constraints. It is embedded as part of a complete device often including hardware and mechanical parts. By contrast, a general-purpose computer, such as a personal...

s, where memory
Data storage device
thumb|200px|right|A reel-to-reel tape recorder .The magnetic tape is a data storage medium. The recorder is data storage equipment using a portable medium to store the data....

 and microprocessor
Microprocessor
A microprocessor incorporates the functions of a computer's central processing unit on a single integrated circuit, or at most a few integrated circuits. It is a multipurpose, programmable device that accepts digital data as input, processes it according to instructions stored in its memory, and...

 power are especially limited. Because formant-based systems have complete control of all aspects of the output speech, a wide variety of prosodies and intonation
Intonation (linguistics)
In linguistics, intonation is variation of pitch while speaking which is not used to distinguish words. It contrasts with tone, in which pitch variation does distinguish words. Intonation, rhythm, and stress are the three main elements of linguistic prosody...

s can be output, conveying not just questions and statements, but a variety of emotions and tones of voice.

Examples of non-real-time but highly accurate intonation control in formant synthesis include the work done in the late 1970s for the Texas Instruments
Texas Instruments
Texas Instruments Inc. , widely known as TI, is an American company based in Dallas, Texas, United States, which develops and commercializes semiconductor and computer technology...

 toy Speak & Spell, and in the early 1980s Sega
Sega
, usually styled as SEGA, is a multinational video game software developer and an arcade software and hardware development company headquartered in Ōta, Tokyo, Japan, with various offices around the world...

 arcade
Video arcade
An amusement arcade or video arcade is a venue where people play arcade games such as video games, pinball machines, electro-mechanical games, redemption games, merchandisers , or coin-operated billiards or air hockey tables...

 machines and in many Atari, Inc. arcade games using the TMS5220 LPC Chips. Creating proper intonation for these projects was painstaking, and the results have yet to be matched by real-time text-to-speech interfaces.

Articulatory synthesis

Articulatory synthesis
Articulatory synthesis
Articulatory synthesis refers to computational techniques for synthesizing speech based on models of the human vocal tract and the articulation processes occurring there. The shape of the vocal tract can be controlled in a number of ways which usually involves modifying the position of the speech...

 refers to computational techniques for synthesizing speech based on models of the human vocal tract
Vocal tract
The vocal tract is the cavity in human beings and in animals where sound that is produced at the sound source is filtered....

 and the articulation processes occurring there. The first articulatory synthesizer regularly used for laboratory experiments was developed at Haskins Laboratories
Haskins Laboratories
Haskins Laboratories is an independent, international, multidisciplinary community of researchers conducting basic research on spoken and written language. Founded in 1935 and located in New Haven, Connecticut since 1970, Haskins Laboratories is a private, non-profit research institute with a...

 in the mid-1970s by Philip Rubin
Philip Rubin
Philip E. Rubin is an American cognitive scientist and technologist who since 2003 has been the Chief Executive Officer and a Senior Scientist at Haskins Laboratories in New Haven, Connecticut...

, Tom Baer, and Paul Mermelstein. This synthesizer, known as ASY, was based on vocal tract models developed at Bell Laboratories in the 1960s and 1970s by Paul Mermelstein, Cecil Coker, and colleagues.

Until recently, articulatory synthesis models have not been incorporated into commercial speech synthesis systems. A notable exception is the NeXT
NeXT
Next, Inc. was an American computer company headquartered in Redwood City, California, that developed and manufactured a series of computer workstations intended for the higher education and business markets...

-based system originally developed and marketed by Trillium Sound Research, a spin-off company of the University of Calgary
University of Calgary
The University of Calgary is a public research university located in Calgary, Alberta, Canada. Founded in 1966 the U of C is composed of 14 faculties and more than 85 research institutes and centres.More than 25,000 undergraduate and 5,500 graduate students are currently...

, where much of the original research was conducted. Following the demise of the various incarnations of NeXT (started by Steve Jobs
Steve Jobs
Steven Paul Jobs was an American businessman and inventor widely recognized as a charismatic pioneer of the personal computer revolution. He was co-founder, chairman, and chief executive officer of Apple Inc...

 in the late 1980s and merged with Apple Computer in 1997), the Trillium software was published under the GNU General Public License
GNU General Public License
The GNU General Public License is the most widely used free software license, originally written by Richard Stallman for the GNU Project....

, with work continuing as gnuspeech
Gnuspeech
Gnuspeech is an extensible text-to-speech computer software package that produces artificial speech output based on real-time articulatory speech synthesis by rules...

. The system, first marketed in 1994, provides full articulatory-based text-to-speech conversion using a waveguide or transmission-line analog of the human oral and nasal tracts controlled by Carré's "distinctive region model".

HMM-based synthesis

HMM-based synthesis is a synthesis method based on hidden Markov model
Hidden Markov model
A hidden Markov model is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved states. An HMM can be considered as the simplest dynamic Bayesian network. The mathematics behind the HMM was developed by L. E...

s, also called Statistical Parametric Synthesis. In this system, the frequency spectrum
Frequency spectrum
The frequency spectrum of a time-domain signal is a representation of that signal in the frequency domain. The frequency spectrum can be generated via a Fourier transform of the signal, and the resulting values are usually presented as amplitude and phase, both plotted versus frequency.Any signal...

 (vocal tract
Vocal tract
The vocal tract is the cavity in human beings and in animals where sound that is produced at the sound source is filtered....

), fundamental frequency
Fundamental frequency
The fundamental frequency, often referred to simply as the fundamental and abbreviated f0, is defined as the lowest frequency of a periodic waveform. In terms of a superposition of sinusoids The fundamental frequency, often referred to simply as the fundamental and abbreviated f0, is defined as the...

 (vocal source), and duration (prosody
Prosody (linguistics)
In linguistics, prosody is the rhythm, stress, and intonation of speech. Prosody may reflect various features of the speaker or the utterance: the emotional state of the speaker; the form of the utterance ; the presence of irony or sarcasm; emphasis, contrast, and focus; or other elements of...

) of speech are modeled simultaneously by HMMs. Speech waveforms are generated from HMMs themselves based on the maximum likelihood
Maximum likelihood
In statistics, maximum-likelihood estimation is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters....

 criterion.

Sinewave synthesis

Sinewave synthesis
Sinewave synthesis
Sinewave synthesis, or sine wave speech, is a technique for synthesizing speech by replacing the formants with pure tone whistles. The first sinewave synthesis program for the automatic creation of stimuli for perceptual experiments was developed by Philip Rubin at Haskins Laboratories in the 1970s...

 is a technique for synthesizing speech by replacing the formants (main bands of energy) with pure tone whistles.

Text normalization challenges

The process of normalizing text is rarely straightforward. Texts are full of heteronym
Heteronym (linguistics)
In linguistics, heteronyms are words that are written identically but have different pronunciations and meanings. In other words, they are homographs that are not homophones. Thus, row and row are heteronyms, but mean and mean are not...

s, number
Number
A number is a mathematical object used to count and measure. In mathematics, the definition of number has been extended over the years to include such numbers as zero, negative numbers, rational numbers, irrational numbers, and complex numbers....

s, and abbreviation
Abbreviation
An abbreviation is a shortened form of a word or phrase. Usually, but not always, it consists of a letter or group of letters taken from the word or phrase...

s that all require expansion into a phonetic representation. There are many spellings in English which are pronounced differently based on context. For example, "My latest project is to learn how to better project my voice" contains two pronunciations of "project".

Most text-to-speech (TTS) systems do not generate semantic representations of their input texts, as processes for doing so are not reliable, well understood, or computationally effective. As a result, various heuristic
Heuristic
Heuristic refers to experience-based techniques for problem solving, learning, and discovery. Heuristic methods are used to speed up the process of finding a satisfactory solution, where an exhaustive search is impractical...

 techniques are used to guess the proper way to disambiguate homographs, like examining neighboring words and using statistics about frequency of occurrence.

Recently TTS systems have begun to use HMMs (discussed above) to generate "parts of speech" to aid in disambiguating homographs. This technique is quite successful for many cases such as whether "read" should be pronounced as "red" implying past tense, or as "reed" implying present tense. Typical error rates when using HMMs in this fashion are usually below five percent. These techniques also work well for most European languages, although access to required training corpora is frequently difficult in these languages.

Deciding how to convert numbers is another problem that TTS systems have to address. It is a simple programming challenge to convert a number into words (at least in English), like "1325" becoming "one thousand three hundred twenty-five." However, numbers occur in many different contexts; "1325" may also be read as "one three two five", "thirteen twenty-five" or "thirteen hundred and twenty five". A TTS system can often infer how to expand a number based on surrounding words, numbers, and punctuation, and sometimes the system provides a way to specify the context if it is ambiguous. Roman numerals can also be read differently depending on context. For example "Henry VIII" reads as "Henry the Eighth", while "Chapter VIII" reads as "Chapter Eight".

Similarly, abbreviations can be ambiguous. For example, the abbreviation "in" for "inches" must be differentiated from the word "in", and the address "12 St John St." uses the same abbreviation for both "Saint" and "Street". TTS systems with intelligent front ends can make educated guesses about ambiguous abbreviations, while others provide the same result in all cases, resulting in nonsensical (and sometimes comical) outputs.

Text-to-phoneme challenges

Speech synthesis systems use two basic approaches to determine the pronunciation of a word based on its spelling
Spelling
Spelling is the writing of one or more words with letters and diacritics. In addition, the term often, but not always, means an accepted standard spelling or the process of naming the letters...

, a process which is often called text-to-phoneme or grapheme-to-phoneme conversion (phoneme
Phoneme
In a language or dialect, a phoneme is the smallest segmental unit of sound employed to form meaningful contrasts between utterances....

 is the term used by linguists to describe distinctive sounds in a language). The simplest approach to text-to-phoneme conversion is the dictionary-based approach, where a large dictionary containing all the words of a language and their correct pronunciations is stored by the program. Determining the correct pronunciation of each word is a matter of looking up each word in the dictionary and replacing the spelling with the pronunciation specified in the dictionary. The other approach is rule-based, in which pronunciation rules are applied to words to determine their pronunciations based on their spellings. This is similar to the "sounding out", or synthetic phonics
Synthetic phonics
Synthetic phonics is a method of teaching reading which first teaches the letter sounds and then builds up to blending these sounds together to achieve full pronunciation of whole words...

, approach to learning reading.

Each approach has advantages and drawbacks. The dictionary-based approach is quick and accurate, but completely fails if it is given a word which is not in its dictionary. As dictionary size grows, so too does the memory space requirements of the synthesis system. On the other hand, the rule-based approach works on any input, but the complexity of the rules grows substantially as the system takes into account irregular spellings or pronunciations. (Consider that the word "of" is very common in English, yet is the only word in which the letter "f" is pronounced [v].) As a result, nearly all speech synthesis systems use a combination of these approaches.

Languages with a phonemic orthography
Phonemic orthography
A phonemic orthography is a writing system where the written graphemes correspond to phonemes, the spoken sounds of the language. In terms of orthographic depth, these are termed shallow orthographies, contrasting with deep orthographies...

 have a very regular writing system, and the prediction of the pronunciation of words based on their spellings is quite successful. Speech synthesis systems for such languages often use the rule-based method extensively, resorting to dictionaries only for those few words, like foreign names and borrowings, whose pronunciations are not obvious from their spellings. On the other hand, speech synthesis systems for languages like English
English language
English is a West Germanic language that arose in the Anglo-Saxon kingdoms of England and spread into what was to become south-east Scotland under the influence of the Anglian medieval kingdom of Northumbria...

, which have extremely irregular spelling systems, are more likely to rely on dictionaries, and to use rule-based methods only for unusual words, or words that aren't in their dictionaries.

Evaluation challenges

The consistent evaluation of speech synthesis systems may be difficult because of a lack of universally agreed objective evaluation criteria. Different organizations often use different speech data. The quality of speech synthesis systems also depends to a large degree on the quality of the production technique (which may involve analogue or digital recording) and on the facilities used to replay the speech. Evaluating speech synthesis systems has therefore often been compromised by differences between production techniques and replay facilities.

Recently, however, some researchers have started to evaluate speech synthesis systems using a common speech dataset.

Prosodics and emotional content

A study in the journal "Speech Communication" by Amy Drahota and colleagues at the University of Portsmouth
University of Portsmouth
The University of Portsmouth is a university in Portsmouth, Hampshire, England. The University was ranked 60th out of 122 in The Sunday Times University Guide...

, UK, reported that listeners to voice recordings could determine, at better than chance levels, whether or not the speaker was smiling. It was suggested that identification of the vocal features that signal emotional content may be used to help make synthesized speech sound more natural.

Dedicated hardware

  • Votrax
    • SC-01A (analog formant)
    • SC-02 / SSI-263 / "Artic 263"
  • General Instrument SP0256-AL2 (CTS256A-AL2)
  • Magnevation SpeakJet (www.speechchips.com TTS256)
  • Savage Innovations SoundGin
  • National Semiconductor DT1050 Digitalker (Mozer)
  • Silicon Systems SSI 263 (analog formant)
  • Texas Instruments LPC Speech Chips
    • TMS5110A
    • TMS5200
  • Oki Semiconductor
    • ML22825 (ADPCM)
    • ML22573 (HQADPCM)
  • Toshiba T6721A
  • Philips / Signetics
    • Mullard
      Mullard
      Mullard Limited was a British manufacturer of electronic components. The Mullard Radio Valve Co. Ltd. of Southfields, London, was founded in 1920 by Captain Stanley R. Mullard, who had previously designed valves for the Admiralty before becoming managing director of the Z Electric Lamp Co. The...

       MEA8000
    • PCF8200
  • TextSpeak Embedded TTS Modules

Atari

Arguably, the first speech system integrated into an operating system
Operating system
An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...

 was the 1400XL/1450XL personal computers designed by Atari, Inc. using the Votrax SC01 chip in 1983. The 1400XL/1450XL computers used a Finite State Machine to enable World English Spelling text-to-speech synthesis. Unfortunately, the 1400XL/1450XL personal computers never shipped in quantity.

The Atari ST
Atari ST
The Atari ST is a home/personal computer that was released by Atari Corporation in 1985 and commercially available from that summer into the early 1990s. The "ST" officially stands for "Sixteen/Thirty-two", which referred to the Motorola 68000's 16-bit external bus and 32-bit internals...

 computers were sold with "stspeech.tos" on floppy disk.

Apple

The first speech system integrated into an operating system
Operating system
An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...

 that shipped in quantity was Apple Computer
Apple Computer
Apple Inc. is an American multinational corporation that designs and markets consumer electronics, computer software, and personal computers. The company's best-known hardware products include the Macintosh line of computers, the iPod, the iPhone and the iPad...

's MacInTalk in 1984. The software was licensed from 3rd party developers Joseph Katz and Mark Barton (later, SoftVoice, Inc.) and was featured during the 1984 introduction of the Macintosh computer. Since the 1980s Macintosh Computers offered text to speech capabilities through The MacinTalk software. In the early 1990s Apple expanded its capabilities offering system wide text-to-speech support. With the introduction of faster PowerPC-based computers they included higher quality voice sampling. Apple also introduced speech recognition
Speech recognition
Speech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to recognition systems that must be trained to a particular speaker—as is the case for most desktop recognition software...

 into its systems which provided a fluid command set. More recently, Apple has added sample-based voices. Starting as a curiosity, the speech system of Apple Macintosh has evolved into a fully supported program, PlainTalk
PlainTalk
PlainTalk is the collective name for several speech synthesis and speech recognition technologies developed by Apple Inc.In 1990, Apple invested a lot of work and money in speech recognition technology, hiring many respected researchers in the field. The result was "PlainTalk", released with the...

, for people with vision problems. VoiceOver
VoiceOver
VoiceOver is a screen reader built into Apple Inc.'s Mac OS X, iOS and iPod operating systems. By using VoiceOver, the user can access their Macintosh or iOS device based on spoken descriptions and, in the case of the Mac, the keyboard. The feature is designed to increase accessibility for blind...

 was for the first time featured in Mac OS X Tiger (10.4). During 10.4 (Tiger) & first releases of 10.5 (Leopard) there was only one standard voice shipping with Mac OS X. Starting with 10.6 (Snow Leopard), the user can choose out of a wide range list of multiple voices. VoiceOver voices feature the taking of realistic-sounding breaths between sentences, as well as improved clarity at high read rates over PlainTalk. Mac OS X also includes say, a command-line based
Command-line interface
A command-line interface is a mechanism for interacting with a computer operating system or software by typing commands to perform specific tasks...

 application that converts text to audible speech. The AppleScript
AppleScript
AppleScript is a scripting language created by Apple Inc. and built into Macintosh operating systems since System 7. The term "AppleScript" may refer to the scripting system itself, or to particular scripts that are written in the AppleScript language....

 Standard Additions includes a say verb that allows a script to use any of the installed voices and to control the pitch, speaking rate and modulation of the spoken text.

The Apple iOS operating system used on the iPhone, iPad and iPod Touch uses VoiceOver
VoiceOver
VoiceOver is a screen reader built into Apple Inc.'s Mac OS X, iOS and iPod operating systems. By using VoiceOver, the user can access their Macintosh or iOS device based on spoken descriptions and, in the case of the Mac, the keyboard. The feature is designed to increase accessibility for blind...

 speech synthesis for accessibility. Some third party applications also provide speech synthesis to facilitate navigating, reading web pages or translating text.

AmigaOS

The second operating system with advanced speech synthesis capabilities was AmigaOS
AmigaOS
AmigaOS is the default native operating system of the Amiga personal computer. It was developed first by Commodore International, and initially introduced in 1985 with the Amiga 1000...

, introduced in 1985. The voice synthesis was licensed by Commodore International
Commodore International
Commodore is the commonly used name for Commodore Business Machines , the U.S.-based home computer manufacturer and electronics manufacturer headquartered in West Chester, Pennsylvania, which also housed Commodore's corporate parent company, Commodore International Limited...

 from SoftVoice, Inc., who also developed the original MacinTalk text-to-speech system. It featured a complete system of voice emulation, with both male and female voices and "stress" indicator markers, made possible by advanced features of the Amiga
Amiga
The Amiga is a family of personal computers that was sold by Commodore in the 1980s and 1990s. The first model was launched in 1985 as a high-end home computer and became popular for its graphical, audio and multi-tasking abilities...

 hardware audio chipset
Chipset
A chipset, PC chipset, or chip set refers to a group of integrated circuits, or chips, that are designed to work together. They are usually marketed as a single product.- Computers :...

. It was divided into a narrator device and a translator library. Amiga Speak Handler featured a text-to-speech translator. AmigaOS considered speech synthesis a virtual hardware device, so the user could even redirect console output to it. Some Amiga programs, such as word processors, made extensive use of the speech system.

Microsoft Windows

Modern Windows
Microsoft Windows
Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...

 desktop systems can use SAPI 4 and SAPI 5 components to support speech synthesis and speech recognition
Speech recognition
Speech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to recognition systems that must be trained to a particular speaker—as is the case for most desktop recognition software...

. SAPI 4.0 was available as an optional add-on for Windows 95
Windows 95
Windows 95 is a consumer-oriented graphical user interface-based operating system. It was released on August 24, 1995 by Microsoft, and was a significant progression from the company's previous Windows products...

 and Windows 98
Windows 98
Windows 98 is a graphical operating system by Microsoft. It is the second major release in the Windows 9x line of operating systems. It was released to manufacturing on 15 May 1998 and to retail on 25 June 1998. Windows 98 is the successor to Windows 95. Like its predecessor, it is a hybrid...

. Windows 2000
Windows 2000
Windows 2000 is a line of operating systems produced by Microsoft for use on personal computers, business desktops, laptops, and servers. Windows 2000 was released to manufacturing on 15 December 1999 and launched to retail on 17 February 2000. It is the successor to Windows NT 4.0, and is the...

 added Narrator
Microsoft Narrator
Narrator is a light-duty screen reader utility included in Microsoft Windows. Narrator reads dialog boxes and window controls in a number of the more basic applications for Windows....

, a text–to–speech utility for people who have visual handicaps. Third-party programs such as CoolSpeech
CoolSpeech
CoolSpeech is an award-winning proprietary text-to-speech program for Microsoft Windows, developed by ByteCool Software. It controls text-to-speech engines compliant with Microsoft Speech API to fetch and read aloud text from a variety of sources, including websites, email accounts, local text...

 can perform various text-to-speech tasks such as reading text aloud from a specified website, email account, text document, the Windows clipboard, the user's keyboard typing, etc. Not all programs can use speech synthesis directly. Some programs can use plug-ins, extensions or add-ons to read text aloud. Third-party programs are available that can read text from the system clipboard.

Microsoft Speech Server
Microsoft Speech Server
The Microsoft Speech Server is a product from Microsoft designed to allow the authoring and deployment of IVR applications incorporating Speech Recognition, Speech Synthesis and DTMF....

 is a server-based package for voice synthesis and recognition. It is designed for network use with web applications and call centers.

Text-to-Speech (TTS) refers to the ability of computers to read text aloud. A TTS Engine converts written text to a phonemic representation, then converts the phonemic representation to waveforms that can be output as sound. TTS engines with different languages, dialects and specialized vocabularies are available through third-party publishers.

Android

Version 1.6 of Android added support for speech synthesis (TTS).

Internet

Currently, there are a number of applications
Application software
Application software, also known as an application or an "app", is computer software designed to help the user to perform specific tasks. Examples include enterprise software, accounting software, office suites, graphics software and media players. Many application programs deal principally with...

, plugins and gadget
Gadget
A gadget is a small technological object that has a particular function, but is often thought of as a novelty. Gadgets are invariably considered to be more unusually or cleverly designed than normal technological objects at the time of their invention...

s that can read messages directly from an e-mail client
E-mail client
An email client, email reader, or more formally mail user agent , is a computer program used to manage a user's email.The term can refer to any system capable of accessing the user's email mailbox, regardless of it being a mail user agent, a relaying server, or a human typing on a terminal...

 and web pages from a web browser
Web browser
A web browser is a software application for retrieving, presenting, and traversing information resources on the World Wide Web. An information resource is identified by a Uniform Resource Identifier and may be a web page, image, video, or other piece of content...

 or Google Toolbar
Google Toolbar
Google Toolbar is an Internet browser toolbar only available for Internet Explorer and Firefox .-Google Toolbar 1.0 December 11, 2000:New features:*Direct access to the Google search functionality from any web page*Web Site search...

 such as Text-to-voice
Text-to-voice
Text to Voice or Text to Speech is a Firefox extension developed by Vikram Joshi, an under-graduate from IIT Delhi. It adds the speech functionality to Firefox....

 which is an add-on to Firefox. Some specialized software
Computer software
Computer software, or just software, is a collection of computer programs and related data that provide the instructions for telling a computer what to do and how to do it....

 can narrate RSS-feeds
RSS
-Mathematics:* Root-sum-square, the square root of the sum of the squares of the elements of a data set* Residual sum of squares in statistics-Technology:* RSS , "Really Simple Syndication" or "Rich Site Summary", a family of web feed formats...

. On one hand, online RSS-narrators simplify information delivery by allowing users to listen to their favourite news sources and to convert them to podcast
Podcast
A podcast is a series of digital media files that are released episodically and often downloaded through web syndication...

s. On the other hand, on-line RSS-readers are available on almost any PC
Personal computer
A personal computer is any general-purpose computer whose size, capabilities, and original sales price make it useful for individuals, and which is intended to be operated directly by an end-user with no intervening computer operator...

 connected to the Internet. Users can download generated audio files to portable devices, e.g. with a help of podcast
Podcast
A podcast is a series of digital media files that are released episodically and often downloaded through web syndication...

 receiver, and listen to them while walking, jogging or commuting to work.

A growing field in internet based TTS is web-based assistive technology
Assistive technology
Assistive technology or adaptive technology is an umbrella term that includes assistive, adaptive, and rehabilitative devices for people with disabilities and also includes the process used in selecting, locating, and using them...

, e.g. 'Browsealoud
Browsealoud
BrowseAloud is assistive technology that adds text-to-speech functionality to websites. It is designed by Texthelp Systems, a Northern-Ireland based company that specialises in the design of assistive technology...

' from a UK company and Readspeaker
Readspeaker
ReadSpeaker is a suite of web-based applications that uses the latest text-to-speech technology to speech-enable web sites, mobile sites, RSS feeds, mobile apps like iPhone, Android or Blackberry apps, as well as online documents and forms. ReadSpeaker products help people that suffer from reading...

. It can deliver TTS functionality to anyone (for reasons of accessibility, convenience, entertainment or information) with access to a web browser. The non-profit project Pediaphon was created in 2006 to provide a similar web-based TTS interface to the Wikipedia
Wikipedia
Wikipedia is a free, web-based, collaborative, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation. Its 20 million articles have been written collaboratively by volunteers around the world. Almost all of its articles can be edited by anyone with access to the site,...

.

Other work is being done in the context of the W3C  through the W3C Audio Incubator Group with the involvement of The BBC and Google Inc.

Others

  • Some e-book readers, such as the Amazon Kindle
    Amazon Kindle
    The Amazon Kindle is an e-book reader developed by Amazon.com subsidiary Lab126 which uses wireless connectivity to enable users to shop for, download, browse, and read e-books, newspapers, magazines, blogs, and other digital media...

    .
  • Some models of Texas Instruments home computers produced in 1979 and 1981 (Texas Instruments TI-99/4 and TI-99/4A) were capable of text-to-phoneme synthesis or reciting complete words and phrases (text-to-dictionary), using a very popular Speech Synthesizer peripheral. TI used a proprietary codec
    Codec
    A codec is a device or computer program capable of encoding or decoding a digital data stream or signal. The word codec is a portmanteau of "compressor-decompressor" or, more commonly, "coder-decoder"...

     to embed complete spoken phrases into applications, primarily video games.
  • IBM
    IBM
    International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...

    's OS/2 Warp 4 included VoiceType, a precursor to IBM ViaVoice.
  • Systems that operate on free and open source software systems including Linux
    Linux
    Linux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...

     are various, and include open-source programs such as the Festival Speech Synthesis System
    Festival Speech Synthesis System
    Festival is a general multi-lingual speech synthesis system originally developed by Alan W. Black at at the University of Edinburgh. Substantial contributions have also been provided by Carnegie Mellon University and other sites...

     which uses diphone-based synthesis (and can use a limited number of MBROLA
    MBROLA
    MBROLA is an algorithm for speech synthesis, and software which is distributed at no financial cost but in binary form only, and a worldwide collaborative project...

     voices), and gnuspeech
    Gnuspeech
    Gnuspeech is an extensible text-to-speech computer software package that produces artificial speech output based on real-time articulatory speech synthesis by rules...

     which uses articulatory synthesis from the Free Software Foundation
    Free Software Foundation
    The Free Software Foundation is a non-profit corporation founded by Richard Stallman on 4 October 1985 to support the free software movement, a copyleft-based movement which aims to promote the universal freedom to create, distribute and modify computer software...

    .
  • Companies which developed speech synthesis systems but which are no longer in this business include BeST Speech (bought by L&H), Eloquent Technology (bought by SpeechWorks), Lernout & Hauspie
    Lernout & Hauspie
    Lernout & Hauspie Speech Products, or L&H, was a leading Belgium-based speech recognition technology company, founded by Jo Lernout and Pol Hauspie, that went bankrupt in 2001...

     (bought by Nuance), SpeechWorks
    SpeechWorks
    SpeechWorks was a company founded in the late 1990s in Boston that developed and supported speech-related computer software. The company was purchased in mid-2003 by Peabody, Massachusetts-based Nuance Communications, which was then known as ScanSoft....

     (bought by Nuance), Rhetorical Systems (bought by Nuance).
  • GPS
    Global Positioning System
    The Global Positioning System is a space-based global navigation satellite system that provides location and time information in all weather, anywhere on or near the Earth, where there is an unobstructed line of sight to four or more GPS satellites...

     Navigation units produced by Garmin
    Garmin
    Garmin Ltd. , incorporated in Schaffhausen, Switzerland, is the parent company of a group of companies founded in 1989 by Gary Burrell and Min Kao , that develops consumer, aviation, and marine technologies for the Global Positioning System...

    , Magellan
    Magellan Navigation
    Magellan Navigation, Inc. is a producer of consumer and professional grade global positioning system receivers. Headquartered in Santa Clara, California, with European sales and engineering centers in Nantes, France and Moscow, Russia, Magellan also produces aftermarket automotive GPS units,...

    , TomTom
    TomTom
    TomTom NV is a Dutch manufacturer of automotive navigation systems, including both stand-alone units and software for personal digital assistants and mobile telephones. It is the leading manufacturer of navigation systems in Europe. TomTom's customer service is located in Amsterdam, Netherlands...

     and others use speech synthesis for automobile navigation.

Speech synthesis markup languages

A number of markup language
Markup language
A markup language is a modern system for annotating a text in a way that is syntactically distinguishable from that text. The idea and terminology evolved from the "marking up" of manuscripts, i.e. the revision instructions by editors, traditionally written with a blue pencil on authors' manuscripts...

s have been established for the rendition of text as speech in an XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

-compliant format. The most recent is Speech Synthesis Markup Language
Speech Synthesis Markup Language
Speech Synthesis Markup Language is an XML-based markup language for speech synthesis applications. It is a recommendation of the W3C's voice browser working group. SSML is often embedded in VoiceXML scripts to drive interactive telephony systems. However, it also may be used alone, such as for...

 (SSML), which became a W3C recommendation
W3C recommendation
A W3C Recommendation is the final stage of a ratification process of the World Wide Web Consortium working group concerning a technical standard. This designation signifies that a document has been subjected to a public and W3C-member organization's review. It aims to standardise the Web technology...

 in 2004. Older speech synthesis markup languages include Java Speech Markup Language (JSML) and SABLE
SABLE
SABLE is an XML markup language used to annotate texts for speech synthesis. It defines tags which control the way words, numbers, and sentences are reproduced by a computer...

. Although each of these was proposed as a standard, none of them has been widely adopted.

Speech synthesis markup languages are distinguished from dialogue markup languages. VoiceXML
VoiceXML
VoiceXML is the W3C's standard XML format for specifying interactive voice dialogues between a human and a computer. It allows voice applications to be developed and deployed in an analogous way to HTML for visual applications. Just as HTML documents are interpreted by a visual web browser,...

, for example, includes tags related to speech recognition, dialogue management and touchtone dialing, in addition to text-to-speech markup.

Applications

Speech synthesis has long been a vital assistive technology
Assistive technology
Assistive technology or adaptive technology is an umbrella term that includes assistive, adaptive, and rehabilitative devices for people with disabilities and also includes the process used in selecting, locating, and using them...

 tool and its application in this area is significant and widespread. It allows environmental barriers to be removed for people with a wide range of disabilities. The longest application has been in the use of screen reader
Screen reader
A screen reader is a software application that attempts to identify and interpret what is being displayed on the screen . This interpretation is then re-presented to the user with text-to-speech, sound icons, or a Braille output device...

s for people with visual impairment
Visual impairment
Visual impairment is vision loss to such a degree as to qualify as an additional support need through a significant limitation of visual capability resulting from either disease, trauma, or congenital or degenerative conditions that cannot be corrected by conventional means, such as refractive...

, but text-to-speech systems are now commonly used by people with dyslexia
Dyslexia
Dyslexia is a very broad term defining a learning disability that impairs a person's fluency or comprehension accuracy in being able to read, and which can manifest itself as a difficulty with phonological awareness, phonological decoding, orthographic coding, auditory short-term memory, or rapid...

 and other reading difficulties as well as by pre-literate children. They are also frequently employed to aid those with severe speech impairment usually through a dedicated voice output communication aid
Voice output communication aid
Speech generating devices , also known as voice output communication aids , are electronic augmentative and alternative communication systems that enable individuals with severe speech impairment to verbally communicate their needs.Speech generating systems may be dedicated devices developed...

.

Speech synthesis techniques are also used in entertainment productions such as games and animations. In 2007, Animo Limited announced the development of a software application package based on its speech synthesis software FineSpeech, explicitly geared towards customers in the entertainment industries, able to generate narration and lines of dialogue according to user specifications. The application reached maturity in 2008, when NEC Biglobe
BIGLOBE
is one of the leading internet service providers in Japan, operated by NEC BIGLOBE, Ltd., a 2006 spin-off from NEC.-References:...

 announced a web service that allows users to create phrases from the voices of Code Geass: Lelouch of the Rebellion R2 characters.

See also

  • Text-to-voice
    Text-to-voice
    Text to Voice or Text to Speech is a Firefox extension developed by Vikram Joshi, an under-graduate from IIT Delhi. It adds the speech functionality to Firefox....

     — Mozilla Firefox extension
  • Microsoft text-to-speech voices
    Microsoft text-to-speech voices
    The Microsoft text-to-speech voices are speech synthesizers provided for use with applications that use the Microsoft Speech API .Microsoft Sam is the default text-to-speech male voice in Microsoft Windows 2000 and Windows XP...

  • Loquendo
    Loquendo
    Loquendo is a multinational computer software technology corporation, headquartered in Torino, Italy, that provides speech recognition, speech synthesis, speaker verification and identification applications...

  • CereProc
    CereProc
    CereProc is a speech synthesis company based in Edinburgh, Scotland, founded in 2005. The company specialises in creating natural and expressive-sounding text to speech voices, synthesis voices with regional accents, and in voice cloning....

  • Comparison of speech synthesizers
    Comparison of speech synthesizers
    Here is a non-exhaustive comparison of speech synthesis programs :- Technical voice details :- Technical details :...

  • Articulatory synthesis
    Articulatory synthesis
    Articulatory synthesis refers to computational techniques for synthesizing speech based on models of the human vocal tract and the articulation processes occurring there. The shape of the vocal tract can be controlled in a number of ways which usually involves modifying the position of the speech...

  • Chinese speech synthesis
    Chinese speech synthesis
    Chinese speech synthesis is the application of speech synthesis to the Chinese language . It poses additional difficulties due to the Chinese characters , the complex prosody which is essential to convey the meaning of words, and sometimes the difficulty in obtaining agreement among native...

  • Natural language processing
    Natural language processing
    Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....

  • Paperless office
    Paperless office
    A paperless office is a work environment in which the use of paper is eliminated or greatly reduced. This is done by converting documents and other papers into digital form. Proponents claim that "going paperless" can save money, boost productivity, save space, make documentation and information...

  • Comparison of screen readers
  • Comparison of file readers
  • Sinewave synthesis
    Sinewave synthesis
    Sinewave synthesis, or sine wave speech, is a technique for synthesizing speech by replacing the formants with pure tone whistles. The first sinewave synthesis program for the automatic creation of stimuli for perceptual experiments was developed by Philip Rubin at Haskins Laboratories in the 1970s...

  • Speech processing
    Speech processing
    Speech processing is the study of speech signals and the processing methods of these signals.The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signal.It is also closely tied to...

  • Silent speech interface
    Silent speech interface
    Silent speech interface is a device that allows speech communication without using the sound made when people vocalize their speech sounds. As such it is a type of electronic lip reading. It works by the computer identifying the phonemes that an individual pronounces from nonauditory sources of...

  • Vocaloid
    Vocaloid
    is a singing synthesizer application, with its signal processing part developed through a joint research project between the Pompeu Fabra University in Spain and Japan's Yamaha Corporation, who backed the development financially—and later developed the software into the commercial product...


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK