IVONA
Encyclopedia
IVONA is a multi-lingual speech synthesis
Speech synthesis
Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware...

 system developed at Polish IT company IVONA Software.
It offers a full text to speech system with various APIs.

Inside IVONA

IVONA text-to-speech system was described at Blizzard Challenge 2006. and Blizzard Challenge 2007 (special version for Blizzard Challenge). It is composed of two parts: a front-end and a back-end. The front-end has two major tasks. First, it converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words. This process is often called text normalization, pre-processing, or tokenization
Tokenization
Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining...

. The front-end then assigns phonetic transcription
Phonetic transcription
Phonetic transcription is the visual representation of speech sounds . The most common type of phonetic transcription uses a phonetic alphabet, e.g., the International Phonetic Alphabet....

s to each word, and divides and marks the text into prosodic units
Prosody (linguistics)
In linguistics, prosody is the rhythm, stress, and intonation of speech. Prosody may reflect various features of the speaker or the utterance: the emotional state of the speaker; the form of the utterance ; the presence of irony or sarcasm; emphasis, contrast, and focus; or other elements of...

, like phrase
Phrase
In everyday speech, a phrase may refer to any group of words. In linguistics, a phrase is a group of words which form a constituent and so function as a single unit in the syntax of a sentence. A phrase is lower on the grammatical hierarchy than a clause....

s, clause
Clause
In grammar, a clause is the smallest grammatical unit that can express a complete proposition. In some languages it may be a pair or group of words that consists of a subject and a predicate, although in other languages in certain clauses the subject may not appear explicitly as a noun phrase,...

s, and sentence
Sentence (linguistics)
In the field of linguistics, a sentence is an expression in natural language, and often defined to indicate a grammatical unit consisting of one or more words that generally bear minimal syntactic relation to the words that precede or follow it...

s. Phonetic transcriptions and prosody information together make up the symbolic linguistic representation that is output by the front-end. The back-end—often referred to as the synthesizer—then converts the symbolic linguistic representation into sound.

Unit selection synthesis

IVONA uses Unit Selection with Limited Time-scale Modification (USLTM) described in their Blizzard Challenge 2006 paper. Unit selection synthesis uses large database
Database
A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...

s of recorded speech. During database creation, each recorded utterance is segmented into some or all of the following: individual phones, syllable
Syllable
A syllable is a unit of organization for a sequence of speech sounds. For example, the word water is composed of two syllables: wa and ter. A syllable is typically made up of a syllable nucleus with optional initial and final margins .Syllables are often considered the phonological "building...

s, morpheme
Morpheme
In linguistics, a morpheme is the smallest semantically meaningful unit in a language. The field of study dedicated to morphemes is called morphology. A morpheme is not identical to a word, and the principal difference between the two is that a morpheme may or may not stand alone, whereas a word,...

s, word
Word
In language, a word is the smallest free form that may be uttered in isolation with semantic or pragmatic content . This contrasts with a morpheme, which is the smallest unit of meaning but will not necessarily stand on its own...

s, phrase
Phrase
In everyday speech, a phrase may refer to any group of words. In linguistics, a phrase is a group of words which form a constituent and so function as a single unit in the syntax of a sentence. A phrase is lower on the grammatical hierarchy than a clause....

s, and sentence
Sentence (linguistics)
In the field of linguistics, a sentence is an expression in natural language, and often defined to indicate a grammatical unit consisting of one or more words that generally bear minimal syntactic relation to the words that precede or follow it...

s. The division into segments is done using a specially modified speech recognizer
Speech recognition
Speech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to recognition systems that must be trained to a particular speaker—as is the case for most desktop recognition software...

. An index
Index (database)
A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space...

 of the units in the speech database is then created based on the segmentation and acoustic parameters like the fundamental frequency
Fundamental frequency
The fundamental frequency, often referred to simply as the fundamental and abbreviated f0, is defined as the lowest frequency of a periodic waveform. In terms of a superposition of sinusoids The fundamental frequency, often referred to simply as the fundamental and abbreviated f0, is defined as the...

 (pitch
Pitch (music)
Pitch is an auditory perceptual property that allows the ordering of sounds on a frequency-related scale.Pitches are compared as "higher" and "lower" in the sense associated with musical melodies,...

), duration, position in the syllable, and neighboring phones. At runtime, the desired target utterance is created by determining the best chain of candidate units from the database (unit selection).

Unit selection provides the greatest naturalness, because it applies digital signal processing
Digital signal processing
Digital signal processing is concerned with the representation of discrete time signals by a sequence of numbers or symbols and the processing of these signals. Digital signal processing and analog signal processing are subfields of signal processing...

 (DSP) to the recorded speech only at concatenation points. DSP often makes recorded speech sound less natural.

Generated speech quality

IVONA Text To Speech System received the highest Mean Opinion Score (MOS) at the scientific contest Blizzard Challenge 2007 in Bonn, Germany. The sentences read out by IVONA were evaluated by experts, a group of British and American students and volunteers recruited via the Internet. Average mean opinion score for IVONA was the highest (3.9 points) from all speech synthesizers. A real person’s recording scored 4.7.

IVONA was also evaluated at Blizzard Challenge 2006 in Pittsburgh, USA and received best Mean Opinion Score (MOS) provided by Speech Experts and Undergraduates for full database results.

Voices and languages

IVONA currently speaks seven different languages with nineteen voices.

American English: Salli, Ivy, Kimberly, Kendra, Jennifer, Joey, Eric and Chipmunk Skippy

American Spanish: Penélope, Miguel

British English: Emma, Amy and Brian

Welsh English: Geraint, Gwyneth

Welsh: Geraint, Gwyneth

German: Marlene, Hans

French: Céline, Mathieu

Castilian Spanish: Conchita, Enrique

Polish: Maja, Ewa Jacek and Jan

Romanian: Carmen

See also

  • Speech synthesis
    Speech synthesis
    Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware...

  • Language
    Language
    Language may refer either to the specifically human capacity for acquiring and using complex systems of communication, or to a specific instance of such a system of complex communication...

  • Natural language processing
    Natural language processing
    Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....

  • Speech processing
    Speech processing
    Speech processing is the study of speech signals and the processing methods of these signals.The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signal.It is also closely tied to...

  • Speech recognition
    Speech recognition
    Speech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to recognition systems that must be trained to a particular speaker—as is the case for most desktop recognition software...

  • List of screen readers

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK