IVONA - AbsoluteAstronomy.com

IVONA is a multi-lingual speech synthesis

Speech synthesis

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware...

system developed at Polish IT company IVONA Software.
It offers a full text to speech system with various APIs.

Inside IVONA

IVONA text-to-speech system was described at Blizzard Challenge 2006. and Blizzard Challenge 2007 (special version for Blizzard Challenge). It is composed of two parts: a front-end and a back-end. The front-end has two major tasks. First, it converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words. This process is often called text normalization, pre-processing, or tokenization
Tokenization
Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining...

. The front-end then assigns phonetic transcription

Phonetic transcription

Phonetic transcription is the visual representation of speech sounds . The most common type of phonetic transcription uses a phonetic alphabet, e.g., the International Phonetic Alphabet....

s to each word, and divides and marks the text into prosodic units

Prosody (linguistics)

In linguistics, prosody is the rhythm, stress, and intonation of speech. Prosody may reflect various features of the speaker or the utterance: the emotional state of the speaker; the form of the utterance ; the presence of irony or sarcasm; emphasis, contrast, and focus; or other elements of...

, like phrase

Phrase

In everyday speech, a phrase may refer to any group of words. In linguistics, a phrase is a group of words which form a constituent and so function as a single unit in the syntax of a sentence. A phrase is lower on the grammatical hierarchy than a clause....

s, clause

Clause

In grammar, a clause is the smallest grammatical unit that can express a complete proposition. In some languages it may be a pair or group of words that consists of a subject and a predicate, although in other languages in certain clauses the subject may not appear explicitly as a noun phrase,...

s, and sentence

Sentence (linguistics)

In the field of linguistics, a sentence is an expression in natural language, and often defined to indicate a grammatical unit consisting of one or more words that generally bear minimal syntactic relation to the words that precede or follow it...

s. Phonetic transcriptions and prosody information together make up the symbolic linguistic representation that is output by the front-end. The back-end—often referred to as the synthesizer—then converts the symbolic linguistic representation into sound.

Unit selection synthesis

IVONA uses Unit Selection with Limited Time-scale Modification (USLTM) described in their Blizzard Challenge 2006 paper. Unit selection synthesis uses large database

Database

A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...

s of recorded speech. During database creation, each recorded utterance is segmented into some or all of the following: individual phones, syllable

Syllable

A syllable is a unit of organization for a sequence of speech sounds. For example, the word water is composed of two syllables: wa and ter. A syllable is typically made up of a syllable nucleus with optional initial and final margins .Syllables are often considered the phonological "building...

s, morpheme

Morpheme

In linguistics, a morpheme is the smallest semantically meaningful unit in a language. The field of study dedicated to morphemes is called morphology. A morpheme is not identical to a word, and the principal difference between the two is that a morpheme may or may not stand alone, whereas a word,...

s, word

Word

In language, a word is the smallest free form that may be uttered in isolation with semantic or pragmatic content . This contrasts with a morpheme, which is the smallest unit of meaning but will not necessarily stand on its own...

s, phrase

Phrase

s, and sentence

Sentence (linguistics)

s. The division into segments is done using a specially modified speech recognizer

Speech recognition

Speech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to recognition systems that must be trained to a particular speaker—as is the case for most desktop recognition software...

. An index

Index (database)

A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space...

of the units in the speech database is then created based on the segmentation and acoustic parameters like the fundamental frequency

Fundamental frequency

The fundamental frequency, often referred to simply as the fundamental and abbreviated f0, is defined as the lowest frequency of a periodic waveform. In terms of a superposition of sinusoids The fundamental frequency, often referred to simply as the fundamental and abbreviated f0, is defined as the...

(pitch

Pitch (music)

Pitch is an auditory perceptual property that allows the ordering of sounds on a frequency-related scale.Pitches are compared as "higher" and "lower" in the sense associated with musical melodies,...

), duration, position in the syllable, and neighboring phones. At runtime, the desired target utterance is created by determining the best chain of candidate units from the database (unit selection).

Unit selection provides the greatest naturalness, because it applies digital signal processing

Digital signal processing

Digital signal processing is concerned with the representation of discrete time signals by a sequence of numbers or symbols and the processing of these signals. Digital signal processing and analog signal processing are subfields of signal processing...

(DSP) to the recorded speech only at concatenation points. DSP often makes recorded speech sound less natural.

Generated speech quality

IVONA Text To Speech System received the highest Mean Opinion Score (MOS) at the scientific contest Blizzard Challenge 2007 in Bonn, Germany. The sentences read out by IVONA were evaluated by experts, a group of British and American students and volunteers recruited via the Internet. Average mean opinion score for IVONA was the highest (3.9 points) from all speech synthesizers. A real person’s recording scored 4.7.

IVONA was also evaluated at Blizzard Challenge 2006 in Pittsburgh, USA and received best Mean Opinion Score (MOS) provided by Speech Experts and Undergraduates for full database results.

Voices and languages

IVONA currently speaks seven different languages with nineteen voices.

American English: Salli, Ivy, Kimberly, Kendra, Jennifer, Joey, Eric and Chipmunk Skippy

American Spanish: Penélope, Miguel

British English: Emma, Amy and Brian

Welsh English: Geraint, Gwyneth

Welsh: Geraint, Gwyneth

German: Marlene, Hans

French: Céline, Mathieu

Castilian Spanish: Conchita, Enrique

Polish: Maja, Ewa Jacek and Jan

Romanian: Carmen

External links

IVONA TTS on-line.
See IVONA TTS in action.
Expressivo Text Reader application voiced by IVONA TTS.
Free web service say.expressivo.com - send and publish prompts spoken by IVONA TTS voices.

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.

Inside IVONA

Unit selection synthesis

Generated speech quality

Voices and languages

See also

External links