Java Speech API
Encyclopedia
The Java Speech API
specifies a cross-platform interface to support command and control recognizers, dictation systems and speech synthesizers. Although JSAPI defines an interface only there are several implementations created by third parties, for example FreeTTS
.
and speech recognition
.
The major steps in producing speech from text are as follows:
The result of these first two steps is a spoken form of the written text. Here are examples of the differences between written and spoken text:
St. Matthew's hospital is on Main St.
-> “Saint Matthew's hospital is on Main Street”
Add $20 to account 55374.
-> “Add twenty dollars to account five five, three seven four.”
The remaining steps convert the spoken text to speech:
Speech synthesizers can make errors in any of the processing steps described above. Human ears are well-tuned to detecting these errors, but careful work by developers can minimize errors and improve the speech output quality. The Java Speech API and the Java Speech API Markup Language
(JSML) provide many ways for you to improve the output quality of a speech synthesizer.
The major steps of a typical speech recognizer are as follows:
A grammar is an object in the Java Speech API that indicates what words a user is expected to say and in what patterns those words may occur. Grammars are important to speech recognizers because they constrain the recognition process. These constraints make recognition faster and more accurate because the recognizer does not have to check for bizarre sentences.
The Java Speech API supports two basic grammar types: rule grammars and dictation grammars. These types differ in various ways, including how applications set up the grammars; the types of sentences they allow; how results are provided; the amount of computational resources required; and how they are used in application design. Rule grammars are defined by JSGF
, the Java Speech Grammar Format.
The Central class is like a factory class that all Java Speech API applications use. It provides static methods to enable the access of speech synthesis and speech recognition engines. The Engine interface encapsulates the generic operations that a Java Speech API-compliant speech engine should provide for speech applications.
Speech applications can primarily use methods to perform actions such as retrieving the properties and state of the speech engine and allocating and deallocating resources for a speech engine. In addition, the Engine interface exposes mechanisms to pause and resume the audio stream generated or processed by the speech engine. The Engine interface is subclassed by the Synthesizer and Recognizer interfaces, which define additional speech synthesis and speech recognition functionality. The Synthesizer interface encapsulates the operations that a Java Speech API-compliant speech synthesis engine should provide for speech applications.
The Java Speech API is based on the event-handling model of AWT components. Events generated by the speech engine can be identified and handled as required. There are two ways to handle speech engine events: through the EngineListener interface or through the EngineAdapter class.
You can get more information about any classes and interfaces in the Java Speech API JavaDocs.
(JCP) and targeted the Java Platform, Standard Edition
(Java SE). Subsequently, the Java Speech API 2 (JSAPI2) was created as JSR 113 under the JCP. This API targeted the Java Platform, Micro Edition
(Java ME), but also complies with Java SE.
Application programming interface
An application programming interface is a source code based specification intended to be used as an interface by software components to communicate with each other...
specifies a cross-platform interface to support command and control recognizers, dictation systems and speech synthesizers. Although JSAPI defines an interface only there are several implementations created by third parties, for example FreeTTS
FreeTTS
FreeTTS is an open source speech synthesis system written entirely in the Java programming language. It is based upon Flite. FreeTTS is an implementation of Sun's Java Speech API.FreeTTS supports end-of-speech markers...
.
Core technologies
Two core speech technologies are supported through the Java Speech API: speech synthesisSpeech synthesis
Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware...
and speech recognition
Speech recognition
Speech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to recognition systems that must be trained to a particular speaker—as is the case for most desktop recognition software...
.
Speech synthesis
Speech synthesis provides the reverse process of producing synthetic speech from text generated by an application, an applet, or a user. It is often referred to as text-to-speech technology.The major steps in producing speech from text are as follows:
- Structure analysis: Processes the input text to determine where paragraphs, sentences, and other structures start and end. For most languages, punctuation and formatting data are used in this stage.
- Text pre-processing: Analyzes the input text for special constructs of the language. In English, special treatment is required for abbreviations, acronyms, dates, times, numbers, currency amounts, e-mail addresses, and many other forms. Other languages need special processing for these forms, and most languages have other specialized requirements.
The result of these first two steps is a spoken form of the written text. Here are examples of the differences between written and spoken text:
St. Matthew's hospital is on Main St.
-> “Saint Matthew's hospital is on Main Street”
Add $20 to account 55374.
-> “Add twenty dollars to account five five, three seven four.”
The remaining steps convert the spoken text to speech:
- Text-to-phoneme conversion: Converts each word to phonemes. A phoneme is a basic unit of sound in a language.
- Prosody analysis: Processes the sentence structure, words, and phonemes to determine the appropriate prosody for the sentence.
- Waveform production: Uses the phonemes and prosody information to produce the audio waveform for each sentence.
Speech synthesizers can make errors in any of the processing steps described above. Human ears are well-tuned to detecting these errors, but careful work by developers can minimize errors and improve the speech output quality. The Java Speech API and the Java Speech API Markup Language
Java Speech Markup Language
Java Speech API Markup Language is an XML-based markup language for annotating text input to speech synthesizers. JSML is used with-in the Java Speech API. JSML is an XML application and conforms to the requirements of well-formed XML documents. Java Speech API Markup Language is referred to as...
(JSML) provide many ways for you to improve the output quality of a speech synthesizer.
Speech recognition
Speech recognition provides computers with the ability to listen to spoken language and determine what has been said. In other words, it processes audio input containing speech by converting it to text.The major steps of a typical speech recognizer are as follows:
- Grammar design: Defines the words that may be spoken by a user and the patterns in which they may be spoken.
- Signal processing: Analyzes the spectrum (i.e., the frequency) characteristics of the incoming audio.
- Phoneme recognition: Compares the spectrum patterns to the patterns of the phonemes of the language being recognized.
- Word recognition: Compares the sequence of likely phonemes against the words and patterns of words specified by the active grammars.
- Result generation: Provides the application with information about the words the recognizer has detected in the incoming audio.
A grammar is an object in the Java Speech API that indicates what words a user is expected to say and in what patterns those words may occur. Grammars are important to speech recognizers because they constrain the recognition process. These constraints make recognition faster and more accurate because the recognizer does not have to check for bizarre sentences.
The Java Speech API supports two basic grammar types: rule grammars and dictation grammars. These types differ in various ways, including how applications set up the grammars; the types of sentences they allow; how results are provided; the amount of computational resources required; and how they are used in application design. Rule grammars are defined by JSGF
JSGF
JSGF stands for Java Speech Grammar Format or the JSpeech Grammar Format . Developed by Sun Microsystems, it is a textual representation of grammars for use in speech recognition for technologies like XHTML+Voice...
, the Java Speech Grammar Format.
The Java Speech API’s classes and interfaces
The different classes and interfaces that form the Java Speech API are grouped into the following three packages:- javax.speech: Contains classes and interfaces for a generic speech engine.
- javax.speech.synthesis: Contains classes and interfaces for speech synthesis.
- javax.speech.recognition: Contains classes and interfaces for speech recognition.
The Central class is like a factory class that all Java Speech API applications use. It provides static methods to enable the access of speech synthesis and speech recognition engines. The Engine interface encapsulates the generic operations that a Java Speech API-compliant speech engine should provide for speech applications.
Speech applications can primarily use methods to perform actions such as retrieving the properties and state of the speech engine and allocating and deallocating resources for a speech engine. In addition, the Engine interface exposes mechanisms to pause and resume the audio stream generated or processed by the speech engine. The Engine interface is subclassed by the Synthesizer and Recognizer interfaces, which define additional speech synthesis and speech recognition functionality. The Synthesizer interface encapsulates the operations that a Java Speech API-compliant speech synthesis engine should provide for speech applications.
The Java Speech API is based on the event-handling model of AWT components. Events generated by the speech engine can be identified and handled as required. There are two ways to handle speech engine events: through the EngineListener interface or through the EngineAdapter class.
You can get more information about any classes and interfaces in the Java Speech API JavaDocs.
Related Specifications
The Java Speech API was written before the Java Community ProcessJava Community Process
The Java Community Process or JCP, established in 1998, is a formalized process that allows interested parties to get involved in the definition of future versions and features of the Java platform....
(JCP) and targeted the Java Platform, Standard Edition
Java Platform, Standard Edition
Java Platform, Standard Edition or Java SE is a widely used platform for programming in the Java language. It is the Java Platform used to deploy portable applications for general use...
(Java SE). Subsequently, the Java Speech API 2 (JSAPI2) was created as JSR 113 under the JCP. This API targeted the Java Platform, Micro Edition
Java Platform, Micro Edition
Java Platform, Micro Edition, or Java ME, is a Java platform designed for embedded systems . Target devices range from industrial controls to mobile phones and set-top boxes...
(Java ME), but also complies with Java SE.