Voice activity detection
Encyclopedia
Voice activity detection (VAD), also known as speech activity detection or speech detection, is a technique used in speech processing
in which the presence or absence of human speech is detected.. The main uses of VAD are in speech coding
and speech recognition
. It can facilitate speech processing, and can also be used to deactivate some processes during non-speech section of an audio session: it can avoid unnecessary coding/transmission of silence packets
in Voice over Internet Protocol applications, saving on computation and on network bandwidth
.
VAD is an important enabling technology for a variety of speech-based applications. Therefore various VAD algorithms have been developed that provide varying features and compromises between latency
, sensitivity
, accuracy and computational cost. Some VAD algorithms also provide further analysis, for example whether the speech is voiced
, unvoiced or sustain
ed. Voice activity detection is usually language independent.
It was first investigated for use on time-assignment speech interpolation
(TASI) systems.
There may be some feedback in this sequence, in which the VAD decision is used to improve the noise estimate in the noise reduction stage, or to adaptively vary the threshold(s). These feedback operations improve the VAD performance in non-stationary noise (i.e. when the noise varies a lot) .
A representative set of recently published VAD methods formulates the decision rule on a frame by frame basis using instantaneous measures of the divergence distance between speech and noise . The different measures which are used in VAD methods include spectral slope, correlation coefficients, log likelihood ratio, cepstral, weighted cepstral, and modified distance measures .
Independently from the choice of VAD algorithm, we must compromise between having voice detected as noise or noise detected as voice (between false positive and false negative
). A VAD operating in a mobile phone
must be able to detect speech in the presence of a range of very diverse types of acoustic background noise. In these difficult detection conditions it is often preferable that a VAD should fail-safe
, indicating speech detected when the decision is in doubt, to lower the chance of losing speech segments. The biggest difficulty in the detection of speech in this environment is the very low signal-to-noise ratio
s (SNRs) that are encountered. It may be impossible to distinguish between speech and noise using simple level detection techniques when parts of the speech utterance are buried below the noise.
For a wide range of applications such as digital mobile radio, Digital Simultaneous Voice and Data (DSVD) or speech storage, it is desirable to provide a discontinuous transmission of speech-coding parameters. Advantages can include lower average power consumption in mobile handsets, higher average bit rate for simultaneous services like data transmission, or a higher capacity on storage chips. However, the improvement depends mainly on the percentage of pauses during speech and the reliability of the VAD used to detect these intervals. On the one hand, it is advantageous to have a low percentage of speech activity. On the other hand clipping, that is the loss of milliseconds of active speech, should be minimized to preserve quality. This is the crucial problem for a VAD algorithm under heavy noise conditions.
s used by telemarketing firms. In order to maximize agent productivity, telemarketing firms set up predictive dialers to call more numbers than they have agents available, knowing most calls will end up in either "Ring - No Answer" or answering machines. When a person answers, they typically speak briefly ("Hello", "Good evening", etc.) and then there is a brief period of silence. Answering machine messages usually contain 3–15 seconds of continuous speech. By setting VAD parameters correctly, dialers can determine whether a person or a machine answered the call, and if it's a person, transfer the call to an available agent. If it detects an answering machine, the dialer hangs up. Often, the system correctly detects a person answering the call, but no agent is available. This leaves the called party frustratedly repeating "Hello? Hello?" into the phone, and when combined with the volume of agents that did get through, created the impetus to develop "Do Not Call" lists across the US.
Although the method described above provides useful objective information concerning the performance of a VAD, it is only an approximate measure of the subjective effect. For example, the effects of speech signal clipping can at times be hidden by the presence of background noise, depending on the model chosen for the comfort noise synthesis, so some of the clipping measured with objective tests is in reality not audible. It is therefore important to carry out subjective tests on VADs, the main aim of which is to ensure that the clipping perceived is acceptable. This kind of test requires a certain number of listeners to judge recordings containing the processing results of the VADs being tested. The listeners have to give marks on the following features:
These marks, obtained by listening to several speech sequences, are then used to calculate average results for each of the features listed above, thus providing a global estimate of the behavior of the VAD being tested. To conclude, whereas objective methods are very useful in an initial stage to evaluate the quality of a VAD, subjective methods are more significant. As, however, they are more expensive (since they require the participation of a certain number of people for a few days), they are generally only used when a proposal is about to be standardized.
Speech processing
Speech processing is the study of speech signals and the processing methods of these signals.The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signal.It is also closely tied to...
in which the presence or absence of human speech is detected.. The main uses of VAD are in speech coding
Speech coding
Speech coding is the application of data compression of digital audio signals containing speech. Speech coding uses speech-specific parameter estimation using audio signal processing techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting...
and speech recognition
Speech recognition
Speech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to recognition systems that must be trained to a particular speaker—as is the case for most desktop recognition software...
. It can facilitate speech processing, and can also be used to deactivate some processes during non-speech section of an audio session: it can avoid unnecessary coding/transmission of silence packets
Packet switching
Packet switching is a digital networking communications method that groups all transmitted data – regardless of content, type, or structure – into suitably sized blocks, called packets. Packet switching features delivery of variable-bit-rate data streams over a shared network...
in Voice over Internet Protocol applications, saving on computation and on network bandwidth
Bandwidth (computing)
In computer networking and computer science, bandwidth, network bandwidth, data bandwidth, or digital bandwidth is a measure of available or consumed data communication resources expressed in bits/second or multiples of it .Note that in textbooks on wireless communications, modem data transmission,...
.
VAD is an important enabling technology for a variety of speech-based applications. Therefore various VAD algorithms have been developed that provide varying features and compromises between latency
Latency (engineering)
Latency is a measure of time delay experienced in a system, the precise definition of which depends on the system and the time being measured. Latencies may have different meaning in different contexts.-Packet-switched networks:...
, sensitivity
Sensitivity (electronics)
The sensitivity of an electronic device, such as a communications system receiver, or detection device, such as a PIN diode, is the minimum magnitude of input signal required to produce a specified output signal having a specified signal-to-noise ratio, or other specified criteria.Sensitivity is...
, accuracy and computational cost. Some VAD algorithms also provide further analysis, for example whether the speech is voiced
Voice (phonetics)
Voice or voicing is a term used in phonetics and phonology to characterize speech sounds, with sounds described as either voiceless or voiced. The term, however, is used to refer to two separate concepts. Voicing can refer to the articulatory process in which the vocal cords vibrate...
, unvoiced or sustain
Sustain
In music, sustain is a parameter of musical sound over time. As its name implies, it denotes the period of time during which the sound remains before it becomes inaudible, or silent.Additionally, sustain is the third of the four segments in an ADSR envelope...
ed. Voice activity detection is usually language independent.
It was first investigated for use on time-assignment speech interpolation
Time-assignment speech interpolation
In telecommunication, a time-assignment speech interpolation was an analog technique used on certain long transmission links to increase voice-transmission capacity....
(TASI) systems.
Algorithm overview
The typical design of a VAD algorithm is as follows:- There may first be a noise reduction stage, e.g. via spectral subtraction.
- Then some features or quantities are calculated from a section of the input signal.
- A classification rule is applied to classify the section as speech or non-speech - often this classification rule finds when a value exceeds a threshold.
There may be some feedback in this sequence, in which the VAD decision is used to improve the noise estimate in the noise reduction stage, or to adaptively vary the threshold(s). These feedback operations improve the VAD performance in non-stationary noise (i.e. when the noise varies a lot) .
A representative set of recently published VAD methods formulates the decision rule on a frame by frame basis using instantaneous measures of the divergence distance between speech and noise . The different measures which are used in VAD methods include spectral slope, correlation coefficients, log likelihood ratio, cepstral, weighted cepstral, and modified distance measures .
Independently from the choice of VAD algorithm, we must compromise between having voice detected as noise or noise detected as voice (between false positive and false negative
Type I and type II errors
In statistical test theory the notion of statistical error is an integral part of hypothesis testing. The test requires an unambiguous statement of a null hypothesis, which usually corresponds to a default "state of nature", for example "this person is healthy", "this accused is not guilty" or...
). A VAD operating in a mobile phone
Mobile phone
A mobile phone is a device which can make and receive telephone calls over a radio link whilst moving around a wide geographic area. It does so by connecting to a cellular network provided by a mobile network operator...
must be able to detect speech in the presence of a range of very diverse types of acoustic background noise. In these difficult detection conditions it is often preferable that a VAD should fail-safe
Fail-safe
A fail-safe or fail-secure device is one that, in the event of failure, responds in a way that will cause no harm, or at least a minimum of harm, to other devices or danger to personnel....
, indicating speech detected when the decision is in doubt, to lower the chance of losing speech segments. The biggest difficulty in the detection of speech in this environment is the very low signal-to-noise ratio
Signal-to-noise ratio
Signal-to-noise ratio is a measure used in science and engineering that compares the level of a desired signal to the level of background noise. It is defined as the ratio of signal power to the noise power. A ratio higher than 1:1 indicates more signal than noise...
s (SNRs) that are encountered. It may be impossible to distinguish between speech and noise using simple level detection techniques when parts of the speech utterance are buried below the noise.
Applications
- VAD is an integral part of different speech communication systems such as audio conferencingConference callA conference call is a telephone call in which the calling party wishes to have more than one called party listen in to the audio portion of the call. The conference calls may be designed to allow the called party to participate during the call, or the call may be set up so that the called party...
, echo cancellationEcho cancellation'The term echo cancellation is used in telephony to describe the process of removing echo from a voice communication in order to improve voice quality on a telephone call...
, speech recognitionSpeech recognitionSpeech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to recognition systems that must be trained to a particular speaker—as is the case for most desktop recognition software...
, speech encodingSpeech encodingSpeech coding is the application of data compression of digital audio signals containing speech. Speech coding uses speech-specific parameter estimation using audio signal processing techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting...
, and hands-free telephonyTelephonyIn telecommunications, telephony encompasses the general use of equipment to provide communication over distances, specifically by connecting telephones to each other....
. - In the field of multimedia applications, VAD allows simultaneous voice and data applications.
- Similarly, in Universal Mobile Telecommunications SystemUniversal Mobile Telecommunications SystemUniversal Mobile Telecommunications System is a third generation mobile cellular technology for networks based on the GSM standard. Developed by the 3GPP , UMTS is a component of the International Telecommunications Union IMT-2000 standard set and compares with the CDMA2000 standard set for...
s (UMTS), it controls and reduces the average bit rateBit rateIn telecommunications and computing, bit rate is the number of bits that are conveyed or processed per unit of time....
and enhances overall coding quality of speech. - In cellular radio systems (for instance GSM and CDMA systems) based on Discontinuous TransmissionDiscontinuous TransmissionDiscontinuous transmission is a means by which a mobile telephone is temporarily shut off or muted while the phone lacks a voice input.-Misconception:...
(DTX) mode, VAD is essential for enhancing system capacity by reducing co-channel interference and power consumption in portable digital devices.
For a wide range of applications such as digital mobile radio, Digital Simultaneous Voice and Data (DSVD) or speech storage, it is desirable to provide a discontinuous transmission of speech-coding parameters. Advantages can include lower average power consumption in mobile handsets, higher average bit rate for simultaneous services like data transmission, or a higher capacity on storage chips. However, the improvement depends mainly on the percentage of pauses during speech and the reliability of the VAD used to detect these intervals. On the one hand, it is advantageous to have a low percentage of speech activity. On the other hand clipping, that is the loss of milliseconds of active speech, should be minimized to preserve quality. This is the crucial problem for a VAD algorithm under heavy noise conditions.
Use in telemarketing
One controversial application of VAD is in conjunction with predictive dialerPredictive dialer
A predictive dialer dials a list of telephone numbers and connects answered dials to people making calls, often referred to as agents. Predictive dialers use statistical algorithms to minimize the time that agents spend waiting between conversations, while minimizing the occurrence of someone...
s used by telemarketing firms. In order to maximize agent productivity, telemarketing firms set up predictive dialers to call more numbers than they have agents available, knowing most calls will end up in either "Ring - No Answer" or answering machines. When a person answers, they typically speak briefly ("Hello", "Good evening", etc.) and then there is a brief period of silence. Answering machine messages usually contain 3–15 seconds of continuous speech. By setting VAD parameters correctly, dialers can determine whether a person or a machine answered the call, and if it's a person, transfer the call to an available agent. If it detects an answering machine, the dialer hangs up. Often, the system correctly detects a person answering the call, but no agent is available. This leaves the called party frustratedly repeating "Hello? Hello?" into the phone, and when combined with the volume of agents that did get through, created the impetus to develop "Do Not Call" lists across the US.
Performance evaluation
To evaluate a VAD, its output using test recordings is compared with those of an "ideal" VAD - created by hand-annotating the presence/absence of voice in the recordings. The performance of a VAD is commonly evaluated on the basis of the following four parameters:- FEC (Front End Clipping): clipping introduced in passing from noise to speech activity;
- MSC (Mid Speech Clipping): clipping due to speech misclassified as noise;
- OVER: noise interpreted as speech due to the VAD flag remaining active in passing from speech activity to noise;
- NDS (Noise Detected as Speech): noise interpreted as speech within a silence period.
Although the method described above provides useful objective information concerning the performance of a VAD, it is only an approximate measure of the subjective effect. For example, the effects of speech signal clipping can at times be hidden by the presence of background noise, depending on the model chosen for the comfort noise synthesis, so some of the clipping measured with objective tests is in reality not audible. It is therefore important to carry out subjective tests on VADs, the main aim of which is to ensure that the clipping perceived is acceptable. This kind of test requires a certain number of listeners to judge recordings containing the processing results of the VADs being tested. The listeners have to give marks on the following features:
- Quality;
- Comprehension difficulty;
- Audibility of clipping.
These marks, obtained by listening to several speech sequences, are then used to calculate average results for each of the features listed above, thus providing a global estimate of the behavior of the VAD being tested. To conclude, whereas objective methods are very useful in an initial stage to evaluate the quality of a VAD, subjective methods are more significant. As, however, they are more expensive (since they require the participation of a certain number of people for a few days), they are generally only used when a proposal is about to be standardized.
Implementations
- One early standard VAD is that developed by British Telecom for use in the Pan-European digital cellular mobile telephone service in 1991. It uses inverse filtering trained on non-speech segments to filter out background noise, so that it can then more reliably use a simple power-threshold to decide if a voice is present.
- The G.729G.729G.729 is an audio data compression algorithm for voice that compresses digital voice in packets of 10 milliseconds duration. It is officially described as Coding of speech at 8 kbit/s using conjugate-structure algebraic code-excited linear prediction .Because of its low bandwidth requirements,...
standard calculates the following features for its VAD: line spectral frequenciesLine spectral pairsLine spectral pairs or line spectral frequencies are used to represent linear prediction coefficients for transmission over a channel. LSPs have several properties that make them superior to direct quantization of LPCs...
, full-band energy, low-band energy (<1 kHz), and zero-crossing rate. It applies a simple classification using a fixed decision boundary in the space defined by these features, and then applies smoothing and adaptive correction to improve the estimate.
- The GSM standard includes two VAD options developed by ETSI . Option 1 computes the SNRSignal-to-noise ratioSignal-to-noise ratio is a measure used in science and engineering that compares the level of a desired signal to the level of background noise. It is defined as the ratio of signal power to the noise power. A ratio higher than 1:1 indicates more signal than noise...
in nine bands and applies a threshold to these values. Option 2 calculates different parameters: channel power, voice metrics, and noise power. It then thresholds the voice metrics using a threshold that varies according to the estimated SNR.
- The SpeexSpeexSpeex is a patent-free audio compression format designed for speech and also a free software speech codec that may be used on VoIP applications and podcasts. It is based on the CELP speech coding algorithm. Speex claims to be free of any patent restrictions and is licensed under the revised BSD...
audio compression library uses a procedure named Improved Minima Controlled Recursive Averaging, which uses a smoothed representation of spectral power and then looks at the minima of a smoothed periodogramPeriodogramThe periodogram is an estimate of the spectral density of a signal. The term was coined by Arthur Schuster in 1898 as in the following quote:...
.. From version 1.2 it was replaced by a kludge in words of the author .