Pitch detection algorithm
Encyclopedia
A pitch detection algorithm (PDA) is an algorithm
designed to estimate the pitch
or fundamental frequency
of a quasiperiodic or virtually periodic
signal, usually a digital recording
of speech
or a musical note or tone. This can be done in the time domain
or the frequency domain
.
PDAs are used in various contexts (e.g. phonetics
, music information retrieval
, speech coding
, musical performance systems) and so there may be different demands placed upon the algorithm. There is as yet no single ideal PDA, so a variety of algorithms exist, most falling broadly into the classes given below.
One simple approach would be to measure the distance between zero crossing
points of the signal (i.e. the Zero-crossing rate). However, this does not work well with complex waveform
s which are composed of multiple sine waves with differing periods. Nevertheless, there are cases in which zero-crossing can be a useful measure, for example in some speech applications where a single source is assumed. The algorithm's simplicity makes it "cheap" to implement.
More sophisticated approaches compare segments of the signal with other segments offset by a trial period to find a match. AMDF (average magnitude difference function), ASMDF (Average Squared Mean Difference Function), and other similar autocorrelation
algorithms work this way. These algorithms can give quite accurate results for highly periodic signals. However, they have false detection problems (often "octave errors"), can sometimes cope badly with noisy signals (depending on the implementation) and - in their basic implementations - do not deal well with polyphonic
sounds (which involve multiple musical notes of different pitches).
Current time-domain pitch detector algorithms tend to build upon the basic methods referred to above, with additional refinements to bring the performance more in line with a human assessment of pitch. For example, the YIN algorithm and the MPM algorithm are both based upon autocorrelation.
(FFT) to convert the signal to a frequency spectrum
. This requires more processing power as the desired accuracy increases, although the well-known efficiency of the FFT algorithm makes it suitably efficient for many purposes.
Popular frequency domain algorithms include: the harmonic product spectrum; cepstral
analysis and maximum likelihood
which attempts to match the frequency domain characteristics to pre-defined frequency maps (useful for detecting pitch of fixed tuning instruments); and the detection of peaks due to harmonic series.
To improve on the pitch estimate derived from the discrete Fourier spectrum, techniques such as spectral reassignment
(phase based) or Grandke interpolation (magnitude based) can be used to go beyond the precision provided by the FFT analysis.
can vary from 40 Hz for low-pitched male voices to 600 Hz for children or high-pitched female voices.
Autocorrelation methods need at least two pitch periods to detect pitch. To detect a fundamental frequency of 40 Hz this means that at least 50 milliseconds (ms) of the speech signal must be analyzed. However, during 50 ms, speech with higher fundamental frequencies may not necessarily have the same fundamental frequency throughout the window.
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...
designed to estimate the pitch
Pitch (music)
Pitch is an auditory perceptual property that allows the ordering of sounds on a frequency-related scale.Pitches are compared as "higher" and "lower" in the sense associated with musical melodies,...
or fundamental frequency
Fundamental frequency
The fundamental frequency, often referred to simply as the fundamental and abbreviated f0, is defined as the lowest frequency of a periodic waveform. In terms of a superposition of sinusoids The fundamental frequency, often referred to simply as the fundamental and abbreviated f0, is defined as the...
of a quasiperiodic or virtually periodic
Periodic function
In mathematics, a periodic function is a function that repeats its values in regular intervals or periods. The most important examples are the trigonometric functions, which repeat over intervals of length 2π radians. Periodic functions are used throughout science to describe oscillations,...
signal, usually a digital recording
Digital recording
In digital recording, digital audio and digital video is directly recorded to a storage device as a stream of discrete numbers, representing the changes in air pressure for audio and chroma and luminance values for video through time, thus making an abstract template for the original sound or...
of speech
Speech processing
Speech processing is the study of speech signals and the processing methods of these signals.The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signal.It is also closely tied to...
or a musical note or tone. This can be done in the time domain
Time domain
Time domain is a term used to describe the analysis of mathematical functions, physical signals or time series of economic or environmental data, with respect to time. In the time domain, the signal or function's value is known for all real numbers, for the case of continuous time, or at various...
or the frequency domain
Frequency domain
In electronics, control systems engineering, and statistics, frequency domain is a term used to describe the domain for analysis of mathematical functions or signals with respect to frequency, rather than time....
.
PDAs are used in various contexts (e.g. phonetics
Phonetics
Phonetics is a branch of linguistics that comprises the study of the sounds of human speech, or—in the case of sign languages—the equivalent aspects of sign. It is concerned with the physical properties of speech sounds or signs : their physiological production, acoustic properties, auditory...
, music information retrieval
Music information retrieval
Music information retrieval is the interdisciplinary science of retrieving information from music. MIR is a small but growing field of research with many real-world applications...
, speech coding
Speech coding
Speech coding is the application of data compression of digital audio signals containing speech. Speech coding uses speech-specific parameter estimation using audio signal processing techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting...
, musical performance systems) and so there may be different demands placed upon the algorithm. There is as yet no single ideal PDA, so a variety of algorithms exist, most falling broadly into the classes given below.
Time-domain approaches
In the time domain, a PDA typically estimates the period of the quasiperiodic signal, then inverts that value to give the frequency.One simple approach would be to measure the distance between zero crossing
Zero crossing
Zero-crossing is a commonly used term in electronics, mathematics, and image processing. In mathematical terms, a "zero-crossing" is a point where the sign of a function changes Zero-crossing is a commonly used term in electronics, mathematics, and image processing. In mathematical terms, a...
points of the signal (i.e. the Zero-crossing rate). However, this does not work well with complex waveform
Waveform
Waveform means the shape and form of a signal such as a wave moving in a physical medium or an abstract representation.In many cases the medium in which the wave is being propagated does not permit a direct visual image of the form. In these cases, the term 'waveform' refers to the shape of a graph...
s which are composed of multiple sine waves with differing periods. Nevertheless, there are cases in which zero-crossing can be a useful measure, for example in some speech applications where a single source is assumed. The algorithm's simplicity makes it "cheap" to implement.
More sophisticated approaches compare segments of the signal with other segments offset by a trial period to find a match. AMDF (average magnitude difference function), ASMDF (Average Squared Mean Difference Function), and other similar autocorrelation
Autocorrelation
Autocorrelation is the cross-correlation of a signal with itself. Informally, it is the similarity between observations as a function of the time separation between them...
algorithms work this way. These algorithms can give quite accurate results for highly periodic signals. However, they have false detection problems (often "octave errors"), can sometimes cope badly with noisy signals (depending on the implementation) and - in their basic implementations - do not deal well with polyphonic
Polyphony
In music, polyphony is a texture consisting of two or more independent melodic voices, as opposed to music with just one voice or music with one dominant melodic voice accompanied by chords ....
sounds (which involve multiple musical notes of different pitches).
Current time-domain pitch detector algorithms tend to build upon the basic methods referred to above, with additional refinements to bring the performance more in line with a human assessment of pitch. For example, the YIN algorithm and the MPM algorithm are both based upon autocorrelation.
Frequency-domain approaches
In the frequency domain, polyphonic detection is possible, usually utilizing the Fast Fourier TransformFast Fourier transform
A fast Fourier transform is an efficient algorithm to compute the discrete Fourier transform and its inverse. "The FFT has been called the most important numerical algorithm of our lifetime ." There are many distinct FFT algorithms involving a wide range of mathematics, from simple...
(FFT) to convert the signal to a frequency spectrum
Frequency spectrum
The frequency spectrum of a time-domain signal is a representation of that signal in the frequency domain. The frequency spectrum can be generated via a Fourier transform of the signal, and the resulting values are usually presented as amplitude and phase, both plotted versus frequency.Any signal...
. This requires more processing power as the desired accuracy increases, although the well-known efficiency of the FFT algorithm makes it suitably efficient for many purposes.
Popular frequency domain algorithms include: the harmonic product spectrum; cepstral
Cepstrum
A cepstrum is the result of taking the Fourier transform of the logarithm of the spectrum of a signal. There is a complex cepstrum, a real cepstrum, a power cepstrum, and phase cepstrum....
analysis and maximum likelihood
Maximum likelihood
In statistics, maximum-likelihood estimation is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters....
which attempts to match the frequency domain characteristics to pre-defined frequency maps (useful for detecting pitch of fixed tuning instruments); and the detection of peaks due to harmonic series.
To improve on the pitch estimate derived from the discrete Fourier spectrum, techniques such as spectral reassignment
Reassignment method
The method of reassignment is a technique forsharpening a time-frequency representation by mappingthe data to time-frequency coordinates that are nearer tothe true region of support of theanalyzed signal...
(phase based) or Grandke interpolation (magnitude based) can be used to go beyond the precision provided by the FFT analysis.
Fundamental frequency of speech
The fundamental frequency of speechSpeech
Speech is the human faculty of speaking.It may also refer to:* Public speaking, the process of speaking to a group of people* Manner of articulation, how the body parts involved in making speech are manipulated...
can vary from 40 Hz for low-pitched male voices to 600 Hz for children or high-pitched female voices.
Autocorrelation methods need at least two pitch periods to detect pitch. To detect a fundamental frequency of 40 Hz this means that at least 50 milliseconds (ms) of the speech signal must be analyzed. However, during 50 ms, speech with higher fundamental frequencies may not necessarily have the same fundamental frequency throughout the window.