Audio timescale-pitch modification
Encyclopedia
Time stretching is the process of changing the speed or duration of an audio signal
without affecting its pitch
.
Pitch scaling or pitch shifting is the opposite: the process of changing the pitch without affecting the speed. There are also more advanced methods used to change speed, pitch, or both at once, as a function of time.
These processes are used, for instance, to match the pitches and tempos of two pre-recorded clips for mixing when the clips cannot be reperformed or resampled. (A drum track containing no pitched instruments could be moderately resampled for tempo without adverse effects, but a pitched track could not). They are also used to create effects such as increasing the range of an instrument (like pitch shifting a guitar down an octave).
audio clip is to resample
it. This is a mathematical operation that effectively rebuilds a continuous waveform from its samples and then samples that waveform again at a different rate. When the new samples are played at the original sampling frequency, the audio clip sounds faster or slower. Unfortunately, the frequencies in the sample are always scaled at the same rate as the speed, transposing its perceived pitch up or down in the process. In other words, slowing down the recording lowers the pitch, speeding it up raises the pitch, and the two effects cannot be separated. This is analogous to speeding up or slowing down an analog
recording, like a phonograph record or tape, creating the chipmunk effect.
after Flanagan, Golden, and Portnoff.
Basic steps:
The phase vocoder handles sinusoid components well, but early implementations introduced considerable smearing on transient ("beat") waveforms at all non-integer compression/expansion rates, which renders the results phasey and diffuse. Recent improvements allow better quality results at all compression/expansion ratios but a residual smearing effect still remains.
The phase vocoder technique can also be used to perform pitch shifting, chorusing, timbre manipulation, harmonizing, and other unusual modifications, all of which can be changed as a function of time.
: attempt to find the period (or equivalently the fundamental frequency
) of a given section of the wave using some pitch detection algorithm
(commonly the peak of the signal's autocorrelation
, or sometimes cepstral
processing), and crossfade
one period into another.
This is called time domain harmonic scaling or the synchronized overlap-add method (SOLA) and performs somewhat faster than the phase vocoder on slower machines but fails when the autocorrelation mis-estimates the period of a signal with complicated harmonics (such as orchestra
l pieces).
Adobe Audition
(formerly Cool Edit Pro) seems to solve this by looking for the period closest to a center period that the user specifies, which should be an integer multiple of the tempo, and between 30 Hz
and the lowest bass frequency.
This is much more limited in scope than the phase vocoder based processing, but can be made much less processor intensive, for real-time applications. It provides the most coherent results for single-pitched sounds like voice or musically monophonic instrument recordings.
High-end commercial audio processing packages either combine the two techniques (for example by separating the signal into sinusoid and transient waveforms), or use other techniques based on the wavelet
transform, or artificial neural network processing, producing the highest-quality time stretching.
By altering only the time control, you can stretch, shrink or reverse time, or generate loops as needed in sampling synthesizer
s.
Time shrinkage can also be used for compression
purposes.
By altering only the phase control, you can shift the pitch or apply FM synthesis distortions to an existing sound.
This can be used to play instruments alternatively to wavetable synthesis
.
For controlling phase and time independently we would need to know the displacement of the sound for every pair of phase and time position.
This corresponds to a cylinder as shown in the figure.
However, a sound signal is a one-dimensional signal.
You can consider this sound signal as observation of the full function on the cylinder. This is drawn as black line in the figure.
The full function on the cylinder can be approximated by interpolating between points on the helix with (approximately) the same phase.
From this function a different sound signal can be derived.
E.g. in the figure the grey line shows the path of a sound that has the same time progression but a frequency lower than the original one,
or a sound that has the same frequency and a faster time progression, or something between.
In the end the whole process can be implemented for discrete sound signals as interpolation between values with similar phase and similar time.
The described technique is used in the monophonic version of the software Melodyne
of the signal. In this method, peaks are identified in frames using the STFT
of the signal, and sinusoidal "tracks" are created by connecting peaks in adjacent frames. The tracks are then re-synthesized at a new time scale. This method can yield good results on both polyphonic and percussive material, especially when the signal is separated into sub-bands. However, this method is more computationally demanding than other methods.
.
Time stretching can be used with audio book
s and recorded lectures.
Slowing down may improve comprehension of foreign languages http://www.enounce.com/whatistsm.shtml.
While one might expect speeding up to reduce comprehension,
Herb Friedman says that "Experiments have shown that the brain works most efficiently if the information rate through the ears--via speech--is the "average" reading rate, which is about 200-300 wpm (words per minute), yet the average rate of speech is in the neighborhood of 100-150 wpm."
Speeding up audio is seen as the equivalent of "speed reading
"
.
Time stretching is often used to adjust Radio commercial
s
http://web.archive.org/web/20080527184101/http://www.tvtechnology.com/features/audio_notes/f_audionotes.shtml and the audio of Television advertisement
s http://www.atarimagazines.com/creative/v9n7/122_Variable_speech.php to fit exactly into the 30 or 60 seconds available.
an audio sample while holding speed or duration constant. This may be accomplished by time stretching and then resampling back to the original length. Alternatively, the frequency of the sinusoids in a sinusoidal model
may be altered directly, and the signal reconstructed at the appropriate time scale.
Transposing can be called frequency
scaling or pitch shift
ing, depending on perspective.
For example, one could move the pitch of every note up by a perfect fifth, keeping the tempo the same.
One can view this transposition as "pitch shifting", "shifting" each note up 7 keys on a piano keyboard, or adding a fixed amount on the Mel scale
, or adding a fixed amount in linear pitch space
.
One can view the same transposition as "frequency scaling", "scaling" (multiplying) the frequency of every note by 3/2.
Musical transposition preserves the ratios of the harmonic
frequencies that determine the sound's timbre
, unlike the frequency shift performed by amplitude modulation
, which adds a fixed frequency offset to the frequency of every note. (In theory one could perform a literal pitch scaling in which the musical pitch space location is scaled [a higher note would be shifted at a greater interval in linear pitch space than a lower note], but that is highly unusual, and not musical).
Time domain processing works much better here, as smearing is less noticeable, but scaling vocal samples distorts the formant
s into a sort of Alvin and the Chipmunks
-like effect, which may be desirable or undesirable.
A process that preserves the formants and character of a voice involves analyzing the signal with a channel vocoder
or LPC
vocoder plus any of several pitch detection algorithm
s and then resynthesizing it at a different fundamental frequency.
A detailed description of older analog recording techniques for pitch shifting can be found within the Alvin and the Chipmunks
entry.
Audio signal processing
Audio signal processing, sometimes referred to as audio processing, is the intentional alteration of auditory signals, or sound. As audio signals may be electronically represented in either digital or analog format, signal processing may occur in either domain...
without affecting its pitch
Pitch (music)
Pitch is an auditory perceptual property that allows the ordering of sounds on a frequency-related scale.Pitches are compared as "higher" and "lower" in the sense associated with musical melodies,...
.
Pitch scaling or pitch shifting is the opposite: the process of changing the pitch without affecting the speed. There are also more advanced methods used to change speed, pitch, or both at once, as a function of time.
These processes are used, for instance, to match the pitches and tempos of two pre-recorded clips for mixing when the clips cannot be reperformed or resampled. (A drum track containing no pitched instruments could be moderately resampled for tempo without adverse effects, but a pitched track could not). They are also used to create effects such as increasing the range of an instrument (like pitch shifting a guitar down an octave).
Resampling
The simplest way to change the duration or pitch of a digitalDigital signal
A digital signal is a physical signal that is a representation of a sequence of discrete values , for example of an arbitrary bit stream, or of a digitized analog signal...
audio clip is to resample
Resampling
Resampling may refer to:* Resampling , several related audio processes* Resampling , resampling methods in statistics* Resampling , scaling of bitmap images* Sample rate conversion-See also:* Downsampling* Upsampling...
it. This is a mathematical operation that effectively rebuilds a continuous waveform from its samples and then samples that waveform again at a different rate. When the new samples are played at the original sampling frequency, the audio clip sounds faster or slower. Unfortunately, the frequencies in the sample are always scaled at the same rate as the speed, transposing its perceived pitch up or down in the process. In other words, slowing down the recording lowers the pitch, speeding it up raises the pitch, and the two effects cannot be separated. This is analogous to speeding up or slowing down an analog
Analog signal
An analog or analogue signal is any continuous signal for which the time varying feature of the signal is a representation of some other time varying quantity, i.e., analogous to another time varying signal. It differs from a digital signal in terms of small fluctuations in the signal which are...
recording, like a phonograph record or tape, creating the chipmunk effect.
Phase vocoder
One way of stretching the length of a signal without affecting the pitch is to build a phase vocoderPhase vocoder
A phase vocoder is a type of vocoder which can scale both the frequency and time domains of audio signals by using phase information. The computer algorithm allows frequency-domain modifications to a digital sound file .At the heart of the phase vocoder is the short-time Fourier transform ,...
after Flanagan, Golden, and Portnoff.
Basic steps:
- compute the instantaneous frequency/amplitude relationship of the signal using the STFTShort-time Fourier transformThe short-time Fourier transform , or alternatively short-term Fourier transform, is a Fourier-related transform used to determine the sinusoidal frequency and phase content of local sections of a signal as it changes over time....
, which is the discrete Fourier transformDiscrete Fourier transformIn mathematics, the discrete Fourier transform is a specific kind of discrete transform, used in Fourier analysis. It transforms one function into another, which is called the frequency domain representation, or simply the DFT, of the original function...
of a short, overlapping and smoothly windowed block of samples; - apply some processing to the Fourier transform magnitudes and phases (like resampling the FFT blocks); and
- perform an inverse STFT by taking the inverse Fourier transform on each chunk and adding the resulting waveform chunks.
The phase vocoder handles sinusoid components well, but early implementations introduced considerable smearing on transient ("beat") waveforms at all non-integer compression/expansion rates, which renders the results phasey and diffuse. Recent improvements allow better quality results at all compression/expansion ratios but a residual smearing effect still remains.
The phase vocoder technique can also be used to perform pitch shifting, chorusing, timbre manipulation, harmonizing, and other unusual modifications, all of which can be changed as a function of time.
SOLA
Rabiner and Schafer in 1978 put forth an alternate solution that works in the time domainTime domain
Time domain is a term used to describe the analysis of mathematical functions, physical signals or time series of economic or environmental data, with respect to time. In the time domain, the signal or function's value is known for all real numbers, for the case of continuous time, or at various...
: attempt to find the period (or equivalently the fundamental frequency
Fundamental frequency
The fundamental frequency, often referred to simply as the fundamental and abbreviated f0, is defined as the lowest frequency of a periodic waveform. In terms of a superposition of sinusoids The fundamental frequency, often referred to simply as the fundamental and abbreviated f0, is defined as the...
) of a given section of the wave using some pitch detection algorithm
Pitch detection algorithm
A pitch detection algorithm is an algorithm designed to estimate the pitch or fundamental frequency of a quasiperiodic or virtually periodic signal, usually a digital recording of speech or a musical note or tone. This can be done in the time domain or the frequency domain.PDAs are used in various...
(commonly the peak of the signal's autocorrelation
Autocorrelation
Autocorrelation is the cross-correlation of a signal with itself. Informally, it is the similarity between observations as a function of the time separation between them...
, or sometimes cepstral
Cepstrum
A cepstrum is the result of taking the Fourier transform of the logarithm of the spectrum of a signal. There is a complex cepstrum, a real cepstrum, a power cepstrum, and phase cepstrum....
processing), and crossfade
Fade (audio engineering)
In audio engineering, a fade is a gradual increase or decrease in the level of an audio signal. The term can also be used for film cinematography or theater lighting, in much the same way ....
one period into another.
This is called time domain harmonic scaling or the synchronized overlap-add method (SOLA) and performs somewhat faster than the phase vocoder on slower machines but fails when the autocorrelation mis-estimates the period of a signal with complicated harmonics (such as orchestra
Orchestra
An orchestra is a sizable instrumental ensemble that contains sections of string, brass, woodwind, and percussion instruments. The term orchestra derives from the Greek ορχήστρα, the name for the area in front of an ancient Greek stage reserved for the Greek chorus...
l pieces).
Adobe Audition
Adobe Audition
Adobe Audition is a digital audio workstation from Adobe Systems featuring both a multitrack, non-destructive mix/edit environment and a destructive-approach waveform editing view.-Origins:...
(formerly Cool Edit Pro) seems to solve this by looking for the period closest to a center period that the user specifies, which should be an integer multiple of the tempo, and between 30 Hz
Hertz
The hertz is the SI unit of frequency defined as the number of cycles per second of a periodic phenomenon. One of its most common uses is the description of the sine wave, particularly those used in radio and audio applications....
and the lowest bass frequency.
This is much more limited in scope than the phase vocoder based processing, but can be made much less processor intensive, for real-time applications. It provides the most coherent results for single-pitched sounds like voice or musically monophonic instrument recordings.
High-end commercial audio processing packages either combine the two techniques (for example by separating the signal into sinusoid and transient waveforms), or use other techniques based on the wavelet
Wavelet
A wavelet is a wave-like oscillation with an amplitude that starts out at zero, increases, and then decreases back to zero. It can typically be visualized as a "brief oscillation" like one might see recorded by a seismograph or heart monitor. Generally, wavelets are purposefully crafted to have...
transform, or artificial neural network processing, producing the highest-quality time stretching.
Untangling phase and time
Another way to shift pitch and stretch time is to separate phase and time in a monophonic sound such as the ones of melody instruments.By altering only the time control, you can stretch, shrink or reverse time, or generate loops as needed in sampling synthesizer
Sample-based synthesis
Sample-based synthesis is a form of audio synthesis that can be contrasted to either subtractive synthesis or additive synthesis. The principal difference with sample-based synthesis is that the seed waveforms are sampled sounds or instruments instead of fundamental waveforms such as the saw waves...
s.
Time shrinkage can also be used for compression
Data compression
In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....
purposes.
By altering only the phase control, you can shift the pitch or apply FM synthesis distortions to an existing sound.
This can be used to play instruments alternatively to wavetable synthesis
Wavetable synthesis
Wavetable synthesis is used in certain digital music synthesizers to implement a restricted form of real-time additive synthesis. The technique was first developed by Wolfgang Palm of PPG in the late 1970s and published in 1979, and has since been used as the primary synthesis method in...
.
For controlling phase and time independently we would need to know the displacement of the sound for every pair of phase and time position.
This corresponds to a cylinder as shown in the figure.
However, a sound signal is a one-dimensional signal.
You can consider this sound signal as observation of the full function on the cylinder. This is drawn as black line in the figure.
The full function on the cylinder can be approximated by interpolating between points on the helix with (approximately) the same phase.
From this function a different sound signal can be derived.
E.g. in the figure the grey line shows the path of a sound that has the same time progression but a frequency lower than the original one,
or a sound that has the same frequency and a faster time progression, or something between.
In the end the whole process can be implemented for discrete sound signals as interpolation between values with similar phase and similar time.
The described technique is used in the monophonic version of the software Melodyne
Sinusoidal/Spectral Modeling
Another alternative method for time stretching relies on a spectral modelSpectral modelling synthesis
Spectral modeling synthesis or simply SMS is an acoustic modeling approach for speech and other signals.SMS considers sounds as a combination of harmonic content and noise content. Harmonic components are identified based on peaks in the frequency spectrum of the signal, normally as found by the...
of the signal. In this method, peaks are identified in frames using the STFT
Short-time Fourier transform
The short-time Fourier transform , or alternatively short-term Fourier transform, is a Fourier-related transform used to determine the sinusoidal frequency and phase content of local sections of a signal as it changes over time....
of the signal, and sinusoidal "tracks" are created by connecting peaks in adjacent frames. The tracks are then re-synthesized at a new time scale. This method can yield good results on both polyphonic and percussive material, especially when the signal is separated into sub-bands. However, this method is more computationally demanding than other methods.
Speed hearing & Speed Talking
For the specific case of speech, time stretching can be performed using PSOLAPSOLA
PSOLA is a digital signal processing technique used for speech processing and more specifically speech synthesis. It can be used to modify the pitch and duration of a speech signal....
.
Time stretching can be used with audio book
Audio book
An audiobook or audio book is a recording of a text being read. It is not necessarily an exact audio version of a book or magazine.Spoken audio has been available in schools and public libraries and to a lesser extent in music shops since the 1930s. Many spoken word albums were made prior to the...
s and recorded lectures.
Slowing down may improve comprehension of foreign languages http://www.enounce.com/whatistsm.shtml.
While one might expect speeding up to reduce comprehension,
Herb Friedman says that "Experiments have shown that the brain works most efficiently if the information rate through the ears--via speech--is the "average" reading rate, which is about 200-300 wpm (words per minute), yet the average rate of speech is in the neighborhood of 100-150 wpm."
Speeding up audio is seen as the equivalent of "speed reading
Speed reading
Speed reading is a collection of reading methods which attempt to increase rates of reading without greatly reducing comprehension or retention. Methods include chunking and eliminating subvocalization...
"
.
Time stretching is often used to adjust Radio commercial
Radio commercial
Commercial radio stations make most of their revenue selling “airtime” to advertisers. Of total media expenditures, radio accounts for 6.9%. Radio advertisements or “spots” are available when a business or service provides valuable consideration, usually cash, in exchange for the station airing...
s
http://web.archive.org/web/20080527184101/http://www.tvtechnology.com/features/audio_notes/f_audionotes.shtml and the audio of Television advertisement
Television advertisement
A television advertisement or television commercial, often just commercial, advert, ad, or ad-film – is a span of television programming produced and paid for by an organization that conveys a message, typically one intended to market a product...
s http://www.atarimagazines.com/creative/v9n7/122_Variable_speech.php to fit exactly into the 30 or 60 seconds available.
Pitch scaling
These techniques can also be used to transposeTransposition (music)
In music transposition refers to the process, or operation, of moving a collection of notes up or down in pitch by a constant interval.For example, one might transpose an entire piece of music into another key...
an audio sample while holding speed or duration constant. This may be accomplished by time stretching and then resampling back to the original length. Alternatively, the frequency of the sinusoids in a sinusoidal model
Sinusoidal model
In statistics, signal processing, and time series analysis, a sinusoidal model to approximate a sequence Yi is:Y_i = C + \alpha\sin + E_i...
may be altered directly, and the signal reconstructed at the appropriate time scale.
Transposing can be called frequency
Frequency
Frequency is the number of occurrences of a repeating event per unit time. It is also referred to as temporal frequency.The period is the duration of one cycle in a repeating event, so the period is the reciprocal of the frequency...
scaling or pitch shift
Pitch shift
Pitch shifting is a sound recording technique in which the original pitch of a sound is raised or lowered. Effects units that raise or lower pitch by a pre-designated musical interval are called "pitch shifters" or "pitch benders".-Pitch/time shifting:...
ing, depending on perspective.
For example, one could move the pitch of every note up by a perfect fifth, keeping the tempo the same.
One can view this transposition as "pitch shifting", "shifting" each note up 7 keys on a piano keyboard, or adding a fixed amount on the Mel scale
Mel scale
The mel scale, named by Stevens, Volkman and Newman in 1937is a perceptual scale of pitches judged by listeners to be equal in distance from one another. The reference point between this scale and normal frequency measurement is defined by assigning a perceptual pitch of 1000 mels to a 1000 Hz...
, or adding a fixed amount in linear pitch space
Pitch space
In music theory, pitch spaces model relationships between pitches. These models typically use distance to model the degree of relatedness, with closely related pitches placed near one another, and less closely related pitches placed farther apart. Depending on the complexity of the relationships...
.
One can view the same transposition as "frequency scaling", "scaling" (multiplying) the frequency of every note by 3/2.
Musical transposition preserves the ratios of the harmonic
Harmonic
A harmonic of a wave is a component frequency of the signal that is an integer multiple of the fundamental frequency, i.e. if the fundamental frequency is f, the harmonics have frequencies 2f, 3f, 4f, . . . etc. The harmonics have the property that they are all periodic at the fundamental...
frequencies that determine the sound's timbre
Timbre
In music, timbre is the quality of a musical note or sound or tone that distinguishes different types of sound production, such as voices and musical instruments, such as string instruments, wind instruments, and percussion instruments. The physical characteristics of sound that determine the...
, unlike the frequency shift performed by amplitude modulation
Amplitude modulation
Amplitude modulation is a technique used in electronic communication, most commonly for transmitting information via a radio carrier wave. AM works by varying the strength of the transmitted signal in relation to the information being sent...
, which adds a fixed frequency offset to the frequency of every note. (In theory one could perform a literal pitch scaling in which the musical pitch space location is scaled [a higher note would be shifted at a greater interval in linear pitch space than a lower note], but that is highly unusual, and not musical).
Time domain processing works much better here, as smearing is less noticeable, but scaling vocal samples distorts the formant
Formant
Formants are defined by Gunnar Fant as 'the spectral peaks of the sound spectrum |P|' of the voice. In speech science and phonetics, formant is also used to mean an acoustic resonance of the human vocal tract...
s into a sort of Alvin and the Chipmunks
Alvin and the Chipmunks
Alvin and the Chipmunks is an American animated music group created by Ross Bagdasarian, Sr. in 1958. The group consists of three singing animated anthropomorphic chipmunks: Alvin, the mischievous troublemaker, who quickly became the star of the group; Simon, the tall, bespectacled intellectual;...
-like effect, which may be desirable or undesirable.
A process that preserves the formants and character of a voice involves analyzing the signal with a channel vocoder
Vocoder
A vocoder is an analysis/synthesis system, mostly used for speech. In the encoder, the input is passed through a multiband filter, each band is passed through an envelope follower, and the control signals from the envelope followers are communicated to the decoder...
or LPC
Linear predictive coding
Linear predictive coding is a tool used mostly in audio signal processing and speech processing for representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear predictive model...
vocoder plus any of several pitch detection algorithm
Pitch detection algorithm
A pitch detection algorithm is an algorithm designed to estimate the pitch or fundamental frequency of a quasiperiodic or virtually periodic signal, usually a digital recording of speech or a musical note or tone. This can be done in the time domain or the frequency domain.PDAs are used in various...
s and then resynthesizing it at a different fundamental frequency.
A detailed description of older analog recording techniques for pitch shifting can be found within the Alvin and the Chipmunks
Alvin and the Chipmunks
Alvin and the Chipmunks is an American animated music group created by Ross Bagdasarian, Sr. in 1958. The group consists of three singing animated anthropomorphic chipmunks: Alvin, the mischievous troublemaker, who quickly became the star of the group; Simon, the tall, bespectacled intellectual;...
entry.
See also
- Audio signal processingAudio signal processingAudio signal processing, sometimes referred to as audio processing, is the intentional alteration of auditory signals, or sound. As audio signals may be electronically represented in either digital or analog format, signal processing may occur in either domain...
- Pitch controlPitch controlA variable speed pitch control is a control on an audio device such as a turntable, tape recorder, or CD player that allows the operator to deviate from a standard speed . The latter term "vari-speed" is more commonly used for tape decks, particularly in the UK...
- Sound effectSound effectFor the album by The Jam, see Sound Affects.Sound effects or audio effects are artificially created or enhanced sounds, or sound processes used to emphasize artistic or other content of films, television shows, live performance, animation, video games, music, or other media...
s - Time-compressed speechTime-compressed speechTime-compressed speech is a technique used, often in advertising, to make recorded speech contain more words in a given time, yet still be understandable.-History:...
- PSOLAPSOLAPSOLA is a digital signal processing technique used for speech processing and more specifically speech synthesis. It can be used to modify the pitch and duration of a speech signal....
External links
- Time Stretching and Pitch Shifting Overview A comprehensive overview of current time and pitch modification techniques by Stephan Bernsee
- Stephan Bernsee's smbPitchShift C source code C source code for doing frequency domain pitch manipulation
- pitchsift.js from KievII A Javascript pitchshifter based on smbPitchShift code, from the open source KievII library
- The Phase Vocoder: A Tutorial - A good description of the phase vocoder
- New Phase-Vocoder Techniques for Pitch-Shifting, Harmonizing and Other Exotic Effects
- A new Approach to Transient Processing in the Phase Vocoder
- PICOLA and TDHS
- How to build a pitch shifter Theory, equations, figures and performances of a real-time guitar pitch shifter running on a DSP chip
- Elastique from Zplane Code used in most of the DJ software