Human speech can be divided into four important elements: content, timbre, pitch and rhythm. The first “content” component of
speech shows the primary information in speech that can be transcribed into text. The second component, “Timbre”, contains information about a speaker’s vocal characteristics; this helps to match the identity of the speaker. The speaker’s emotion is expressed by the last two components, pitch and rhythm. The variation of “Pitch” reflects aspects of the speaker’s tone, and rhythm characterizes the speed at which the speaker pronounces each word or syllable.
Obtaining unraveled representations of four speech components can be useful in speech analysis and generation applications. Currently, the available models can only untangle the timbre, while the pitch, rhythm and content information is still mixed up. Untangling the three remaining speech components is an under-determined problem without explicit annotations for each component, and expensive to obtain.
This paper offers SpeechSplit, an auto-encoder capable of breaking down speech into content, timbre, rhythm and pitch. This model can blindly break down speech into its four components by introducing three carefully designed information bottlenecks. SpeechSplit is among the first algorithms that can separately perform style transfer on timbre, pitch and rhythm without text labels.
Audio demo (interactive): https://anonymous0818.github.io/
Related documents / articles: