Abstract:
Text-to-Speech (TTS) synthesis faces the inherent challenge of producing multiple speech outputs with varying prosody given a single text input. While previous research has addressed this by predicting prosodic information from both text and speech, additional contextual information, such as video, remains under-utilized despite being available in many applications. To address this challenge, this study explores the potential of integrating visual context to enhance prosody prediction. We propose a novel model, VisualSpeech, which incorporates both visual and textual information for improved prosody generation in TTS. Empirical results demonstrate that visual features provide valuable prosodic cues beyond the textual input, significantly enhancing the naturalness of the synthesized speech.
TTS Results
These samples refer to Section 4.4 of our paper, which demostrate the effectiveness of our proposed model, which improves prosody in TTS using visual features. Here are the speech synthesized by all model variants, including the baseline (i.e. Fastspeech2), VisualSpeech (Omnivore), and VisualSpeech (ResNet50) . As you could hear, speech produced by the proposed model VisualSpeech outperforms those by the baseline model Fastspeech2 in terms of prosody. The performace is even close to ground-truth.
                        Fastspeech2                                         VisualSpeech (Omnivore)                                   VisualSpeech (ResNet50)                                                Ground-truth
Text: "who's gonna defend your kids."
                        Fastspeech2                                         VisualSpeech (Omnivore)                                   VisualSpeech (ResNet50)                                                Ground-truth
Text: "It's okay, Rockle."
                        Fastspeech2                                         VisualSpeech (Omnivore)                                   VisualSpeech (ResNet50)                                                Ground-truth
Text: "and keep moving forward."
                        Fastspeech2                                         VisualSpeech (Omnivore)                                   VisualSpeech (ResNet50)                                                Ground-truth
Text: "You have to show him balance."
                        Fastspeech2                                         VisualSpeech (Omnivore)                                   VisualSpeech (ResNet50)                                                Ground-truth
Text: "That's being the detective, be a Lieutenant."
                        Fastspeech2                                         VisualSpeech (Omnivore)                                   VisualSpeech (ResNet50)                                                Ground-truth
Text: "You are scared to death about their future."
                        Fastspeech2                                         VisualSpeech (Omnivore)                                   VisualSpeech (ResNet50)                                                Ground-truth
Text: "I'm going to ask the district attorney to petition."
                        Fastspeech2                                         VisualSpeech (Omnivore)                                   VisualSpeech (ResNet50)                                                Ground-truth
Text: "You will not follow protocol."
                        Fastspeech2                                         VisualSpeech (Omnivore)                                   VisualSpeech (ResNet50)                                                Ground-truth
Text: "It somewhat looks like you have sewn together."
                        Fastspeech2                                         VisualSpeech (Omnivore)                                   VisualSpeech (ResNet50)                                                Ground-truth
Text: "Bet your feet one of the con."