Github Page demo

Abstract:

Text-to-Speech (TTS) synthesis faces the inherent challenge of producing multiple speech outputs with varying prosody given a single text input. While previous research has addressed this by predicting prosodic information from both text and speech, additional contextual information, such as video, remains under-utilized despite being available in many applications. To address this challenge, this study explores the potential of integrating visual context to enhance prosody prediction. We propose a novel model, VisualSpeech, which incorporates both visual and textual information for improved prosody generation in TTS. Empirical results demonstrate that visual features provide valuable prosodic cues beyond the textual input, significantly enhancing the naturalness of the synthesized speech.

TTS Results

These samples refer to Section 4.4 of our paper, which demostrate the effectiveness of our proposed model, which improves prosody in TTS using visual features. Here are the speech synthesized by all model variants, including the baseline (i.e. Fastspeech2), VisualSpeech (Omnivore), and VisualSpeech (ResNet50) . As you could hear, speech produced by the proposed model VisualSpeech outperforms those by the baseline model Fastspeech2 in terms of prosody. The performace is even close to ground-truth.

Fastspeech2 VisualSpeech (Omnivore) VisualSpeech (ResNet50) Ground-truth

Text: "who's gonna defend your kids."