Abstract

Accent conversion (AC) transforms a non-native speaker's accent into a native accent while maintaining the speaker's voice timbre. In this paper, we propose approaches to improving accent conversion applicability, as well as quality. First of all, we assume no reference speech is available at the conversion stage, and hence we employ an end-to-end text-to-speech system that is trained on native speech to generate native reference speech. To improve the quality and accent of the converted speech, we introduce reference encoders which make us capable of utilizing multi-source information. This is motivated by acoustic features extracted from native reference and linguistic information, which are complementary to conventional phonetic posteriorgrams (PPGs), so they can be concatenated as features to improve a baseline system based only on PPGs. Moreover, we optimize model architecture using GMM-based attention instead of windowed attention to elevate synthesized performance. Experimental results indicate when the proposed techniques are applied the integrated system significantly raises the scores of acoustic quality (30% relative increase in mean opinion score) and native accent (68% relative preference) while retaining the voice identity of the non-native speaker.


Notes

  • ZHAA is an Arabic speaker of L2 English
  • Dataset (L2-ARCTIC corpus) [1]: https://psi.engr.tamu.edu/l2-arctic-corpus
  • Please view this page in Google Chrome or Microsoft Edge for best quality

  • System

    baseline [2]: About the conversion model, please refer to https://github.com/guanlongzhao/fac-via-ppg.

    System 1: Only structural adjustments is made. We replace the Local Sensitive Attention with the GMM attention, and replace the original Tacotron2 encoder with CBHG network.

    System 2: In addition to structural adjustments, the mel reference encoder is added.

    System 3: In addition to structural adjustments, both the mel and phoneme reference encoders are added.



    Speech Samples

    source non-native accent audio: ground truth audio of speaker ZHAA;
    target native accent audio: audio generated from TTS system

    ZHAA TTS Baseline System1 System2 System3
    text

    "The more his opponents grew excited the more Ernest deliberately excited them"

    Sample 1
    text

    "Violation of this law was made a high misdemeanor and punished accordingly"

    Sample 2
    text

    "The flower of the artistic and intellectual world were revolutionists"

    Sample 3
    text

    "The life there was healthful and athletic but too juvenile"

    Sample 4
    text

    "Did I possess too much vitality"

    Sample 5
    text

    "Captain Doane's orders were swiftly obeyed"

    Sample 6
    text

    "The task we set ourselves was threefold"

    Sample 7
    text

    "Zilla relaxed her sour mouth long enough to sigh her satisfaction"

    Sample 8