dc_tts

A TensorFlow Implementation of DC-TTS: yet another text-to-speech model

1,161

366

1,161

View on GitHub

Top Related Projects

TTS

9,935

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)

tacotron2

5,257

Tacotron 2 - PyTorch implementation with faster-than-realtime inference

Real-Time-Voice-Cloning

54,789

Clone a voice in 5 seconds to generate arbitrary speech in real-time

deepvoice3_pytorch

1,981

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

tacotron

2,980

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)

Quick Overview

The Kyubyong/dc_tts repository is an implementation of the DC-TTS (Deep Convolutional Text-to-Speech) model in TensorFlow. It aims to provide a fast and efficient text-to-speech synthesis system based on the paper "Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention."

Pros

Fast inference time compared to traditional TTS models
High-quality speech synthesis with natural-sounding output
Relatively simple architecture, making it easier to understand and modify
Supports multiple languages, including English and Korean

Cons

Requires significant computational resources for training
Limited documentation and explanations in the repository
May require fine-tuning for optimal performance on specific datasets
Dependency on older versions of TensorFlow and other libraries

Code Examples

Loading and preprocessing text data:

def load_data(text_file):
    texts = []
    with open(text_file, 'r') as f:
        for line in f:
            texts.append(text_normalize(line.strip()))
    return texts

texts = load_data('data/texts.txt')

Creating a TextEncoder instance:

from text import TextEncoder

encoder = TextEncoder()
encoded_texts = [encoder.encode(text) for text in texts]

Generating mel spectrograms:

from networks import TextEnc, AudioEnc, AudioDec, Attention

text_enc = TextEnc()
audio_enc = AudioEnc()
audio_dec = AudioDec()
attention = Attention()

mel_outputs = audio_dec(audio_enc(text_enc(encoded_texts), attention))

Getting Started

Clone the repository:

git clone https://github.com/Kyubyong/dc_tts.git
cd dc_tts

Install dependencies:
```
pip install -r requirements.txt
```
Prepare your dataset and update the hyperparams.py file with appropriate paths and parameters.
Train the model:
```
python train.py
```

Synthesize speech:

python synthesize.py --text "Your text here" --checkpoint /path/to/checkpoint

Note: Make sure to use compatible versions of TensorFlow and other dependencies as specified in the repository.

Competitor Comparisons

TTS

9,935

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)

Pros of TTS

More comprehensive and actively maintained project with regular updates
Supports multiple TTS architectures and models
Provides pre-trained models and easy-to-use inference scripts

Cons of TTS

Higher complexity and steeper learning curve
Requires more computational resources for training and inference

Code Comparison

TTS:

from TTS.utils.synthesizer import Synthesizer

synthesizer = Synthesizer(
    tts_checkpoint="path/to/model.pth",
    tts_config_path="path/to/config.json",
    vocoder_checkpoint="path/to/vocoder.pth",
    vocoder_config="path/to/vocoder_config.json"
)
wav = synthesizer.tts("Hello world!")

dc_tts:

from synthesizer import Synthesizer

synthesizer = Synthesizer()
wav = synthesizer.synthesize("Hello world!")

Key Differences

TTS offers more flexibility and options for model architectures
dc_tts focuses on a single architecture (DC-TTS) with simpler implementation
TTS provides more extensive documentation and community support
dc_tts is lighter and easier to get started with for beginners

Use Cases

TTS: Suitable for production environments and research projects requiring various TTS models
dc_tts: Ideal for learning TTS basics and quick prototyping with DC-TTS architecture

tacotron2

5,257

Tacotron 2 - PyTorch implementation with faster-than-realtime inference

Pros of Tacotron2

Higher quality speech synthesis with more natural-sounding output
Better handling of long sentences and complex pronunciations
Includes a pre-trained model for quick start and experimentation

Cons of Tacotron2

More computationally intensive, requiring more powerful hardware
Longer training time compared to DC-TTS
More complex architecture, potentially harder to modify or fine-tune

Code Comparison

DC-TTS:

def conv1d(inputs, filters, size, rate, padding, activation, training, scope):
    with tf.variable_scope(scope):
        outputs = tf.layers.conv1d(
            inputs,
            filters=filters,
            kernel_size=size,
            dilation_rate=rate,
            padding=padding,
            activation=None)
        outputs = tf.layers.batch_normalization(outputs, training=training)
        outputs = activation(outputs)
    return outputs

Tacotron2:

class ConvNorm(torch.nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size=1, stride=1,
                 padding=None, dilation=1, bias=True, w_init_gain='linear'):
        super(ConvNorm, self).__init__()
        if padding is None:
            assert(kernel_size % 2 == 1)
            padding = int(dilation * (kernel_size - 1) / 2)

        self.conv = torch.nn.Conv1d(in_channels, out_channels,
                                    kernel_size=kernel_size, stride=stride,
                                    padding=padding, dilation=dilation,
                                    bias=bias)

Real-Time-Voice-Cloning

54,789

Clone a voice in 5 seconds to generate arbitrary speech in real-time

Pros of Real-Time-Voice-Cloning

Supports real-time voice cloning with as little as 5 seconds of audio
Utilizes a more advanced architecture (SV2TTS) for improved voice quality
Offers a user-friendly interface for easy experimentation

Cons of Real-Time-Voice-Cloning

Requires more computational resources due to its complex architecture
May have a steeper learning curve for beginners
Potentially slower inference time compared to dc_tts

Code Comparison

Real-Time-Voice-Cloning:

encoder = SpeakerEncoder("encoder/saved_models/pretrained.pt")
synthesizer = Synthesizer("synthesizer/saved_models/pretrained.pt")
vocoder = WaveRNN("vocoder/saved_models/pretrained.pt")

dc_tts:

def synthesize(text, models, configs, dir_name, base_path):
    # Load models
    model = Text2Mel(configs.model)
    ssrn = SSRN(configs.model)

The Real-Time-Voice-Cloning code shows the initialization of three separate models (encoder, synthesizer, and vocoder), while dc_tts uses two models (Text2Mel and SSRN). This reflects the more complex architecture of Real-Time-Voice-Cloning, which contributes to its advanced capabilities but also increases resource requirements.

WaveRNN

2,166

WaveRNN Vocoder + TTS

Pros of WaveRNN

Faster synthesis speed and improved audio quality
More flexible architecture, allowing for various vocoder implementations
Active development and community support

Cons of WaveRNN

More complex setup and training process
Requires more computational resources for training
May have longer inference times for high-quality output

Code Comparison

WaveRNN:

def forward(self, x, mels):
    bsize = x.size(0)
    h1, h2 = self.wavernn(x, mels)
    return self.fc(h1), self.fc(h2)

dc_tts:

def forward(self, inputs):
    enc_output = self.encoder(inputs)
    mel_output = self.decoder(enc_output)
    return mel_output

WaveRNN focuses on waveform generation using recurrent neural networks, while dc_tts uses a more traditional encoder-decoder architecture for mel-spectrogram synthesis. WaveRNN's code snippet shows the forward pass of the neural network, handling both waveform and mel-spectrogram inputs. In contrast, dc_tts processes text inputs through an encoder and generates mel-spectrograms using a decoder.

deepvoice3_pytorch

1,981

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Pros of deepvoice3_pytorch

Supports multi-speaker TTS with speaker embeddings
Implements multiple vocoder options (WaveNet, Griffin-Lim)
More extensive documentation and examples

Cons of deepvoice3_pytorch

More complex architecture, potentially harder to understand and modify
Longer training time due to additional components
May require more computational resources

Code Comparison

dc_tts:

def text2mel(text):
    # Text-to-mel-spectrogram conversion
    return mel_spectrogram

def ssrn(mel):
    # Mel-spectrogram-to-waveform conversion
    return waveform

deepvoice3_pytorch:

def text2mel(text, speaker_id):
    # Text-to-mel-spectrogram conversion with speaker embedding
    return mel_spectrogram

def wavenet(mel):
    # Mel-spectrogram-to-waveform conversion using WaveNet
    return waveform

def griffin_lim(mel):
    # Alternative mel-spectrogram-to-waveform conversion
    return waveform

Both repositories implement text-to-speech systems, but deepvoice3_pytorch offers more flexibility with multi-speaker support and multiple vocoder options. However, this comes at the cost of increased complexity and computational requirements. dc_tts provides a simpler architecture that may be easier to understand and modify for specific use cases.

tacotron

2,980

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)

Pros of Tacotron

More established and widely used in the research community
Supports both character and phoneme inputs
Includes pre-trained models for quick experimentation

Cons of Tacotron

Generally slower training and inference times
May require more computational resources
Less focus on real-time applications

Code Comparison

Tacotron

def create_hparams(hparams_string=None, verbose=False):
    hparams = tf.contrib.training.HParams(
        # Comma-separated list of cleaners to run on text prior to training and eval.
        cleaners='english_cleaners',
        # Audio
        num_mels=80,
        num_freq=1025,
        sample_rate=20000,
        frame_length_ms=50,
        frame_shift_ms=12.5,
        # ...
    )

DC_TTS

def get_spectrograms(fpath):
    # Loading sound file
    y, sr = librosa.load(fpath, sr=hp.sr)

    # Trimming
    y, _ = librosa.effects.trim(y)

    # Preemphasis
    y = np.append(y[0], y[1:] - hp.preemphasis * y[:-1])

    # stft
    linear = librosa.stft(y=y,
                          n_fft=hp.n_fft,
                          hop_length=hp.hop_length,
                          win_length=hp.win_length)

The code snippets show different approaches to audio processing and hyperparameter management in the two projects.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

A TensorFlow Implementation of DC-TTS: yet another text-to-speech model

I implement yet another text-to-speech model, dc-tts, introduced in Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention. My goal, however, is not just replicating the paper. Rather, I'd like to gain insights about various sound projects.

Requirements

NumPy >= 1.11.1
TensorFlow >= 1.3 (Note that the API of tf.contrib.layers.layer_norm has changed since 1.3)
librosa
tqdm
matplotlib
scipy

Data

I train English models and an Korean model on four different speech datasets.

1. LJ Speech Dataset
2. Nick Offerman's Audiobooks
3. Kate Winslet's Audiobook
4. KSS Dataset

LJ Speech Dataset is recently widely used as a benchmark dataset in the TTS task because it is publicly available, and it has 24 hours of reasonable quality samples. Nick's and Kate's audiobooks are additionally used to see if the model can learn even with less data, variable speech samples. They are 18 hours and 5 hours long, respectively. Finally, KSS Dataset is a Korean single speaker speech dataset that lasts more than 12 hours.

Training

STEP 0. Download LJ Speech Dataset or prepare your own data.
STEP 1. Adjust hyper parameters in hyperparams.py. (If you want to do preprocessing, set prepro True`.
STEP 2. Run python train.py 1 for training Text2Mel. (If you set prepro True, run python prepro.py first)
STEP 3. Run python train.py 2 for training SSRN.

You can do STEP 2 and 3 at the same time, if you have more than one gpu card.

Training Curves

Attention Plot

Sample Synthesis

I generate speech samples based on Harvard Sentences as the original paper does. It is already included in the repo.

Run synthesize.py and check the files in samples.

Generated Samples

Dataset	Samples
LJ	50k 200k 310k 800k
Nick	40k 170k 300k 800k
Kate	40k 160k 300k 800k
KSS	400k

Pretrained Model for LJ

Download this.

Notes

The paper didn't mention normalization, but without normalization I couldn't get it to work. So I added layer normalization.
The paper fixed the learning rate to 0.001, but it didn't work for me. So I decayed it.
I tried to train Text2Mel and SSRN simultaneously, but it didn't work. I guess separating those two networks mitigates the burden of training.
The authors claimed that the model can be trained within a day, but unfortunately the luck was not mine. However obviously this is much fater than Tacotron as it uses only convolution layers.
Thanks to the guided attention, the attention plot looks monotonic almost from the beginning. I guess this seems to hold the aligment tight so it won't lose track.
The paper didn't mention dropouts. I applied them as I believe it helps for regularization.
Check also other TTS models such as Tacotron and Deep Voice 3.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot