Convert Figma logo to code with AI

Kyubyong logodc_tts

A TensorFlow Implementation of DC-TTS: yet another text-to-speech model

1,161
370
1,161
67

Top Related Projects

9,217

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)

Tacotron 2 - PyTorch implementation with faster-than-realtime inference

Clone a voice in 5 seconds to generate arbitrary speech in real-time

2,124

WaveRNN Vocoder + TTS

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)

Quick Overview

The Kyubyong/dc_tts repository is an implementation of the DC-TTS (Deep Convolutional Text-to-Speech) model in TensorFlow. It aims to provide a fast and efficient text-to-speech synthesis system based on the paper "Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention."

Pros

  • Fast inference time compared to traditional TTS models
  • High-quality speech synthesis with natural-sounding output
  • Relatively simple architecture, making it easier to understand and modify
  • Supports multiple languages, including English and Korean

Cons

  • Requires significant computational resources for training
  • Limited documentation and explanations in the repository
  • May require fine-tuning for optimal performance on specific datasets
  • Dependency on older versions of TensorFlow and other libraries

Code Examples

  1. Loading and preprocessing text data:
def load_data(text_file):
    texts = []
    with open(text_file, 'r') as f:
        for line in f:
            texts.append(text_normalize(line.strip()))
    return texts

texts = load_data('data/texts.txt')
  1. Creating a TextEncoder instance:
from text import TextEncoder

encoder = TextEncoder()
encoded_texts = [encoder.encode(text) for text in texts]
  1. Generating mel spectrograms:
from networks import TextEnc, AudioEnc, AudioDec, Attention

text_enc = TextEnc()
audio_enc = AudioEnc()
audio_dec = AudioDec()
attention = Attention()

mel_outputs = audio_dec(audio_enc(text_enc(encoded_texts), attention))

Getting Started

  1. Clone the repository:

    git clone https://github.com/Kyubyong/dc_tts.git
    cd dc_tts
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Prepare your dataset and update the hyperparams.py file with appropriate paths and parameters.

  4. Train the model:

    python train.py
    
  5. Synthesize speech:

    python synthesize.py --text "Your text here" --checkpoint /path/to/checkpoint
    

Note: Make sure to use compatible versions of TensorFlow and other dependencies as specified in the repository.

Competitor Comparisons

9,217

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)

Pros of TTS

  • More comprehensive and actively maintained project with regular updates
  • Supports multiple TTS architectures and models
  • Provides pre-trained models and easy-to-use inference scripts

Cons of TTS

  • Higher complexity and steeper learning curve
  • Requires more computational resources for training and inference

Code Comparison

TTS:

from TTS.utils.synthesizer import Synthesizer

synthesizer = Synthesizer(
    tts_checkpoint="path/to/model.pth",
    tts_config_path="path/to/config.json",
    vocoder_checkpoint="path/to/vocoder.pth",
    vocoder_config="path/to/vocoder_config.json"
)
wav = synthesizer.tts("Hello world!")

dc_tts:

from synthesizer import Synthesizer

synthesizer = Synthesizer()
wav = synthesizer.synthesize("Hello world!")

Key Differences

  • TTS offers more flexibility and options for model architectures
  • dc_tts focuses on a single architecture (DC-TTS) with simpler implementation
  • TTS provides more extensive documentation and community support
  • dc_tts is lighter and easier to get started with for beginners

Use Cases

  • TTS: Suitable for production environments and research projects requiring various TTS models
  • dc_tts: Ideal for learning TTS basics and quick prototyping with DC-TTS architecture

Tacotron 2 - PyTorch implementation with faster-than-realtime inference

Pros of Tacotron2

  • Higher quality speech synthesis with more natural-sounding output
  • Better handling of long sentences and complex pronunciations
  • Includes a pre-trained model for quick start and experimentation

Cons of Tacotron2

  • More computationally intensive, requiring more powerful hardware
  • Longer training time compared to DC-TTS
  • More complex architecture, potentially harder to modify or fine-tune

Code Comparison

DC-TTS:

def conv1d(inputs, filters, size, rate, padding, activation, training, scope):
    with tf.variable_scope(scope):
        outputs = tf.layers.conv1d(
            inputs,
            filters=filters,
            kernel_size=size,
            dilation_rate=rate,
            padding=padding,
            activation=None)
        outputs = tf.layers.batch_normalization(outputs, training=training)
        outputs = activation(outputs)
    return outputs

Tacotron2:

class ConvNorm(torch.nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size=1, stride=1,
                 padding=None, dilation=1, bias=True, w_init_gain='linear'):
        super(ConvNorm, self).__init__()
        if padding is None:
            assert(kernel_size % 2 == 1)
            padding = int(dilation * (kernel_size - 1) / 2)

        self.conv = torch.nn.Conv1d(in_channels, out_channels,
                                    kernel_size=kernel_size, stride=stride,
                                    padding=padding, dilation=dilation,
                                    bias=bias)

Clone a voice in 5 seconds to generate arbitrary speech in real-time

Pros of Real-Time-Voice-Cloning

  • Supports real-time voice cloning with as little as 5 seconds of audio
  • Utilizes a more advanced architecture (SV2TTS) for improved voice quality
  • Offers a user-friendly interface for easy experimentation

Cons of Real-Time-Voice-Cloning

  • Requires more computational resources due to its complex architecture
  • May have a steeper learning curve for beginners
  • Potentially slower inference time compared to dc_tts

Code Comparison

Real-Time-Voice-Cloning:

encoder = SpeakerEncoder("encoder/saved_models/pretrained.pt")
synthesizer = Synthesizer("synthesizer/saved_models/pretrained.pt")
vocoder = WaveRNN("vocoder/saved_models/pretrained.pt")

dc_tts:

def synthesize(text, models, configs, dir_name, base_path):
    # Load models
    model = Text2Mel(configs.model)
    ssrn = SSRN(configs.model)

The Real-Time-Voice-Cloning code shows the initialization of three separate models (encoder, synthesizer, and vocoder), while dc_tts uses two models (Text2Mel and SSRN). This reflects the more complex architecture of Real-Time-Voice-Cloning, which contributes to its advanced capabilities but also increases resource requirements.

2,124

WaveRNN Vocoder + TTS

Pros of WaveRNN

  • Faster synthesis speed and improved audio quality
  • More flexible architecture, allowing for various vocoder implementations
  • Active development and community support

Cons of WaveRNN

  • More complex setup and training process
  • Requires more computational resources for training
  • May have longer inference times for high-quality output

Code Comparison

WaveRNN:

def forward(self, x, mels):
    bsize = x.size(0)
    h1, h2 = self.wavernn(x, mels)
    return self.fc(h1), self.fc(h2)

dc_tts:

def forward(self, inputs):
    enc_output = self.encoder(inputs)
    mel_output = self.decoder(enc_output)
    return mel_output

WaveRNN focuses on waveform generation using recurrent neural networks, while dc_tts uses a more traditional encoder-decoder architecture for mel-spectrogram synthesis. WaveRNN's code snippet shows the forward pass of the neural network, handling both waveform and mel-spectrogram inputs. In contrast, dc_tts processes text inputs through an encoder and generates mel-spectrograms using a decoder.

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Pros of deepvoice3_pytorch

  • Supports multi-speaker TTS with speaker embeddings
  • Implements multiple vocoder options (WaveNet, Griffin-Lim)
  • More extensive documentation and examples

Cons of deepvoice3_pytorch

  • More complex architecture, potentially harder to understand and modify
  • Longer training time due to additional components
  • May require more computational resources

Code Comparison

dc_tts:

def text2mel(text):
    # Text-to-mel-spectrogram conversion
    return mel_spectrogram

def ssrn(mel):
    # Mel-spectrogram-to-waveform conversion
    return waveform

deepvoice3_pytorch:

def text2mel(text, speaker_id):
    # Text-to-mel-spectrogram conversion with speaker embedding
    return mel_spectrogram

def wavenet(mel):
    # Mel-spectrogram-to-waveform conversion using WaveNet
    return waveform

def griffin_lim(mel):
    # Alternative mel-spectrogram-to-waveform conversion
    return waveform

Both repositories implement text-to-speech systems, but deepvoice3_pytorch offers more flexibility with multi-speaker support and multiple vocoder options. However, this comes at the cost of increased complexity and computational requirements. dc_tts provides a simpler architecture that may be easier to understand and modify for specific use cases.

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)

Pros of Tacotron

  • More established and widely used in the research community
  • Supports both character and phoneme inputs
  • Includes pre-trained models for quick experimentation

Cons of Tacotron

  • Generally slower training and inference times
  • May require more computational resources
  • Less focus on real-time applications

Code Comparison

Tacotron

def create_hparams(hparams_string=None, verbose=False):
    hparams = tf.contrib.training.HParams(
        # Comma-separated list of cleaners to run on text prior to training and eval.
        cleaners='english_cleaners',
        # Audio
        num_mels=80,
        num_freq=1025,
        sample_rate=20000,
        frame_length_ms=50,
        frame_shift_ms=12.5,
        # ...
    )

DC_TTS

def get_spectrograms(fpath):
    # Loading sound file
    y, sr = librosa.load(fpath, sr=hp.sr)

    # Trimming
    y, _ = librosa.effects.trim(y)

    # Preemphasis
    y = np.append(y[0], y[1:] - hp.preemphasis * y[:-1])

    # stft
    linear = librosa.stft(y=y,
                          n_fft=hp.n_fft,
                          hop_length=hp.hop_length,
                          win_length=hp.win_length)

The code snippets show different approaches to audio processing and hyperparameter management in the two projects.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

A TensorFlow Implementation of DC-TTS: yet another text-to-speech model

I implement yet another text-to-speech model, dc-tts, introduced in Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention. My goal, however, is not just replicating the paper. Rather, I'd like to gain insights about various sound projects.

Requirements

  • NumPy >= 1.11.1
  • TensorFlow >= 1.3 (Note that the API of tf.contrib.layers.layer_norm has changed since 1.3)
  • librosa
  • tqdm
  • matplotlib
  • scipy

Data

I train English models and an Korean model on four different speech datasets.

1. LJ Speech Dataset
2. Nick Offerman's Audiobooks
3. Kate Winslet's Audiobook
4. KSS Dataset

LJ Speech Dataset is recently widely used as a benchmark dataset in the TTS task because it is publicly available, and it has 24 hours of reasonable quality samples. Nick's and Kate's audiobooks are additionally used to see if the model can learn even with less data, variable speech samples. They are 18 hours and 5 hours long, respectively. Finally, KSS Dataset is a Korean single speaker speech dataset that lasts more than 12 hours.

Training

  • STEP 0. Download LJ Speech Dataset or prepare your own data.
  • STEP 1. Adjust hyper parameters in hyperparams.py. (If you want to do preprocessing, set prepro True`.
  • STEP 2. Run python train.py 1 for training Text2Mel. (If you set prepro True, run python prepro.py first)
  • STEP 3. Run python train.py 2 for training SSRN.

You can do STEP 2 and 3 at the same time, if you have more than one gpu card.

Training Curves

Attention Plot

Sample Synthesis

I generate speech samples based on Harvard Sentences as the original paper does. It is already included in the repo.

  • Run synthesize.py and check the files in samples.

Generated Samples

DatasetSamples
LJ50k 200k 310k 800k
Nick40k 170k 300k 800k
Kate40k 160k 300k 800k
KSS400k

Pretrained Model for LJ

Download this.

Notes

  • The paper didn't mention normalization, but without normalization I couldn't get it to work. So I added layer normalization.
  • The paper fixed the learning rate to 0.001, but it didn't work for me. So I decayed it.
  • I tried to train Text2Mel and SSRN simultaneously, but it didn't work. I guess separating those two networks mitigates the burden of training.
  • The authors claimed that the model can be trained within a day, but unfortunately the luck was not mine. However obviously this is much fater than Tacotron as it uses only convolution layers.
  • Thanks to the guided attention, the attention plot looks monotonic almost from the beginning. I guess this seems to hold the aligment tight so it won't lose track.
  • The paper didn't mention dropouts. I applied them as I believe it helps for regularization.
  • Check also other TTS models such as Tacotron and Deep Voice 3.