Convert Figma logo to code with AI

Kyubyong logotacotron

A TensorFlow Implementation of Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model

1,827
435
1,827
85

Top Related Projects

9,217

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)

Tacotron 2 - PyTorch implementation with faster-than-realtime inference

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)

Clone a voice in 5 seconds to generate arbitrary speech in real-time

2,124

WaveRNN Vocoder + TTS

Quick Overview

Kyubyong/tacotron is a GitHub repository that implements Tacotron, a text-to-speech synthesis model, in TensorFlow. It aims to generate human-like speech from text input using deep learning techniques, based on the original Tacotron paper by Google.

Pros

  • Implements a state-of-the-art text-to-speech model
  • Provides a complete pipeline for training and synthesizing speech
  • Includes pre-trained models for quick experimentation
  • Supports both English and Korean languages

Cons

  • Requires significant computational resources for training
  • May have lower audio quality compared to more recent TTS models
  • Limited documentation and examples for customization
  • Depends on older versions of TensorFlow and other libraries

Code Examples

  1. Loading a pre-trained model and synthesizing speech:
from synthesizer import Synthesizer

synthesizer = Synthesizer()
wav = synthesizer.synthesize("Hello, world!")
  1. Preparing a dataset for training:
from data_load import load_data

fpaths, text_lengths, texts = load_data(mode="train")
  1. Training the Tacotron model:
from train import train

train(data_path="path/to/dataset", save_path="path/to/save/model")

Getting Started

  1. Clone the repository:

    git clone https://github.com/Kyubyong/tacotron.git
    cd tacotron
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Download pre-trained models or prepare your dataset.

  4. For synthesis using a pre-trained model:

    from synthesizer import Synthesizer
    
    synthesizer = Synthesizer()
    wav = synthesizer.synthesize("Your text here")
    
  5. For training a new model, prepare your dataset and run:

    from train import train
    
    train(data_path="path/to/dataset", save_path="path/to/save/model")
    

Competitor Comparisons

9,217

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)

Pros of TTS

  • More comprehensive and actively maintained project
  • Supports multiple TTS architectures and models
  • Includes pre-trained models and easy-to-use inference scripts

Cons of TTS

  • More complex setup and usage due to broader feature set
  • Larger codebase, potentially harder to understand for beginners
  • Requires more computational resources for training and inference

Code Comparison

TTS:

from TTS.utils.synthesizer import Synthesizer

synthesizer = Synthesizer(
    tts_checkpoint="path/to/model.pth",
    tts_config_path="path/to/config.json",
    vocoder_checkpoint="path/to/vocoder.pth",
    vocoder_config="path/to/vocoder_config.json"
)
wavs = synthesizer.tts("Hello world!")

Tacotron:

from synthesizer import Synthesizer

synthesizer = Synthesizer()
wav = synthesizer.synthesize("Hello world!")

The TTS code snippet demonstrates its more flexible architecture, allowing for separate TTS and vocoder models. Tacotron's implementation is simpler but less customizable. TTS offers more control over the synthesis process, while Tacotron provides a more straightforward API for basic usage.

Tacotron 2 - PyTorch implementation with faster-than-realtime inference

Pros of tacotron2

  • Improved audio quality with WaveNet vocoder integration
  • Better performance and faster training due to NVIDIA optimizations
  • More extensive documentation and examples

Cons of tacotron2

  • Higher computational requirements for training and inference
  • More complex architecture, potentially harder to modify or customize

Code Comparison

tacotron:

def prenet(inputs, is_training, scope="prenet"):
    with tf.variable_scope(scope):
        outputs = tf.layers.dense(inputs, units=256, activation=tf.nn.relu)
        outputs = tf.layers.dropout(outputs, rate=0.5, training=is_training)
        outputs = tf.layers.dense(outputs, units=128, activation=tf.nn.relu)
        outputs = tf.layers.dropout(outputs, rate=0.5, training=is_training)
    return outputs

tacotron2:

class Prenet(nn.Module):
    def __init__(self, in_dim, sizes):
        super(Prenet, self).__init__()
        in_sizes = [in_dim] + sizes[:-1]
        self.layers = nn.ModuleList(
            [LinearNorm(in_size, out_size, bias=False)
             for (in_size, out_size) in zip(in_sizes, sizes)])

    def forward(self, x):
        for linear in self.layers:
            x = F.dropout(F.relu(linear(x)), p=0.5, training=True)
        return x

The tacotron2 implementation uses PyTorch, while the original tacotron uses TensorFlow. The tacotron2 version offers more flexibility with customizable layer sizes and uses a more object-oriented approach.

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)

Pros of Tacotron (keithito)

  • More active development and maintenance
  • Better documentation and code organization
  • Includes pre-trained models for quicker start

Cons of Tacotron (keithito)

  • Slightly more complex implementation
  • May require more computational resources

Code Comparison

Tacotron (keithito):

def create_hparams(hparams_string=None, verbose=False):
    hparams = tf.contrib.training.HParams(
        # Comma-separated list of cleaners to run on text prior to training and eval:
        cleaners='english_cleaners',
        # Audio:
        num_mels=80,
        num_freq=1025,
        sample_rate=20000,
        frame_length_ms=50,
        frame_shift_ms=12.5,
        preemphasis=0.97,
        min_level_db=-100,
        ref_level_db=20,
    )

Tacotron (Kyubyong):

def get_spectrograms(sound_file):
    # Loading sound file
    y, sr = librosa.load(sound_file, sr=None)

    # Trimming
    y, _ = librosa.effects.trim(y)

    # Preemphasis
    y = np.append(y[0], y[1:] - hp.preemphasis * y[:-1])

    # stft
    linear = librosa.stft(y=y,
                          n_fft=hp.n_fft,
                          hop_length=hp.hop_length,
                          win_length=hp.win_length)

The code snippets show different approaches to audio processing and hyperparameter management between the two implementations.

Clone a voice in 5 seconds to generate arbitrary speech in real-time

Pros of Real-Time-Voice-Cloning

  • Offers real-time voice cloning capabilities
  • Includes a user-friendly interface for easy interaction
  • Supports multi-speaker voice cloning

Cons of Real-Time-Voice-Cloning

  • More complex setup and dependencies
  • Requires more computational resources
  • May have longer processing times for voice synthesis

Code Comparison

Tacotron

def griffin_lim(spectrogram):
    X_best = copy.deepcopy(spectrogram)
    for i in range(n_iter):
        X_t = invert_spectrogram(X_best)
        est = librosa.stft(X_t, n_fft, hop_length, win_length=win_length)
        phase = est / np.maximum(1e-8, np.abs(est))
        X_best = spectrogram * phase
    X_t = invert_spectrogram(X_best)
    y = np.real(X_t)
    return y

Real-Time-Voice-Cloning

def load_model(model_dir: Path):
    json_config = model_dir.joinpath("config.json")
    with json_config.open() as f:
        config = json.load(f)
    model = SV2TTS(config)
    checkpoint = torch.load(model_dir.joinpath("model.pt"))
    model.load_state_dict(checkpoint["model_state"])
    return model

The code snippets showcase different aspects of each project. Tacotron focuses on audio processing with the Griffin-Lim algorithm, while Real-Time-Voice-Cloning emphasizes model loading and configuration for voice synthesis.

2,124

WaveRNN Vocoder + TTS

Pros of WaveRNN

  • Faster inference time due to efficient WaveRNN architecture
  • Supports real-time audio synthesis
  • More recent implementation with active development

Cons of WaveRNN

  • Requires more computational resources for training
  • Less extensive documentation compared to Tacotron
  • Narrower focus on vocoder implementation

Code Comparison

Tacotron (preprocessing):

def preprocess(text):
    text = text.lower()
    seq = [char2idx[c] for c in text]
    seq += [char2idx[EOS]]  # End of Sequence
    return np.array(seq, dtype=np.int32)

WaveRNN (audio processing):

def load_wav(path):
    return librosa.load(path, sr=hp.sample_rate)[0]

def save_wav(x, path):
    scipy.io.wavfile.write(path, hp.sample_rate, x.astype(np.float32))

The code snippets highlight different focuses: Tacotron emphasizes text preprocessing for sequence-to-sequence models, while WaveRNN concentrates on audio processing tasks, reflecting their respective roles in the text-to-speech pipeline.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

A (Heavily Documented) TensorFlow Implementation of Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model

Requirements

  • NumPy >= 1.11.1
  • TensorFlow >= 1.3
  • librosa
  • tqdm
  • matplotlib
  • scipy

Data

We train the model on three different speech datasets.

  1. LJ Speech Dataset
  2. Nick Offerman's Audiobooks
  3. The World English Bible

LJ Speech Dataset is recently widely used as a benchmark dataset in the TTS task because it is publicly available. It has 24 hours of reasonable quality samples. Nick's audiobooks are additionally used to see if the model can learn even with less data, variable speech samples. They are 18 hours long. The World English Bible is a public domain update of the American Standard Version of 1901 into modern English. Its original audios are freely available here. Kyubyong split each chapter by verse manually and aligned the segmented audio clips to the text. They are 72 hours in total. You can download them at Kaggle Datasets.

Training

  • STEP 0. Download LJ Speech Dataset or prepare your own data.
  • STEP 1. Adjust hyper parameters in hyperparams.py. (If you want to do preprocessing, set prepro True`.
  • STEP 2. Run python train.py. (If you set prepro True, run python prepro.py first)
  • STEP 3. Run python eval.py regularly during training.

Sample Synthesis

We generate speech samples based on Harvard Sentences as the original paper does. It is already included in the repo.

  • Run python synthesize.py and check the files in samples.

Training Curve

Attention Plot

Generated Samples

Pretrained Files

  • Keep in mind 200k steps may not be enough for the best performance.
  • LJ 200k
  • WEB 200k

Notes

  • It's important to monitor the attention plots during training. If the attention plots look good (alignment looks linear), and then they look bad (the plots will look similar to what they looked like in the begining of training), then training has gone awry and most likely will need to be restarted from a checkpoint where the attention looked good, because we've learned that it's unlikely that the loss will ever recover. This deterioration of attention will correspond with a spike in the loss.

  • In the original paper, the authors said, "An important trick we discovered was predicting multiple, non-overlapping output frames at each decoder step" where the number of of multiple frame is the reduction factor, r. We originally interpretted this as predicting non-sequential frames during each decoding step t. Thus were using the following scheme (with r=5) during decoding.

    t    frame numbers
    -----------------------
    0    [ 0  1  2  3  4]
    1    [ 5  6  7  8  9]
    2    [10 11 12 13 14]
    ...
    

    After much experimentation, we were unable to have our model learning anything useful. We then switched to predicting r sequential frames during each decoding step.

    t    frame numbers
    -----------------------
    0    [ 0  1  2  3  4]
    1    [ 5  6  7  8  9]
    2    [10 11 12 13 14]
    ...
    

    With this setup we noticed improvements in the attention and have since kept it.

  • Perhaps the most important hyperparemeter is the learning rate. With an intitial learning rate of 0.002 we were never able to learn a clean attention, the loss would frequently explode. With an initial learning rate of 0.001 we were able to learn a clean attention and train for much longer get decernable words during synthesis.

  • Check other TTS models such as DCTTS or deep voice 3.

Differences from the original paper

  • We use Noam style warmup and decay.
  • We implement gradient clipping.
  • Our training batches are bucketed.
  • After the last convolutional layer of the post-processing net, we apply an affine transformation to bring the dimensionality up to 128 from 80, because the required dimensionality of highway net is 128. In the original highway networks paper, the authors mention that the dimensionality of the input can also be increased with zero-padding, but they used the affine transformation in all their experiments. We do not know what the Tacotron authors chose.

Papers that referenced this repo

Jan. 2018, Kyubyong Park & Tommy Mulc