seamless_communication
Foundational Models for State-of-the-Art Speech and Text Translation
Top Related Projects
Robust Speech Recognition via Large-Scale Weak Supervision
:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
End-to-End Speech Processing Toolkit
A PyTorch-based Speech Toolkit
Quick Overview
Seamless Communication is an open-source project by Meta AI that aims to break down language barriers through AI-powered translation and communication tools. It includes models for speech-to-speech, speech-to-text, and text-to-speech translation across multiple languages, with a focus on preserving the speaker's voice and style.
Pros
- Supports a wide range of languages and translation tasks
- Preserves speaker's voice characteristics in speech-to-speech translation
- Open-source, allowing for community contributions and improvements
- Includes pre-trained models for quick deployment
Cons
- Requires significant computational resources for training and inference
- May have limitations in accuracy for less common languages or dialects
- Potential privacy concerns when processing sensitive speech or text data
- Dependency on large language models may introduce biases
Code Examples
# Example 1: Loading a pre-trained model
import torch
from seamless_communication.models import load_model
model, tokenizer = load_model("seamless_m4t_medium")
# Example 2: Performing speech-to-text translation
from seamless_communication.inference import Translator
translator = Translator(model, tokenizer, "medium")
translated_text = translator.predict(audio_input, "speech_to_text", tgt_lang="eng")
# Example 3: Text-to-speech synthesis
from seamless_communication.inference import TextToSpeech
tts = TextToSpeech(model, tokenizer, "medium")
audio_output = tts.predict(text_input, tgt_lang="fra")
Getting Started
To get started with Seamless Communication:
-
Install the package:
pip install git+https://github.com/facebookresearch/seamless_communication.git
-
Load a pre-trained model:
from seamless_communication.models import load_model model, tokenizer = load_model("seamless_m4t_medium")
-
Initialize a translator:
from seamless_communication.inference import Translator translator = Translator(model, tokenizer, "medium")
-
Perform translation:
result = translator.predict(input_data, task="speech_to_text", tgt_lang="eng")
For more detailed instructions and advanced usage, refer to the project's documentation on GitHub.
Competitor Comparisons
Robust Speech Recognition via Large-Scale Weak Supervision
Pros of Whisper
- Highly accurate speech recognition across multiple languages
- Open-source and well-documented, making it easy to use and integrate
- Supports transcription, translation, and language identification tasks
Cons of Whisper
- Limited to audio-only processing, lacking support for other modalities
- Requires significant computational resources for real-time processing
- Does not support real-time speech-to-speech translation
Code Comparison
Whisper:
import whisper
model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])
Seamless Communication:
from seamless_communication.models.inference import Translator
translator = Translator("seamlessM4T_medium", "vocoder_36langs", device="cuda")
translated_text, _ = translator.predict("Hello, how are you?", "eng", "fra")
While Whisper focuses on speech recognition and transcription, Seamless Communication offers a more comprehensive suite of tools for multilingual communication, including speech-to-speech translation. Whisper excels in its simplicity and accuracy for audio processing, while Seamless Communication provides a broader range of features for cross-lingual communication tasks.
:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Pros of TTS
- More focused on text-to-speech tasks, potentially offering deeper specialization
- Longer development history, potentially more stable and mature
- Broader language support for TTS tasks
Cons of TTS
- Less comprehensive in terms of speech-related tasks (e.g., no speech-to-speech translation)
- Potentially less advanced in terms of AI/ML techniques compared to Seamless Communication
- May have fewer resources for development and maintenance
Code Comparison
TTS:
from TTS.api import TTS
tts = TTS(model_name="tts_models/en/ljspeech/tacotron2-DDC")
tts.tts_to_file(text="Hello world!", file_path="output.wav")
Seamless Communication:
from seamless_communication.models.inference import Translator
translator = Translator("seamlessM4T_medium", "cpu")
translated_text, _, _ = translator.predict("Hello world!", "eng", "fra", "text_to_text")
Both repositories offer easy-to-use APIs for their respective tasks, but Seamless Communication provides a more comprehensive set of speech-related functionalities in a single package.
End-to-End Speech Processing Toolkit
Pros of ESPnet
- More comprehensive toolkit covering various speech processing tasks (ASR, TTS, speech enhancement, etc.)
- Longer development history and larger community support
- Extensive documentation and tutorials for easier adoption
Cons of ESPnet
- May have a steeper learning curve due to its broader scope
- Potentially slower inference time for some models compared to Seamless Communication
- Less focus on multilingual and cross-lingual capabilities
Code Comparison
ESPnet example (ASR):
import espnet2.bin.asr_inference as asr_inference
asr_model = asr_inference.Speech2Text.from_pretrained("espnet/model")
text, *_ = asr_model("audio.wav")
print(text)
Seamless Communication example (S2TT):
from seamless_communication.models.inference import Translator
translator = Translator("seamlessM4T_medium", "vocoder_36langs", device="cuda")
translated_text, _, _ = translator.predict("audio.wav", "eng", "fra")
print(translated_text)
Both repositories offer powerful speech processing capabilities, but ESPnet provides a more comprehensive toolkit for various tasks, while Seamless Communication focuses on multilingual and cross-lingual applications with potentially faster inference for specific use cases.
A PyTorch-based Speech Toolkit
Pros of SpeechBrain
- More comprehensive toolkit for speech-related tasks
- Easier to use for researchers and developers new to speech processing
- Extensive documentation and tutorials available
Cons of SpeechBrain
- Less focused on multilingual and cross-lingual capabilities
- May not be as optimized for large-scale production deployments
- Fewer pre-trained models for specific language pairs
Code Comparison
SpeechBrain:
from speechbrain.pretrained import EncoderDecoderASR
asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-crdnn-rnnlm-librispeech")
transcription = asr_model.transcribe_file("audio_file.wav")
Seamless Communication:
from seamless_communication.models.inference import Translator
translator = Translator("seamlessM4T_medium", "vocoder_36langs", device="cuda")
translated_text, _ = translator.predict("audio_file.wav", "eng", "fra")
Both repositories offer powerful tools for speech processing, but SpeechBrain provides a more comprehensive toolkit for various speech-related tasks, while Seamless Communication focuses on multilingual and cross-lingual translation capabilities. SpeechBrain may be easier for beginners, while Seamless Communication offers more advanced features for specific use cases.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Seamless Intro
Seamless is a family of AI models that enable more natural and authentic communication across languages. SeamlessM4T is a massive multilingual multimodal machine translation model supporting around 100 languages. SeamlessM4T serves as foundation for SeamlessExpressive, a model that preserves elements of prosody and voice style across languages and SeamlessStreaming, a model supporting simultaneous translation and streaming ASR for around 100 languages. SeamlessExpressive and SeamlessStreaming are combined into Seamless, a unified model featuring multilinguality, real-time and expressive translations.
Links
Demos
SeamlessM4T v2 | SeamlessExpressive | SeamlessStreaming | |
---|---|---|---|
Demo | SeamlessM4T v2 Demo | SeamlessExpressive Demo | |
HuggingFace Space Demo | ð¤ SeamlessM4T v2 Space | ð¤ SeamlessExpressive Space | ð¤ SeamlessStreaming Space |
Papers
Blog
Tutorial
An exhaustive tutorial given at the NeurIPS 2023 - Seamless EXPO, which is a one-stop shop to learn how to use the entire suite of Seamless models. Please feel free to play with the notebook.
SeamlessM4T
SeamlessM4T is our foundational all-in-one Massively Multilingual and Multimodal Machine Translation model delivering high-quality translation for speech and text in nearly 100 languages.
SeamlessM4T models support the tasks of:
- Speech-to-speech translation (S2ST)
- Speech-to-text translation (S2TT)
- Text-to-speech translation (T2ST)
- Text-to-text translation (T2TT)
- Automatic speech recognition (ASR)
:star2: We are releasing SeamlessM4T v2, an updated version with our novel UnitY2 architecture. This new model improves over SeamlessM4T v1 in quality as well as inference latency in speech generation tasks.
To learn more about the collection of SeamlessM4T models, the approach used in each, their language coverage and their performance, visit the SeamlessM4T README or ð¤ Model Card.
[!NOTE] Seamless M4T is also available in the ð¤ Transformers library. Visit this section for more details.
SeamlessExpressive
SeamlessExpressive is a speech-to-speech translation model that captures certain underexplored aspects of prosody such as speech rate and pauses, while preserving the style of one's voice and high content translation quality.
To learn more about SeamlessExpressive models, visit the SeamlessExpressive README or ð¤ Model Card
SeamlessStreaming
SeamlessStreaming is a streaming translation model. The model supports speech as input modality and speech/text as output modalities.
The SeamlessStreaming model supports the following tasks:
- Speech-to-speech translation (S2ST)
- Speech-to-text translation (S2TT)
- Automatic speech recognition (ASR)
To learn more about SeamlessStreaming models, visit the SeamlessStreaming README or ð¤ Model Card
Seamless
The Seamless model is the unified model for expressive streaming speech-to-speech translations.
What's new
- [12/18/2023] We are open-sourcing our Conformer-based W2v-BERT 2.0 speech encoder as described in Section 3.2.1 of the paper, which is at the core of our Seamless models.
- [12/14/2023] We are releasing the Seamless tutorial given at NeurIPS 2023.
Quick Start
Installation
[!NOTE] One of the prerequisites is fairseq2 which has pre-built packages available only for Linux x86-64 and Apple-silicon Mac computers. In addition it has a dependency on libsndfile which might not be installed on your machine. If you experience any installation issues, please refer to its README for further instructions.
pip install .
[!NOTE] Transcribing inference audio for computing metric uses Whisper, which is automatically installed. Whisper in turn requires the command-line tool
ffmpeg
to be installed on your system, which is available from most package managers.
Running inference
SeamlessM4T Inference
Hereâs an example of using the CLI from the root directory to run inference.
S2ST task:
m4t_predict <path_to_input_audio> --task s2st --tgt_lang <tgt_lang> --output_path <path_to_save_audio>
T2TT task:
m4t_predict <input_text> --task t2tt --tgt_lang <tgt_lang> --src_lang <src_lang>
Please refer to the inference README for detailed instruction on how to run inference and the list of supported languages on the source, target sides for speech, text modalities.
For running S2TT/ASR natively (without Python) using GGML, please refer to the unity.cpp section.
SeamlessExpressive Inference
[!NOTE] Please check the section on how to download the model.
Hereâs an example of using the CLI from the root directory to run inference.
expressivity_predict <path_to_input_audio> --tgt_lang <tgt_lang> --model_name seamless_expressivity --vocoder_name vocoder_pretssel --output_path <path_to_save_audio>
SeamlessStreaming and Seamless Inference
Streaming Evaluation README has detailed instructions for running evaluations for the SeamlessStreaming and Seamless models. The CLI has an --no-scoring
option that can be used to skip the scoring part and just run inference.
Please check the inference README for more details.
Running SeamlessStreaming Demo
You can duplicate the SeamlessStreaming HF space to run the streaming demo.
You can also run the demo locally, by cloning the space from here. See the README of the SeamlessStreaming HF repo for more details on installation.
Running SeamlessM4T & SeamlessExpressive Gradio demos locally
To launch the same demo Space we host on Hugging Face locally:
cd demo
pip install -r requirements.txt
python app.py
Resources and usage
Model
SeamlessM4T models
Model Name | #params | checkpoint | metrics |
---|---|---|---|
SeamlessM4T-Large v2 | 2.3B | ð¤ Model card - checkpoint | metrics |
SeamlessM4T-Large (v1) | 2.3B | ð¤ Model card - checkpoint | metrics |
SeamlessM4T-Medium (v1) | 1.2B | ð¤ Model card - checkpoint | metrics |
SeamlessExpressive models
To access and download SeamlessExpressive, please request the model artifacts through this request form. Upon approval, you will then receive an email with download links to each model artifact.
Please note that SeamlessExpressive is made available under its own License and Acceptable Use Policy.
SeamlessStreaming models
Model Name | #params | checkpoint | metrics |
---|---|---|---|
SeamlessStreaming | 2.5B | ð¤ Model card - monotonic decoder checkpoint - streaming UnitY2 checkpoint | metrics |
Seamless models
Seamless model is simply the SeamlessStreaming model with the non-expressive vocoder_v2
swapped out with the expressive vocoder_pretssel
.
Please check out above section on how to acquire vocoder_pretssel
checkpoint.
W2v-BERT 2.0 speech encoder
Model Name | #params | checkpoint |
---|---|---|
W2v-BERT 2.0 | 600M | ð¤ Model card - checkpoint |
Here's how you should do a foward pass through the speech encoder:
import torch
from fairseq2.data.audio import AudioDecoder, WaveformToFbankConverter
from fairseq2.memory import MemoryBlock
from fairseq2.nn.padding import get_seqs_and_padding_mask
from fairseq2.data import Collater
from pathlib import Path
from seamless_communication.models.conformer_shaw import load_conformer_shaw_model
audio_wav_path, device, dtype = ...
audio_decoder = AudioDecoder(dtype=torch.float32, device=device)
fbank_converter = WaveformToFbankConverter(
num_mel_bins=80,
waveform_scale=2**15,
channel_last=True,
standardize=True,
device=device,
dtype=dtype,
)
collater = Collater(pad_value=1)
model = load_conformer_shaw_model("conformer_shaw", device=device, dtype=dtype)
model.eval()
with Path(audio_wav_path).open("rb") as fb:
block = MemoryBlock(fb.read())
decoded_audio = audio_decoder(block)
src = collater(fbank_converter(decoded_audio))["fbank"]
seqs, padding_mask = get_seqs_and_padding_mask(src)
with torch.inference_mode():
seqs, padding_mask = model.encoder_frontend(seqs, padding_mask)
seqs, padding_mask = model.encoder(seqs, padding_mask)
Evaluation
SeamlessM4T Evaluation
To reproduce our results, or to evaluate using the same metrics over your own test sets, please check out the README here.
SeamlessExpressive Evaluation
Below is the script for efficient batched evaluation.
export MODEL_DIR="/path/to/SeamlessExpressive/model"
export TEST_SET_TSV="input.tsv" # Your dataset in a TSV file, with headers "id", "audio"
export TGT_LANG="spa" # Target language to translate into, options including "fra", "deu", "eng" ("cmn" and "ita" are experimental)
export OUTPUT_DIR="tmp/" # Output directory for generated text/unit/waveform
export TGT_TEXT_COL="tgt_text" # The column in your ${TEST_SET_TSV} for reference target text to calcuate BLEU score. You can skip this argument.
export DFACTOR="1.0" # Duration factor for model inference to tune predicted duration (preddur=DFACTOR*preddur) per each position which affects output speech rate. Greater value means slower speech rate (default to 1.0). See expressive evaluation README for details on duration factor we used.
expressivity_evaluate ${TEST_SET_TSV} \
--gated-model-dir ${MODEL_DIR} --task s2st --tgt_lang ${TGT_LANG} \
--audio_root_dir "" --output_path ${OUTPUT_DIR} --ref_field ${TGT_TEXT_COL} \
--model_name seamless_expressivity --vocoder_name vocoder_pretssel \
--text_unk_blocking True --duration_factor ${DFACTOR}
Please check out this README section
SeamlessStreaming and Seamless Evaluation
Streaming Evaluation README has detailed instructions for running evaluations on the SeamlessStreaming and Seamless models.
Unity.cpp
To enable Seamless Communication Everywhere, we implemented unity.cpp so users could run SeamlessM4T models in GGML - a C tensor library allowing easier integration on verbose platforms.
To transcribe/translte a given audio,
./ggml/bin/unity --model seamlessM4T_medium.ggml input.wav
For details of build and more usage please check out unity.cpp
Expressive Datasets
We created two expressive speech-to-speech translation datasets, mExpresso and mDRAL, between English and five other languages -- French, German, Italian, Mandarin and Spanish. We currently open source the speech-to-text of mExpresso for out-of-English directions, and we will open source the remaining part of the datasets soon. For details, please check out README
SeamlessAlignExpressive
Weâre introducing the first expressive speech alignment procedure. Starting with raw data, the expressive alignment procedure automatically discovers pairs of audio segments sharing not only the same meaning, but the same overall expressivity. To showcase this procedure, we are making metadata available to create a benchmarking dataset called SeamlessAlignExpressive, that can be used to validate the quality of our alignment method. SeamlessAlignExpressive is the first large-scale (11k+ hours) collection of multilingual audio alignments for expressive translation. More details can be found on the SeamlessAlignExpressive README.
Converting raw audio to units
Please check out the README here. Note that SeamlessM4T v1 model uses reduced units and other models use non-reduced units.
Libraries
Seamless Communication depends on 4 libraries developed by Meta.
fairseq2
fairseq2 is our next-generation open-source library of sequence modeling components that provides researchers and developers with building blocks for machine translation, language modeling, and other sequence generation tasks. All SeamlessM4T models in this repository are powered by fairseq2.
SONAR and BLASER 2.0
SONAR, Sentence-level multimOdal and laNguage-Agnostic Representations is a new multilingual and -modal sentence embedding space which outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks. SONAR provides text and speech encoders for many languages. SeamlessAlign was mined based on SONAR embeddings.
BLASER 2.0 is our latest model-based evaluation metric for multimodal translation. It is an extension of BLASER, supporting both speech and text. It operates directly on the source signal, and as such, does not require any intermediate ASR system like ASR-BLEU. As in the first version, BLASER 2.0 leverages the similarity between input and output sentence embeddings. SONAR is the underlying embedding space for BLASER 2.0. Scripts to run evaluation with BLASER 2.0 can be found in the SONAR repo.
stopes
As part of the seamless communication project, we've extended the stopes library. Version 1 provided a text-to-text mining tool to build training dataset for translation models. Version 2 has been extended thanks to SONAR, to support tasks around training large speech translation models. In particular, we provide tools to read/write the fairseq audiozip datasets and a new mining pipeline that can do speech-to-speech, text-to-speech, speech-to-text and text-to-text mining, all based on the new SONAR embedding space.
SimulEval
SimulEval is a library used for evaluating simulaneous translation models. SimulEval also provides a backend for generation using partial/incremental inputs with flexible/extensible states, which is used to implement streaming inference. Users define agents which implement SimulEval's interface, which can be connected together in a pipeline. You can find agents implemented for SeamlessStreaming here.
[Legacy] SeamlessM4T v1 instructions
Finetuning SeamlessM4T v1 models
Please check out the README here.
On-device models
Apart from Seamless-M4T large (2.3B) and medium (1.2B) models, we are also releasing a small model (281M) targeted for on-device inference. To learn more about the usage and model details check out the README here.
SeamlessAlign mined dataset
We open-source the metadata to SeamlessAlign, the largest open dataset for multimodal translation, totaling 270k+ hours of aligned Speech and Text data. The dataset can be rebuilt by the community based on the SeamlessAlign readme.
Citation
If you use Seamless in your work or any models/datasets/artifacts published in Seamless, please cite :
@inproceedings{seamless2023,
title="Seamless: Multilingual Expressive and Streaming Speech Translation",
author="{Seamless Communication}, Lo{\"i}c Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang, Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek, Yilin Yang, Ethan Ye, Ivan Evtimov, Pierre Fernandez, Cynthia Gao, Prangthip Hansanti, Elahe Kalbassi, Amanda Kallet, Artyom Kozhevnikov, Gabriel Mejia, Robin San Roman, Christophe Touret, Corinne Wong, Carleigh Wood, Bokai Yu, Pierre Andrews, Can Balioglu, Peng-Jen Chen, Marta R. Costa-juss{\`a}, Maha Elbayad, Hongyu Gong, Francisco Guzm{\'a}n, Kevin Heffernan, Somya Jain, Justine Kao, Ann Lee, Xutai Ma, Alex Mourachko, Benjamin Peloquin, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Anna Sun, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang, Mary Williamson",
journal={ArXiv},
year={2023}
}
License
We have three license categories.
The following non-generative components are MIT licensed as found in MIT_LICENSE:
- W2v-BERT 2.0 speech encoder
- Code
- Text only part of the mExpresso dataset found in the SeamlessExpressive README.
- UnitY2 forced alignment extractor found in the UnitY2 Aligner README.
- Speech toxicity tool with the etox dataset found in the ETOX README.
- MuTox: Universal MUltilingual Audio-based TOXicity Dataset and Zero-shot Detector Mutox README
The following models are CC-BY-NC 4.0 licensed as found in the LICENSE:
- SeamlessM4T models (v1 and v2).
- SeamlessStreaming models.
The following models are Seamless licensed as found in SEAMLESS_LICENSE:
- Seamless models.
- SeamlessExpressive models.
Top Related Projects
Robust Speech Recognition via Large-Scale Weak Supervision
:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
End-to-End Speech Processing Toolkit
A PyTorch-based Speech Toolkit
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot