audiocraft
Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.
Top Related Projects
Implementation of MusicLM, Google's new SOTA model for music generation using attention networks, in Pytorch
🔊 Text-Prompted Generative Audio Model
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
Muzic: Music Understanding and Generation with Artificial Intelligence
Code for the paper "Jukebox: A Generative Model for Music"
Quick Overview
AudioCraft is a PyTorch library and collection of models for audio generation developed by Facebook Research. It includes state-of-the-art models for text-to-audio, text-to-music, and audio compression tasks, with a focus on high-quality and controllable audio synthesis.
Pros
- Offers cutting-edge models for various audio generation tasks
- Provides pre-trained models for easy use and fine-tuning
- Supports both CPU and GPU acceleration for efficient processing
- Includes comprehensive documentation and examples
Cons
- Requires significant computational resources for training and inference
- Limited to PyTorch ecosystem, which may not suit all developers
- Steep learning curve for those unfamiliar with deep learning audio techniques
- Potential ethical concerns regarding AI-generated audio content
Code Examples
- Loading and using a pre-trained MusicGen model:
import torchaudio
from audiocraft.models import MusicGen
model = MusicGen.get_pretrained('medium')
descriptions = ['An electronic dance track with a strong beat']
wav = model.generate(descriptions, progress=True)
torchaudio.save('generated_music.wav', wav[0].cpu(), model.sample_rate)
- Generating audio from text using AudioGen:
from audiocraft.models import AudioGen
model = AudioGen.get_pretrained('medium')
descriptions = ['A dog barking in the distance']
wav = model.generate(descriptions, progress=True)
torchaudio.save('generated_audio.wav', wav[0].cpu(), model.sample_rate)
- Compressing audio using EnCodec:
from audiocraft.models import EncodecModel
import torch
model = EncodecModel.get_pretrained('encodec_24khz')
wav, sr = torchaudio.load('input_audio.wav')
wav = wav.unsqueeze(0)
with torch.no_grad():
encoded_frames = model.encode(wav)
decoded_wav = model.decode(encoded_frames)
torchaudio.save('compressed_audio.wav', decoded_wav.squeeze(0).cpu(), sr)
Getting Started
To get started with AudioCraft:
- Install the library:
pip install audiocraft
- Import and use a model:
from audiocraft.models import MusicGen
model = MusicGen.get_pretrained('small')
wav = model.generate(['A cheerful country song with acoustic guitar'])
- Save the generated audio:
import torchaudio
torchaudio.save('output.wav', wav[0].cpu(), model.sample_rate)
Competitor Comparisons
Implementation of MusicLM, Google's new SOTA model for music generation using attention networks, in Pytorch
Pros of MusicLM-PyTorch
- Simpler implementation, easier to understand and modify
- Focuses specifically on music generation
- More lightweight and potentially faster to train
Cons of MusicLM-PyTorch
- Less comprehensive feature set compared to AudioCraft
- May require more manual setup and configuration
- Limited to music generation, while AudioCraft covers broader audio tasks
Code Comparison
MusicLM-PyTorch:
from musiclm_pytorch import MusicLM
model = MusicLM(
dim = 512,
depth = 6,
heads = 8,
dim_head = 64,
max_seq_len = 1024
)
AudioCraft:
import audiocraft
from audiocraft.models import MusicGen
model = MusicGen.get_pretrained('medium')
model.set_generation_params(duration=8)
wav = model.generate_unconditional(4)
MusicLM-PyTorch provides a more low-level implementation, allowing for greater customization of model architecture. AudioCraft offers a higher-level API with pre-trained models, making it easier to get started with audio generation tasks out of the box.
🔊 Text-Prompted Generative Audio Model
Pros of Bark
- Supports multi-language text-to-speech generation
- Offers voice cloning capabilities
- Provides more fine-grained control over speech characteristics
Cons of Bark
- Limited to speech generation only
- Requires more computational resources for inference
- Less extensive documentation and community support
Code Comparison
Bark:
from bark import SAMPLE_RATE, generate_audio, preload_models
preload_models()
text = "Hello, world!"
audio_array = generate_audio(text)
AudioCraft:
import torchaudio
from audiocraft.models import MusicGen
model = MusicGen.get_pretrained('small')
wav = model.generate_unconditional(4)
torchaudio.save('audio.wav', wav[0], 32000)
Key Differences
- Bark focuses on speech synthesis, while AudioCraft primarily targets music generation
- AudioCraft offers more comprehensive audio manipulation capabilities
- Bark provides more options for voice customization
- AudioCraft has better integration with PyTorch ecosystem
- Bark's codebase is more accessible for beginners
Use Cases
- Bark: Text-to-speech applications, voice assistants, voice cloning
- AudioCraft: Music production, sound design, audio content creation
Community and Support
- AudioCraft benefits from Facebook's backing and larger developer community
- Bark has a growing community but less extensive resources
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
Pros of AudioGPT
- Offers a more comprehensive suite of audio-related tasks, including speech recognition and audio captioning
- Provides a user-friendly interface for interacting with the model
- Integrates multiple pre-trained models for various audio tasks
Cons of AudioGPT
- Less focused on high-quality music generation compared to Audiocraft
- May require more computational resources due to its broader scope
- Documentation and examples are less extensive than Audiocraft
Code Comparison
AudioGPT:
from audiopgt import AudioGPT
model = AudioGPT()
result = model.generate_audio("Create a piano melody")
result.save("output.wav")
Audiocraft:
import audiocraft.models as models
from audiocraft.data.audio import audio_write
model = models.MusicGen.get_pretrained()
wav = model.generate(prompt="Create a piano melody", duration=10)
audio_write("output", wav[0].cpu(), model.sample_rate, strategy="loudness")
Both repositories offer powerful audio generation capabilities, but they cater to different use cases. Audiocraft focuses primarily on high-quality music generation, while AudioGPT provides a broader range of audio-related tasks. The code comparison shows that Audiocraft's API is more specialized for music generation, while AudioGPT offers a more general-purpose interface for various audio tasks.
Muzic: Music Understanding and Generation with Artificial Intelligence
Pros of Muzic
- Broader scope, covering various music AI tasks like generation, understanding, and editing
- Includes tools for symbolic music generation and audio-to-MIDI conversion
- Offers a wider range of pre-trained models for different music-related tasks
Cons of Muzic
- Less focused on high-quality audio generation compared to Audiocraft
- May require more setup and configuration due to its broader feature set
- Documentation could be less comprehensive for specific audio generation tasks
Code Comparison
Muzic (Symbolic music generation):
from muzic.symbolic.musicbert import MusicBERT
model = MusicBERT.from_pretrained('symbolic/musicbert-base')
generated = model.generate(input_ids, max_length=512)
Audiocraft (Audio generation):
from audiocraft.models import MusicGen
model = MusicGen.get_pretrained('medium')
wav = model.generate(descriptions, duration=8)
Both repositories offer powerful tools for music AI, but Muzic provides a broader range of features for various music-related tasks, while Audiocraft focuses more on high-quality audio generation. The choice between them depends on the specific requirements of your project.
Code for the paper "Jukebox: A Generative Model for Music"
Pros of Jukebox
- More established project with a longer history and potentially more stability
- Capable of generating complete songs with vocals and lyrics
- Offers a wider range of musical styles and genres
Cons of Jukebox
- Requires more computational resources and longer generation times
- Less user-friendly interface and setup process
- Output quality can be inconsistent, especially for vocals
Code Comparison
Jukebox:
vqvae, *priors = MODELS['5b']
raw_to_tokens = vqvae.encode(x.unsqueeze(0)).squeeze(0)
tokens_to_prior = priors[-1].encode(raw_to_tokens.unsqueeze(0)).squeeze(0)
Audiocraft:
model = MusicGen.get_pretrained('melody')
wav = model.generate_unconditional(
num_samples=1,
progress=True,
)
Audiocraft offers a more streamlined API for generating music, while Jukebox requires more complex setup and encoding steps. Audiocraft's approach is more user-friendly and easier to integrate into projects, but Jukebox provides more fine-grained control over the generation process.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
AudioCraft
AudioCraft is a PyTorch library for deep learning research on audio generation. AudioCraft contains inference and training code for two state-of-the-art AI generative models producing high-quality audio: AudioGen and MusicGen.
Installation
AudioCraft requires Python 3.9, PyTorch 2.1.0. To install AudioCraft, you can run the following:
# Best to make sure you have torch installed first, in particular before installing xformers.
# Don't run this if you already have PyTorch installed.
python -m pip install 'torch==2.1.0'
# You might need the following before trying to install the packages
python -m pip install setuptools wheel
# Then proceed to one of the following
python -m pip install -U audiocraft # stable release
python -m pip install -U git+https://git@github.com/facebookresearch/audiocraft#egg=audiocraft # bleeding edge
python -m pip install -e . # or if you cloned the repo locally (mandatory if you want to train).
python -m pip install -e '.[wm]' # if you want to train a watermarking model
We also recommend having ffmpeg
installed, either through your system or Anaconda:
sudo apt-get install ffmpeg
# Or if you are using Anaconda or Miniconda
conda install "ffmpeg<5" -c conda-forge
Models
At the moment, AudioCraft contains the training code and inference code for:
- MusicGen: A state-of-the-art controllable text-to-music model.
- AudioGen: A state-of-the-art text-to-sound model.
- EnCodec: A state-of-the-art high fidelity neural audio codec.
- Multi Band Diffusion: An EnCodec compatible decoder using diffusion.
- MAGNeT: A state-of-the-art non-autoregressive model for text-to-music and text-to-sound.
- AudioSeal: A state-of-the-art audio watermarking.
- MusicGen Style: A state-of-the-art text-and-style-to-music model.
Training code
AudioCraft contains PyTorch components for deep learning research in audio and training pipelines for the developed models. For a general introduction of AudioCraft design principles and instructions to develop your own training pipeline, refer to the AudioCraft training documentation.
For reproducing existing work and using the developed training pipelines, refer to the instructions for each specific model that provides pointers to configuration, example grids and model/task-specific information and FAQ.
API documentation
We provide some API documentation for AudioCraft.
FAQ
Is the training code available?
Yes! We provide the training code for EnCodec, MusicGen and Multi Band Diffusion.
Where are the models stored?
Hugging Face stored the model in a specific location, which can be overridden by setting the AUDIOCRAFT_CACHE_DIR
environment variable for the AudioCraft models.
In order to change the cache location of the other Hugging Face models, please check out the Hugging Face Transformers documentation for the cache setup.
Finally, if you use a model that relies on Demucs (e.g. musicgen-melody
) and want to change the download location for Demucs, refer to the Torch Hub documentation.
License
- The code in this repository is released under the MIT license as found in the LICENSE file.
- The models weights in this repository are released under the CC-BY-NC 4.0 license as found in the LICENSE_weights file.
Citation
For the general framework of AudioCraft, please cite the following.
@inproceedings{copet2023simple,
title={Simple and Controllable Music Generation},
author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023},
}
When referring to a specific model, please cite as mentioned in the model specific README, e.g ./docs/MUSICGEN.md, ./docs/AUDIOGEN.md, etc.
Top Related Projects
Implementation of MusicLM, Google's new SOTA model for music generation using attention networks, in Pytorch
🔊 Text-Prompted Generative Audio Model
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
Muzic: Music Understanding and Generation with Artificial Intelligence
Code for the paper "Jukebox: A Generative Model for Music"
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot