musiclm-pytorch

Implementation of MusicLM, Google's new SOTA model for music generation using attention networks, in Pytorch

3,272

265

3,272

View on GitHub

Top Related Projects

AudioGPT

10,168

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

bark

38,091

🔊 Text-Prompted Generative Audio Model

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.

unilm

21,586

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

jukebox

8,014

Code for the paper "Jukebox: A Generative Model for Music"

Quick Overview

The musiclm-pytorch repository is a PyTorch implementation of the MusicLM model, a generative AI system that can create high-fidelity music from text prompts. The model is capable of generating diverse and coherent musical compositions based on natural language descriptions.

Pros

Powerful Text-to-Music Generation: The MusicLM model demonstrates impressive capabilities in translating text prompts into musical compositions, opening up new possibilities for creative expression and music composition.
Diverse and Coherent Output: The generated music exhibits a high degree of diversity and coherence, showcasing the model's ability to capture the nuances and structure of musical compositions.
Potential for Artistic Collaboration: The text-to-music generation capabilities of MusicLM could enable new forms of artistic collaboration between humans and AI, where the model serves as a creative partner.
Extensibility: The PyTorch implementation allows for further research and development, potentially leading to improvements or adaptations of the MusicLM model.

Cons

Computational Complexity: Training and running the MusicLM model can be computationally intensive, requiring significant hardware resources and potentially limiting its accessibility.
Potential Biases and Limitations: As with any AI system, the MusicLM model may exhibit biases or limitations in its understanding and generation of music, which could impact the quality or diversity of the output.
Ethical Considerations: The use of AI-generated music raises questions about intellectual property, authorship, and the potential displacement of human musicians, which need to be carefully considered.
Lack of Real-Time Interaction: The current implementation of MusicLM does not support real-time interaction or performance, limiting its potential applications in live music settings.

Code Examples

from musiclm_pytorch import MusicLM, generate_music

# Initialize the MusicLM model
model = MusicLM(
    num_tokens=1024,
    dim=512,
    depth=12,
    heads=8,
    dim_head=64,
    max_seq_len=1024,
    num_instruments=128,
    sample_rate=16000,
    instrument_conditioning=True
)

# Generate music from a text prompt
prompt = "A lively, upbeat jazz piece with a driving rhythm section and improvised solos."
music = generate_music(model, prompt, num_samples=1, max_length=30_000)

# Save the generated music to a file
music.save("generated_music.wav")

This code example demonstrates how to initialize the MusicLM model and use it to generate music from a text prompt.

from musiclm_pytorch import MusicLM, train_model

# Load the dataset
dataset = load_dataset("path/to/dataset")

# Train the MusicLM model
model = MusicLM(...)
train_model(model, dataset, num_epochs=10, batch_size=4)

This code example shows how to train the MusicLM model on a dataset of music and text pairs.

from musiclm_pytorch import MusicLM, generate_music

# Load a pre-trained MusicLM model
model = MusicLM.load("path/to/pretrained/model.pt")

# Generate music with instrument conditioning
prompt = "A classical piano piece with a melancholic and introspective mood."
music = generate_music(model, prompt, num_samples=1, max_length=30_000, instrument_conditioning=True)
music.save("generated_piano_music.wav")

This code example demonstrates how to load a pre-trained MusicLM model and use it to generate music with instrument conditioning.

Getting Started

To get started with the musiclm-pytorch project, follow these steps:

Clone the repository:

git clone https://github.com/lucidrains/musiclm-pytorch.git

Install the required dependencies:

cd musiclm-pytorch
pip install -r requirements.txt

Download the pre-trained MusicLM model weights:

wget https://github.com/lucidrains/musiclm-pytorch/releases/download/v0.1.0/musiclm.pt

Competitor Comparisons

AudioGPT

10,168

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

Pros of AudioGPT

Broader scope: Handles multiple audio-related tasks beyond music generation
User-friendly interface: Provides a more accessible way to interact with the model
Integration with external tools: Incorporates various audio processing libraries

Cons of AudioGPT

Less specialized: May not perform as well on specific music generation tasks
Potentially higher resource requirements due to its broader functionality
Less focus on cutting-edge music generation techniques

Code Comparison

AudioGPT:

def process_audio(audio_file, task):
    if task == "transcribe":
        return transcribe_audio(audio_file)
    elif task == "generate":
        return generate_audio(audio_file)
    # ... other tasks

musiclm-pytorch:

def generate_music(text_prompt, duration):
    tokens = tokenize(text_prompt)
    latents = model.generate(tokens, duration)
    return decode_to_audio(latents)

The code snippets highlight the difference in focus between the two projects. AudioGPT offers a more general-purpose approach to audio processing, while musiclm-pytorch is specifically tailored for music generation based on text prompts.

bark

38,091

🔊 Text-Prompted Generative Audio Model

Pros of Bark

More versatile, capable of generating both speech and music
Easier to use with a simpler API and pre-trained models
Better documentation and examples for quick start

Cons of Bark

Less focused on music generation specifically
May produce lower quality music outputs compared to MusicLM

Code Comparison

Bark:

from bark import SAMPLE_RATE, generate_audio, preload_models

preload_models()
text_prompt = "Hello, this is a test."
audio_array = generate_audio(text_prompt)

MusicLM:

import torch
from musiclm_pytorch import MusicLM

model = MusicLM(
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64,
    t5_name = 't5-large'
)

music = model.generate(text = ['energetic EDM', 'sad jazz'])

Bark offers a more straightforward API for generating audio, while MusicLM provides more control over model parameters and is specifically tailored for music generation. Bark's simplicity makes it easier for beginners, but MusicLM may offer more advanced options for fine-tuning music output.

audiocraft

22,358

Pros of AudioCraft

More comprehensive audio generation toolkit with multiple models (MusicGen, AudioGen, EnCodec)
Official implementation by Meta AI, likely more optimized and maintained
Includes pre-trained models and demo notebooks for easy experimentation

Cons of AudioCraft

Larger codebase, potentially more complex to understand and modify
Requires more computational resources due to its comprehensive nature
May have stricter licensing terms as it's from a large corporation

Code Comparison

AudioCraft:

import torch
from audiocraft.models import MusicGen

model = MusicGen.get_pretrained('melody')
melody = model.generate_with_chroma('jazz', melody_wavs, melody_sample_rate)

musiclm-pytorch:

from musiclm_pytorch import MusicLM

model = MusicLM(
    dim = 512,
    depth = 12,
    heads = 8,
    dim_head = 64,
)

AudioCraft provides a higher-level API with pre-trained models, while musiclm-pytorch offers a more flexible, lower-level implementation for custom model configurations.

unilm

21,586

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Pros of UniLM

Broader scope: UniLM is a unified pre-training framework for various natural language understanding and generation tasks, while musiclm-pytorch focuses specifically on music generation.
More extensive documentation and examples: UniLM provides detailed documentation and numerous examples for various NLP tasks.
Backed by Microsoft: Benefits from the resources and support of a large tech company.

Cons of UniLM

Steeper learning curve: Due to its broader scope, UniLM may be more complex to understand and implement for specific tasks.
Larger codebase: UniLM's versatility comes at the cost of a more extensive codebase, which may be overwhelming for some users.

Code Comparison

UniLM (example of tokenization):

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("microsoft/unilm-base-cased")
tokens = tokenizer.tokenize("Hello, how are you?")

musiclm-pytorch (example of model initialization):

from musiclm_pytorch import MusicLM
model = MusicLM(
    dim = 512,
    depth = 6,
    heads = 8
)

Both repositories provide high-quality implementations of their respective models, but they serve different purposes. UniLM is more versatile and suitable for a wide range of NLP tasks, while musiclm-pytorch is specialized for music generation.

jukebox

8,014

Code for the paper "Jukebox: A Generative Model for Music"

Pros of Jukebox

More established and well-documented project with extensive research backing
Capable of generating high-quality, diverse musical samples across various genres
Includes pre-trained models and comprehensive training scripts

Cons of Jukebox

Requires significant computational resources for training and inference
Limited flexibility in terms of customization and fine-tuning for specific use cases
Relatively complex architecture, making it challenging for beginners to understand and modify

Code Comparison

MusicLM-PyTorch:

music = MusicLM(
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64,
    t5_name = 't5-base'
)

Jukebox:

vqvae = VQVAE(
    input_shape=(n_ctx,),
    levels=3,
    downs_t=(3, 2, 2),
    emb_width=64,
    resblock="1",
    resblock_kernel_sizes=(3, 1),
    resblock_dilation_sizes=((1, 3, 5), (1, 3, 5)),
    hidden_size=128,
    mu_levels=[2048, 1024, 512],
    commitment_cost=0.25,
)

MusicLM-PyTorch offers a more concise and straightforward implementation, while Jukebox provides a more detailed and customizable architecture. The code snippets highlight the difference in complexity and level of control between the two projects.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

MusicLM - Pytorch

Implementation of MusicLM, Google's new SOTA model for music generation using attention networks, in Pytorch.

They are basically using text-conditioned AudioLM, but surprisingly with the embeddings from a text-audio contrastive learned model named MuLan. MuLan is what will be built out in this repository, with AudioLM modified from the other repository to support the music generation needs here.

Please join if you are interested in helping out with the replication with the LAION community

What's AI by Louis Bouchard

Appreciation

Stability.ai for the generous sponsorship to work and open source cutting edge artificial intelligence research
ð¤ Huggingface for their accelerate training library

Usage

$ pip install musiclm-pytorch

Usage

MuLaN first needs to be trained

import torch
from musiclm_pytorch import MuLaN, AudioSpectrogramTransformer, TextTransformer

audio_transformer = AudioSpectrogramTransformer(
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64,
    spec_n_fft = 128,
    spec_win_length = 24,
    spec_aug_stretch_factor = 0.8
)

text_transformer = TextTransformer(
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64
)

mulan = MuLaN(
    audio_transformer = audio_transformer,
    text_transformer = text_transformer
)

# get a ton of <sound, text> pairs and train

wavs = torch.randn(2, 1024)
texts = torch.randint(0, 20000, (2, 256))

loss = mulan(wavs, texts)
loss.backward()

# after much training, you can embed sounds and text into a joint embedding space
# for conditioning the audio LM

embeds = mulan.get_audio_latents(wavs)  # during training

embeds = mulan.get_text_latents(texts)  # during inference

To obtain the conditioning embeddings for the three transformers that are a part of AudioLM, you must use the MuLaNEmbedQuantizer as so

from musiclm_pytorch import MuLaNEmbedQuantizer

# setup the quantizer with the namespaced conditioning embeddings, unique per quantizer as well as namespace (per transformer)

quantizer = MuLaNEmbedQuantizer(
    mulan = mulan,                          # pass in trained mulan from above
    conditioning_dims = (1024, 1024, 1024), # say all three transformers have model dimensions of 1024
    namespaces = ('semantic', 'coarse', 'fine')
)

# now say you want the conditioning embeddings for semantic transformer

wavs = torch.randn(2, 1024)
conds = quantizer(wavs = wavs, namespace = 'semantic') # (2, 8, 1024) - 8 is number of quantizers

To train (or finetune) the three transformers that are a part of AudioLM, you simply follow the instructions over at audiolm-pytorch for training, but pass in the MulanEmbedQuantizer instance to the training classes under the keyword audio_conditioner

ex. SemanticTransformerTrainer

import torch
from audiolm_pytorch import HubertWithKmeans, SemanticTransformer, SemanticTransformerTrainer

wav2vec = HubertWithKmeans(
    checkpoint_path = './hubert/hubert_base_ls960.pt',
    kmeans_path = './hubert/hubert_base_ls960_L9_km500.bin'
)

semantic_transformer = SemanticTransformer(
    num_semantic_tokens = wav2vec.codebook_size,
    dim = 1024,
    depth = 6,
    audio_text_condition = True      # this must be set to True (same for CoarseTransformer and FineTransformers)
).cuda()

trainer = SemanticTransformerTrainer(
    transformer = semantic_transformer,
    wav2vec = wav2vec,
    audio_conditioner = quantizer,   # pass in the MulanEmbedQuantizer instance above
    folder ='/path/to/audio/files',
    batch_size = 1,
    data_max_length = 320 * 32,
    num_train_steps = 1
)

trainer.train()

After much training on all three transformers (semantic, coarse, fine), you will pass your finetuned or trained-from-scratch AudioLM and MuLaN wrapped in MuLaNEmbedQuantizer to the MusicLM

# you need the trained AudioLM (audio_lm) from above
# with the MulanEmbedQuantizer (mulan_embed_quantizer)

from musiclm_pytorch import MusicLM

musiclm = MusicLM(
    audio_lm = audio_lm,                 # `AudioLM` from https://github.com/lucidrains/audiolm-pytorch
    mulan_embed_quantizer = quantizer    # the `MuLaNEmbedQuantizer` from above
)

music = musiclm('the crystalline sounds of the piano in a ballroom', num_samples = 4) # sample 4 and pick the top match with mulan

Todo

mulan seems to be using decoupled contrastive learning, offer that as an option
wrap mulan with mulan wrapper and quantize the output, project to audiolm dimensions
modify audiolm to accept conditioning embeddings, optionally take care of different dimensions through a separate projection
audiolm and mulan goes into musiclm and generate, filter with mulan
give dynamic positional bias to self attention in AST
implement MusicLM generating multiple samples and selecting top match with MuLaN
support variable lengthed audio with masking in audio transformer
add a version of mulan to open clip
set all the proper spectrogram hyperparameters

Citations

@inproceedings{Agostinelli2023MusicLMGM,
    title     = {MusicLM: Generating Music From Text},
    author    = {Andrea Agostinelli and Timo I. Denk and Zal{\'a}n Borsos and Jesse Engel and Mauro Verzetti and Antoine Caillon and Qingqing Huang and Aren Jansen and Adam Roberts and Marco Tagliasacchi and Matthew Sharifi and Neil Zeghidour and C. Frank},
    year      = {2023}
}

@article{Huang2022MuLanAJ,
    title   = {MuLan: A Joint Embedding of Music Audio and Natural Language},
    author  = {Qingqing Huang and Aren Jansen and Joonseok Lee and Ravi Ganti and Judith Yue Li and Daniel P. W. Ellis},
    journal = {ArXiv},
    year    = {2022},
    volume  = {abs/2208.12415}
}

@misc{https://doi.org/10.48550/arxiv.2302.01327,
    doi     = {10.48550/ARXIV.2302.01327},
    url     = {https://arxiv.org/abs/2302.01327},
    author  = {Kumar, Manoj and Dehghani, Mostafa and Houlsby, Neil},
    title   = {Dual PatchNorm},
    publisher = {arXiv},
    year    = {2023},
    copyright = {Creative Commons Attribution 4.0 International}
}

@article{Liu2022PatchDropoutEV,
    title   = {PatchDropout: Economizing Vision Transformers Using Patch Dropout},
    author  = {Yue Liu and Christos Matsoukas and Fredrik Strand and Hossein Azizpour and Kevin Smith},
    journal = {ArXiv},
    year    = {2022},
    volume  = {abs/2208.07220}
}

@misc{liu2021swin,
    title   = {Swin Transformer V2: Scaling Up Capacity and Resolution},
    author  = {Ze Liu and Han Hu and Yutong Lin and Zhuliang Yao and Zhenda Xie and Yixuan Wei and Jia Ning and Yue Cao and Zheng Zhang and Li Dong and Furu Wei and Baining Guo},
    year    = {2021},
    eprint  = {2111.09883},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

@misc{gilmer2023intriguing
    title  = {Intriguing Properties of Transformer Training Instabilities},
    author = {Justin Gilmer, Andrea Schioppa, and Jeremy Cohen},
    year   = {2023},
    status = {to be published - one attention stabilization technique is circulating within Google Brain, being used by multiple teams}
}

@inproceedings{Shukor2022EfficientVP,
    title   = {Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment},
    author  = {Mustafa Shukor and Guillaume Couairon and Matthieu Cord},
    booktitle = {British Machine Vision Conference},
    year    = {2022}
}

@inproceedings{Zhai2023SigmoidLF,
    title   = {Sigmoid Loss for Language Image Pre-Training},
    author  = {Xiaohua Zhai and Basil Mustafa and Alexander Kolesnikov and Lucas Beyer},
    year    = {2023}
}

The only truth is music. - Jack Kerouac

Music is the universal language of mankind. - Henry Wadsworth Longfellow

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot