audio

Data manipulation and transformation for audio signal processing, powered by PyTorch

2,701

707

2,701

317

View on GitHub

Top Related Projects

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

tensorflow

190,523

An Open Source Machine Learning Framework for Everyone

librosa

7,777

Python library for audio and music analysis

magenta

19,599

Magenta: Music and Art Generation with Machine Intelligence

espnet

9,348

End-to-End Speech Processing Toolkit

kaldi

15,018

kaldi-asr/kaldi is the official location of the Kaldi project.

Quick Overview

PyTorch Audio (torchaudio) is an open-source library for audio and signal processing in PyTorch. It provides a wide range of audio I/O, processing, and feature extraction functionalities, making it easier for researchers and developers to work with audio data in machine learning projects.

Pros

Seamless integration with PyTorch ecosystem
Extensive collection of audio processing and feature extraction tools
GPU acceleration support for faster processing
Active development and community support

Cons

Steeper learning curve for those unfamiliar with PyTorch
Limited documentation for some advanced features
Occasional inconsistencies in API design across different modules
Dependency on external libraries for certain functionalities

Code Examples

Loading and resampling an audio file:

import torchaudio

waveform, sample_rate = torchaudio.load("audio.wav")
resampled_waveform = torchaudio.functional.resample(waveform, sample_rate, target_sample_rate=16000)

Applying a spectrogram transform:

import torch
import torchaudio.transforms as T

spectrogram = T.Spectrogram()
spec = spectrogram(waveform)

Applying data augmentation:

time_stretch = T.TimeStretch()
pitch_shift = T.PitchShift(sample_rate=16000)

augmented_waveform = pitch_shift(time_stretch(waveform))

Using a pre-trained model for speech recognition:

bundle = torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H
model = bundle.get_model()
labels = bundle.get_labels()

with torch.inference_mode():
    emission, _ = model(waveform)

emission = torch.log_softmax(emission, dim=-1)
transcript = post_process(emission, labels)

Getting Started

To get started with torchaudio, first install it using pip:

pip install torchaudio

Then, you can import and use the library in your Python script:

import torch
import torchaudio

# Load an audio file
waveform, sample_rate = torchaudio.load("path/to/audio.wav")

# Apply a transform
spectrogram = torchaudio.transforms.Spectrogram()
spec = spectrogram(waveform)

# Save the processed audio
torchaudio.save("processed_audio.wav", waveform, sample_rate)

This basic example demonstrates loading an audio file, applying a spectrogram transform, and saving the processed audio. Explore the documentation for more advanced functionalities and examples.

Competitor Comparisons

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Pros of Transformers

Broader scope, covering various NLP tasks and models
Extensive documentation and community support
Regular updates and integration with latest research

Cons of Transformers

Larger codebase, potentially more complex to navigate
May have higher computational requirements for some tasks
Less specialized for audio-specific tasks

Code Comparison

Transformers example:

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier("I love this library!")[0]
print(f"Label: {result['label']}, Score: {result['score']:.4f}")

Audio example:

import torchaudio

waveform, sample_rate = torchaudio.load("audio.wav")
spectrogram = torchaudio.transforms.Spectrogram()(waveform)
print(f"Spectrogram shape: {spectrogram.shape}")

Both libraries offer high-level APIs for their respective domains. Transformers provides easy-to-use pipelines for various NLP tasks, while Audio focuses on audio processing and feature extraction. Transformers is more versatile for general NLP tasks, while Audio is specialized for audio-related operations. The choice between them depends on the specific project requirements and the primary focus of the application.

tensorflow

190,523

An Open Source Machine Learning Framework for Everyone

Pros of TensorFlow

Larger ecosystem with more tools and libraries
Better support for production deployment and mobile/edge devices
More extensive documentation and community resources

Cons of TensorFlow

Steeper learning curve compared to PyTorch
Less dynamic and flexible for research and prototyping
Slower release cycle for new features

Code Comparison

TensorFlow:

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])

PyTorch Audio:

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(input_dim, 64),
    nn.ReLU(),
    nn.Linear(64, 10),
    nn.Softmax(dim=1)
)

Summary

TensorFlow is a more comprehensive framework with better production support, while PyTorch Audio focuses specifically on audio processing tasks. TensorFlow offers a wider range of tools and resources but can be more complex to learn. PyTorch Audio provides a more intuitive and flexible approach for audio-related research and development. The choice between the two depends on the specific project requirements, deployment needs, and developer preferences.

librosa

7,777

Python library for audio and music analysis

Pros of librosa

More comprehensive set of audio processing functions and features
Easier to use for general audio analysis tasks
Better documentation and examples for beginners

Cons of librosa

Slower performance compared to PyTorch Audio
Lacks deep learning integration and GPU acceleration
Limited real-time processing capabilities

Code Comparison

librosa:

import librosa

y, sr = librosa.load('audio.wav')
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)

PyTorch Audio:

import torchaudio

waveform, sample_rate = torchaudio.load('audio.wav')
mfcc_transform = torchaudio.transforms.MFCC(sample_rate=sample_rate, n_mfcc=13)
mfccs = mfcc_transform(waveform)

librosa is more straightforward for basic audio processing tasks, while PyTorch Audio integrates seamlessly with PyTorch's ecosystem for deep learning applications. PyTorch Audio offers better performance and GPU acceleration, making it suitable for large-scale audio processing and machine learning tasks. However, librosa provides a wider range of audio analysis functions and is generally easier to use for researchers and developers who don't require deep learning capabilities.

magenta

19,599

Magenta: Music and Art Generation with Machine Intelligence

Pros of Magenta

Focuses on creative AI applications in music and art
Offers a wide range of pre-built models for various creative tasks
Includes interactive demos and notebooks for easy experimentation

Cons of Magenta

Less flexible for general audio processing tasks
Steeper learning curve for users not familiar with TensorFlow
Smaller community compared to PyTorch ecosystem

Code Comparison

Magenta (using TensorFlow):

import magenta.music as mm
from magenta.models.melody_rnn import melody_rnn_sequence_generator

model = melody_rnn_sequence_generator.get_generator(checkpoint='basic_rnn')
melody = model.generate(steps=128, primer_melody=mm.Melody())

Torchaudio:

import torchaudio

waveform, sample_rate = torchaudio.load('audio.wav')
spectrogram = torchaudio.transforms.Spectrogram()(waveform)
mel_spectrogram = torchaudio.transforms.MelSpectrogram()(waveform)

Summary

Magenta is ideal for creative AI projects in music and art, offering specialized models and tools. Torchaudio, part of the PyTorch ecosystem, provides more general-purpose audio processing capabilities with greater flexibility. Magenta uses TensorFlow, while Torchaudio is built on PyTorch, influencing the coding style and available resources.

espnet

9,348

End-to-End Speech Processing Toolkit

Pros of ESPnet

Comprehensive end-to-end speech processing toolkit with support for various tasks (ASR, TTS, Speech Enhancement, etc.)
Extensive pre-trained models and recipes for different languages and datasets
Active community and frequent updates

Cons of ESPnet

Steeper learning curve due to its comprehensive nature
Potentially more complex setup and configuration process
May require more computational resources for some tasks

Code Comparison

ESPnet example (ASR training):

from espnet2.bin.asr_train import main

args = {
    "output_dir": "exp/asr_train",
    "max_epoch": 100,
    "batch_size": 32,
    "accum_grad": 2,
    "train_data_path_and_name_and_type": ["data/train/wav.scp", "speech", "sound"],
}
main(args)

torchaudio example (Audio loading and preprocessing):

import torchaudio

waveform, sample_rate = torchaudio.load("audio.wav")
spectrogram = torchaudio.transforms.MelSpectrogram()(waveform)
mfcc = torchaudio.transforms.MFCC()(waveform)

While ESPnet provides a more comprehensive toolkit for speech processing tasks, torchaudio offers simpler audio processing capabilities within the PyTorch ecosystem. ESPnet is better suited for end-to-end speech tasks, while torchaudio is more focused on audio I/O and transformations.

kaldi

15,018

kaldi-asr/kaldi is the official location of the Kaldi project.

Pros of Kaldi

More comprehensive toolkit for speech recognition tasks
Extensive documentation and recipes for various ASR scenarios
Larger community and longer history in speech recognition research

Cons of Kaldi

Steeper learning curve due to complexity
Less integration with modern deep learning frameworks
Written primarily in C++, which may be less accessible for some developers

Code Comparison

Kaldi (C++):

FeatureWindowFunction feature_window_function(opts);
MelBanks mel_banks(opts.mel_opts, feature_window_function);
ComputeMelBankFeatures(waveform, opts, &mel_banks, &features);

PyTorch Audio (Python):

waveform, sample_rate = torchaudio.load("audio.wav")
mel_spectrogram = torchaudio.transforms.MelSpectrogram(sample_rate=sample_rate)(waveform)

Summary

Kaldi is a more established and comprehensive toolkit for speech recognition, offering a wide range of tools and recipes. It's particularly well-suited for researchers and those working on complex ASR tasks. However, it has a steeper learning curve and is less integrated with modern deep learning frameworks.

PyTorch Audio, on the other hand, is more accessible for developers familiar with PyTorch and Python. It offers seamless integration with the PyTorch ecosystem and is easier to use for simpler audio processing tasks. However, it may not provide as many specialized tools for advanced speech recognition research as Kaldi does.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

torchaudio: an audio library for PyTorch

[!NOTE] We are in the process of refactoring TorchAudio and transitioning it into a maintenance phase. This process will include removing some user-facing features: those features are deprecated from TorchAudio 2.8 and will be removed in 2.9. Our main goals are to reduce redundancies with the rest of the PyTorch ecosystem, make it easier to maintain, and create a version of TorchAudio that is more tightly scoped to its strengths: processing audio data for ML. Please see our community message for more details.

The aim of torchaudio is to apply PyTorch to the audio domain. By supporting PyTorch, torchaudio follows the same philosophy of providing strong GPU acceleration, having a focus on trainable features through the autograd system, and having consistent style (tensor names and dimension names). Therefore, it is primarily a machine learning library and not a general signal processing library. The benefits of PyTorch can be seen in torchaudio through having all the computations be through PyTorch operations which makes it easy to use and feel like a natural extension.

Support audio I/O (Load files, Save files)
- Load a variety of audio formats, such as wav, mp3, ogg, flac, opus, sphere, into a torch Tensor using SoX
- Kaldi (ark/scp)
Dataloaders for common audio datasets
Audio and speech processing functions
- forced_align
Common audio transforms
- Spectrogram, AmplitudeToDB, MelScale, MelSpectrogram, MFCC, MuLawEncoding, MuLawDecoding, Resample
Compliance interfaces: Run code using PyTorch that align with other libraries
- Kaldi: spectrogram, fbank, mfcc

Installation

Please refer to https://pytorch.org/audio/main/installation.html for installation and build process of TorchAudio.

API Reference

API Reference is located here: http://pytorch.org/audio/main/

Contributing Guidelines

Please refer to CONTRIBUTING.md

Citation

If you find this package useful, please cite as:

@article{yang2021torchaudio,
  title={TorchAudio: Building Blocks for Audio and Speech Processing},
  author={Yao-Yuan Yang and Moto Hira and Zhaoheng Ni and Anjali Chourdia and Artyom Astafurov and Caroline Chen and Ching-Feng Yeh and Christian Puhrsch and David Pollack and Dmitriy Genzel and Donny Greenberg and Edward Z. Yang and Jason Lian and Jay Mahadeokar and Jeff Hwang and Ji Chen and Peter Goldsborough and Prabhat Roy and Sean Narenthiran and Shinji Watanabe and Soumith Chintala and Vincent Quenneville-BÃ©lair and Yangyang Shi},
  journal={arXiv preprint arXiv:2110.15018},
  year={2021}
}

@misc{hwang2023torchaudio,
      title={TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch}, 
      author={Jeff Hwang and Moto Hira and Caroline Chen and Xiaohui Zhang and Zhaoheng Ni and Guangzhi Sun and Pingchuan Ma and Ruizhe Huang and Vineel Pratap and Yuekai Zhang and Anurag Kumar and Chin-Yun Yu and Chuang Zhu and Chunxi Liu and Jacob Kahn and Mirco Ravanelli and Peng Sun and Shinji Watanabe and Yangyang Shi and Yumeng Tao and Robin Scheibler and Samuele Cornell and Sean Kim and Stavros Petridis},
      year={2023},
      eprint={2310.17864},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

Disclaimer on Datasets

This is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.

If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thanks for your contribution to the ML community!

Pre-trained Model License

The pre-trained models provided in this library may have their own licenses or terms and conditions derived from the dataset used for training. It is your responsibility to determine whether you have permission to use the models for your use case.

For instance, SquimSubjective model is released under the Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC 4.0) license. See the link for additional details.

Other pre-trained models that have different license are noted in documentation. Please checkout the documentation page.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot