audio
Data manipulation and transformation for audio signal processing, powered by PyTorch
Top Related Projects
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
An Open Source Machine Learning Framework for Everyone
Python library for audio and music analysis
Magenta: Music and Art Generation with Machine Intelligence
End-to-End Speech Processing Toolkit
kaldi-asr/kaldi is the official location of the Kaldi project.
Quick Overview
PyTorch Audio (torchaudio) is an open-source library for audio and signal processing in PyTorch. It provides a wide range of audio I/O, processing, and feature extraction functionalities, making it easier for researchers and developers to work with audio data in machine learning projects.
Pros
- Seamless integration with PyTorch ecosystem
- Extensive collection of audio processing and feature extraction tools
- GPU acceleration support for faster processing
- Active development and community support
Cons
- Steeper learning curve for those unfamiliar with PyTorch
- Limited documentation for some advanced features
- Occasional inconsistencies in API design across different modules
- Dependency on external libraries for certain functionalities
Code Examples
- Loading and resampling an audio file:
import torchaudio
waveform, sample_rate = torchaudio.load("audio.wav")
resampled_waveform = torchaudio.functional.resample(waveform, sample_rate, target_sample_rate=16000)
- Applying a spectrogram transform:
import torch
import torchaudio.transforms as T
spectrogram = T.Spectrogram()
spec = spectrogram(waveform)
- Applying data augmentation:
time_stretch = T.TimeStretch()
pitch_shift = T.PitchShift(sample_rate=16000)
augmented_waveform = pitch_shift(time_stretch(waveform))
- Using a pre-trained model for speech recognition:
bundle = torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H
model = bundle.get_model()
labels = bundle.get_labels()
with torch.inference_mode():
emission, _ = model(waveform)
emission = torch.log_softmax(emission, dim=-1)
transcript = post_process(emission, labels)
Getting Started
To get started with torchaudio, first install it using pip:
pip install torchaudio
Then, you can import and use the library in your Python script:
import torch
import torchaudio
# Load an audio file
waveform, sample_rate = torchaudio.load("path/to/audio.wav")
# Apply a transform
spectrogram = torchaudio.transforms.Spectrogram()
spec = spectrogram(waveform)
# Save the processed audio
torchaudio.save("processed_audio.wav", waveform, sample_rate)
This basic example demonstrates loading an audio file, applying a spectrogram transform, and saving the processed audio. Explore the documentation for more advanced functionalities and examples.
Competitor Comparisons
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Pros of Transformers
- Broader scope, covering various NLP tasks and models
- Extensive documentation and community support
- Regular updates and integration with latest research
Cons of Transformers
- Larger codebase, potentially more complex to navigate
- May have higher computational requirements for some tasks
- Less specialized for audio-specific tasks
Code Comparison
Transformers example:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("I love this library!")[0]
print(f"Label: {result['label']}, Score: {result['score']:.4f}")
Audio example:
import torchaudio
waveform, sample_rate = torchaudio.load("audio.wav")
spectrogram = torchaudio.transforms.Spectrogram()(waveform)
print(f"Spectrogram shape: {spectrogram.shape}")
Both libraries offer high-level APIs for their respective domains. Transformers provides easy-to-use pipelines for various NLP tasks, while Audio focuses on audio processing and feature extraction. Transformers is more versatile for general NLP tasks, while Audio is specialized for audio-related operations. The choice between them depends on the specific project requirements and the primary focus of the application.
An Open Source Machine Learning Framework for Everyone
Pros of TensorFlow
- Larger ecosystem with more tools and libraries
- Better support for production deployment and mobile/edge devices
- More extensive documentation and community resources
Cons of TensorFlow
- Steeper learning curve compared to PyTorch
- Less dynamic and flexible for research and prototyping
- Slower release cycle for new features
Code Comparison
TensorFlow:
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
PyTorch Audio:
import torch.nn as nn
model = nn.Sequential(
nn.Linear(input_dim, 64),
nn.ReLU(),
nn.Linear(64, 10),
nn.Softmax(dim=1)
)
Summary
TensorFlow is a more comprehensive framework with better production support, while PyTorch Audio focuses specifically on audio processing tasks. TensorFlow offers a wider range of tools and resources but can be more complex to learn. PyTorch Audio provides a more intuitive and flexible approach for audio-related research and development. The choice between the two depends on the specific project requirements, deployment needs, and developer preferences.
Python library for audio and music analysis
Pros of librosa
- More comprehensive set of audio processing functions and features
- Easier to use for general audio analysis tasks
- Better documentation and examples for beginners
Cons of librosa
- Slower performance compared to PyTorch Audio
- Lacks deep learning integration and GPU acceleration
- Limited real-time processing capabilities
Code Comparison
librosa:
import librosa
y, sr = librosa.load('audio.wav')
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
PyTorch Audio:
import torchaudio
waveform, sample_rate = torchaudio.load('audio.wav')
mfcc_transform = torchaudio.transforms.MFCC(sample_rate=sample_rate, n_mfcc=13)
mfccs = mfcc_transform(waveform)
librosa is more straightforward for basic audio processing tasks, while PyTorch Audio integrates seamlessly with PyTorch's ecosystem for deep learning applications. PyTorch Audio offers better performance and GPU acceleration, making it suitable for large-scale audio processing and machine learning tasks. However, librosa provides a wider range of audio analysis functions and is generally easier to use for researchers and developers who don't require deep learning capabilities.
Magenta: Music and Art Generation with Machine Intelligence
Pros of Magenta
- Focuses on creative AI applications in music and art
- Offers a wide range of pre-built models for various creative tasks
- Includes interactive demos and notebooks for easy experimentation
Cons of Magenta
- Less flexible for general audio processing tasks
- Steeper learning curve for users not familiar with TensorFlow
- Smaller community compared to PyTorch ecosystem
Code Comparison
Magenta (using TensorFlow):
import magenta.music as mm
from magenta.models.melody_rnn import melody_rnn_sequence_generator
model = melody_rnn_sequence_generator.get_generator(checkpoint='basic_rnn')
melody = model.generate(steps=128, primer_melody=mm.Melody())
Torchaudio:
import torchaudio
waveform, sample_rate = torchaudio.load('audio.wav')
spectrogram = torchaudio.transforms.Spectrogram()(waveform)
mel_spectrogram = torchaudio.transforms.MelSpectrogram()(waveform)
Summary
Magenta is ideal for creative AI projects in music and art, offering specialized models and tools. Torchaudio, part of the PyTorch ecosystem, provides more general-purpose audio processing capabilities with greater flexibility. Magenta uses TensorFlow, while Torchaudio is built on PyTorch, influencing the coding style and available resources.
End-to-End Speech Processing Toolkit
Pros of ESPnet
- Comprehensive end-to-end speech processing toolkit with support for various tasks (ASR, TTS, Speech Enhancement, etc.)
- Extensive pre-trained models and recipes for different languages and datasets
- Active community and frequent updates
Cons of ESPnet
- Steeper learning curve due to its comprehensive nature
- Potentially more complex setup and configuration process
- May require more computational resources for some tasks
Code Comparison
ESPnet example (ASR training):
from espnet2.bin.asr_train import main
args = {
"output_dir": "exp/asr_train",
"max_epoch": 100,
"batch_size": 32,
"accum_grad": 2,
"train_data_path_and_name_and_type": ["data/train/wav.scp", "speech", "sound"],
}
main(args)
torchaudio example (Audio loading and preprocessing):
import torchaudio
waveform, sample_rate = torchaudio.load("audio.wav")
spectrogram = torchaudio.transforms.MelSpectrogram()(waveform)
mfcc = torchaudio.transforms.MFCC()(waveform)
While ESPnet provides a more comprehensive toolkit for speech processing tasks, torchaudio offers simpler audio processing capabilities within the PyTorch ecosystem. ESPnet is better suited for end-to-end speech tasks, while torchaudio is more focused on audio I/O and transformations.
kaldi-asr/kaldi is the official location of the Kaldi project.
Pros of Kaldi
- More comprehensive toolkit for speech recognition tasks
- Extensive documentation and recipes for various ASR scenarios
- Larger community and longer history in speech recognition research
Cons of Kaldi
- Steeper learning curve due to complexity
- Less integration with modern deep learning frameworks
- Written primarily in C++, which may be less accessible for some developers
Code Comparison
Kaldi (C++):
FeatureWindowFunction feature_window_function(opts);
MelBanks mel_banks(opts.mel_opts, feature_window_function);
ComputeMelBankFeatures(waveform, opts, &mel_banks, &features);
PyTorch Audio (Python):
waveform, sample_rate = torchaudio.load("audio.wav")
mel_spectrogram = torchaudio.transforms.MelSpectrogram(sample_rate=sample_rate)(waveform)
Summary
Kaldi is a more established and comprehensive toolkit for speech recognition, offering a wide range of tools and recipes. It's particularly well-suited for researchers and those working on complex ASR tasks. However, it has a steeper learning curve and is less integrated with modern deep learning frameworks.
PyTorch Audio, on the other hand, is more accessible for developers familiar with PyTorch and Python. It offers seamless integration with the PyTorch ecosystem and is easier to use for simpler audio processing tasks. However, it may not provide as many specialized tools for advanced speech recognition research as Kaldi does.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
torchaudio: an audio library for PyTorch
The aim of torchaudio is to apply PyTorch to the audio domain. By supporting PyTorch, torchaudio follows the same philosophy of providing strong GPU acceleration, having a focus on trainable features through the autograd system, and having consistent style (tensor names and dimension names). Therefore, it is primarily a machine learning library and not a general signal processing library. The benefits of PyTorch can be seen in torchaudio through having all the computations be through PyTorch operations which makes it easy to use and feel like a natural extension.
- Support audio I/O (Load files, Save files)
- Load a variety of audio formats, such as
wav
,mp3
,ogg
,flac
,opus
,sphere
, into a torch Tensor using SoX - Kaldi (ark/scp)
- Load a variety of audio formats, such as
- Dataloaders for common audio datasets
- Audio and speech processing functions
- Common audio transforms
- Compliance interfaces: Run code using PyTorch that align with other libraries
Installation
Please refer to https://pytorch.org/audio/main/installation.html for installation and build process of TorchAudio.
API Reference
API Reference is located here: http://pytorch.org/audio/main/
Contributing Guidelines
Please refer to CONTRIBUTING.md
Citation
If you find this package useful, please cite as:
@article{yang2021torchaudio,
title={TorchAudio: Building Blocks for Audio and Speech Processing},
author={Yao-Yuan Yang and Moto Hira and Zhaoheng Ni and Anjali Chourdia and Artyom Astafurov and Caroline Chen and Ching-Feng Yeh and Christian Puhrsch and David Pollack and Dmitriy Genzel and Donny Greenberg and Edward Z. Yang and Jason Lian and Jay Mahadeokar and Jeff Hwang and Ji Chen and Peter Goldsborough and Prabhat Roy and Sean Narenthiran and Shinji Watanabe and Soumith Chintala and Vincent Quenneville-Bélair and Yangyang Shi},
journal={arXiv preprint arXiv:2110.15018},
year={2021}
}
@misc{hwang2023torchaudio,
title={TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch},
author={Jeff Hwang and Moto Hira and Caroline Chen and Xiaohui Zhang and Zhaoheng Ni and Guangzhi Sun and Pingchuan Ma and Ruizhe Huang and Vineel Pratap and Yuekai Zhang and Anurag Kumar and Chin-Yun Yu and Chuang Zhu and Chunxi Liu and Jacob Kahn and Mirco Ravanelli and Peng Sun and Shinji Watanabe and Yangyang Shi and Yumeng Tao and Robin Scheibler and Samuele Cornell and Sean Kim and Stavros Petridis},
year={2023},
eprint={2310.17864},
archivePrefix={arXiv},
primaryClass={eess.AS}
}
Disclaimer on Datasets
This is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.
If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thanks for your contribution to the ML community!
Pre-trained Model License
The pre-trained models provided in this library may have their own licenses or terms and conditions derived from the dataset used for training. It is your responsibility to determine whether you have permission to use the models for your use case.
For instance, SquimSubjective model is released under the Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC 4.0) license. See the link for additional details.
Other pre-trained models that have different license are noted in documentation. Please checkout the documentation page.
Top Related Projects
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
An Open Source Machine Learning Framework for Everyone
Python library for audio and music analysis
Magenta: Music and Art Generation with Machine Intelligence
End-to-End Speech Processing Toolkit
kaldi-asr/kaldi is the official location of the Kaldi project.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot