Convert Figma logo to code with AI

pytorch logoaudio

Data manipulation and transformation for audio signal processing, powered by PyTorch

2,506
648
2,506
257

Top Related Projects

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

186,879

An Open Source Machine Learning Framework for Everyone

7,088

Python library for audio and music analysis

19,130

Magenta: Music and Art Generation with Machine Intelligence

8,390

End-to-End Speech Processing Toolkit

14,200

kaldi-asr/kaldi is the official location of the Kaldi project.

Quick Overview

PyTorch Audio (torchaudio) is an open-source library for audio and signal processing in PyTorch. It provides a wide range of audio I/O, processing, and feature extraction functionalities, making it easier for researchers and developers to work with audio data in machine learning projects.

Pros

  • Seamless integration with PyTorch ecosystem
  • Extensive collection of audio processing and feature extraction tools
  • GPU acceleration support for faster processing
  • Active development and community support

Cons

  • Steeper learning curve for those unfamiliar with PyTorch
  • Limited documentation for some advanced features
  • Occasional inconsistencies in API design across different modules
  • Dependency on external libraries for certain functionalities

Code Examples

  1. Loading and resampling an audio file:
import torchaudio

waveform, sample_rate = torchaudio.load("audio.wav")
resampled_waveform = torchaudio.functional.resample(waveform, sample_rate, target_sample_rate=16000)
  1. Applying a spectrogram transform:
import torch
import torchaudio.transforms as T

spectrogram = T.Spectrogram()
spec = spectrogram(waveform)
  1. Applying data augmentation:
time_stretch = T.TimeStretch()
pitch_shift = T.PitchShift(sample_rate=16000)

augmented_waveform = pitch_shift(time_stretch(waveform))
  1. Using a pre-trained model for speech recognition:
bundle = torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H
model = bundle.get_model()
labels = bundle.get_labels()

with torch.inference_mode():
    emission, _ = model(waveform)

emission = torch.log_softmax(emission, dim=-1)
transcript = post_process(emission, labels)

Getting Started

To get started with torchaudio, first install it using pip:

pip install torchaudio

Then, you can import and use the library in your Python script:

import torch
import torchaudio

# Load an audio file
waveform, sample_rate = torchaudio.load("path/to/audio.wav")

# Apply a transform
spectrogram = torchaudio.transforms.Spectrogram()
spec = spectrogram(waveform)

# Save the processed audio
torchaudio.save("processed_audio.wav", waveform, sample_rate)

This basic example demonstrates loading an audio file, applying a spectrogram transform, and saving the processed audio. Explore the documentation for more advanced functionalities and examples.

Competitor Comparisons

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Pros of Transformers

  • Broader scope, covering various NLP tasks and models
  • Extensive documentation and community support
  • Regular updates and integration with latest research

Cons of Transformers

  • Larger codebase, potentially more complex to navigate
  • May have higher computational requirements for some tasks
  • Less specialized for audio-specific tasks

Code Comparison

Transformers example:

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier("I love this library!")[0]
print(f"Label: {result['label']}, Score: {result['score']:.4f}")

Audio example:

import torchaudio

waveform, sample_rate = torchaudio.load("audio.wav")
spectrogram = torchaudio.transforms.Spectrogram()(waveform)
print(f"Spectrogram shape: {spectrogram.shape}")

Both libraries offer high-level APIs for their respective domains. Transformers provides easy-to-use pipelines for various NLP tasks, while Audio focuses on audio processing and feature extraction. Transformers is more versatile for general NLP tasks, while Audio is specialized for audio-related operations. The choice between them depends on the specific project requirements and the primary focus of the application.

186,879

An Open Source Machine Learning Framework for Everyone

Pros of TensorFlow

  • Larger ecosystem with more tools and libraries
  • Better support for production deployment and mobile/edge devices
  • More extensive documentation and community resources

Cons of TensorFlow

  • Steeper learning curve compared to PyTorch
  • Less dynamic and flexible for research and prototyping
  • Slower release cycle for new features

Code Comparison

TensorFlow:

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])

PyTorch Audio:

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(input_dim, 64),
    nn.ReLU(),
    nn.Linear(64, 10),
    nn.Softmax(dim=1)
)

Summary

TensorFlow is a more comprehensive framework with better production support, while PyTorch Audio focuses specifically on audio processing tasks. TensorFlow offers a wider range of tools and resources but can be more complex to learn. PyTorch Audio provides a more intuitive and flexible approach for audio-related research and development. The choice between the two depends on the specific project requirements, deployment needs, and developer preferences.

7,088

Python library for audio and music analysis

Pros of librosa

  • More comprehensive set of audio processing functions and features
  • Easier to use for general audio analysis tasks
  • Better documentation and examples for beginners

Cons of librosa

  • Slower performance compared to PyTorch Audio
  • Lacks deep learning integration and GPU acceleration
  • Limited real-time processing capabilities

Code Comparison

librosa:

import librosa

y, sr = librosa.load('audio.wav')
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)

PyTorch Audio:

import torchaudio

waveform, sample_rate = torchaudio.load('audio.wav')
mfcc_transform = torchaudio.transforms.MFCC(sample_rate=sample_rate, n_mfcc=13)
mfccs = mfcc_transform(waveform)

librosa is more straightforward for basic audio processing tasks, while PyTorch Audio integrates seamlessly with PyTorch's ecosystem for deep learning applications. PyTorch Audio offers better performance and GPU acceleration, making it suitable for large-scale audio processing and machine learning tasks. However, librosa provides a wider range of audio analysis functions and is generally easier to use for researchers and developers who don't require deep learning capabilities.

19,130

Magenta: Music and Art Generation with Machine Intelligence

Pros of Magenta

  • Focuses on creative AI applications in music and art
  • Offers a wide range of pre-built models for various creative tasks
  • Includes interactive demos and notebooks for easy experimentation

Cons of Magenta

  • Less flexible for general audio processing tasks
  • Steeper learning curve for users not familiar with TensorFlow
  • Smaller community compared to PyTorch ecosystem

Code Comparison

Magenta (using TensorFlow):

import magenta.music as mm
from magenta.models.melody_rnn import melody_rnn_sequence_generator

model = melody_rnn_sequence_generator.get_generator(checkpoint='basic_rnn')
melody = model.generate(steps=128, primer_melody=mm.Melody())

Torchaudio:

import torchaudio

waveform, sample_rate = torchaudio.load('audio.wav')
spectrogram = torchaudio.transforms.Spectrogram()(waveform)
mel_spectrogram = torchaudio.transforms.MelSpectrogram()(waveform)

Summary

Magenta is ideal for creative AI projects in music and art, offering specialized models and tools. Torchaudio, part of the PyTorch ecosystem, provides more general-purpose audio processing capabilities with greater flexibility. Magenta uses TensorFlow, while Torchaudio is built on PyTorch, influencing the coding style and available resources.

8,390

End-to-End Speech Processing Toolkit

Pros of ESPnet

  • Comprehensive end-to-end speech processing toolkit with support for various tasks (ASR, TTS, Speech Enhancement, etc.)
  • Extensive pre-trained models and recipes for different languages and datasets
  • Active community and frequent updates

Cons of ESPnet

  • Steeper learning curve due to its comprehensive nature
  • Potentially more complex setup and configuration process
  • May require more computational resources for some tasks

Code Comparison

ESPnet example (ASR training):

from espnet2.bin.asr_train import main

args = {
    "output_dir": "exp/asr_train",
    "max_epoch": 100,
    "batch_size": 32,
    "accum_grad": 2,
    "train_data_path_and_name_and_type": ["data/train/wav.scp", "speech", "sound"],
}
main(args)

torchaudio example (Audio loading and preprocessing):

import torchaudio

waveform, sample_rate = torchaudio.load("audio.wav")
spectrogram = torchaudio.transforms.MelSpectrogram()(waveform)
mfcc = torchaudio.transforms.MFCC()(waveform)

While ESPnet provides a more comprehensive toolkit for speech processing tasks, torchaudio offers simpler audio processing capabilities within the PyTorch ecosystem. ESPnet is better suited for end-to-end speech tasks, while torchaudio is more focused on audio I/O and transformations.

14,200

kaldi-asr/kaldi is the official location of the Kaldi project.

Pros of Kaldi

  • More comprehensive toolkit for speech recognition tasks
  • Extensive documentation and recipes for various ASR scenarios
  • Larger community and longer history in speech recognition research

Cons of Kaldi

  • Steeper learning curve due to complexity
  • Less integration with modern deep learning frameworks
  • Written primarily in C++, which may be less accessible for some developers

Code Comparison

Kaldi (C++):

FeatureWindowFunction feature_window_function(opts);
MelBanks mel_banks(opts.mel_opts, feature_window_function);
ComputeMelBankFeatures(waveform, opts, &mel_banks, &features);

PyTorch Audio (Python):

waveform, sample_rate = torchaudio.load("audio.wav")
mel_spectrogram = torchaudio.transforms.MelSpectrogram(sample_rate=sample_rate)(waveform)

Summary

Kaldi is a more established and comprehensive toolkit for speech recognition, offering a wide range of tools and recipes. It's particularly well-suited for researchers and those working on complex ASR tasks. However, it has a steeper learning curve and is less integrated with modern deep learning frameworks.

PyTorch Audio, on the other hand, is more accessible for developers familiar with PyTorch and Python. It offers seamless integration with the PyTorch ecosystem and is easier to use for simpler audio processing tasks. However, it may not provide as many specialized tools for advanced speech recognition research as Kaldi does.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

torchaudio: an audio library for PyTorch

Documentation Anaconda Badge Anaconda-Server Badge

TorchAudio Logo

The aim of torchaudio is to apply PyTorch to the audio domain. By supporting PyTorch, torchaudio follows the same philosophy of providing strong GPU acceleration, having a focus on trainable features through the autograd system, and having consistent style (tensor names and dimension names). Therefore, it is primarily a machine learning library and not a general signal processing library. The benefits of PyTorch can be seen in torchaudio through having all the computations be through PyTorch operations which makes it easy to use and feel like a natural extension.

Installation

Please refer to https://pytorch.org/audio/main/installation.html for installation and build process of TorchAudio.

API Reference

API Reference is located here: http://pytorch.org/audio/main/

Contributing Guidelines

Please refer to CONTRIBUTING.md

Citation

If you find this package useful, please cite as:

@article{yang2021torchaudio,
  title={TorchAudio: Building Blocks for Audio and Speech Processing},
  author={Yao-Yuan Yang and Moto Hira and Zhaoheng Ni and Anjali Chourdia and Artyom Astafurov and Caroline Chen and Ching-Feng Yeh and Christian Puhrsch and David Pollack and Dmitriy Genzel and Donny Greenberg and Edward Z. Yang and Jason Lian and Jay Mahadeokar and Jeff Hwang and Ji Chen and Peter Goldsborough and Prabhat Roy and Sean Narenthiran and Shinji Watanabe and Soumith Chintala and Vincent Quenneville-Bélair and Yangyang Shi},
  journal={arXiv preprint arXiv:2110.15018},
  year={2021}
}
@misc{hwang2023torchaudio,
      title={TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch}, 
      author={Jeff Hwang and Moto Hira and Caroline Chen and Xiaohui Zhang and Zhaoheng Ni and Guangzhi Sun and Pingchuan Ma and Ruizhe Huang and Vineel Pratap and Yuekai Zhang and Anurag Kumar and Chin-Yun Yu and Chuang Zhu and Chunxi Liu and Jacob Kahn and Mirco Ravanelli and Peng Sun and Shinji Watanabe and Yangyang Shi and Yumeng Tao and Robin Scheibler and Samuele Cornell and Sean Kim and Stavros Petridis},
      year={2023},
      eprint={2310.17864},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

Disclaimer on Datasets

This is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.

If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thanks for your contribution to the ML community!

Pre-trained Model License

The pre-trained models provided in this library may have their own licenses or terms and conditions derived from the dataset used for training. It is your responsibility to determine whether you have permission to use the models for your use case.

For instance, SquimSubjective model is released under the Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC 4.0) license. See the link for additional details.

Other pre-trained models that have different license are noted in documentation. Please checkout the documentation page.