data
A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
Top Related Projects
TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
Parallel computing with task scheduling
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
HDF5 for Python -- The h5py package is a Pythonic interface to the HDF5 binary data format.
Quick Overview
PyTorch Data is a library that provides a collection of datasets and data loading utilities for machine learning tasks. It aims to simplify the process of loading and preprocessing data for PyTorch models, offering a wide range of built-in datasets and tools for creating custom datasets.
Pros
- Extensive collection of pre-built datasets for various machine learning tasks
- Flexible and customizable data loading pipelines
- Seamless integration with PyTorch models and training loops
- Efficient data handling with support for parallel processing and caching
Cons
- Learning curve for advanced customization of data loaders
- Limited documentation for some less common datasets
- Occasional inconsistencies in dataset formats across different domains
- May require additional dependencies for certain datasets
Code Examples
- Loading a built-in dataset:
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
# Load MNIST dataset
train_dataset = MNIST(root='./data', train=True, download=True)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
- Creating a custom dataset:
from torch.utils.data import Dataset
class CustomDataset(Dataset):
def __init__(self, data, labels):
self.data = data
self.labels = labels
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
return self.data[idx], self.labels[idx]
# Usage
custom_dataset = CustomDataset(data, labels)
custom_loader = DataLoader(custom_dataset, batch_size=16, shuffle=True)
- Applying transformations to data:
from torchvision import transforms
# Define transformations
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
# Apply transformations to a dataset
transformed_dataset = MNIST(root='./data', train=True, download=True, transform=transform)
Getting Started
To get started with PyTorch Data, follow these steps:
- Install PyTorch and torchvision:
pip install torch torchvision
- Import the necessary modules:
from torch.utils.data import Dataset, DataLoader
from torchvision import datasets, transforms
- Load a dataset and create a data loader:
# Load CIFAR-10 dataset
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = DataLoader(trainset, batch_size=4, shuffle=True, num_workers=2)
Now you can use the trainloader
in your PyTorch training loop to iterate over batches of data.
Competitor Comparisons
TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
Pros of datasets
- Larger collection of ready-to-use datasets
- Better documentation and examples
- Seamless integration with TensorFlow ecosystem
Cons of datasets
- Less flexible for custom dataset creation
- Slower loading times for large datasets
- Limited support for non-TensorFlow frameworks
Code Comparison
datasets:
import tensorflow_datasets as tfds
dataset = tfds.load('mnist', split='train', as_supervised=True)
dataset = dataset.shuffle(1000).batch(32)
for images, labels in dataset:
# Train model
data:
from torchvision import datasets, transforms
dataset = datasets.MNIST(root='./data', train=True, transform=transforms.ToTensor())
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)
for images, labels in dataloader:
# Train model
Both repositories provide easy access to common datasets, but datasets offers a more extensive collection out-of-the-box. data, however, allows for more customization in dataset creation and preprocessing. datasets integrates seamlessly with TensorFlow, while data is designed for PyTorch users. The code examples demonstrate the simplicity of loading and using datasets in both frameworks, with similar syntax for data loading and iteration.
Parallel computing with task scheduling
Pros of Dask
- Designed for parallel and distributed computing, allowing for processing of larger-than-memory datasets
- Integrates well with the PyData ecosystem (NumPy, Pandas, Scikit-learn)
- Flexible task scheduling and execution for complex workflows
Cons of Dask
- Steeper learning curve compared to PyTorch's data utilities
- May have higher overhead for smaller datasets
- Less focused on deep learning specific data processing
Code Comparison
Dask:
import dask.dataframe as dd
df = dd.read_csv('large_dataset.csv')
result = df.groupby('category').mean().compute()
PyTorch Data:
from torch.utils.data import Dataset, DataLoader
class CustomDataset(Dataset):
def __init__(self, csv_file):
self.data = pd.read_csv(csv_file)
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
return self.data.iloc[idx]
dataset = CustomDataset('dataset.csv')
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Pros of pandas
- Mature and widely-used library with extensive documentation and community support
- Powerful data manipulation and analysis capabilities for structured data
- Seamless integration with other data science libraries in the Python ecosystem
Cons of pandas
- Can be memory-intensive for large datasets
- Learning curve can be steep for beginners due to its extensive functionality
- Performance may be slower compared to specialized libraries for specific tasks
Code Comparison
pandas:
import pandas as pd
df = pd.read_csv('data.csv')
filtered_df = df[df['column'] > 5]
result = filtered_df.groupby('category').mean()
pytorch/data:
from torch.utils.data import Dataset
class CustomDataset(Dataset):
def __init__(self, data_file):
self.data = self.load_data(data_file)
def __getitem__(self, index):
return self.data[index]
The code snippets highlight the different focus areas of the two libraries. pandas is designed for data manipulation and analysis, while pytorch/data is tailored for creating custom datasets for machine learning tasks.
HDF5 for Python -- The h5py package is a Pythonic interface to the HDF5 binary data format.
Pros of h5py
- Specialized for handling HDF5 files, offering efficient storage and access for large datasets
- Language-agnostic, allowing data interchange between different programming environments
- Supports complex data structures and metadata
Cons of h5py
- Limited to HDF5 format, less versatile for other data types
- Steeper learning curve for users unfamiliar with HDF5
- Less integrated with machine learning workflows compared to PyTorch Data
Code Comparison
h5py:
import h5py
with h5py.File('data.h5', 'r') as f:
dataset = f['dataset_name'][:]
PyTorch Data:
from torch.utils.data import Dataset
class CustomDataset(Dataset):
def __init__(self, data_path):
self.data = load_data(data_path)
h5py excels in handling HDF5 files, providing efficient storage and access for large datasets across different programming environments. It supports complex data structures and metadata. However, it's limited to the HDF5 format and has a steeper learning curve.
PyTorch Data, on the other hand, is more versatile and integrates seamlessly with PyTorch's machine learning ecosystem. It supports various data formats and is easier to use for those already familiar with PyTorch. However, it may not be as efficient for very large datasets compared to h5py's specialized HDF5 handling.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
TorchData (see note below on current status)
What is TorchData? | Stateful DataLoader | Install guide | Contributing | License
:warning: June 2024 Status Update: Removing DataPipes and DataLoader V2
We are re-focusing the torchdata repo to be an iterative enhancement of torch.utils.data.DataLoader. We do not plan on
continuing development or maintaining the [DataPipes
] and [DataLoaderV2
] solutions, and they will be removed from
the torchdata repo. We'll also be revisiting the DataPipes
references in pytorch/pytorch. In release
torchdata==0.8.0
(July 2024) they will be marked as deprecated, and in 0.9.0 (Oct 2024) they will be deleted. Existing
users are advised to pin to torchdata==0.8.0
or an older version until they are able to migrate away. Subsequent
releases will not include DataPipes or DataLoaderV2. The old version of this README is
available here. Please reach out if you suggestions or comments
(please use #1196 for feedback).
What is TorchData?
The TorchData project is an iterative enhancement to the PyTorch torch.utils.data.DataLoader and torch.utils.data.Dataset/IterableDataset to make them scalable, performant dataloading solutions. We will be iterating on the enhancements under the torchdata repo.
Our first change begins with adding checkpointing to torch.utils.data.DataLoader, which can be found in
stateful_dataloader, a drop-in replacement for torch.utils.data.DataLoader, by defining
load_state_dict
and state_dict
methods that enable mid-epoch checkpointing, and an API for users to track custom
iteration progress, and other custom states from the dataloader workers such as token buffers and/or RNG states.
Stateful DataLoader
torchdata.stateful_dataloader.StatefulDataLoader
is a drop-in replacement for torch.utils.data.DataLoader which
provides state_dict and load_state_dict functionality. See
the Stateful DataLoader main page for more information and examples. Also check out the
examples
in this Colab notebook.
Installation
Version Compatibility
The following is the corresponding torchdata
versions and supported Python versions.
torch | torchdata | python |
---|---|---|
master / nightly | main / nightly | >=3.8 , <=3.12 |
2.4.0 | 0.8.0 | >=3.8 , <=3.12 |
2.0.0 | 0.6.0 | >=3.8 , <=3.11 |
1.13.1 | 0.5.1 | >=3.7 , <=3.10 |
1.12.1 | 0.4.1 | >=3.7 , <=3.10 |
1.12.0 | 0.4.0 | >=3.7 , <=3.10 |
1.11.0 | 0.3.0 | >=3.7 , <=3.10 |
Local pip or conda
First, set up an environment. We will be installing a PyTorch binary as well as torchdata. If you're using conda, create a conda environment:
conda create --name torchdata
conda activate torchdata
If you wish to use venv
instead:
python -m venv torchdata-env
source torchdata-env/bin/activate
Install torchdata:
Using pip:
pip install torchdata
Using conda:
conda install -c pytorch torchdata
From source
pip install .
In case building TorchData from source fails, install the nightly version of PyTorch following the linked guide on the contributing page.
From nightly
The nightly version of TorchData is also provided and updated daily from main branch.
Using pip:
pip install --pre torchdata --extra-index-url https://download.pytorch.org/whl/nightly/cpu
Using conda:
conda install torchdata -c pytorch-nightly
Contributing
We welcome PRs! See the CONTRIBUTING file.
Beta Usage and Feedback
We'd love to hear from and work with early adopters to shape our designs. Please reach out by raising an issue if you're interested in using this tooling for your project.
License
TorchData is BSD licensed, as found in the LICENSE file.
Top Related Projects
TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
Parallel computing with task scheduling
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
HDF5 for Python -- The h5py package is a Pythonic interface to the HDF5 binary data format.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot