datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...

4,402

1,572

4,402

698

View on GitHub

Top Related Projects

data

1,206

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.

scikit-learn

62,466

scikit-learn: machine learning in Python

pandas

45,255

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

numpy

30,015

The fundamental package for scientific computing with Python.

Quick Overview

TensorFlow Datasets is an open-source library that provides a collection of ready-to-use datasets for machine learning tasks. It offers a simple and consistent interface to access and prepare datasets for use with TensorFlow, making it easier for researchers and developers to focus on model development rather than data preprocessing.

Pros

Large variety of datasets available, covering various domains such as image classification, natural language processing, and speech recognition
Consistent API for loading and preprocessing datasets, simplifying integration with TensorFlow models
Automatic download and caching of datasets, saving time and storage space
Support for custom dataset creation and contribution to the library

Cons

Some datasets may require significant storage space and download time
Limited customization options for certain preprocessing steps
Occasional compatibility issues with specific TensorFlow versions
Some less popular datasets may have limited documentation or support

Code Examples

Loading and preparing a dataset:

import tensorflow_datasets as tfds

# Load the MNIST dataset
dataset = tfds.load('mnist', split='train', as_supervised=True)

# Prepare the dataset for training
dataset = dataset.map(lambda img, label: (tf.cast(img, tf.float32) / 255.0, label))
dataset = dataset.batch(32).prefetch(tf.data.AUTOTUNE)

Creating a custom dataset:

import tensorflow_datasets as tfds

class MyDataset(tfds.core.GeneratorBasedBuilder):
    VERSION = tfds.core.Version('1.0.0')
    RELEASE_NOTES = {
        '1.0.0': 'Initial release.',
    }

    def _info(self):
        return tfds.core.DatasetInfo(
            builder=self,
            features=tfds.features.FeaturesDict({
                'image': tfds.features.Image(shape=(28, 28, 1)),
                'label': tfds.features.ClassLabel(num_classes=10),
            }),
        )

    def _split_generators(self, dl_manager):
        # Define your data sources and splits here
        ...

    def _generate_examples(self):
        # Generate examples from your data sources
        ...

Using a dataset with a TensorFlow model:

import tensorflow as tf
import tensorflow_datasets as tfds

# Load and prepare the dataset
dataset = tfds.load('cifar10', split='train', as_supervised=True)
dataset = dataset.map(lambda img, label: (tf.cast(img, tf.float32) / 255.0, label))
dataset = dataset.batch(32).prefetch(tf.data.AUTOTUNE)

# Create a simple model
model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(32, 32, 3)),
    tf.keras.layers.GlobalAveragePooling2D(),
    tf.keras.layers.Dense(10)
])

# Compile and train the model
model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy'])
model.fit(dataset, epochs=5)

Getting Started

To get started with TensorFlow Datasets, follow these steps:

Install the library:

pip install tensorflow-datasets

Import the library and load a dataset:

import tensorflow_datasets as tfds

# Load the MNIST dataset
dataset = tfds.load('mnist', split='train', as_supervised=True)

# Iterate through the dataset
for image, label in dataset.take(1):
    print(f"Image shape: {image.shape}")
    print(f"Label: {label.numpy()}")

This will download the MNIST dataset (if not already cached) and print information about the first example in the dataset.

Competitor Comparisons

data

1,206

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.

Pros of data

More flexible and customizable dataset creation
Easier integration with PyTorch ecosystem
Simpler API for data loading and preprocessing

Cons of data

Smaller collection of pre-built datasets compared to datasets
Less standardized dataset formats and interfaces
Fewer built-in features for distributed data loading

Code Comparison

datasets:

import tensorflow_datasets as tfds

dataset = tfds.load('mnist', split='train', as_supervised=True)
dataset = dataset.shuffle(1000).batch(32)

for images, labels in dataset:
    # Train model

data:

from torchvision import datasets, transforms

dataset = datasets.MNIST(root='./data', train=True, transform=transforms.ToTensor())
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)

for images, labels in dataloader:
    # Train model

Both repositories provide tools for loading and preprocessing datasets, but data offers more flexibility in dataset creation and easier integration with PyTorch. datasets has a larger collection of pre-built datasets and more standardized formats. The code examples show how to load the MNIST dataset in both frameworks, highlighting the simpler API of data compared to the more feature-rich but complex API of datasets.

scikit-learn

62,466

scikit-learn: machine learning in Python

Pros of scikit-learn

Comprehensive collection of machine learning algorithms and tools
Easy-to-use API with consistent interface across different models
Extensive documentation and community support

Cons of scikit-learn

Limited support for deep learning and neural networks
Not optimized for large-scale distributed computing
Slower performance compared to specialized libraries for specific tasks

Code Comparison

scikit-learn:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=4)
clf = RandomForestClassifier()
clf.fit(X, y)

TensorFlow Datasets:

import tensorflow_datasets as tfds

dataset = tfds.load('mnist', split='train', shuffle_files=True)
dataset = dataset.shuffle(1000).batch(32).prefetch(tf.data.AUTOTUNE)
for example in dataset:
    image, label = example['image'], example['label']

Key Differences

scikit-learn focuses on traditional machine learning algorithms, while TensorFlow Datasets is primarily used for loading and preprocessing data for deep learning tasks
scikit-learn provides a unified API for various ML tasks, whereas TensorFlow Datasets is specifically designed for efficient data loading and preprocessing in TensorFlow workflows
scikit-learn includes built-in datasets for quick experimentation, while TensorFlow Datasets offers a wide range of pre-built datasets from various domains

pandas

45,255

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Pros of pandas

More versatile for general data manipulation and analysis tasks
Extensive documentation and large community support
Seamless integration with other data science libraries in Python

Cons of pandas

Less optimized for machine learning workflows compared to TensorFlow Datasets
May require more memory for large datasets
Slower performance for certain operations on very large datasets

Code Comparison

pandas:

import pandas as pd

df = pd.read_csv('data.csv')
filtered_df = df[df['column'] > 5]
result = filtered_df.groupby('category').mean()

TensorFlow Datasets:

import tensorflow_datasets as tfds

dataset = tfds.load('mnist', split='train')
dataset = dataset.filter(lambda x: x['label'] < 5)
dataset = dataset.batch(32).prefetch(tf.data.AUTOTUNE)

The pandas code demonstrates basic data loading, filtering, and aggregation, while the TensorFlow Datasets code shows dataset loading, filtering, and preparation for machine learning tasks. pandas is more flexible for general data manipulation, while TensorFlow Datasets is optimized for machine learning workflows, especially with TensorFlow.

numpy

30,015

The fundamental package for scientific computing with Python.

Pros of NumPy

More general-purpose, suitable for a wide range of numerical computing tasks
Smaller library size, faster import times
Extensive documentation and large community support

Cons of NumPy

Lacks built-in dataset management features
Not optimized for machine learning workflows
Limited support for GPU acceleration

Code Comparison

NumPy:

import numpy as np

# Create and manipulate arrays
arr = np.array([1, 2, 3, 4, 5])
result = arr * 2

TensorFlow Datasets:

import tensorflow_datasets as tfds

# Load a dataset
dataset = tfds.load('mnist', split='train')
for example in dataset.take(1):
    image, label = example['image'], example['label']

NumPy focuses on array operations and mathematical functions, while TensorFlow Datasets specializes in loading and preprocessing datasets for machine learning. NumPy is more versatile for general numerical computing, but TensorFlow Datasets offers streamlined dataset management for ML workflows. The choice between them depends on the specific requirements of your project.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

TensorFlow Datasets

TensorFlow Datasets provides many public datasets as tf.data.Datasets.

Documentation

To install and use TFDS, we strongly encourage to start with our getting started guide. Try it interactively in a Colab notebook.

Our documentation contains:

# !pip install tensorflow-datasets
import tensorflow_datasets as tfds
import tensorflow as tf

# Construct a tf.data.Dataset
ds = tfds.load('mnist', split='train', as_supervised=True, shuffle_files=True)

# Build your input pipeline
ds = ds.shuffle(1000).batch(128).prefetch(10).take(5)
for image, label in ds:
  pass

TFDS core values

TFDS has been built with these principles in mind:

Simplicity: Standard use-cases should work out-of-the box
Performance: TFDS follows best practices and can achieve state-of-the-art speed
Determinism/reproducibility: All users get the same examples in the same order
Customisability: Advanced users can have fine-grained control

If those use cases are not satisfied, please send us feedback.

Want a certain dataset?

Adding a dataset is really straightforward by following our guide.

Request a dataset by opening a Dataset request GitHub issue.

And vote on the current set of requests by adding a thumbs-up reaction to the issue.

Citation

Please include the following citation when using tensorflow-datasets for a paper, in addition to any citation specific to the used datasets.

@misc{TFDS,
  title = {{TensorFlow Datasets}, A collection of ready-to-use datasets},
  howpublished = {\url{https://www.tensorflow.org/datasets}},
}

Disclaimers

This is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.

If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thanks for your contribution to the ML community!

If you're interested in learning more about responsible AI practices, including fairness, please see Google AI's Responsible AI Practices.

tensorflow/datasets is Apache 2.0 licensed. See the LICENSE file.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot