Top Related Projects
A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
scikit-learn: machine learning in Python
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
The fundamental package for scientific computing with Python.
Quick Overview
TensorFlow Datasets is an open-source library that provides a collection of ready-to-use datasets for machine learning tasks. It offers a simple and consistent interface to access and prepare datasets for use with TensorFlow, making it easier for researchers and developers to focus on model development rather than data preprocessing.
Pros
- Large variety of datasets available, covering various domains such as image classification, natural language processing, and speech recognition
- Consistent API for loading and preprocessing datasets, simplifying integration with TensorFlow models
- Automatic download and caching of datasets, saving time and storage space
- Support for custom dataset creation and contribution to the library
Cons
- Some datasets may require significant storage space and download time
- Limited customization options for certain preprocessing steps
- Occasional compatibility issues with specific TensorFlow versions
- Some less popular datasets may have limited documentation or support
Code Examples
- Loading and preparing a dataset:
import tensorflow_datasets as tfds
# Load the MNIST dataset
dataset = tfds.load('mnist', split='train', as_supervised=True)
# Prepare the dataset for training
dataset = dataset.map(lambda img, label: (tf.cast(img, tf.float32) / 255.0, label))
dataset = dataset.batch(32).prefetch(tf.data.AUTOTUNE)
- Creating a custom dataset:
import tensorflow_datasets as tfds
class MyDataset(tfds.core.GeneratorBasedBuilder):
VERSION = tfds.core.Version('1.0.0')
RELEASE_NOTES = {
'1.0.0': 'Initial release.',
}
def _info(self):
return tfds.core.DatasetInfo(
builder=self,
features=tfds.features.FeaturesDict({
'image': tfds.features.Image(shape=(28, 28, 1)),
'label': tfds.features.ClassLabel(num_classes=10),
}),
)
def _split_generators(self, dl_manager):
# Define your data sources and splits here
...
def _generate_examples(self):
# Generate examples from your data sources
...
- Using a dataset with a TensorFlow model:
import tensorflow as tf
import tensorflow_datasets as tfds
# Load and prepare the dataset
dataset = tfds.load('cifar10', split='train', as_supervised=True)
dataset = dataset.map(lambda img, label: (tf.cast(img, tf.float32) / 255.0, label))
dataset = dataset.batch(32).prefetch(tf.data.AUTOTUNE)
# Create a simple model
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(32, 32, 3)),
tf.keras.layers.GlobalAveragePooling2D(),
tf.keras.layers.Dense(10)
])
# Compile and train the model
model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy'])
model.fit(dataset, epochs=5)
Getting Started
To get started with TensorFlow Datasets, follow these steps:
- Install the library:
pip install tensorflow-datasets
- Import the library and load a dataset:
import tensorflow_datasets as tfds
# Load the MNIST dataset
dataset = tfds.load('mnist', split='train', as_supervised=True)
# Iterate through the dataset
for image, label in dataset.take(1):
print(f"Image shape: {image.shape}")
print(f"Label: {label.numpy()}")
This will download the MNIST dataset (if not already cached) and print information about the first example in the dataset.
Competitor Comparisons
A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
Pros of data
- More flexible and customizable dataset creation
- Easier integration with PyTorch ecosystem
- Simpler API for data loading and preprocessing
Cons of data
- Smaller collection of pre-built datasets compared to datasets
- Less standardized dataset formats and interfaces
- Fewer built-in features for distributed data loading
Code Comparison
datasets:
import tensorflow_datasets as tfds
dataset = tfds.load('mnist', split='train', as_supervised=True)
dataset = dataset.shuffle(1000).batch(32)
for images, labels in dataset:
# Train model
data:
from torchvision import datasets, transforms
dataset = datasets.MNIST(root='./data', train=True, transform=transforms.ToTensor())
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)
for images, labels in dataloader:
# Train model
Both repositories provide tools for loading and preprocessing datasets, but data offers more flexibility in dataset creation and easier integration with PyTorch. datasets has a larger collection of pre-built datasets and more standardized formats. The code examples show how to load the MNIST dataset in both frameworks, highlighting the simpler API of data compared to the more feature-rich but complex API of datasets.
scikit-learn: machine learning in Python
Pros of scikit-learn
- Comprehensive collection of machine learning algorithms and tools
- Easy-to-use API with consistent interface across different models
- Extensive documentation and community support
Cons of scikit-learn
- Limited support for deep learning and neural networks
- Not optimized for large-scale distributed computing
- Slower performance compared to specialized libraries for specific tasks
Code Comparison
scikit-learn:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=4)
clf = RandomForestClassifier()
clf.fit(X, y)
TensorFlow Datasets:
import tensorflow_datasets as tfds
dataset = tfds.load('mnist', split='train', shuffle_files=True)
dataset = dataset.shuffle(1000).batch(32).prefetch(tf.data.AUTOTUNE)
for example in dataset:
image, label = example['image'], example['label']
Key Differences
- scikit-learn focuses on traditional machine learning algorithms, while TensorFlow Datasets is primarily used for loading and preprocessing data for deep learning tasks
- scikit-learn provides a unified API for various ML tasks, whereas TensorFlow Datasets is specifically designed for efficient data loading and preprocessing in TensorFlow workflows
- scikit-learn includes built-in datasets for quick experimentation, while TensorFlow Datasets offers a wide range of pre-built datasets from various domains
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Pros of pandas
- More versatile for general data manipulation and analysis tasks
- Extensive documentation and large community support
- Seamless integration with other data science libraries in Python
Cons of pandas
- Less optimized for machine learning workflows compared to TensorFlow Datasets
- May require more memory for large datasets
- Slower performance for certain operations on very large datasets
Code Comparison
pandas:
import pandas as pd
df = pd.read_csv('data.csv')
filtered_df = df[df['column'] > 5]
result = filtered_df.groupby('category').mean()
TensorFlow Datasets:
import tensorflow_datasets as tfds
dataset = tfds.load('mnist', split='train')
dataset = dataset.filter(lambda x: x['label'] < 5)
dataset = dataset.batch(32).prefetch(tf.data.AUTOTUNE)
The pandas code demonstrates basic data loading, filtering, and aggregation, while the TensorFlow Datasets code shows dataset loading, filtering, and preparation for machine learning tasks. pandas is more flexible for general data manipulation, while TensorFlow Datasets is optimized for machine learning workflows, especially with TensorFlow.
The fundamental package for scientific computing with Python.
Pros of NumPy
- More general-purpose, suitable for a wide range of numerical computing tasks
- Smaller library size, faster import times
- Extensive documentation and large community support
Cons of NumPy
- Lacks built-in dataset management features
- Not optimized for machine learning workflows
- Limited support for GPU acceleration
Code Comparison
NumPy:
import numpy as np
# Create and manipulate arrays
arr = np.array([1, 2, 3, 4, 5])
result = arr * 2
TensorFlow Datasets:
import tensorflow_datasets as tfds
# Load a dataset
dataset = tfds.load('mnist', split='train')
for example in dataset.take(1):
image, label = example['image'], example['label']
NumPy focuses on array operations and mathematical functions, while TensorFlow Datasets specializes in loading and preprocessing datasets for machine learning. NumPy is more versatile for general numerical computing, but TensorFlow Datasets offers streamlined dataset management for ML workflows. The choice between them depends on the specific requirements of your project.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
TensorFlow Datasets
TensorFlow Datasets provides many public datasets as tf.data.Datasets
.
Documentation
To install and use TFDS, we strongly encourage to start with our getting started guide. Try it interactively in a Colab notebook.
Our documentation contains:
- Tutorials and guides
- List of all available datasets
- The API reference
# !pip install tensorflow-datasets
import tensorflow_datasets as tfds
import tensorflow as tf
# Construct a tf.data.Dataset
ds = tfds.load('mnist', split='train', as_supervised=True, shuffle_files=True)
# Build your input pipeline
ds = ds.shuffle(1000).batch(128).prefetch(10).take(5)
for image, label in ds:
pass
TFDS core values
TFDS has been built with these principles in mind:
- Simplicity: Standard use-cases should work out-of-the box
- Performance: TFDS follows best practices and can achieve state-of-the-art speed
- Determinism/reproducibility: All users get the same examples in the same order
- Customisability: Advanced users can have fine-grained control
If those use cases are not satisfied, please send us feedback.
Want a certain dataset?
Adding a dataset is really straightforward by following our guide.
Request a dataset by opening a Dataset request GitHub issue.
And vote on the current set of requests by adding a thumbs-up reaction to the issue.
Citation
Please include the following citation when using tensorflow-datasets
for a
paper, in addition to any citation specific to the used datasets.
@misc{TFDS,
title = {{TensorFlow Datasets}, A collection of ready-to-use datasets},
howpublished = {\url{https://www.tensorflow.org/datasets}},
}
Disclaimers
This is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.
If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thanks for your contribution to the ML community!
If you're interested in learning more about responsible AI practices, including fairness, please see Google AI's Responsible AI Practices.
tensorflow/datasets
is Apache 2.0 licensed. See the
LICENSE
file.
Top Related Projects
A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
scikit-learn: machine learning in Python
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
The fundamental package for scientific computing with Python.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot