Convert Figma logo to code with AI

tensorflow logodatasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...

4,269
1,530
4,269
683

Top Related Projects

1,115

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.

scikit-learn: machine learning in Python

43,205

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

27,505

The fundamental package for scientific computing with Python.

Quick Overview

TensorFlow Datasets is an open-source library that provides a collection of ready-to-use datasets for machine learning tasks. It offers a simple and consistent interface to access and prepare datasets for use with TensorFlow, making it easier for researchers and developers to focus on model development rather than data preprocessing.

Pros

  • Large variety of datasets available, covering various domains such as image classification, natural language processing, and speech recognition
  • Consistent API for loading and preprocessing datasets, simplifying integration with TensorFlow models
  • Automatic download and caching of datasets, saving time and storage space
  • Support for custom dataset creation and contribution to the library

Cons

  • Some datasets may require significant storage space and download time
  • Limited customization options for certain preprocessing steps
  • Occasional compatibility issues with specific TensorFlow versions
  • Some less popular datasets may have limited documentation or support

Code Examples

  1. Loading and preparing a dataset:
import tensorflow_datasets as tfds

# Load the MNIST dataset
dataset = tfds.load('mnist', split='train', as_supervised=True)

# Prepare the dataset for training
dataset = dataset.map(lambda img, label: (tf.cast(img, tf.float32) / 255.0, label))
dataset = dataset.batch(32).prefetch(tf.data.AUTOTUNE)
  1. Creating a custom dataset:
import tensorflow_datasets as tfds

class MyDataset(tfds.core.GeneratorBasedBuilder):
    VERSION = tfds.core.Version('1.0.0')
    RELEASE_NOTES = {
        '1.0.0': 'Initial release.',
    }

    def _info(self):
        return tfds.core.DatasetInfo(
            builder=self,
            features=tfds.features.FeaturesDict({
                'image': tfds.features.Image(shape=(28, 28, 1)),
                'label': tfds.features.ClassLabel(num_classes=10),
            }),
        )

    def _split_generators(self, dl_manager):
        # Define your data sources and splits here
        ...

    def _generate_examples(self):
        # Generate examples from your data sources
        ...
  1. Using a dataset with a TensorFlow model:
import tensorflow as tf
import tensorflow_datasets as tfds

# Load and prepare the dataset
dataset = tfds.load('cifar10', split='train', as_supervised=True)
dataset = dataset.map(lambda img, label: (tf.cast(img, tf.float32) / 255.0, label))
dataset = dataset.batch(32).prefetch(tf.data.AUTOTUNE)

# Create a simple model
model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(32, 32, 3)),
    tf.keras.layers.GlobalAveragePooling2D(),
    tf.keras.layers.Dense(10)
])

# Compile and train the model
model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy'])
model.fit(dataset, epochs=5)

Getting Started

To get started with TensorFlow Datasets, follow these steps:

  1. Install the library:
pip install tensorflow-datasets
  1. Import the library and load a dataset:
import tensorflow_datasets as tfds

# Load the MNIST dataset
dataset = tfds.load('mnist', split='train', as_supervised=True)

# Iterate through the dataset
for image, label in dataset.take(1):
    print(f"Image shape: {image.shape}")
    print(f"Label: {label.numpy()}")

This will download the MNIST dataset (if not already cached) and print information about the first example in the dataset.

Competitor Comparisons

1,115

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.

Pros of data

  • More flexible and customizable dataset creation
  • Easier integration with PyTorch ecosystem
  • Simpler API for data loading and preprocessing

Cons of data

  • Smaller collection of pre-built datasets compared to datasets
  • Less standardized dataset formats and interfaces
  • Fewer built-in features for distributed data loading

Code Comparison

datasets:

import tensorflow_datasets as tfds

dataset = tfds.load('mnist', split='train', as_supervised=True)
dataset = dataset.shuffle(1000).batch(32)

for images, labels in dataset:
    # Train model

data:

from torchvision import datasets, transforms

dataset = datasets.MNIST(root='./data', train=True, transform=transforms.ToTensor())
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)

for images, labels in dataloader:
    # Train model

Both repositories provide tools for loading and preprocessing datasets, but data offers more flexibility in dataset creation and easier integration with PyTorch. datasets has a larger collection of pre-built datasets and more standardized formats. The code examples show how to load the MNIST dataset in both frameworks, highlighting the simpler API of data compared to the more feature-rich but complex API of datasets.

scikit-learn: machine learning in Python

Pros of scikit-learn

  • Comprehensive collection of machine learning algorithms and tools
  • Easy-to-use API with consistent interface across different models
  • Extensive documentation and community support

Cons of scikit-learn

  • Limited support for deep learning and neural networks
  • Not optimized for large-scale distributed computing
  • Slower performance compared to specialized libraries for specific tasks

Code Comparison

scikit-learn:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=4)
clf = RandomForestClassifier()
clf.fit(X, y)

TensorFlow Datasets:

import tensorflow_datasets as tfds

dataset = tfds.load('mnist', split='train', shuffle_files=True)
dataset = dataset.shuffle(1000).batch(32).prefetch(tf.data.AUTOTUNE)
for example in dataset:
    image, label = example['image'], example['label']

Key Differences

  • scikit-learn focuses on traditional machine learning algorithms, while TensorFlow Datasets is primarily used for loading and preprocessing data for deep learning tasks
  • scikit-learn provides a unified API for various ML tasks, whereas TensorFlow Datasets is specifically designed for efficient data loading and preprocessing in TensorFlow workflows
  • scikit-learn includes built-in datasets for quick experimentation, while TensorFlow Datasets offers a wide range of pre-built datasets from various domains
43,205

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Pros of pandas

  • More versatile for general data manipulation and analysis tasks
  • Extensive documentation and large community support
  • Seamless integration with other data science libraries in Python

Cons of pandas

  • Less optimized for machine learning workflows compared to TensorFlow Datasets
  • May require more memory for large datasets
  • Slower performance for certain operations on very large datasets

Code Comparison

pandas:

import pandas as pd

df = pd.read_csv('data.csv')
filtered_df = df[df['column'] > 5]
result = filtered_df.groupby('category').mean()

TensorFlow Datasets:

import tensorflow_datasets as tfds

dataset = tfds.load('mnist', split='train')
dataset = dataset.filter(lambda x: x['label'] < 5)
dataset = dataset.batch(32).prefetch(tf.data.AUTOTUNE)

The pandas code demonstrates basic data loading, filtering, and aggregation, while the TensorFlow Datasets code shows dataset loading, filtering, and preparation for machine learning tasks. pandas is more flexible for general data manipulation, while TensorFlow Datasets is optimized for machine learning workflows, especially with TensorFlow.

27,505

The fundamental package for scientific computing with Python.

Pros of NumPy

  • More general-purpose, suitable for a wide range of numerical computing tasks
  • Smaller library size, faster import times
  • Extensive documentation and large community support

Cons of NumPy

  • Lacks built-in dataset management features
  • Not optimized for machine learning workflows
  • Limited support for GPU acceleration

Code Comparison

NumPy:

import numpy as np

# Create and manipulate arrays
arr = np.array([1, 2, 3, 4, 5])
result = arr * 2

TensorFlow Datasets:

import tensorflow_datasets as tfds

# Load a dataset
dataset = tfds.load('mnist', split='train')
for example in dataset.take(1):
    image, label = example['image'], example['label']

NumPy focuses on array operations and mathematical functions, while TensorFlow Datasets specializes in loading and preprocessing datasets for machine learning. NumPy is more versatile for general numerical computing, but TensorFlow Datasets offers streamlined dataset management for ML workflows. The choice between them depends on the specific requirements of your project.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

TensorFlow Datasets

TensorFlow Datasets provides many public datasets as tf.data.Datasets.

Unittests PyPI version Python 3.10+ Tutorial API Catalog

Documentation

To install and use TFDS, we strongly encourage to start with our getting started guide. Try it interactively in a Colab notebook.

Our documentation contains:

# !pip install tensorflow-datasets
import tensorflow_datasets as tfds
import tensorflow as tf

# Construct a tf.data.Dataset
ds = tfds.load('mnist', split='train', as_supervised=True, shuffle_files=True)

# Build your input pipeline
ds = ds.shuffle(1000).batch(128).prefetch(10).take(5)
for image, label in ds:
  pass

TFDS core values

TFDS has been built with these principles in mind:

  • Simplicity: Standard use-cases should work out-of-the box
  • Performance: TFDS follows best practices and can achieve state-of-the-art speed
  • Determinism/reproducibility: All users get the same examples in the same order
  • Customisability: Advanced users can have fine-grained control

If those use cases are not satisfied, please send us feedback.

Want a certain dataset?

Adding a dataset is really straightforward by following our guide.

Request a dataset by opening a Dataset request GitHub issue.

And vote on the current set of requests by adding a thumbs-up reaction to the issue.

Citation

Please include the following citation when using tensorflow-datasets for a paper, in addition to any citation specific to the used datasets.

@misc{TFDS,
  title = {{TensorFlow Datasets}, A collection of ready-to-use datasets},
  howpublished = {\url{https://www.tensorflow.org/datasets}},
}

Disclaimers

This is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.

If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thanks for your contribution to the ML community!

If you're interested in learning more about responsible AI practices, including fairness, please see Google AI's Responsible AI Practices.

tensorflow/datasets is Apache 2.0 licensed. See the LICENSE file.