dataset

Easy-to-use data handling for SQL data stores with support for implicit table creation, bulk loading, and transactions.

4,801

295

4,801

View on GitHub

Top Related Projects

pandas

46,172

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

arrow

15,787

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

intake

1,047

Intake is a lightweight package for finding, investigating, loading and disseminating data.

dvc

14,591

🦉 Data Versioning and ML Experiments

great_expectations

10,608

Always know what to expect from your data.

Quick Overview

Dataset is a Python library that provides a simple abstraction layer for reading and writing tabular data. It aims to make working with structured data easier by offering a consistent interface for various file formats, including CSV, JSON, and SQL databases. The library emphasizes simplicity and ease of use for common data manipulation tasks.

Pros

Unified API for multiple data formats (CSV, JSON, SQL, etc.)
Simple and intuitive interface for reading, writing, and manipulating data
Built-in support for data type inference and schema management
Extensible architecture allowing for custom data sources and sinks

Cons

Limited support for complex data transformations compared to more comprehensive data processing libraries
Performance may be slower for very large datasets compared to specialized tools
Documentation could be more extensive, especially for advanced use cases
Fewer features compared to more established data manipulation libraries like pandas

Code Examples

Reading a CSV file and iterating through rows:

import dataset

with dataset.connect('sqlite:///mydatabase.db') as db:
    table = db['mytable']
    for row in table.find(country='USA'):
        print(row['name'], row['age'])

Writing data to a JSON file:

import dataset

data = [
    {'name': 'Alice', 'age': 30},
    {'name': 'Bob', 'age': 25},
    {'name': 'Charlie', 'age': 35}
]

with dataset.connect('json://mydata.json') as db:
    table = db['people']
    table.insert_many(data)

Performing a simple SQL query:

import dataset

db = dataset.connect('sqlite:///mydatabase.db')
result = db.query('SELECT name, age FROM users WHERE age > 30')
for row in result:
    print(row['name'], row['age'])

Getting Started

To get started with Dataset, first install it using pip:

pip install dataset

Then, you can use it in your Python code:

import dataset

# Connect to a database (creates it if it doesn't exist)
db = dataset.connect('sqlite:///mydatabase.db')

# Create a table and insert some data
table = db['users']
table.insert(dict(name='John Doe', age=46))

# Query the data
for user in table.find(age={'>=': 30}):
    print(user['name'], user['age'])

This example demonstrates connecting to a SQLite database, creating a table, inserting data, and querying it using Dataset's simple API.

Competitor Comparisons

pandas

46,172

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Pros of pandas

More comprehensive and feature-rich data manipulation library
Highly optimized for performance with large datasets
Extensive documentation and community support

Cons of pandas

Steeper learning curve for beginners
Higher memory usage, especially for smaller datasets
More complex setup and dependencies

Code Comparison

pandas:

import pandas as pd

df = pd.read_csv('data.csv')
filtered = df[df['column'] > 5]
result = filtered.groupby('category').mean()

dataset:

from dataset import Dataset

db = Dataset('sqlite:///data.db')
table = db['mytable']
filtered = table.find(column={'gt': 5})
result = table.group_by('category').avg('value')

Summary

pandas is a powerful and versatile data manipulation library, ideal for complex analysis and large datasets. dataset offers a simpler, database-oriented approach that's easier to learn and use for basic operations. pandas excels in performance and advanced features, while dataset provides a more straightforward interface for working with tabular data, especially when integrating with databases.

arrow

15,787

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

Pros of Arrow

High-performance data processing and analytics capabilities
Supports multiple programming languages and platforms
Extensive ecosystem with tools for various data tasks

Cons of Arrow

Steeper learning curve due to its complexity
May be overkill for simple data manipulation tasks
Requires more setup and configuration

Code Comparison

Arrow:

import pyarrow as pa

table = pa.table({'col1': [1, 2, 3], 'col2': ['a', 'b', 'c']})
filtered = table.filter(table['col1'] > 1)

Dataset:

from dataset import Dataset

db = Dataset('sqlite:///mydata.db')
table = db['mytable']
filtered = table.find(col1={'gt': 1})

Key Differences

Arrow focuses on high-performance data processing across languages
Dataset provides a simpler interface for database operations
Arrow offers more advanced features for large-scale data analysis
Dataset is more suitable for quick prototyping and small-scale projects

Use Cases

Arrow:

Big data processing and analytics
Cross-language data interchange
High-performance computing applications

Dataset:

Rapid prototyping of data-driven applications
Simple database operations and queries
Small to medium-scale data manipulation tasks

intake

1,047

Intake is a lightweight package for finding, investigating, loading and disseminating data.

Pros of Intake

More extensive data catalog capabilities, supporting a wider range of data formats and sources
Better integration with data science ecosystems like Dask and Pandas
Active development and larger community support

Cons of Intake

Steeper learning curve due to more complex features
Heavier dependencies, potentially leading to larger project footprint
Less focus on simple tabular data manipulation compared to Dataset

Code Comparison

Dataset:

import dataset

db = dataset.connect('sqlite:///mydatabase.db')
table = db['mytable']
table.insert(dict(name='John', age=30))

Intake:

import intake

catalog = intake.open_catalog('mycatalog.yml')
source = catalog.mydatasource.read()
df = source.to_dask()

Both libraries aim to simplify data access and manipulation, but Intake focuses more on catalog management and integration with data science tools, while Dataset emphasizes simplicity for working with tabular data in databases.

dvc

14,591

🦉 Data Versioning and ML Experiments

Pros of DVC

Focuses on version control for machine learning projects and large datasets
Integrates well with existing Git workflows
Supports remote storage options like S3, Google Cloud, and Azure Blob Storage

Cons of DVC

Steeper learning curve for users not familiar with Git
Requires more setup and configuration compared to Dataset
May be overkill for simpler data management tasks

Code Comparison

Dataset:

import dataset

db = dataset.connect('sqlite:///mydatabase.db')
table = db['mytable']
table.insert(dict(name='John Doe', age=37))

DVC:

dvc init
dvc add data/mydata.csv
git add data/mydata.csv.dvc
dvc push

Summary

Dataset is a lightweight Python library for working with databases, while DVC is a more comprehensive version control system for machine learning projects. Dataset is easier to use for simple data management tasks, while DVC offers more advanced features for tracking large datasets and ML experiments. The choice between them depends on the complexity of your project and your specific data management needs.

great_expectations

10,608

Always know what to expect from your data.

Pros of Great Expectations

More comprehensive data validation and testing capabilities
Supports a wider range of data sources and integrations
Offers a suite of pre-built expectations for common data quality checks

Cons of Great Expectations

Steeper learning curve due to more complex architecture
Requires more setup and configuration for basic use cases
Heavier resource usage for large-scale data validation tasks

Code Comparison

Dataset:

import dataset

db = dataset.connect('sqlite:///mydatabase.db')
table = db['users']
table.insert(dict(name='John Doe', age=37))

Great Expectations:

import great_expectations as ge

context = ge.get_context()
suite = context.create_expectation_suite("my_suite")
validator = context.get_validator(
    batch_request={"data_asset_name": "users"},
    expectation_suite=suite
)
validator.expect_column_values_to_be_between("age", min_value=0, max_value=120)

Dataset is more straightforward for simple database operations, while Great Expectations provides more robust data validation capabilities but requires more setup.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

dataset: databases for lazy people

In short, dataset makes reading and writing data in databases as simple as reading and writing JSON files.

Read the docs

To install dataset, fetch it with pip:

$ pip install dataset

Note: as of version 1.0, dataset is split into two packages, with the data export features now extracted into a stand-alone package, datafreeze. See the relevant repository here.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot