datatable

A Python package for manipulating 2-dimensional tabular data structures

1,840

163

1,840

179

View on GitHub

Top Related Projects

pandas

45,255

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

dask

13,158

Parallel computing with task scheduling

modin

10,129

Modin: Scale your Pandas workflows by changing a single line of code

polars

33,322

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

vaex

8,377

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

Quick Overview

Datatable is a Python library for efficient data manipulation and analysis, particularly for large datasets. It offers fast and memory-efficient operations, with a focus on out-of-memory processing and parallel computation. Datatable is designed to be compatible with pandas, numpy, and other popular data science libraries.

Pros

High performance for large datasets, with efficient memory usage
Supports out-of-memory processing for datasets larger than available RAM
Parallel computation capabilities for faster data manipulation
Compatible with pandas and numpy, allowing easy integration into existing workflows

Cons

Smaller community and ecosystem compared to pandas
Limited documentation and learning resources
Fewer built-in functions and methods compared to more established libraries
May require additional learning for users familiar with pandas or other data manipulation libraries

Code Examples

Creating a datatable and performing basic operations:

import datatable as dt

# Create a datatable
df = dt.Frame({"A": [1, 2, 3], "B": ["x", "y", "z"]})

# Display the datatable
print(df)

# Perform a simple operation
result = df[:, dt.sum(f.A)]
print(result)

Reading a large CSV file efficiently:

import datatable as dt

# Read a large CSV file
df = dt.fread("large_file.csv")

# Get basic information about the datatable
print(df.shape)
print(df.names)

Grouping and aggregating data:

import datatable as dt

# Create a sample datatable
df = dt.Frame({"category": ["A", "B", "A", "C", "B"],
               "value": [10, 20, 15, 25, 30]})

# Group by category and calculate mean
result = df[:, dt.mean(f.value), by("category")]
print(result)

Getting Started

To get started with datatable, follow these steps:

Install datatable using pip:
```
pip install datatable
```
Import datatable in your Python script:
```
import datatable as dt
```

Create a simple datatable:

df = dt.Frame({"A": [1, 2, 3], "B": ["x", "y", "z"]})
print(df)

For more advanced usage and detailed documentation, refer to the official datatable documentation at https://datatable.readthedocs.io/.

Competitor Comparisons

pandas

45,255

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Pros of pandas

Extensive documentation and large community support
Wide range of data manipulation and analysis functions
Seamless integration with other Python scientific libraries

Cons of pandas

Slower performance for large datasets
Higher memory usage, especially for operations on large DataFrames
Steeper learning curve for beginners due to numerous functions and methods

Code Comparison

pandas:

import pandas as pd

df = pd.read_csv("data.csv")
result = df.groupby("category")["value"].mean()
filtered = df[df["column"] > 10]

datatable:

import datatable as dt

df = dt.fread("data.csv")
result = df[:, dt.mean(f.value), dt.by(f.category)]
filtered = df[f.column > 10, :]

Both libraries offer similar functionality, but datatable's syntax is more concise and often faster for large datasets. pandas provides a more familiar API for those coming from other data analysis backgrounds, while datatable focuses on performance and memory efficiency.

dask

13,158

Parallel computing with task scheduling

Pros of Dask

Distributed computing capabilities for large-scale data processing
Integrates well with the PyData ecosystem (NumPy, Pandas, Scikit-learn)
Flexible task scheduling and parallel execution

Cons of Dask

Steeper learning curve for beginners
Can be complex to set up and configure for distributed environments
May have higher memory overhead for certain operations

Code Comparison

Dask:

import dask.dataframe as dd

df = dd.read_csv('large_file.csv')
result = df.groupby('column').mean().compute()

datatable:

import datatable as dt

df = dt.fread('large_file.csv')
result = df[:, dt.mean(f.values), by('column')]

Both libraries aim to handle large datasets efficiently, but Dask focuses on distributed computing and integration with existing PyData tools, while datatable emphasizes single-node performance and memory efficiency. Dask offers more flexibility for complex workflows, while datatable provides a simpler API for common data manipulation tasks.

modin

10,129

Modin: Scale your Pandas workflows by changing a single line of code

Pros of Modin

Seamless integration with pandas API, allowing easy adoption for existing pandas users
Distributed computing capabilities, enabling processing of larger-than-memory datasets
Support for multiple execution engines (Ray, Dask, etc.) for flexibility in different environments

Cons of Modin

Performance may not always surpass pandas, especially for smaller datasets
Some pandas functions may not be fully implemented or optimized
Potential overhead in setup and configuration compared to simpler alternatives

Code Comparison

Modin:

import modin.pandas as pd

df = pd.read_csv("large_file.csv")
result = df.groupby("column").mean()

Datatable:

import datatable as dt

df = dt.fread("large_file.csv")
result = df[:, dt.mean(f.values), by("column")]

Both libraries aim to improve data processing performance, but Modin focuses on maintaining pandas compatibility, while Datatable introduces a new syntax for data manipulation. Modin may be easier for pandas users to adopt, while Datatable offers potentially faster performance for certain operations.

polars

33,322

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

Pros of Polars

Faster performance for many operations, especially on larger datasets
More comprehensive and actively developed ecosystem of data manipulation tools
Better support for handling missing data and null values

Cons of Polars

Steeper learning curve, especially for users familiar with pandas
Less memory-efficient for certain operations, particularly on smaller datasets
More complex API with multiple ways to achieve similar results

Code Comparison

Polars:

import polars as pl

df = pl.read_csv("data.csv")
result = df.filter(pl.col("age") > 30).groupby("city").agg(pl.col("salary").mean())

Datatable:

import datatable as dt

df = dt.fread("data.csv")
result = df[f.age > 30, dt.mean(f.salary), by("city")]

Both libraries offer efficient data manipulation capabilities, but Polars generally provides better performance and a more extensive feature set, while Datatable offers a simpler API and better memory efficiency for certain use cases. The choice between them depends on specific project requirements and user preferences.

vaex

8,377

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

Pros of Vaex

Handles out-of-memory datasets efficiently
Supports lazy evaluation for improved performance
Offers advanced visualization capabilities

Cons of Vaex

Steeper learning curve for users familiar with pandas
Limited support for certain data types compared to datatable
May have slower performance for smaller datasets

Code Comparison

Vaex:

import vaex
df = vaex.from_csv('large_file.csv')
result = df[df.age > 30].mean(df.salary)

datatable:

import datatable as dt
df = dt.fread('large_file.csv')
result = df[f.age > 30, dt.mean(f.salary)]

Both libraries aim to handle large datasets efficiently, but they differ in syntax and approach. Vaex focuses on out-of-memory processing and visualization, while datatable emphasizes in-memory performance and a pandas-like interface. The choice between them depends on specific use cases and dataset characteristics.

cudf

8,886

cuDF - GPU DataFrame Library

Pros of cudf

GPU-accelerated data processing for faster performance on large datasets
Seamless integration with other RAPIDS ecosystem libraries
Supports CUDA-enabled GPUs for massive parallelism

Cons of cudf

Requires NVIDIA GPU hardware, limiting accessibility
Steeper learning curve due to GPU programming concepts
May have higher memory requirements for large datasets

Code Comparison

datatable:

import datatable as dt
df = dt.fread("data.csv")
result = df[:, dt.sum(f.numeric_column), by("category")]

cudf:

import cudf
df = cudf.read_csv("data.csv")
result = df.groupby("category")["numeric_column"].sum().reset_index()

Both libraries offer similar functionality for data manipulation, but cudf leverages GPU acceleration for improved performance on compatible hardware. datatable provides a more accessible CPU-based solution with a syntax reminiscent of data.table in R. cudf integrates well with other RAPIDS libraries for end-to-end GPU-accelerated data science workflows, while datatable focuses on efficient CPU-based operations and memory management.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

datatable

This is a Python package for manipulating 2-dimensional tabular data structures (aka data frames). It is close in spirit to pandas or SFrame; however we put specific emphasis on speed and big data support. As the name suggests, the package is closely related to R's data.table and attempts to mimic its core algorithms and API.

Requirements: Python 3.6+ (64 bit) and pip 20.3+.

Project goals

datatable started in 2017 as a toolkit for performing big data (up to 100GB) operations on a single-node machine, at the maximum speed possible. Such requirements are dictated by modern machine-learning applications, which need to process large volumes of data and generate many features in order to achieve the best model accuracy. The first user of datatable was Driverless.ai.

The set of features that we want to implement with datatable is at least the following:

Column-oriented data storage.
Native-C implementation for all datatypes, including strings. Packages such as pandas and numpy already do that for numeric columns, but not for strings.
Support for date-time and categorical types. Object type is also supported, but promotion into object discouraged.
All types should support null values, with as little overhead as possible.
Data should be stored on disk in the same format as in memory. This will allow us to memory-map data on disk and work on out-of-memory datasets transparently.
Work with memory-mapped datasets to avoid loading into memory more data than necessary for each particular operation.
Fast data reading from CSV and other formats.
Multi-threaded data processing: time-consuming operations should attempt to utilize all cores for maximum efficiency.
Efficient algorithms for sorting/grouping/joining.
Expressive query syntax (similar to data.table).
Minimal amount of data copying, copy-on-write semantics for shared data.
Use "rowindex" views in filtering/sorting/grouping/joining operators to avoid unnecessary data copying.
Interoperability with pandas / numpy / pyarrow / pure python: the users should have the ability to convert to another data-processing framework with ease.

Installation

On macOS, Linux and Windows systems installing datatable is as easy as

pip install datatable

On all other platforms a source distribution will be needed. For more information see Build instructions.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of pandas

Cons of pandas

Code Comparison

Pros of Dask

Cons of Dask

Code Comparison

Pros of Modin

Cons of Modin

Code Comparison

Pros of Polars

Cons of Polars

Code Comparison

Pros of Vaex

Cons of Vaex

Code Comparison

Pros of cudf

Cons of cudf

Code Comparison

Convert designs to code with AI

README

datatable

Project goals

Installation

See also

Top Related Projects

Convert designs to code with AI