Convert Figma logo to code with AI

h2oai logodatatable

A Python package for manipulating 2-dimensional tabular data structures

1,811
155
1,811
177

Top Related Projects

43,524

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

12,495

Parallel computing with task scheduling

9,845

Modin: Scale your Pandas workflows by changing a single line of code

29,748

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

8,280

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

8,348

cuDF - GPU DataFrame Library

Quick Overview

Datatable is a Python library for efficient data manipulation and analysis, particularly for large datasets. It offers fast and memory-efficient operations, with a focus on out-of-memory processing and parallel computation. Datatable is designed to be compatible with pandas, numpy, and other popular data science libraries.

Pros

  • High performance for large datasets, with efficient memory usage
  • Supports out-of-memory processing for datasets larger than available RAM
  • Parallel computation capabilities for faster data manipulation
  • Compatible with pandas and numpy, allowing easy integration into existing workflows

Cons

  • Smaller community and ecosystem compared to pandas
  • Limited documentation and learning resources
  • Fewer built-in functions and methods compared to more established libraries
  • May require additional learning for users familiar with pandas or other data manipulation libraries

Code Examples

  1. Creating a datatable and performing basic operations:
import datatable as dt

# Create a datatable
df = dt.Frame({"A": [1, 2, 3], "B": ["x", "y", "z"]})

# Display the datatable
print(df)

# Perform a simple operation
result = df[:, dt.sum(f.A)]
print(result)
  1. Reading a large CSV file efficiently:
import datatable as dt

# Read a large CSV file
df = dt.fread("large_file.csv")

# Get basic information about the datatable
print(df.shape)
print(df.names)
  1. Grouping and aggregating data:
import datatable as dt

# Create a sample datatable
df = dt.Frame({"category": ["A", "B", "A", "C", "B"],
               "value": [10, 20, 15, 25, 30]})

# Group by category and calculate mean
result = df[:, dt.mean(f.value), by("category")]
print(result)

Getting Started

To get started with datatable, follow these steps:

  1. Install datatable using pip:

    pip install datatable
    
  2. Import datatable in your Python script:

    import datatable as dt
    
  3. Create a simple datatable:

    df = dt.Frame({"A": [1, 2, 3], "B": ["x", "y", "z"]})
    print(df)
    

For more advanced usage and detailed documentation, refer to the official datatable documentation at https://datatable.readthedocs.io/.

Competitor Comparisons

43,524

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Pros of pandas

  • Extensive documentation and large community support
  • Wide range of data manipulation and analysis functions
  • Seamless integration with other Python scientific libraries

Cons of pandas

  • Slower performance for large datasets
  • Higher memory usage, especially for operations on large DataFrames
  • Steeper learning curve for beginners due to numerous functions and methods

Code Comparison

pandas:

import pandas as pd

df = pd.read_csv("data.csv")
result = df.groupby("category")["value"].mean()
filtered = df[df["column"] > 10]

datatable:

import datatable as dt

df = dt.fread("data.csv")
result = df[:, dt.mean(f.value), dt.by(f.category)]
filtered = df[f.column > 10, :]

Both libraries offer similar functionality, but datatable's syntax is more concise and often faster for large datasets. pandas provides a more familiar API for those coming from other data analysis backgrounds, while datatable focuses on performance and memory efficiency.

12,495

Parallel computing with task scheduling

Pros of Dask

  • Distributed computing capabilities for large-scale data processing
  • Integrates well with the PyData ecosystem (NumPy, Pandas, Scikit-learn)
  • Flexible task scheduling and parallel execution

Cons of Dask

  • Steeper learning curve for beginners
  • Can be complex to set up and configure for distributed environments
  • May have higher memory overhead for certain operations

Code Comparison

Dask:

import dask.dataframe as dd

df = dd.read_csv('large_file.csv')
result = df.groupby('column').mean().compute()

datatable:

import datatable as dt

df = dt.fread('large_file.csv')
result = df[:, dt.mean(f.values), by('column')]

Both libraries aim to handle large datasets efficiently, but Dask focuses on distributed computing and integration with existing PyData tools, while datatable emphasizes single-node performance and memory efficiency. Dask offers more flexibility for complex workflows, while datatable provides a simpler API for common data manipulation tasks.

9,845

Modin: Scale your Pandas workflows by changing a single line of code

Pros of Modin

  • Seamless integration with pandas API, allowing easy adoption for existing pandas users
  • Distributed computing capabilities, enabling processing of larger-than-memory datasets
  • Support for multiple execution engines (Ray, Dask, etc.) for flexibility in different environments

Cons of Modin

  • Performance may not always surpass pandas, especially for smaller datasets
  • Some pandas functions may not be fully implemented or optimized
  • Potential overhead in setup and configuration compared to simpler alternatives

Code Comparison

Modin:

import modin.pandas as pd

df = pd.read_csv("large_file.csv")
result = df.groupby("column").mean()

Datatable:

import datatable as dt

df = dt.fread("large_file.csv")
result = df[:, dt.mean(f.values), by("column")]

Both libraries aim to improve data processing performance, but Modin focuses on maintaining pandas compatibility, while Datatable introduces a new syntax for data manipulation. Modin may be easier for pandas users to adopt, while Datatable offers potentially faster performance for certain operations.

29,748

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

Pros of Polars

  • Faster performance for many operations, especially on larger datasets
  • More comprehensive and actively developed ecosystem of data manipulation tools
  • Better support for handling missing data and null values

Cons of Polars

  • Steeper learning curve, especially for users familiar with pandas
  • Less memory-efficient for certain operations, particularly on smaller datasets
  • More complex API with multiple ways to achieve similar results

Code Comparison

Polars:

import polars as pl

df = pl.read_csv("data.csv")
result = df.filter(pl.col("age") > 30).groupby("city").agg(pl.col("salary").mean())

Datatable:

import datatable as dt

df = dt.fread("data.csv")
result = df[f.age > 30, dt.mean(f.salary), by("city")]

Both libraries offer efficient data manipulation capabilities, but Polars generally provides better performance and a more extensive feature set, while Datatable offers a simpler API and better memory efficiency for certain use cases. The choice between them depends on specific project requirements and user preferences.

8,280

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

Pros of Vaex

  • Handles out-of-memory datasets efficiently
  • Supports lazy evaluation for improved performance
  • Offers advanced visualization capabilities

Cons of Vaex

  • Steeper learning curve for users familiar with pandas
  • Limited support for certain data types compared to datatable
  • May have slower performance for smaller datasets

Code Comparison

Vaex:

import vaex
df = vaex.from_csv('large_file.csv')
result = df[df.age > 30].mean(df.salary)

datatable:

import datatable as dt
df = dt.fread('large_file.csv')
result = df[f.age > 30, dt.mean(f.salary)]

Both libraries aim to handle large datasets efficiently, but they differ in syntax and approach. Vaex focuses on out-of-memory processing and visualization, while datatable emphasizes in-memory performance and a pandas-like interface. The choice between them depends on specific use cases and dataset characteristics.

8,348

cuDF - GPU DataFrame Library

Pros of cudf

  • GPU-accelerated data processing for faster performance on large datasets
  • Seamless integration with other RAPIDS ecosystem libraries
  • Supports CUDA-enabled GPUs for massive parallelism

Cons of cudf

  • Requires NVIDIA GPU hardware, limiting accessibility
  • Steeper learning curve due to GPU programming concepts
  • May have higher memory requirements for large datasets

Code Comparison

datatable:

import datatable as dt
df = dt.fread("data.csv")
result = df[:, dt.sum(f.numeric_column), by("category")]

cudf:

import cudf
df = cudf.read_csv("data.csv")
result = df.groupby("category")["numeric_column"].sum().reset_index()

Both libraries offer similar functionality for data manipulation, but cudf leverages GPU acceleration for improved performance on compatible hardware. datatable provides a more accessible CPU-based solution with a syntax reminiscent of data.table in R. cudf integrates well with other RAPIDS libraries for end-to-end GPU-accelerated data science workflows, while datatable focuses on efficient CPU-based operations and memory management.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

datatable

PyPi version License Build Status Documentation Status Codacy Badge

This is a Python package for manipulating 2-dimensional tabular data structures (aka data frames). It is close in spirit to pandas or SFrame; however we put specific emphasis on speed and big data support. As the name suggests, the package is closely related to R's data.table and attempts to mimic its core algorithms and API.

Requirements: Python 3.6+ (64 bit) and pip 20.3+.

Project goals

datatable started in 2017 as a toolkit for performing big data (up to 100GB) operations on a single-node machine, at the maximum speed possible. Such requirements are dictated by modern machine-learning applications, which need to process large volumes of data and generate many features in order to achieve the best model accuracy. The first user of datatable was Driverless.ai.

The set of features that we want to implement with datatable is at least the following:

  • Column-oriented data storage.

  • Native-C implementation for all datatypes, including strings. Packages such as pandas and numpy already do that for numeric columns, but not for strings.

  • Support for date-time and categorical types. Object type is also supported, but promotion into object discouraged.

  • All types should support null values, with as little overhead as possible.

  • Data should be stored on disk in the same format as in memory. This will allow us to memory-map data on disk and work on out-of-memory datasets transparently.

  • Work with memory-mapped datasets to avoid loading into memory more data than necessary for each particular operation.

  • Fast data reading from CSV and other formats.

  • Multi-threaded data processing: time-consuming operations should attempt to utilize all cores for maximum efficiency.

  • Efficient algorithms for sorting/grouping/joining.

  • Expressive query syntax (similar to data.table).

  • Minimal amount of data copying, copy-on-write semantics for shared data.

  • Use "rowindex" views in filtering/sorting/grouping/joining operators to avoid unnecessary data copying.

  • Interoperability with pandas / numpy / pyarrow / pure python: the users should have the ability to convert to another data-processing framework with ease.

Installation

On macOS, Linux and Windows systems installing datatable is as easy as

pip install datatable

On all other platforms a source distribution will be needed. For more information see Build instructions.

See also