zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.

1,732

335

1,732

428

View on GitHub

Top Related Projects

dask

13,376

Parallel computing with task scheduling

h5py

2,172

HDF5 for Python -- The h5py package is a Pythonic interface to the HDF5 binary data format.

xarray

3,926

N-D labeled arrays and datasets in Python

intake

1,055

Intake is a lightweight package for finding, investigating, loading and disseminating data.

filesystem_spec

1,226

A specification that python filesystems should adhere to.

Quick Overview

Zarr-python is an implementation of the Zarr array storage format for Python. It provides a flexible and efficient way to store and access large, chunked, compressed N-dimensional arrays. Zarr is particularly useful for working with big data in scientific computing and data analysis applications.

Pros

Efficient storage and access of large N-dimensional arrays
Supports compression and chunking for optimized performance
Flexible storage backends (local file systems, cloud storage, etc.)
Integrates well with other scientific Python libraries (NumPy, Dask, etc.)

Cons

Learning curve for users new to chunked array storage concepts
Limited support for some specialized array operations compared to NumPy
Potential performance overhead for small datasets
Ecosystem still evolving, with some features in development

Code Examples

Creating and writing to a Zarr array:

import zarr
import numpy as np

# Create a 1000x1000 2D array filled with random data
data = np.random.rand(1000, 1000)

# Create a compressed Zarr array
z = zarr.create((1000, 1000), chunks=(100, 100), dtype='f8', compressor=zarr.Blosc(cname='zstd', clevel=3))

# Write data to the array
z[:] = data

Reading from a Zarr array:

# Read a subset of the data
subset = z[200:400, 300:500]

# Perform operations on the subset
mean_value = subset.mean()

Using Zarr with cloud storage (e.g., AWS S3):

import s3fs
import zarr

# Create an S3 filesystem object
s3 = s3fs.S3FileSystem()

# Open a Zarr group stored on S3
store = s3fs.S3Map(root='mybucket/path/to/data', s3=s3, check=False)
group = zarr.open_group(store)

# Access an array within the group
array = group['my_array']

Getting Started

To get started with Zarr-python:

Install Zarr using pip:
```
pip install zarr
```
Import Zarr in your Python script:
```
import zarr
```

Create a simple Zarr array:

import numpy as np

# Create a 3D Zarr array
z = zarr.create((1000, 1000, 1000), chunks=(100, 100, 100), dtype='f4')

# Fill it with random data
z[:] = np.random.rand(1000, 1000, 1000)

# Read a slice of data
subset = z[500:600, 500:600, 500:600]

For more advanced usage and configuration options, refer to the Zarr documentation.

Competitor Comparisons

dask

13,376

Parallel computing with task scheduling

Pros of Dask

Broader scope: Dask is a flexible library for parallel computing in Python, offering a wide range of tools beyond just data storage
Advanced scheduling: Provides sophisticated task scheduling and distributed computing capabilities
Integrates well with other Python libraries: Works seamlessly with NumPy, Pandas, and Scikit-learn

Cons of Dask

Steeper learning curve: More complex to set up and use compared to Zarr's focused approach
Potential overhead: May introduce unnecessary complexity for simple data storage tasks

Code Comparison

Zarr example:

import zarr
z = zarr.create((10000, 10000), chunks=(1000, 1000), dtype='i4')
z[:] = 42

Dask example:

import dask.array as da
x = da.random.random((10000, 10000), chunks=(1000, 1000))
y = x + x.T
z = y.mean(axis=0)
result = z.compute()

While Zarr focuses on efficient array storage and access, Dask provides a broader set of tools for parallel and distributed computing, including delayed evaluation and task scheduling. Zarr is more specialized for chunked, compressed, N-dimensional arrays, while Dask offers a more comprehensive solution for large-scale data processing and analysis.

h5py

2,172

HDF5 for Python -- The h5py package is a Pythonic interface to the HDF5 binary data format.

Pros of h5py

Mature and widely adopted in scientific computing
Native support for complex data structures and metadata
Efficient handling of large datasets with chunking and compression

Cons of h5py

Limited support for cloud storage and distributed computing
Less flexible for concurrent read/write operations
Requires HDF5 library installation, which can be challenging on some systems

Code Comparison

h5py:

import h5py

with h5py.File('data.h5', 'w') as f:
    dset = f.create_dataset('data', (100, 100), dtype='float32')
    dset[:] = numpy_array

zarr:

import zarr

z = zarr.open('data.zarr', mode='w', shape=(100, 100), dtype='float32')
z[:] = numpy_array

Both libraries offer similar APIs for creating and accessing datasets, but Zarr provides more flexibility in terms of storage backends and distributed computing support. h5py is more established in the scientific community and offers richer support for complex data structures, while Zarr excels in cloud-native and distributed environments.

xarray

3,926

N-D labeled arrays and datasets in Python

Pros of xarray

Higher-level API for working with labeled multi-dimensional arrays
Built-in support for NetCDF files and other common scientific data formats
Extensive data analysis and visualization capabilities

Cons of xarray

Steeper learning curve due to more complex API
Potentially slower performance for basic array operations
Larger dependency footprint

Code Comparison

xarray:

import xarray as xr

# Create a labeled 2D array
data = xr.DataArray(
    [[1, 2, 3], [4, 5, 6]],
    dims=("x", "y"),
    coords={"x": [10, 20], "y": [100, 200, 300]}
)

zarr-python:

import zarr

# Create a 2D array
z = zarr.create((2, 3), dtype=int)
z[:] = [[1, 2, 3], [4, 5, 6]]

xarray provides a more feature-rich interface for working with labeled multi-dimensional data, while zarr-python focuses on efficient storage and access of large arrays. xarray is better suited for complex data analysis tasks, while zarr-python excels in scenarios requiring high-performance I/O operations on large datasets.

intake

1,055

Intake is a lightweight package for finding, investigating, loading and disseminating data.

Pros of Intake

Provides a unified interface for accessing various data sources, including Zarr
Offers data cataloging and metadata management capabilities
Supports lazy loading and efficient data retrieval

Cons of Intake

Steeper learning curve due to its broader scope
May introduce additional overhead for simple data access scenarios
Less specialized for Zarr-specific optimizations

Code Comparison

Zarr usage:

import zarr
array = zarr.open('data.zarr', mode='r')
subset = array[0:100, 0:100]

Intake usage:

import intake
catalog = intake.open_catalog('catalog.yml')
dataset = catalog.my_dataset.read()
subset = dataset.sel(x=slice(0, 100), y=slice(0, 100))

Summary

Intake provides a more versatile data access solution, supporting multiple data formats and offering cataloging features. However, it may be more complex for simple use cases. Zarr is more focused on efficient array storage and access, making it potentially simpler and more optimized for specific scenarios involving multidimensional arrays.

filesystem_spec

1,226

A specification that python filesystems should adhere to.

Pros of filesystem_spec

Broader scope: Supports a wide range of file systems and storage backends
More flexible: Can be used independently of data formats or processing libraries
Active development: Frequent updates and improvements

Cons of filesystem_spec

Steeper learning curve: More complex API due to its broader scope
Less specialized: May lack some Zarr-specific optimizations

Code Comparison

filesystem_spec:

import fsspec

fs = fsspec.filesystem("s3")
with fs.open("mybucket/myfile.txt", "rb") as f:
    content = f.read()

zarr-python:

import zarr
import s3fs

s3 = s3fs.S3FileSystem()
store = s3fs.S3Map(root="mybucket/mydata", s3=s3, check=False)
z = zarr.open(store, mode="r")

Summary

filesystem_spec is a more general-purpose library for working with various file systems, while zarr-python is specifically designed for the Zarr array format. filesystem_spec offers greater flexibility and broader support for different storage backends, but may require more setup and configuration. zarr-python provides a more streamlined experience for working with Zarr arrays, but is limited to that specific use case.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Zarr

Latest Release

Package Status
License
Build Status
Pre-commit Status
Coverage
Downloads
Developer Chat
Funding
Citation

What is it?

Zarr is a Python package providing an implementation of compressed, chunked, N-dimensional arrays, designed for use in parallel computing. See the documentation for more information.

Main Features

Create N-dimensional arrays with any NumPy dtype.
Chunk arrays along any dimension.
Compress and/or filter chunks using any NumCodecs codec.
Store arrays in memory, on disk, inside a zip file, on S3, etc...
Read an array concurrently from multiple threads or processes.
Write to an array concurrently from multiple threads or processes.
Organize arrays into hierarchies via groups.

Where to get it

Zarr can be installed from PyPI using pip:

pip install zarr

or via conda:

conda install -c conda-forge zarr

For more details, including how to install from source, see the installation documentation.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot