Convert Figma logo to code with AI

zarr-developers logozarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.

1,551
290
1,551
315

Top Related Projects

12,495

Parallel computing with task scheduling

2,097

HDF5 for Python -- The h5py package is a Pythonic interface to the HDF5 binary data format.

3,588

N-D labeled arrays and datasets in Python

1,015

Intake is a lightweight package for finding, investigating, loading and disseminating data.

A specification that python filesystems should adhere to.

Quick Overview

Zarr-python is an implementation of the Zarr array storage format for Python. It provides a flexible and efficient way to store and access large, chunked, compressed N-dimensional arrays. Zarr is particularly useful for working with big data in scientific computing and data analysis applications.

Pros

  • Efficient storage and access of large N-dimensional arrays
  • Supports compression and chunking for optimized performance
  • Flexible storage backends (local file systems, cloud storage, etc.)
  • Integrates well with other scientific Python libraries (NumPy, Dask, etc.)

Cons

  • Learning curve for users new to chunked array storage concepts
  • Limited support for some specialized array operations compared to NumPy
  • Potential performance overhead for small datasets
  • Ecosystem still evolving, with some features in development

Code Examples

Creating and writing to a Zarr array:

import zarr
import numpy as np

# Create a 1000x1000 2D array filled with random data
data = np.random.rand(1000, 1000)

# Create a compressed Zarr array
z = zarr.create((1000, 1000), chunks=(100, 100), dtype='f8', compressor=zarr.Blosc(cname='zstd', clevel=3))

# Write data to the array
z[:] = data

Reading from a Zarr array:

# Read a subset of the data
subset = z[200:400, 300:500]

# Perform operations on the subset
mean_value = subset.mean()

Using Zarr with cloud storage (e.g., AWS S3):

import s3fs
import zarr

# Create an S3 filesystem object
s3 = s3fs.S3FileSystem()

# Open a Zarr group stored on S3
store = s3fs.S3Map(root='mybucket/path/to/data', s3=s3, check=False)
group = zarr.open_group(store)

# Access an array within the group
array = group['my_array']

Getting Started

To get started with Zarr-python:

  1. Install Zarr using pip:

    pip install zarr
    
  2. Import Zarr in your Python script:

    import zarr
    
  3. Create a simple Zarr array:

    import numpy as np
    
    # Create a 3D Zarr array
    z = zarr.create((1000, 1000, 1000), chunks=(100, 100, 100), dtype='f4')
    
    # Fill it with random data
    z[:] = np.random.rand(1000, 1000, 1000)
    
    # Read a slice of data
    subset = z[500:600, 500:600, 500:600]
    

For more advanced usage and configuration options, refer to the Zarr documentation.

Competitor Comparisons

12,495

Parallel computing with task scheduling

Pros of Dask

  • Broader scope: Dask is a flexible library for parallel computing in Python, offering a wide range of tools beyond just data storage
  • Advanced scheduling: Provides sophisticated task scheduling and distributed computing capabilities
  • Integrates well with other Python libraries: Works seamlessly with NumPy, Pandas, and Scikit-learn

Cons of Dask

  • Steeper learning curve: More complex to set up and use compared to Zarr's focused approach
  • Potential overhead: May introduce unnecessary complexity for simple data storage tasks

Code Comparison

Zarr example:

import zarr
z = zarr.create((10000, 10000), chunks=(1000, 1000), dtype='i4')
z[:] = 42

Dask example:

import dask.array as da
x = da.random.random((10000, 10000), chunks=(1000, 1000))
y = x + x.T
z = y.mean(axis=0)
result = z.compute()

While Zarr focuses on efficient array storage and access, Dask provides a broader set of tools for parallel and distributed computing, including delayed evaluation and task scheduling. Zarr is more specialized for chunked, compressed, N-dimensional arrays, while Dask offers a more comprehensive solution for large-scale data processing and analysis.

2,097

HDF5 for Python -- The h5py package is a Pythonic interface to the HDF5 binary data format.

Pros of h5py

  • Mature and widely adopted in scientific computing
  • Native support for complex data structures and metadata
  • Efficient handling of large datasets with chunking and compression

Cons of h5py

  • Limited support for cloud storage and distributed computing
  • Less flexible for concurrent read/write operations
  • Requires HDF5 library installation, which can be challenging on some systems

Code Comparison

h5py:

import h5py

with h5py.File('data.h5', 'w') as f:
    dset = f.create_dataset('data', (100, 100), dtype='float32')
    dset[:] = numpy_array

zarr:

import zarr

z = zarr.open('data.zarr', mode='w', shape=(100, 100), dtype='float32')
z[:] = numpy_array

Both libraries offer similar APIs for creating and accessing datasets, but Zarr provides more flexibility in terms of storage backends and distributed computing support. h5py is more established in the scientific community and offers richer support for complex data structures, while Zarr excels in cloud-native and distributed environments.

3,588

N-D labeled arrays and datasets in Python

Pros of xarray

  • Higher-level API for working with labeled multi-dimensional arrays
  • Built-in support for NetCDF files and other common scientific data formats
  • Extensive data analysis and visualization capabilities

Cons of xarray

  • Steeper learning curve due to more complex API
  • Potentially slower performance for basic array operations
  • Larger dependency footprint

Code Comparison

xarray:

import xarray as xr

# Create a labeled 2D array
data = xr.DataArray(
    [[1, 2, 3], [4, 5, 6]],
    dims=("x", "y"),
    coords={"x": [10, 20], "y": [100, 200, 300]}
)

zarr-python:

import zarr

# Create a 2D array
z = zarr.create((2, 3), dtype=int)
z[:] = [[1, 2, 3], [4, 5, 6]]

xarray provides a more feature-rich interface for working with labeled multi-dimensional data, while zarr-python focuses on efficient storage and access of large arrays. xarray is better suited for complex data analysis tasks, while zarr-python excels in scenarios requiring high-performance I/O operations on large datasets.

1,015

Intake is a lightweight package for finding, investigating, loading and disseminating data.

Pros of Intake

  • Provides a unified interface for accessing various data sources, including Zarr
  • Offers data cataloging and metadata management capabilities
  • Supports lazy loading and efficient data retrieval

Cons of Intake

  • Steeper learning curve due to its broader scope
  • May introduce additional overhead for simple data access scenarios
  • Less specialized for Zarr-specific optimizations

Code Comparison

Zarr usage:

import zarr
array = zarr.open('data.zarr', mode='r')
subset = array[0:100, 0:100]

Intake usage:

import intake
catalog = intake.open_catalog('catalog.yml')
dataset = catalog.my_dataset.read()
subset = dataset.sel(x=slice(0, 100), y=slice(0, 100))

Summary

Intake provides a more versatile data access solution, supporting multiple data formats and offering cataloging features. However, it may be more complex for simple use cases. Zarr is more focused on efficient array storage and access, making it potentially simpler and more optimized for specific scenarios involving multidimensional arrays.

A specification that python filesystems should adhere to.

Pros of filesystem_spec

  • Broader scope: Supports a wide range of file systems and storage backends
  • More flexible: Can be used independently of data formats or processing libraries
  • Active development: Frequent updates and improvements

Cons of filesystem_spec

  • Steeper learning curve: More complex API due to its broader scope
  • Less specialized: May lack some Zarr-specific optimizations

Code Comparison

filesystem_spec:

import fsspec

fs = fsspec.filesystem("s3")
with fs.open("mybucket/myfile.txt", "rb") as f:
    content = f.read()

zarr-python:

import zarr
import s3fs

s3 = s3fs.S3FileSystem()
store = s3fs.S3Map(root="mybucket/mydata", s3=s3, check=False)
z = zarr.open(store, mode="r")

Summary

filesystem_spec is a more general-purpose library for working with various file systems, while zarr-python is specifically designed for the Zarr array format. filesystem_spec offers greater flexibility and broader support for different storage backends, but may require more setup and configuration. zarr-python provides a more streamlined experience for working with Zarr arrays, but is limited to that specific use case.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README


Zarr

Latest Release latest release
latest release
Package Status status
License license
Build Status build status
Pre-commit Status pre-commit status
Coverage coverage
Downloads pypi downloads
Zulip
Funding CZI's Essential Open Source Software for Science
Citation DOI

What is it?

Zarr is a Python package providing an implementation of compressed, chunked, N-dimensional arrays, designed for use in parallel computing. See the documentation for more information.

Main Features

  • Create N-dimensional arrays with any NumPy dtype.
  • Chunk arrays along any dimension.
  • Compress and/or filter chunks using any NumCodecs codec.
  • Store arrays in memory, on disk, inside a zip file, on S3, etc...
  • Read an array concurrently from multiple threads or processes.
  • Write to an array concurrently from multiple threads or processes.
  • Organize arrays into hierarchies via groups.

Where to get it

Zarr can be installed from PyPI using pip:

pip install zarr

or via conda:

conda install -c conda-forge zarr

For more details, including how to install from source, see the installation documentation.