filesystem_spec

A specification that python filesystems should adhere to.

1,226

405

1,226

332

View on GitHub

Top Related Projects

intake

1,055

Intake is a lightweight package for finding, investigating, loading and disseminating data.

dask

13,376

Parallel computing with task scheduling

zarr-python

1,732

An implementation of chunked, compressed, N-dimensional arrays for Python.

xarray

3,926

N-D labeled arrays and datasets in Python

pandas

46,172

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

arrow

15,787

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

Quick Overview

Filesystem Spec (fsspec) is a unified pythonic interface for interacting with different file systems and storage backends. It provides a consistent API for working with local files, cloud storage, and other remote file systems, allowing developers to write code that can seamlessly work across various storage types.

Pros

Consistent API across multiple file systems and storage backends
Supports a wide range of storage types, including local, cloud (S3, GCS, Azure), and remote (SSH, FTP)
Extensible architecture allowing easy addition of new file systems
Integrates well with other data processing libraries like Pandas and Dask

Cons

Learning curve for users unfamiliar with the abstraction layer
Some file system-specific features may not be available through the generic interface
Performance overhead in some cases due to the abstraction layer
Documentation can be sparse for some less common use cases

Code Examples

Opening and reading a file from different storage backends:

import fsspec

# Local file
with fsspec.open("local_file.txt", "r") as f:
    content = f.read()

# S3 file
with fsspec.open("s3://bucket/file.txt", "r") as f:
    content = f.read()

# Google Cloud Storage file
with fsspec.open("gcs://bucket/file.txt", "r") as f:
    content = f.read()

Listing files in a directory:

import fsspec

fs = fsspec.filesystem("file")  # Local file system
files = fs.ls("/path/to/directory")

fs = fsspec.filesystem("s3")  # S3 file system
files = fs.ls("s3://bucket/path")

Writing data to a file:

import fsspec

data = "Hello, World!"

with fsspec.open("output.txt", "w") as f:
    f.write(data)

with fsspec.open("s3://bucket/output.txt", "w") as f:
    f.write(data)

Getting Started

To get started with fsspec, first install it using pip:

pip install fsspec

For specific file systems, you may need to install additional dependencies:

pip install s3fs  # for S3 support
pip install gcsfs  # for Google Cloud Storage support

Then, you can start using fsspec in your Python code:

import fsspec

# Open a file (local or remote)
with fsspec.open("path/to/file.txt", "r") as f:
    content = f.read()

# List files in a directory
fs = fsspec.filesystem("file")  # or "s3", "gcs", etc.
files = fs.ls("/path/to/directory")

# Write data to a file
with fsspec.open("output.txt", "w") as f:
    f.write("Hello, World!")

Competitor Comparisons

intake

1,055

Intake is a lightweight package for finding, investigating, loading and disseminating data.

Pros of Intake

Provides a high-level API for data cataloging and discovery
Offers data source plugins for various formats and storage systems
Supports metadata management and data versioning

Cons of Intake

More complex setup and configuration compared to filesystem_spec
Potentially steeper learning curve for new users
May introduce overhead for simple data access tasks

Code Comparison

Intake:

import intake
catalog = intake.open_catalog("my_catalog.yaml")
dataset = catalog.my_dataset.read()

filesystem_spec:

import fsspec
fs = fsspec.filesystem("file")
with fs.open("my_file.csv", "r") as f:
    data = f.read()

Summary

Intake is a comprehensive data cataloging and access library, while filesystem_spec focuses on providing a unified interface for various filesystems. Intake offers more advanced features for data management and discovery, but may be overkill for simple file access tasks. filesystem_spec provides a simpler, more lightweight approach to working with different storage systems, but lacks the higher-level data cataloging capabilities of Intake.

dask

13,376

Parallel computing with task scheduling

Pros of Dask

Provides a powerful distributed computing framework for large-scale data processing
Offers a familiar API that mimics NumPy and Pandas for easy adoption
Includes advanced features like task scheduling and adaptive scaling

Cons of Dask

Steeper learning curve due to its more comprehensive feature set
Potentially overkill for simpler file system operations
Requires more system resources for setup and execution

Code Comparison

Dask:

import dask.dataframe as dd

df = dd.read_csv('s3://mybucket/*.csv')
result = df.groupby('column').mean().compute()

Filesystem Spec:

import fsspec

fs = fsspec.filesystem('s3')
with fs.open('s3://mybucket/file.csv', 'rb') as f:
    content = f.read()

Summary

Dask is a comprehensive distributed computing framework ideal for large-scale data processing, while Filesystem Spec focuses on providing a unified interface for various file systems. Dask offers more advanced features but may be complex for simple file operations, whereas Filesystem Spec is lightweight and specialized for file system interactions across different storage backends.

zarr-python

1,732

An implementation of chunked, compressed, N-dimensional arrays for Python.

Pros of zarr-python

Specialized for efficient storage and access of large N-dimensional arrays
Supports chunked, compressed, and parallel I/O operations
Integrates well with NumPy and other scientific Python libraries

Cons of zarr-python

More focused on array data, less versatile for general file system operations
Steeper learning curve for users not familiar with array-based data structures
Limited support for non-array data types compared to filesystem_spec

Code Comparison

zarr-python:

import zarr
import numpy as np

z = zarr.create((10000, 10000), chunks=(1000, 1000), dtype='i4')
z[:] = np.random.randint(0, 1000, size=(10000, 10000))

filesystem_spec:

from fsspec import filesystem
fs = filesystem("file")
with fs.open("myfile.txt", "w") as f:
    f.write("Hello, world!")

The zarr-python example demonstrates creating and populating a large, chunked array, while the filesystem_spec example shows basic file I/O operations. This highlights the different focus areas of the two libraries: zarr-python for array data and filesystem_spec for general file system interactions.

xarray

3,926

N-D labeled arrays and datasets in Python

Pros of xarray

Powerful N-dimensional labeled data structures for scientific computing
Built-in support for complex operations like grouping, resampling, and rolling window computations
Seamless integration with other scientific Python libraries like NumPy, pandas, and dask

Cons of xarray

Steeper learning curve compared to filesystem_spec's simpler file system abstraction
More focused on scientific data analysis, less versatile for general file system operations
Larger library size and potentially higher memory footprint

Code Comparison

xarray:

import xarray as xr

ds = xr.Dataset({
    'temperature': (['time', 'lat', 'lon'], temp_data),
    'precipitation': (['time', 'lat', 'lon'], precip_data)
})

filesystem_spec:

from fsspec import filesystem

fs = filesystem('s3')
with fs.open('bucket/file.txt', 'rb') as f:
    content = f.read()

xarray excels in handling multi-dimensional labeled data, while filesystem_spec provides a unified interface for various file systems. xarray is more suitable for scientific data analysis, whereas filesystem_spec offers a simpler approach for file system operations across different storage backends.

pandas

46,172

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Pros of pandas

Comprehensive data manipulation and analysis library
Extensive documentation and large community support
Powerful data structures like DataFrame and Series

Cons of pandas

Larger memory footprint for big datasets
Steeper learning curve for beginners
Limited support for distributed computing

Code Comparison

pandas:

import pandas as pd

df = pd.read_csv('data.csv')
filtered_df = df[df['column'] > 5]
result = filtered_df.groupby('category').mean()

filesystem_spec:

import fsspec

with fsspec.open('s3://bucket/data.csv', 'r') as f:
    content = f.read()
    # Process content as needed

Key Differences

pandas focuses on data analysis and manipulation, while filesystem_spec specializes in file system abstractions
filesystem_spec provides a unified interface for various storage systems, whereas pandas primarily works with in-memory data structures
pandas offers more advanced data processing capabilities, while filesystem_spec excels in handling different file systems and cloud storage

Use Cases

Use pandas for data analysis, cleaning, and transformation tasks
Choose filesystem_spec when working with multiple storage backends or cloud services
Combine both libraries for efficient data loading and processing workflows

arrow

15,787

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

Pros of Arrow

Comprehensive data processing framework with support for multiple languages
High-performance columnar memory format for efficient data processing
Strong community support and active development from Apache Foundation

Cons of Arrow

Steeper learning curve due to its broader scope and complexity
May be overkill for simple file system operations
Requires more setup and configuration for basic use cases

Code Comparison

filesystem_spec:

import fsspec

fs = fsspec.filesystem('s3')
with fs.open('mybucket/myfile.txt', 'rb') as f:
    content = f.read()

Arrow:

import pyarrow as pa
import pyarrow.fs as fs

s3 = fs.S3FileSystem()
with s3.open_input_file('mybucket/myfile.txt') as f:
    content = f.read()

Summary

filesystem_spec is a lightweight, flexible library focused on providing a unified interface for various file systems. It's simpler to use for basic file operations across different storage backends.

Arrow is a more comprehensive data processing framework that includes file system capabilities. It offers high-performance data structures and operations but may be more complex for simple use cases.

Choose filesystem_spec for straightforward file system operations across multiple backends, and Arrow for more advanced data processing needs or when working with large datasets in a columnar format.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

filesystem_spec

A specification for pythonic filesystems.

Install

pip install fsspec

would install the base fsspec. Various optionally supported features might require specification of custom extra require, e.g. pip install fsspec[ssh] will install dependencies for ssh backends support. Use pip install fsspec[full] for installation of all known extra dependencies.

Up-to-date package also provided through conda-forge distribution:

conda install -c conda-forge fsspec

Purpose

To produce a template or specification for a file-system interface, that specific implementations should follow, so that applications making use of them can rely on a common behaviour and not have to worry about the specific internal implementation decisions with any given backend. Many such implementations are included in this package, or in sister projects such as s3fs and gcsfs.

In addition, if this is well-designed, then additional functionality, such as a key-value store or FUSE mounting of the file-system implementation may be available for all implementations "for free".

Documentation

Please refer to RTD

Develop

fsspec uses GitHub Actions for CI. Environment files can be found in the "ci/" directory. Note that the main environment is called "py38", but it is expected that the version of python installed be adjustable at CI runtime. For local use, pick a version suitable for you.

# For a new environment (mamba / conda).
mamba create -n fsspec -c conda-forge  python=3.9 -y
conda activate fsspec

# Standard dev install with docs and tests.
pip install -e ".[dev,doc,test]"

# Full tests except for downstream
pip install s3fs
pip uninstall s3fs
pip install -e .[dev,doc,test_full]
pip install s3fs --no-deps
pytest -v

# Downstream tests.
sh install_s3fs.sh
# Windows powershell.
install_s3fs.sh

Testing

Tests can be run in the dev environment, if activated, via pytest fsspec.

The full fsspec suite requires a system-level docker, docker-compose, and fuse installation. If only making changes to one backend implementation, it is not generally necessary to run all tests locally.

It is expected that contributors ensure that any change to fsspec does not cause issues or regressions for either other fsspec-related packages such as gcsfs and s3fs, nor for downstream users of fsspec. The "downstream" CI run and corresponding environment file run a set of tests from the dask test suite, and very minimal tests against pandas and zarr from the test_downstream.py module in this repo.

Code Formatting

fsspec uses Black to ensure a consistent code format throughout the project. Run black fsspec from the root of the filesystem_spec repository to auto-format your code. Additionally, many editors have plugins that will apply black as you edit files. black is included in the tox environments.

Optionally, you may wish to setup pre-commit hooks to automatically run black when you make a git commit. Run pre-commit install --install-hooks from the root of the filesystem_spec repository to setup pre-commit hooks. black will now be run before you commit, reformatting any changed files. You can format without committing via pre-commit run or skip these checks with git commit --no-verify.

Support

Work on this repository is supported in part by:

"Anaconda, Inc. - Advancing AI through open source."

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot