Convert Figma logo to code with AI

fsspec logofilesystem_spec

A specification that python filesystems should adhere to.

1,055
363
1,055
293

Top Related Projects

1,015

Intake is a lightweight package for finding, investigating, loading and disseminating data.

12,495

Parallel computing with task scheduling

An implementation of chunked, compressed, N-dimensional arrays for Python.

3,588

N-D labeled arrays and datasets in Python

43,524

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

14,426

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

Quick Overview

Filesystem Spec (fsspec) is a unified pythonic interface for interacting with different file systems and storage backends. It provides a consistent API for working with local files, cloud storage, and other remote file systems, allowing developers to write code that can seamlessly work across various storage types.

Pros

  • Consistent API across multiple file systems and storage backends
  • Supports a wide range of storage types, including local, cloud (S3, GCS, Azure), and remote (SSH, FTP)
  • Extensible architecture allowing easy addition of new file systems
  • Integrates well with other data processing libraries like Pandas and Dask

Cons

  • Learning curve for users unfamiliar with the abstraction layer
  • Some file system-specific features may not be available through the generic interface
  • Performance overhead in some cases due to the abstraction layer
  • Documentation can be sparse for some less common use cases

Code Examples

  1. Opening and reading a file from different storage backends:
import fsspec

# Local file
with fsspec.open("local_file.txt", "r") as f:
    content = f.read()

# S3 file
with fsspec.open("s3://bucket/file.txt", "r") as f:
    content = f.read()

# Google Cloud Storage file
with fsspec.open("gcs://bucket/file.txt", "r") as f:
    content = f.read()
  1. Listing files in a directory:
import fsspec

fs = fsspec.filesystem("file")  # Local file system
files = fs.ls("/path/to/directory")

fs = fsspec.filesystem("s3")  # S3 file system
files = fs.ls("s3://bucket/path")
  1. Writing data to a file:
import fsspec

data = "Hello, World!"

with fsspec.open("output.txt", "w") as f:
    f.write(data)

with fsspec.open("s3://bucket/output.txt", "w") as f:
    f.write(data)

Getting Started

To get started with fsspec, first install it using pip:

pip install fsspec

For specific file systems, you may need to install additional dependencies:

pip install s3fs  # for S3 support
pip install gcsfs  # for Google Cloud Storage support

Then, you can start using fsspec in your Python code:

import fsspec

# Open a file (local or remote)
with fsspec.open("path/to/file.txt", "r") as f:
    content = f.read()

# List files in a directory
fs = fsspec.filesystem("file")  # or "s3", "gcs", etc.
files = fs.ls("/path/to/directory")

# Write data to a file
with fsspec.open("output.txt", "w") as f:
    f.write("Hello, World!")

Competitor Comparisons

1,015

Intake is a lightweight package for finding, investigating, loading and disseminating data.

Pros of Intake

  • Provides a high-level API for data cataloging and discovery
  • Offers data source plugins for various formats and storage systems
  • Supports metadata management and data versioning

Cons of Intake

  • More complex setup and configuration compared to filesystem_spec
  • Potentially steeper learning curve for new users
  • May introduce overhead for simple data access tasks

Code Comparison

Intake:

import intake
catalog = intake.open_catalog("my_catalog.yaml")
dataset = catalog.my_dataset.read()

filesystem_spec:

import fsspec
fs = fsspec.filesystem("file")
with fs.open("my_file.csv", "r") as f:
    data = f.read()

Summary

Intake is a comprehensive data cataloging and access library, while filesystem_spec focuses on providing a unified interface for various filesystems. Intake offers more advanced features for data management and discovery, but may be overkill for simple file access tasks. filesystem_spec provides a simpler, more lightweight approach to working with different storage systems, but lacks the higher-level data cataloging capabilities of Intake.

12,495

Parallel computing with task scheduling

Pros of Dask

  • Provides a powerful distributed computing framework for large-scale data processing
  • Offers a familiar API that mimics NumPy and Pandas for easy adoption
  • Includes advanced features like task scheduling and adaptive scaling

Cons of Dask

  • Steeper learning curve due to its more comprehensive feature set
  • Potentially overkill for simpler file system operations
  • Requires more system resources for setup and execution

Code Comparison

Dask:

import dask.dataframe as dd

df = dd.read_csv('s3://mybucket/*.csv')
result = df.groupby('column').mean().compute()

Filesystem Spec:

import fsspec

fs = fsspec.filesystem('s3')
with fs.open('s3://mybucket/file.csv', 'rb') as f:
    content = f.read()

Summary

Dask is a comprehensive distributed computing framework ideal for large-scale data processing, while Filesystem Spec focuses on providing a unified interface for various file systems. Dask offers more advanced features but may be complex for simple file operations, whereas Filesystem Spec is lightweight and specialized for file system interactions across different storage backends.

An implementation of chunked, compressed, N-dimensional arrays for Python.

Pros of zarr-python

  • Specialized for efficient storage and access of large N-dimensional arrays
  • Supports chunked, compressed, and parallel I/O operations
  • Integrates well with NumPy and other scientific Python libraries

Cons of zarr-python

  • More focused on array data, less versatile for general file system operations
  • Steeper learning curve for users not familiar with array-based data structures
  • Limited support for non-array data types compared to filesystem_spec

Code Comparison

zarr-python:

import zarr
import numpy as np

z = zarr.create((10000, 10000), chunks=(1000, 1000), dtype='i4')
z[:] = np.random.randint(0, 1000, size=(10000, 10000))

filesystem_spec:

from fsspec import filesystem
fs = filesystem("file")
with fs.open("myfile.txt", "w") as f:
    f.write("Hello, world!")

The zarr-python example demonstrates creating and populating a large, chunked array, while the filesystem_spec example shows basic file I/O operations. This highlights the different focus areas of the two libraries: zarr-python for array data and filesystem_spec for general file system interactions.

3,588

N-D labeled arrays and datasets in Python

Pros of xarray

  • Powerful N-dimensional labeled data structures for scientific computing
  • Built-in support for complex operations like grouping, resampling, and rolling window computations
  • Seamless integration with other scientific Python libraries like NumPy, pandas, and dask

Cons of xarray

  • Steeper learning curve compared to filesystem_spec's simpler file system abstraction
  • More focused on scientific data analysis, less versatile for general file system operations
  • Larger library size and potentially higher memory footprint

Code Comparison

xarray:

import xarray as xr

ds = xr.Dataset({
    'temperature': (['time', 'lat', 'lon'], temp_data),
    'precipitation': (['time', 'lat', 'lon'], precip_data)
})

filesystem_spec:

from fsspec import filesystem

fs = filesystem('s3')
with fs.open('bucket/file.txt', 'rb') as f:
    content = f.read()

xarray excels in handling multi-dimensional labeled data, while filesystem_spec provides a unified interface for various file systems. xarray is more suitable for scientific data analysis, whereas filesystem_spec offers a simpler approach for file system operations across different storage backends.

43,524

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Pros of pandas

  • Comprehensive data manipulation and analysis library
  • Extensive documentation and large community support
  • Powerful data structures like DataFrame and Series

Cons of pandas

  • Larger memory footprint for big datasets
  • Steeper learning curve for beginners
  • Limited support for distributed computing

Code Comparison

pandas:

import pandas as pd

df = pd.read_csv('data.csv')
filtered_df = df[df['column'] > 5]
result = filtered_df.groupby('category').mean()

filesystem_spec:

import fsspec

with fsspec.open('s3://bucket/data.csv', 'r') as f:
    content = f.read()
    # Process content as needed

Key Differences

  • pandas focuses on data analysis and manipulation, while filesystem_spec specializes in file system abstractions
  • filesystem_spec provides a unified interface for various storage systems, whereas pandas primarily works with in-memory data structures
  • pandas offers more advanced data processing capabilities, while filesystem_spec excels in handling different file systems and cloud storage

Use Cases

  • Use pandas for data analysis, cleaning, and transformation tasks
  • Choose filesystem_spec when working with multiple storage backends or cloud services
  • Combine both libraries for efficient data loading and processing workflows
14,426

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

Pros of Arrow

  • Comprehensive data processing framework with support for multiple languages
  • High-performance columnar memory format for efficient data processing
  • Strong community support and active development from Apache Foundation

Cons of Arrow

  • Steeper learning curve due to its broader scope and complexity
  • May be overkill for simple file system operations
  • Requires more setup and configuration for basic use cases

Code Comparison

filesystem_spec:

import fsspec

fs = fsspec.filesystem('s3')
with fs.open('mybucket/myfile.txt', 'rb') as f:
    content = f.read()

Arrow:

import pyarrow as pa
import pyarrow.fs as fs

s3 = fs.S3FileSystem()
with s3.open_input_file('mybucket/myfile.txt') as f:
    content = f.read()

Summary

filesystem_spec is a lightweight, flexible library focused on providing a unified interface for various file systems. It's simpler to use for basic file operations across different storage backends.

Arrow is a more comprehensive data processing framework that includes file system capabilities. It offers high-performance data structures and operations but may be more complex for simple use cases.

Choose filesystem_spec for straightforward file system operations across multiple backends, and Arrow for more advanced data processing needs or when working with large datasets in a columnar format.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

filesystem_spec

PyPI version Anaconda-Server Badge Build Docs

A specification for pythonic filesystems.

Install

pip install fsspec

would install the base fsspec. Various optionally supported features might require specification of custom extra require, e.g. pip install fsspec[ssh] will install dependencies for ssh backends support. Use pip install fsspec[full] for installation of all known extra dependencies.

Up-to-date package also provided through conda-forge distribution:

conda install -c conda-forge fsspec

Purpose

To produce a template or specification for a file-system interface, that specific implementations should follow, so that applications making use of them can rely on a common behaviour and not have to worry about the specific internal implementation decisions with any given backend. Many such implementations are included in this package, or in sister projects such as s3fs and gcsfs.

In addition, if this is well-designed, then additional functionality, such as a key-value store or FUSE mounting of the file-system implementation may be available for all implementations "for free".

Documentation

Please refer to RTD

Develop

fsspec uses GitHub Actions for CI. Environment files can be found in the "ci/" directory. Note that the main environment is called "py38", but it is expected that the version of python installed be adjustable at CI runtime. For local use, pick a version suitable for you.

# For a new environment (mamba / conda).
mamba create -n fsspec -c conda-forge  python=3.9 -y
conda activate fsspec

# Standard dev install with docs and tests.
pip install -e ".[dev,doc,test]"

# Full tests except for downstream
pip install s3fs
pip uninstall s3fs
pip install -e .[dev,doc,test_full]
pip install s3fs --no-deps
pytest -v

# Downstream tests.
sh install_s3fs.sh
# Windows powershell.
install_s3fs.sh

Testing

Tests can be run in the dev environment, if activated, via pytest fsspec.

The full fsspec suite requires a system-level docker, docker-compose, and fuse installation. If only making changes to one backend implementation, it is not generally necessary to run all tests locally.

It is expected that contributors ensure that any change to fsspec does not cause issues or regressions for either other fsspec-related packages such as gcsfs and s3fs, nor for downstream users of fsspec. The "downstream" CI run and corresponding environment file run a set of tests from the dask test suite, and very minimal tests against pandas and zarr from the test_downstream.py module in this repo.

Code Formatting

fsspec uses Black to ensure a consistent code format throughout the project. Run black fsspec from the root of the filesystem_spec repository to auto-format your code. Additionally, many editors have plugins that will apply black as you edit files. black is included in the tox environments.

Optionally, you may wish to setup pre-commit hooks to automatically run black when you make a git commit. Run pre-commit install --install-hooks from the root of the filesystem_spec repository to setup pre-commit hooks. black will now be run before you commit, reformatting any changed files. You can format without committing via pre-commit run or skip these checks with git commit --no-verify.