smart_open

Utils for streaming large files (S3, HDFS, gzip, bz2...)

3,329

384

3,329

View on GitHub

Top Related Projects

filesystem_spec

1,185

A specification that python filesystems should adhere to.

dask

13,376

Parallel computing with task scheduling

intake

1,047

Intake is a lightweight package for finding, investigating, loading and disseminating data.

xarray

3,926

N-D labeled arrays and datasets in Python

Quick Overview

Smart_open is a Python library that provides a convenient and efficient way to read and write streaming large files, including compressed files and remote files. It supports various file formats and protocols, such as local files, S3, HDFS, WebHDFS, HTTP, HTTPS, and more, with a unified and simple interface.

Pros

Seamless handling of different file formats and protocols
Efficient streaming of large files without loading them entirely into memory
Easy-to-use interface with a consistent API across different storage systems
Supports compression formats like gzip, bzip2, and lzma

Cons

Dependency on external libraries for certain protocols (e.g., boto3 for S3)
Limited support for writing to some remote protocols (e.g., HTTP)
May have performance overhead for small files compared to native Python file operations
Documentation could be more comprehensive for advanced use cases

Code Examples

Reading a file from S3:

from smart_open import open

with open('s3://mybucket/mykey.txt', 'r') as f:
    content = f.read()
print(content)

Writing a gzip-compressed file:

from smart_open import open

with open('myfile.txt.gz', 'w', compression='gzip') as f:
    f.write('Hello, compressed world!')

Streaming a large file from a URL:

from smart_open import open

with open('https://example.com/large_file.csv', 'r') as f:
    for line in f:
        process_line(line)

Getting Started

To get started with smart_open, first install it using pip:

pip install smart_open

Then, you can use it in your Python code:

from smart_open import open

# Read a local file
with open('myfile.txt', 'r') as f:
    content = f.read()

# Write to an S3 file
with open('s3://mybucket/mykey.txt', 'w') as f:
    f.write('Hello, S3!')

# Read a compressed file from a URL
with open('http://example.com/data.csv.gz', 'r', compression='gzip') as f:
    for line in f:
        print(line.strip())

Note that for S3 operations, you'll need to install the boto3 library and configure your AWS credentials.

Competitor Comparisons

gensim

16,122

Topic Modelling for Humans

Pros of gensim

Comprehensive library for topic modeling and document similarity
Efficient implementations of popular algorithms like Word2Vec and LDA
Large community and extensive documentation

Cons of gensim

Steeper learning curve due to its broad scope
Heavier dependency footprint
May be overkill for simple text processing tasks

Code comparison

gensim:

from gensim.models import Word2Vec

sentences = [['this', 'is', 'a', 'sentence'], ['another', 'sentence']]
model = Word2Vec(sentences, min_count=1)
vector = model.wv['sentence']

smart_open:

from smart_open import open

with open('s3://bucket/key.txt', 'rb') as f:
    content = f.read()

Key differences

gensim focuses on advanced NLP tasks, while smart_open specializes in efficient file streaming
smart_open is more lightweight and easier to integrate for basic I/O operations
gensim provides a wide range of text analysis tools, whereas smart_open is primarily for handling various file formats and protocols

Both libraries are maintained by the same author (piskvorky) and can be used complementarily in NLP projects, with smart_open potentially serving as a utility within larger gensim-based applications.

filesystem_spec

1,185

A specification that python filesystems should adhere to.

Pros of filesystem_spec

More comprehensive support for various file systems and protocols
Provides a unified interface for different storage backends
Offers advanced features like caching and transaction support

Cons of filesystem_spec

Steeper learning curve due to more complex API
May be overkill for simple file operations
Requires more dependencies for full functionality

Code Comparison

smart_open:

from smart_open import open

with open('s3://bucket/key', 'rb') as f:
    content = f.read()

filesystem_spec:

import fsspec

with fsspec.open('s3://bucket/key', 'rb') as f:
    content = f.read()

Both libraries provide similar basic functionality for opening files, but filesystem_spec offers more advanced features and customization options for working with various file systems and protocols. smart_open is generally simpler to use for basic operations, while filesystem_spec provides a more comprehensive solution for complex file system interactions across different storage backends.

dask

13,376

Parallel computing with task scheduling

Pros of Dask

Powerful distributed computing framework for large-scale data processing
Integrates well with NumPy, Pandas, and scikit-learn ecosystems
Offers advanced features like task scheduling and parallel computing

Cons of Dask

Steeper learning curve due to its complexity and extensive features
Requires more setup and configuration for distributed computing
May be overkill for simple file I/O operations

Code Comparison

smart_open:

from smart_open import open

with open('s3://bucket/key.txt', 'r') as f:
    content = f.read()

Dask:

import dask.dataframe as dd

df = dd.read_csv('s3://bucket/key.csv')
result = df.groupby('column').mean().compute()

Key Differences

smart_open focuses on simplified I/O operations across various protocols
Dask is a comprehensive data processing framework with distributed computing capabilities
smart_open is more lightweight and easier to use for basic file operations
Dask excels in handling large-scale data processing tasks and parallel computing

intake

1,047

Intake is a lightweight package for finding, investigating, loading and disseminating data.

Pros of intake

Provides a unified data access layer for various data sources
Supports data cataloging and metadata management
Offers powerful data discovery and exploration capabilities

Cons of intake

Steeper learning curve compared to smart_open
May be overkill for simple file access tasks
Less focused on cloud storage integration

Code comparison

smart_open:

from smart_open import open

with open('s3://bucket/key.txt', 'r') as f:
    content = f.read()

intake:

import intake

catalog = intake.open_catalog('catalog.yml')
dataset = catalog.my_dataset.read()

Key differences

smart_open focuses on providing a simple, unified interface for reading and writing files from various sources, including cloud storage.
intake is a more comprehensive data access and cataloging system, offering additional features like data discovery and metadata management.
smart_open is generally easier to use for basic file operations, while intake provides more advanced data handling capabilities.
intake requires catalog definitions and is better suited for managing larger datasets and data pipelines.
smart_open has better support for cloud storage services out of the box, while intake may require additional plugins for some cloud integrations.

xarray

3,926

N-D labeled arrays and datasets in Python

Pros of xarray

Powerful N-dimensional array operations and labeled data handling
Extensive support for working with large datasets and NetCDF files
Rich ecosystem of tools and plugins for scientific computing

Cons of xarray

Steeper learning curve due to more complex data structures
Heavier dependency footprint compared to smart_open
May be overkill for simple file I/O operations

Code Comparison

xarray:

import xarray as xr

# Open a NetCDF file
ds = xr.open_dataset('data.nc')

# Perform operations on multidimensional data
result = ds['temperature'].mean(dim='time')

smart_open:

from smart_open import open

# Read from a compressed file
with open('data.gz', 'r') as f:
    content = f.read()

# Write to an S3 bucket
with open('s3://bucket/key', 'w') as f:
    f.write('Hello, World!')

xarray is focused on handling multidimensional labeled arrays and datasets, particularly useful for scientific computing and data analysis. smart_open, on the other hand, specializes in providing a unified interface for reading and writing to various file formats and remote storage systems. While xarray offers powerful data manipulation capabilities, smart_open excels in simplifying I/O operations across different storage backends.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

====================================================== smart_open â utils for streaming large files in Python

|License|_ |GHA|_ |Coveralls|_ |Downloads|_

.. |License| image:: https://img.shields.io/pypi/l/smart_open.svg .. |GHA| image:: https://github.com/piskvorky/smart_open/workflows/Test/badge.svg .. |Coveralls| image:: https://coveralls.io/repos/github/RaRe-Technologies/smart_open/badge.svg?branch=develop .. |Downloads| image:: https://pepy.tech/badge/smart-open/month .. _License: https://github.com/piskvorky/smart_open/blob/master/LICENSE .. _GHA: https://github.com/piskvorky/smart_open/actions?query=workflow%3ATest .. _Coveralls: https://coveralls.io/github/RaRe-Technologies/smart_open?branch=HEAD .. _Downloads: https://pypi.org/project/smart-open/

What?

smart_open is a Python 3 library for efficient streaming of very large files from/to storages such as S3, GCS, Azure Blob Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem. It supports transparent, on-the-fly (de-)compression for a variety of different formats.

smart_open is a drop-in replacement for Python's built-in open(): it can do anything open can (100% compatible, falls back to native open wherever possible), plus lots of nifty extra stuff on top.

Python 2.7 is no longer supported. If you need Python 2.7, please use smart_open 1.10.1 <https://github.com/piskvorky/smart_open/releases/tag/1.10.0>_, the last version to support Python 2.

Why?

Working with large remote files, for example using Amazon's boto3 <https://boto3.amazonaws.com/v1/documentation/api/latest/index.html>_ Python library, is a pain. boto3's Object.upload_fileobj() and Object.download_fileobj() methods require gotcha-prone boilerplate to use successfully, such as constructing file-like object wrappers. smart_open shields you from that. It builds on boto3 and other remote storage libraries, but offers a clean unified Pythonic API. The result is less code for you to write and fewer bugs to make.

How?

smart_open is well-tested, well-documented, and has a simple Pythonic API:

.. _doctools_before_examples:

.. code-block:: python

from smart_open import open

stream lines from an S3 object

for line in open('s3://commoncrawl/robots.txt'): ... print(repr(line)) ... break 'User-Agent: *\n'

stream from/to compressed files, with transparent (de)compression:

for line in open('smart_open/tests/test_data/1984.txt.gz', encoding='utf-8'): ... print(repr(line)) 'It was a bright cold day in April, and the clocks were striking thirteen.\n' 'Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n' 'wind, slipped quickly through the glass doors of Victory Mansions, though not\n' 'quickly enough to prevent a swirl of gritty dust from entering along with him.\n'

can use context managers too:

with open('smart_open/tests/test_data/1984.txt.gz') as fin: ... with open('smart_open/tests/test_data/1984.txt.bz2', 'w') as fout: ... for line in fin: ... fout.write(line) 74 80 78 79

can use any IOBase operations, like seek

with open('s3://commoncrawl/robots.txt', 'rb') as fin: ... for line in fin: ... print(repr(line.decode('utf-8'))) ... break ... offset = fin.seek(0) # seek to the beginning ... print(fin.read(4)) 'User-Agent: *\n' b'User'

stream from HTTP

for line in open('http://example.com/index.html'): ... print(repr(line)) ... break '\n'

.. _doctools_after_examples:

Other examples of URLs that smart_open accepts::

s3://my_bucket/my_key
s3://my_key:my_secret@my_bucket/my_key
s3://my_key:my_secret@my_server:my_port@my_bucket/my_key
gs://my_bucket/my_blob
azure://my_bucket/my_blob
hdfs:///path/file
hdfs://path/file
webhdfs://host:port/path/file
./local/path/file
~/local/path/file
local/path/file
./local/path/file.gz
file:///home/user/file
file:///home/user/file.bz2
[ssh|scp|sftp]://username@host//path/file
[ssh|scp|sftp]://username@host/path/file
[ssh|scp|sftp]://username:password@host/path/file

Documentation

The API reference can be viewed at help.txt <https://github.com/piskvorky/smart_open/blob/master/help.txt>__

Installation

smart_open supports a wide range of storage solutions. For all options, see the API reference. Each individual solution has its own dependencies. By default, smart_open does not install any dependencies, in order to keep the installation size small. You can install one or more of these dependencies explicitly using optional dependencies:

pip install smart_open[s3,gcs,azure,http,webhdfs,ssh,zst]

Or, if you don't mind installing a large number of third party libraries, you can install all dependencies using::

pip install smart_open[all]

Be warned that this option increases the installation size significantly, e.g. over 100MB.

If you're upgrading from smart_open versions 2.x and below, please check out the Migration Guide <MIGRATING_FROM_OLDER_VERSIONS.rst>_.

Built-in help

To view the API reference, use the help python builtin:

.. code-block:: python

help('smart_open')

or view help.txt <https://github.com/piskvorky/smart_open/blob/master/help.txt>__ in your browser.

More examples

For the sake of simplicity, the examples below assume you have all the dependencies installed, i.e. you have done::

pip install smart_open[all]

.. code-block:: python

>>> import os, boto3
>>> from smart_open import open
>>>
>>> # stream content *into* S3 (write mode) using a custom session
>>> session = boto3.Session(
...     aws_access_key_id=os.environ['AWS_ACCESS_KEY_ID'],
...     aws_secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY'],
... )
>>> url = 's3://smart-open-py37-benchmark-results/test.txt'
>>> with open(url, 'wb', transport_params={'client': session.client('s3')}) as fout:
...     bytes_written = fout.write(b'hello world!')
...     print(bytes_written)
12

.. code-block:: python

# stream from HDFS
for line in open('hdfs://user/hadoop/my_file.txt', encoding='utf8'):
    print(line)

# stream from WebHDFS
for line in open('webhdfs://host:port/user/hadoop/my_file.txt'):
    print(line)

# stream content *into* HDFS (write mode):
with open('hdfs://host:port/user/hadoop/my_file.txt', 'wb') as fout:
    fout.write(b'hello world')

# stream content *into* WebHDFS (write mode):
with open('webhdfs://host:port/user/hadoop/my_file.txt', 'wb') as fout:
    fout.write(b'hello world')

# stream from a completely custom s3 server, like s3proxy:
for line in open('s3u://user:secret@host:port@mybucket/mykey.txt'):
    print(line)

# Stream to Digital Ocean Spaces bucket providing credentials from boto3 profile
session = boto3.Session(profile_name='digitalocean')
client = session.client('s3', endpoint_url='https://ams3.digitaloceanspaces.com')
transport_params = {'client': client}
with open('s3://bucket/key.txt', 'wb', transport_params=transport_params) as fout:
    fout.write(b'here we stand')

# stream from GCS
for line in open('gs://my_bucket/my_file.txt'):
    print(line)

# stream content *into* GCS (write mode):
with open('gs://my_bucket/my_file.txt', 'wb') as fout:
    fout.write(b'hello world')

# stream from Azure Blob Storage
connect_str = os.environ['AZURE_STORAGE_CONNECTION_STRING']
transport_params = {
    'client': azure.storage.blob.BlobServiceClient.from_connection_string(connect_str),
}
for line in open('azure://mycontainer/myfile.txt', transport_params=transport_params):
    print(line)

# stream content *into* Azure Blob Storage (write mode):
connect_str = os.environ['AZURE_STORAGE_CONNECTION_STRING']
transport_params = {
    'client': azure.storage.blob.BlobServiceClient.from_connection_string(connect_str),
}
with open('azure://mycontainer/my_file.txt', 'wb', transport_params=transport_params) as fout:
    fout.write(b'hello world')

Compression Handling

The top-level compression parameter controls compression/decompression behavior when reading and writing. The supported values for this parameter are:

infer_from_extension (default behavior)
disable
.gz
.bz2
.zst

By default, smart_open determines the compression algorithm to use based on the file extension.

.. code-block:: python

>>> from smart_open import open, register_compressor
>>> with open('smart_open/tests/test_data/1984.txt.gz') as fin:
...     print(fin.read(32))
It was a bright cold day in Apri

You can override this behavior to either disable compression, or explicitly specify the algorithm to use. To disable compression:

.. code-block:: python

>>> from smart_open import open, register_compressor
>>> with open('smart_open/tests/test_data/1984.txt.gz', 'rb', compression='disable') as fin:
...     print(fin.read(32))
b'\x1f\x8b\x08\x08\x85F\x94\\\x00\x031984.txt\x005\x8f=r\xc3@\x08\x85{\x9d\xe2\x1d@'

To specify the algorithm explicitly (e.g. for non-standard file extensions):

.. code-block:: python

>>> from smart_open import open, register_compressor
>>> with open('smart_open/tests/test_data/1984.txt.gzip', compression='.gz') as fin:
...     print(fin.read(32))
It was a bright cold day in Apri

You can also easily add support for other file extensions and compression formats. For example, to open xz-compressed files:

.. code-block:: python

>>> import lzma, os
>>> from smart_open import open, register_compressor

>>> def _handle_xz(file_obj, mode):
...      return lzma.LZMAFile(filename=file_obj, mode=mode, format=lzma.FORMAT_XZ)

>>> register_compressor('.xz', _handle_xz)

>>> with open('smart_open/tests/test_data/1984.txt.xz') as fin:
...     print(fin.read(32))
It was a bright cold day in Apri

lzma is in the standard library in Python 3.3 and greater. For 2.7, use backports.lzma_.

.. _backports.lzma: https://pypi.org/project/backports.lzma/

Transport-specific Options

smart_open supports a wide range of transport options out of the box, including:

S3
HTTP, HTTPS (read-only)
SSH, SCP and SFTP
WebHDFS
GCS
Azure Blob Storage

Each option involves setting up its own set of parameters. For example, for accessing S3, you often need to set up authentication, like API keys or a profile name. smart_open's open function accepts a keyword argument transport_params which accepts additional parameters for the transport layer. Here are some examples of using this parameter:

.. code-block:: python

import boto3 fin = open('s3://commoncrawl/robots.txt', transport_params=dict(client=boto3.client('s3'))) fin = open('s3://commoncrawl/robots.txt', transport_params=dict(buffer_size=1024))

For the full list of keyword arguments supported by each transport option, see the documentation:

.. code-block:: python

help('smart_open.open')

S3 Credentials

smart_open uses the boto3 library to talk to S3. boto3 has several mechanisms <https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html>__ for determining the credentials to use. By default, smart_open will defer to boto3 and let the latter take care of the credentials. There are several ways to override this behavior.

The first is to pass a boto3.Client object as a transport parameter to the open function. You can customize the credentials when constructing the session for the client. smart_open will then use the session when talking to S3.

.. code-block:: python

session = boto3.Session(
    aws_access_key_id=ACCESS_KEY,
    aws_secret_access_key=SECRET_KEY,
    aws_session_token=SESSION_TOKEN,
)
client = session.client('s3', endpoint_url=..., config=...)
fin = open('s3://bucket/key', transport_params={'client': client})

Your second option is to specify the credentials within the S3 URL itself:

.. code-block:: python

fin = open('s3://aws_access_key_id:aws_secret_access_key@bucket/key', ...)

Important: The two methods above are mutually exclusive. If you pass an AWS client and the URL contains credentials, smart_open will ignore the latter.

Important: smart_open ignores configuration files from the older boto library. Port your old boto settings to boto3 in order to use them with smart_open.

S3 Advanced Usage

Additional keyword arguments can be propagated to the boto3 methods that are used by smart_open under the hood using the client_kwargs transport parameter.

For instance, to upload a blob with Metadata, ACL, StorageClass, these keyword arguments can be passed to create_multipart_upload (docs <https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.create_multipart_upload>__).

.. code-block:: python

kwargs = {'Metadata': {'version': 2}, 'ACL': 'authenticated-read', 'StorageClass': 'STANDARD_IA'}
fout = open('s3://bucket/key', 'wb', transport_params={'client_kwargs': {'S3.Client.create_multipart_upload': kwargs}})

Iterating Over an S3 Bucket's Contents

Since going over all (or select) keys in an S3 bucket is a very common operation, there's also an extra function smart_open.s3.iter_bucket() that does this efficiently, processing the bucket keys in parallel (using multiprocessing):

.. code-block:: python

from smart_open import s3

we use workers=1 for reproducibility; you should use as many workers as you have cores

bucket = 'silo-open-data' prefix = 'Official/annual/monthly_rain/' for key, content in s3.iter_bucket(bucket, prefix=prefix, accept_key=lambda key: '/201' in key, workers=1, key_limit=3): ... print(key, round(len(content) / 2**20)) Official/annual/monthly_rain/2010.monthly_rain.nc 13 Official/annual/monthly_rain/2011.monthly_rain.nc 13 Official/annual/monthly_rain/2012.monthly_rain.nc 13

GCS Credentials

smart_open uses the google-cloud-storage library to talk to GCS. google-cloud-storage uses the google-cloud package under the hood to handle authentication. There are several options <https://googleapis.dev/python/google-api-core/latest/auth.html>__ to provide credentials. By default, smart_open will defer to google-cloud-storage and let it take care of the credentials.

To override this behavior, pass a google.cloud.storage.Client object as a transport parameter to the open function. You can customize the credentials <https://googleapis.dev/python/storage/latest/client.html>__ when constructing the client. smart_open will then use the client when talking to GCS. To follow allow with the example below, refer to Google's guide <https://cloud.google.com/storage/docs/reference/libraries#setting_up_authentication>__ to setting up GCS authentication with a service account.

.. code-block:: python

import os
from google.cloud.storage import Client
service_account_path = os.environ['GOOGLE_APPLICATION_CREDENTIALS']
client = Client.from_service_account_json(service_account_path)
fin = open('gs://gcp-public-data-landsat/index.csv.gz', transport_params=dict(client=client))

If you need more credential options, you can create an explicit google.auth.credentials.Credentials object and pass it to the Client. To create an API token for use in the example below, refer to the GCS authentication guide <https://cloud.google.com/storage/docs/authentication#apiauth>__.

.. code-block:: python

import os
from google.auth.credentials import Credentials
from google.cloud.storage import Client
token = os.environ['GOOGLE_API_TOKEN']
credentials = Credentials(token=token)
client = Client(credentials=credentials)
fin = open('gs://gcp-public-data-landsat/index.csv.gz', transport_params={'client': client})

GCS Advanced Usage

Additional keyword arguments can be propagated to the GCS open method (docs <https://cloud.google.com/python/docs/reference/storage/latest/google.cloud.storage.blob.Blob#google_cloud_storage_blob_Blob_open>__), which is used by smart_open under the hood, using the blob_open_kwargs transport parameter.

Additionally keyword arguments can be propagated to the GCS get_blob method (docs <https://cloud.google.com/python/docs/reference/storage/latest/google.cloud.storage.bucket.Bucket#google_cloud_storage_bucket_Bucket_get_blob>__) when in a read-mode, using the get_blob_kwargs transport parameter.

Additional blob properties (docs <https://cloud.google.com/python/docs/reference/storage/latest/google.cloud.storage.blob.Blob#properties>__) can be set before an upload, as long as they are not read-only, using the blob_properties transport parameter.

.. code-block:: python

open_kwargs = {'predefined_acl': 'authenticated-read'}
properties = {'metadata': {'version': 2}, 'storage_class': 'COLDLINE'}
fout = open('gs://bucket/key', 'wb', transport_params={'blob_open_kwargs': open_kwargs, 'blob_properties': properties})

Azure Credentials

smart_open uses the azure-storage-blob library to talk to Azure Blob Storage. By default, smart_open will defer to azure-storage-blob and let it take care of the credentials.

Azure Blob Storage does not have any ways of inferring credentials therefore, passing a azure.storage.blob.BlobServiceClient object as a transport parameter to the open function is required. You can customize the credentials <https://docs.microsoft.com/en-us/azure/storage/common/storage-samples-python#authentication>__ when constructing the client. smart_open will then use the client when talking to. To follow allow with the example below, refer to Azure's guide <https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python#copy-your-credentials-from-the-azure-portal>__ to setting up authentication.

.. code-block:: python

import os
from azure.storage.blob import BlobServiceClient
azure_storage_connection_string = os.environ['AZURE_STORAGE_CONNECTION_STRING']
client = BlobServiceClient.from_connection_string(azure_storage_connection_string)
fin = open('azure://my_container/my_blob.txt', transport_params={'client': client})

If you need more credential options, refer to the Azure Storage authentication guide <https://docs.microsoft.com/en-us/azure/storage/common/storage-samples-python#authentication>__.

Azure Advanced Usage

Additional keyword arguments can be propagated to the commit_block_list method (docs <https://azuresdkdocs.blob.core.windows.net/$web/python/azure-storage-blob/12.14.1/azure.storage.blob.html#azure.storage.blob.BlobClient.commit_block_list>__), which is used by smart_open under the hood for uploads, using the blob_kwargs transport parameter.

.. code-block:: python

kwargs = {'metadata': {'version': 2}}
fout = open('azure://container/key', 'wb', transport_params={'blob_kwargs': kwargs})

Drop-in replacement of `pathlib.Path.open`

smart_open.open can also be used with Path objects. The built-in Path.open() is not able to read text from compressed files, so use patch_pathlib to replace it with smart_open.open() instead. This can be helpful when e.g. working with compressed files.

.. code-block:: python

>>> from pathlib import Path
>>> from smart_open.smart_open_lib import patch_pathlib
>>>
>>> _ = patch_pathlib()  # replace `Path.open` with `smart_open.open`
>>>
>>> path = Path("smart_open/tests/test_data/crime-and-punishment.txt.gz")
>>>
>>> with path.open("r") as infile:
...     print(infile.readline()[:41])
Ð Ð½Ð°ÑÐ°Ð»Ðµ Ð¸ÑÐ»Ñ, Ð² ÑÑÐµÐ·Ð²ÑÑÐ°Ð¹Ð½Ð¾ Ð¶Ð°ÑÐºÐ¾Ðµ Ð²ÑÐµÐ¼Ñ

How do I ...?

See this document <howto.md>__.

Extending `smart_open`

See this document <extending.md>__.

Testing `smart_open`

smart_open comes with a comprehensive suite of unit tests. Before you can run the test suite, install the test dependencies::

pip install -e .[test]

Now, you can run the unit tests::

pytest smart_open

The tests are also run automatically with GitHub Actions <https://github.com/piskvorky/smart_open/actions/workflows/python-package.yml>_ on every commit push & pull request.

Comments, bug reports

smart_open lives on Github <https://github.com/piskvorky/smart_open>_. You can file issues or pull requests there. Suggestions, pull requests and improvements welcome!

smart_open is open source software released under the MIT license <https://github.com/piskvorky/smart_open/blob/master/LICENSE>. Copyright (c) 2015-now Radim ÅehÅ¯Åek <https://radimrehurek.com>.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of gensim

Cons of gensim

Code comparison

Key differences

Pros of filesystem_spec

Cons of filesystem_spec

Code Comparison

Pros of Dask

Cons of Dask

Code Comparison

Key Differences

Pros of intake

Cons of intake

Code comparison

Key differences

Pros of xarray

Cons of xarray

Code Comparison

Convert designs to code with AI

README

====================================================== smart_open â utils for streaming large files in Python

What?

Why?

How?

stream lines from an S3 object

stream from/to compressed files, with transparent (de)compression:

can use context managers too:

can use any IOBase operations, like seek

stream from HTTP

Documentation

Installation

Built-in help

More examples

Compression Handling

Transport-specific Options

S3 Credentials

S3 Advanced Usage

Iterating Over an S3 Bucket's Contents

we use workers=1 for reproducibility; you should use as many workers as you have cores

GCS Credentials

GCS Advanced Usage

Azure Credentials

Azure Advanced Usage

Drop-in replacement of pathlib.Path.open

How do I ...?

Extending smart_open

Testing smart_open

Comments, bug reports

Top Related Projects

Convert designs to code with AI

====================================================== smart_open â utils for streaming large files in Python

Drop-in replacement of `pathlib.Path.open`

Extending `smart_open`

Testing `smart_open`