Top Related Projects
An orchestration platform for the development, production, and observation of data assets.
The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️
Open Source Platform for developing, scaling and deploying serious ML, AI, and data science systems
Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
Open source platform for the machine learning lifecycle
Quick Overview
Papermill is a tool for parameterizing, executing, and analyzing Jupyter Notebooks. It allows users to run notebooks with different parameter sets, execute notebooks remotely, and collect metrics across multiple notebook runs. Papermill is designed to support data science workflows and reproducible research.
Pros
- Enables easy parameterization of Jupyter Notebooks
- Supports remote execution of notebooks
- Facilitates reproducible research and automated reporting
- Integrates well with data science workflows and pipelines
Cons
- May have a learning curve for users new to notebook parameterization
- Limited to Jupyter Notebook format, not applicable to other file types
- Requires additional setup and configuration for advanced features
- Performance may be impacted when dealing with large notebooks or datasets
Code Examples
- Executing a notebook with parameters:
import papermill as pm
pm.execute_notebook(
'input.ipynb',
'output.ipynb',
parameters={'alpha': 0.6, 'ratio': 0.1}
)
- Reading notebook output:
import papermill as pm
nb = pm.read_notebook('output.ipynb')
df = nb.dataframe
print(df.head())
- Executing a notebook with a custom engine:
import papermill as pm
from papermill.engines import NBClientEngine
pm.execute_notebook(
'input.ipynb',
'output.ipynb',
engine=NBClientEngine,
kernel_name='python3'
)
Getting Started
To get started with Papermill, follow these steps:
- Install Papermill:
pip install papermill
-
Create a notebook with parameters:
- Add a cell with the tag "parameters"
- Define your parameters in this cell
-
Execute the notebook:
import papermill as pm
pm.execute_notebook(
'input.ipynb',
'output.ipynb',
parameters={'param1': value1, 'param2': value2}
)
- Analyze the results in the output notebook or use Papermill's API to extract data programmatically.
Competitor Comparisons
An orchestration platform for the development, production, and observation of data assets.
Pros of Dagster
- More comprehensive data orchestration framework with built-in scheduling, monitoring, and error handling
- Supports complex data pipelines with dependencies and conditional execution
- Provides a web-based UI for visualizing and managing workflows
Cons of Dagster
- Steeper learning curve due to its more extensive feature set
- Requires more setup and configuration compared to Papermill's simplicity
- May be overkill for simple notebook execution tasks
Code Comparison
Papermill execution:
import papermill as pm
pm.execute_notebook(
'input.ipynb',
'output.ipynb',
parameters={'alpha': 0.6, 'ratio': 0.1}
)
Dagster execution:
@solid
def process_data(context, data):
# Process data here
return processed_data
@pipeline
def my_pipeline():
process_data()
execute_pipeline(my_pipeline)
Dagster offers a more structured approach to defining data pipelines, while Papermill focuses on simple notebook parameterization and execution. Dagster's code involves defining solids (tasks) and pipelines, whereas Papermill directly executes notebooks with parameters.
The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️
Pros of Ploomber
- More comprehensive workflow management, including DAG-based pipeline creation
- Built-in support for various execution environments (local, cloud, clusters)
- Extensive documentation and tutorials for complex data science workflows
Cons of Ploomber
- Steeper learning curve due to more advanced features
- May be overkill for simple notebook parameterization tasks
- Less integration with Jupyter ecosystem compared to Papermill
Code Comparison
Papermill example:
import papermill as pm
pm.execute_notebook(
'input.ipynb',
'output.ipynb',
parameters={'alpha': 0.6, 'ratio': 0.1}
)
Ploomber example:
from ploomber import DAG
from ploomber.tasks import PythonCallable, NotebookRunner
dag = DAG()
dag.add(NotebookRunner('input.ipynb', product='output.ipynb'))
dag.add(PythonCallable(lambda x: x * 2, product='result.pkl'))
dag.build()
Summary
Ploomber offers more advanced features for complex data science workflows, including DAG-based pipeline creation and support for various execution environments. However, it has a steeper learning curve and may be excessive for simple notebook parameterization tasks. Papermill, on the other hand, is more focused on notebook execution and parameterization, with better integration within the Jupyter ecosystem. The choice between the two depends on the complexity of your workflow and your specific requirements.
Open Source Platform for developing, scaling and deploying serious ML, AI, and data science systems
Pros of Metaflow
- Designed for large-scale data science workflows and production environments
- Provides built-in versioning and tracking of data artifacts
- Offers seamless integration with cloud computing resources
Cons of Metaflow
- Steeper learning curve due to its more complex architecture
- Less focused on notebook parameterization compared to Papermill
- May be overkill for simpler data science projects
Code Comparison
Papermill example:
import papermill as pm
pm.execute_notebook(
'input.ipynb',
'output.ipynb',
parameters={'alpha': 0.6, 'ratio': 0.1}
)
Metaflow example:
from metaflow import FlowSpec, step
class MyFlow(FlowSpec):
@step
def start(self):
self.alpha = 0.6
self.ratio = 0.1
self.next(self.process_data)
@step
def process_data(self):
# Data processing logic here
self.next(self.end)
@step
def end(self):
pass
if __name__ == '__main__':
MyFlow()
Papermill focuses on parameterizing and executing notebooks, while Metaflow provides a more comprehensive framework for defining and managing data science workflows. Papermill is simpler to use for basic notebook automation, whereas Metaflow offers more advanced features for complex, production-grade data science pipelines.
Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
Pros of Kedro
- Provides a comprehensive framework for data science project structure and workflow management
- Offers built-in support for data catalogs, pipelines, and configuration management
- Integrates well with other tools in the data science ecosystem
Cons of Kedro
- Steeper learning curve due to its more complex architecture
- May be overkill for simple projects or one-off notebook executions
- Requires adherence to specific project structure and conventions
Code Comparison
Papermill execution:
import papermill as pm
pm.execute_notebook(
'input.ipynb',
'output.ipynb',
parameters={'alpha': 0.6, 'ratio': 0.1}
)
Kedro pipeline execution:
from kedro.framework.session import KedroSession
with KedroSession.create() as session:
session.run(pipeline_name="data_science")
Papermill focuses on parameterizing and executing individual notebooks, while Kedro emphasizes building modular pipelines and managing project structure. Papermill is more lightweight and easier to integrate into existing workflows, whereas Kedro provides a more comprehensive framework for data science projects.
Open source platform for the machine learning lifecycle
Pros of MLflow
- Comprehensive end-to-end ML lifecycle management
- Supports multiple languages and frameworks
- Includes experiment tracking, model registry, and deployment tools
Cons of MLflow
- Steeper learning curve due to more extensive features
- May be overkill for simple notebook parameterization tasks
- Requires more setup and infrastructure
Code Comparison
MLflow:
import mlflow
with mlflow.start_run():
mlflow.log_param("param1", value1)
mlflow.log_metric("metric1", value2)
mlflow.sklearn.log_model(model, "model")
Papermill:
import papermill as pm
pm.execute_notebook(
'input.ipynb',
'output.ipynb',
parameters={'param1': value1}
)
MLflow is a comprehensive platform for managing the entire machine learning lifecycle, including experiment tracking, model versioning, and deployment. It offers more features and flexibility but may require more setup and learning.
Papermill focuses specifically on parameterizing and executing Jupyter notebooks. It's simpler to use for basic notebook automation tasks but lacks the broader ML lifecycle management capabilities of MLflow.
Choose MLflow for full ML project management or Papermill for straightforward notebook parameterization and execution.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
papermill is a tool for parameterizing, executing, and analyzing Jupyter Notebooks.
Papermill lets you:
- parameterize notebooks
- execute notebooks
This opens up new opportunities for how notebooks can be used. For example:
- Perhaps you have a financial report that you wish to run with different values on the first or last day of a month or at the beginning or end of the year, using parameters makes this task easier.
- Do you want to run a notebook and depending on its results, choose a particular notebook to run next? You can now programmatically execute a workflow without having to copy and paste from notebook to notebook manually.
Papermill takes an opinionated approach to notebook parameterization and execution based on our experiences using notebooks at scale in data pipelines.
Installation
From the command line:
pip install papermill
For all optional io dependencies, you can specify individual bundles
like s3
, or azure
-- or use all
. To use Black to format parameters you can add as an extra requires ['black'].
pip install papermill[all]
Python Version Support
This library currently supports Python 3.8+ versions. As minor Python versions are officially sunset by the Python org papermill will similarly drop support in the future.
Usage
Parameterizing a Notebook
To parameterize your notebook designate a cell with the tag parameters
.
Papermill looks for the parameters
cell and treats this cell as defaults for the parameters passed in at execution time. Papermill will add a new cell tagged with injected-parameters
with input parameters in order to overwrite the values in parameters
. If no cell is tagged with parameters
the injected cell will be inserted at the top of the notebook.
Additionally, if you rerun notebooks through papermill and it will reuse the injected-parameters
cell from the prior run. In this case Papermill will replace the old injected-parameters
cell with the new run's inputs.
Executing a Notebook
The two ways to execute the notebook with parameters are: (1) through the Python API and (2) through the command line interface.
Execute via the Python API
import papermill as pm
pm.execute_notebook(
'path/to/input.ipynb',
'path/to/output.ipynb',
parameters = dict(alpha=0.6, ratio=0.1)
)
Execute via CLI
Here's an example of a local notebook being executed and output to an Amazon S3 account:
$ papermill local/input.ipynb s3://bkt/output.ipynb -p alpha 0.6 -p l1_ratio 0.1
NOTE:
If you use multiple AWS accounts, and you have properly configured your AWS credentials, then you can specify which account to use by setting the AWS_PROFILE
environment variable at the command-line. For example:
$ AWS_PROFILE=dev_account papermill local/input.ipynb s3://bkt/output.ipynb -p alpha 0.6 -p l1_ratio 0.1
In the above example, two parameters are set: alpha
and l1_ratio
using -p
(--parameters
also works). Parameter values that look like booleans or numbers will be interpreted as such. Here are the different ways users may set parameters:
$ papermill local/input.ipynb s3://bkt/output.ipynb -r version 1.0
Using -r
or --parameters_raw
, users can set parameters one by one. However, unlike -p
, the parameter will remain a string, even if it may be interpreted as a number or boolean.
$ papermill local/input.ipynb s3://bkt/output.ipynb -f parameters.yaml
Using -f
or --parameters_file
, users can provide a YAML file from which parameter values should be read.
$ papermill local/input.ipynb s3://bkt/output.ipynb -y "
alpha: 0.6
l1_ratio: 0.1"
Using -y
or --parameters_yaml
, users can directly provide a YAML string containing parameter values.
$ papermill local/input.ipynb s3://bkt/output.ipynb -b YWxwaGE6IDAuNgpsMV9yYXRpbzogMC4xCg==
Using -b
or --parameters_base64
, users can provide a YAML string, base64-encoded, containing parameter values.
When using YAML to pass arguments, through -y
, -b
or -f
, parameter values can be arrays or dictionaries:
$ papermill local/input.ipynb s3://bkt/output.ipynb -y "
x:
- 0.0
- 1.0
- 2.0
- 3.0
linear_function:
slope: 3.0
intercept: 1.0"
Supported Name Handlers
Papermill supports the following name handlers for input and output paths during execution:
-
Local file system:
local
-
HTTP, HTTPS protocol:
http://, https://
-
Amazon Web Services: AWS S3
s3://
-
Azure: Azure DataLake Store, Azure Blob Store
adl://, abs://
-
Google Cloud: Google Cloud Storage
gs://
Development Guide
Read CONTRIBUTING.md for guidelines on how to setup a local development environment and make code changes back to Papermill.
For development guidelines look in the DEVELOPMENT_GUIDE.md file. This should inform you on how to make particular additions to the code base.
Documentation
We host the Papermill documentation on ReadTheDocs.
Top Related Projects
An orchestration platform for the development, production, and observation of data assets.
The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️
Open Source Platform for developing, scaling and deploying serious ML, AI, and data science systems
Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
Open source platform for the machine learning lifecycle
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot