dataprep
Open-source low code data preparation library in python. Collect, clean and visualization your data in python with a few lines of code.
Top Related Projects
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
Always know what to expect from your data.
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
SciPy library main repository
Quick Overview
DataPrep is an open-source Python library designed to simplify and accelerate data preparation tasks. It provides a collection of easy-to-use functions for data cleaning, transformation, and exploration, making it an efficient tool for data scientists and analysts to quickly prepare their datasets for analysis or machine learning.
Pros
- User-friendly API with intuitive function names and parameters
- Comprehensive set of data preparation tools in a single library
- Efficient performance, leveraging Dask for parallel processing
- Excellent documentation and examples for easy adoption
Cons
- Limited support for certain specialized data formats
- May have a steeper learning curve for users unfamiliar with data preparation concepts
- Some advanced features might require additional dependencies
Code Examples
- Loading and exploring a dataset:
from dataprep.eda import create_report
df = pd.read_csv("your_dataset.csv")
create_report(df)
This code loads a CSV file into a pandas DataFrame and generates an interactive EDA report.
- Cleaning missing values:
from dataprep.clean import clean_country
df["clean_country"] = clean_country(df["country"])
This example demonstrates how to clean and standardize country names in a DataFrame column.
- Connecting to a database and querying data:
from dataprep.connector import connect
conn = connect("postgresql", database="your_db", username="user", password="pass")
df = conn.query("SELECT * FROM your_table")
This code establishes a connection to a PostgreSQL database and executes a SQL query, returning the results as a DataFrame.
Getting Started
To get started with DataPrep, follow these steps:
- Install the library:
pip install dataprep
- Import the necessary modules:
from dataprep.eda import create_report
from dataprep.clean import clean_text, clean_country
from dataprep.connector import connect
- Load your data and start using DataPrep functions:
import pandas as pd
df = pd.read_csv("your_dataset.csv")
create_report(df)
df["clean_text"] = clean_text(df["text_column"])
df["clean_country"] = clean_country(df["country_column"])
Now you're ready to explore, clean, and prepare your data using DataPrep's various functions!
Competitor Comparisons
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
Pros of ydata-profiling
- More comprehensive profiling capabilities, including correlation analysis and advanced visualizations
- Generates interactive HTML reports for easy sharing and exploration
- Supports profiling of large datasets through sampling and multiprocessing
Cons of ydata-profiling
- Slower performance for large datasets compared to dataprep
- More complex configuration options, which may be overwhelming for beginners
- Heavier dependencies, potentially leading to longer installation times
Code Comparison
ydata-profiling:
from ydata_profiling import ProfileReport
profile = ProfileReport(df, title="Profiling Report")
profile.to_file("report.html")
dataprep:
from dataprep.eda import create_report
create_report(df).save("report.html")
Both libraries offer simple APIs for generating reports, but ydata-profiling provides more customization options through its ProfileReport
class. dataprep's create_report
function is more straightforward but offers fewer configuration options out of the box.
ydata-profiling excels in producing detailed, interactive reports with advanced visualizations, while dataprep focuses on simplicity and faster performance for quick exploratory data analysis tasks.
Always know what to expect from your data.
Pros of Great Expectations
- More comprehensive data validation and testing capabilities
- Extensive documentation and community support
- Integration with various data platforms and workflows
Cons of Great Expectations
- Steeper learning curve due to more complex features
- Requires more setup and configuration
- May be overkill for simple data profiling tasks
Code Comparison
Great Expectations:
import great_expectations as ge
df = ge.read_csv("data.csv")
expectation_suite = df.profile()
validation_result = df.validate(expectation_suite=expectation_suite)
DataPrep:
from dataprep.eda import create_report
create_report("data.csv")
Great Expectations offers more granular control over data validation, while DataPrep provides a simpler, more automated approach to data profiling and exploration. Great Expectations is better suited for complex data pipelines and strict data quality requirements, whereas DataPrep excels in quick, visual data analysis and reporting.
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
Pros of ydata-profiling
- More comprehensive profiling capabilities, including correlation analysis and advanced visualizations
- Generates interactive HTML reports for easy sharing and exploration
- Supports profiling of large datasets through sampling and multiprocessing
Cons of ydata-profiling
- Slower performance for large datasets compared to dataprep
- More complex configuration options, which may be overwhelming for beginners
- Heavier dependencies, potentially leading to longer installation times
Code Comparison
ydata-profiling:
from ydata_profiling import ProfileReport
profile = ProfileReport(df, title="Profiling Report")
profile.to_file("report.html")
dataprep:
from dataprep.eda import create_report
create_report(df).save("report.html")
Both libraries offer simple APIs for generating reports, but ydata-profiling provides more customization options through its ProfileReport
class. dataprep's create_report
function is more straightforward but offers fewer configuration options out of the box.
ydata-profiling excels in producing detailed, interactive reports with advanced visualizations, while dataprep focuses on simplicity and faster performance for quick exploratory data analysis tasks.
DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
Pros of DoWhy
- Focused on causal inference and effect estimation
- Provides a unified framework for causal reasoning
- Supports various causal inference methods and estimators
Cons of DoWhy
- Steeper learning curve for users unfamiliar with causal inference
- More specialized use case compared to general data preparation
- Requires more domain knowledge to use effectively
Code Comparison
DoWhy example:
import dowhy
from dowhy import CausalModel
model = CausalModel(
data=data,
treatment='treatment',
outcome='outcome',
graph=graph
)
identified_estimand = model.identify_effect()
estimate = model.estimate_effect(identified_estimand)
DataPrep example:
from dataprep.eda import create_report
from dataprep.clean import clean_country
report = create_report(df)
cleaned_df = clean_country(df, "country_column")
DataPrep focuses on general data preparation and exploratory data analysis, while DoWhy specializes in causal inference. DataPrep offers a more user-friendly approach for data cleaning and visualization, whereas DoWhy provides advanced tools for understanding causal relationships in data. The choice between the two depends on the specific needs of the project and the user's expertise in causal inference.
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Pros of pandas
- Mature and widely adopted library with extensive documentation and community support
- Offers a broader range of data manipulation and analysis capabilities
- Highly optimized for performance, especially with large datasets
Cons of pandas
- Steeper learning curve for beginners due to its extensive functionality
- Can be memory-intensive for very large datasets
- Less focus on automated data cleaning and preparation tasks
Code Comparison
pandas:
import pandas as pd
df = pd.read_csv('data.csv')
df['new_column'] = df['column_a'] + df['column_b']
result = df.groupby('category').mean()
DataPrep:
from dataprep.eda import create_report
from dataprep.clean import clean_country
report = create_report('data.csv')
cleaned_data = clean_country(df, 'country_column')
DataPrep focuses on automated data cleaning and exploratory data analysis, offering high-level functions for quick insights and data preparation. pandas provides more granular control over data manipulation and analysis, requiring more explicit coding but offering greater flexibility for complex operations.
SciPy library main repository
Pros of SciPy
- Comprehensive scientific computing library with a wide range of mathematical functions and algorithms
- Well-established, mature project with extensive documentation and community support
- Highly optimized and efficient implementations for numerical operations
Cons of SciPy
- Steeper learning curve for beginners due to its extensive functionality
- Larger package size and potential for slower installation compared to more focused libraries
- May require additional dependencies for certain functionalities
Code Comparison
SciPy example (numerical integration):
from scipy import integrate
def f(x):
return x**2
result = integrate.quad(f, 0, 1)
print(f"The integral of x^2 from 0 to 1 is: {result[0]}")
DataPrep example (data cleaning):
from dataprep.clean import clean_country
df = pd.DataFrame({"country": ["United States", "USA", "U.S.A.", "United States of America"]})
cleaned_df = clean_country(df, "country")
print(cleaned_df)
DataPrep focuses on data preparation and cleaning tasks, while SciPy provides a broader range of scientific computing tools. The choice between them depends on the specific requirements of your project and the type of data processing or analysis you need to perform.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Low code data preparation
Currently, you can use DataPrep to:
- Collect data from common data sources (through
dataprep.connector
) - Do your exploratory data analysis (through
dataprep.eda
) - Clean and standardize data (through
dataprep.clean
) - ...more modules are coming
Releases
Installation
pip install -U dataprep
EDA
DataPrep.EDA is the fastest and the easiest EDA (Exploratory Data Analysis) tool in Python. It allows you to understand a Pandas/Dask DataFrame with a few lines of code in seconds.
Create Profile Reports, Fast
You can create a beautiful profile report from a Pandas/Dask DataFrame with the create_report
function. DataPrep.EDA has the following advantages compared to other tools:
- 10X Faster: DataPrep.EDA can be 10X faster than Pandas-based profiling tools due to its highly optimized Dask-based computing module.
- Interactive Visualization: DataPrep.EDA generates interactive visualizations in a report, which makes the report look more appealing to end users.
- Big Data Support: DataPrep.EDA naturally supports big data stored in a Dask cluster by accepting a Dask dataframe as input.
The following code demonstrates how to use DataPrep.EDA to create a profile report for the titanic dataset.
from dataprep.datasets import load_dataset
from dataprep.eda import create_report
df = load_dataset("titanic")
create_report(df).show_browser()
Click here to see the generated report of the above code.
Click here to see the benchmark result.
Try DataPrep.EDA Online: DataPrep.EDA Demo in Colab
Innovative System Design
DataPrep.EDA is the only task-centric EDA system in Python. It is carefully designed to improve usability.
- Task-Centric API Design: You can declaratively specify a wide range of EDA tasks in different granularity with a single function call. All needed visualizations will be automatically and intelligently generated for you.
- Auto-Insights: DataPrep.EDA automatically detects and highlights the insights (e.g., a column has many outliers) to facilitate pattern discovery about the data.
- How-to Guide: A how-to guide is provided to show the configuration of each plot function. With this feature, you can easily customize the generated visualizations.
Learn DataPrep.EDA in 2 minutes:
Click here to check all the supported tasks.
Check plot, plot_correlation, plot_missing and create_report to see how each function works.
Clean
DataPrep.Clean contains about 140+ functions designed for cleaning and validating data in a DataFrame. It provides
- A Convenient GUI: incorporated into Jupyter Notebook, users can clean their own DataFrame without any coding (see the video below).
- A Unified API: each function follows the syntax
clean_{type}(df, 'column name')
(see an example below). - Speed: the computations are parallelized using Dask. It can clean 50K rows per second on a dual-core laptop (that means cleaning 1 million rows in only 20 seconds).
- Transparency: a report is generated that summarizes the alterations to the data that occured during cleaning.
The following video shows how to use GUI of Dataprep.Clean
The following example shows how to clean and standardize a column of country names.
from dataprep.clean import clean_country
import pandas as pd
df = pd.DataFrame({'country': ['USA', 'country: Canada', '233', ' tr ', 'NA']})
df2 = clean_country(df, 'country')
df2
country country_clean
0 USA United States
1 country: Canada Canada
2 233 Estonia
3 tr Turkey
4 NA NaN
Type validation is also supported:
from dataprep.clean import validate_country
series = validate_country(df['country'])
series
0 True
1 False
2 True
3 True
4 False
Name: country, dtype: bool
Check Documentation of Dataprep.Clean to see how each function works.
Connector
Connector now supports loading data from both web API and databases.
Web API
Connector is an intuitive, open-source API wrapper that speeds up development by standardizing calls to multiple APIs as a simple workflow.
Connector provides a simple wrapper to collect structured data from different Web APIs (e.g., Twitter, Spotify), making web data collection easy and efficient, without requiring advanced programming skills.
Do you want to leverage the growing number of websites that are opening their data through public APIs? Connector is for you!
Let's check out the several benefits that Connector offers:
- A unified API: You can fetch data using one or two lines of code to get data from tens of popular websites.
- Auto Pagination: Do you want to invoke a Web API that could return a large result set and need to handle it through pagination? Connector automatically does the pagination for you! Just specify the desired number of returned results (argument
_count
) without getting into unnecessary detail about a specific pagination scheme. - Speed: Do you want to fetch results more quickly by making concurrent requests to Web APIs? Through the
_concurrency
argument, Connector simplifies concurrency, issuing API requests in parallel while respecting the API's rate limit policy.
How to fetch all publications of Andrew Y. Ng?
from dataprep.connector import connect
conn_dblp = connect("dblp", _concurrency = 5)
df = await conn_dblp.query("publication", author = "Andrew Y. Ng", _count = 2000)
Here, you can find detailed Examples.
Connector is designed to be easy to extend. If you want to connect with your own web API, you just have to write a simple configuration file to support it. This configuration file describes the API's main attributes like the URL, query parameters, authorization method, pagination properties, etc.
Database
Connector now has adopted connectorx in order to enable loading data from databases (Postgres, Mysql, SQLServer, etc.) into Python dataframes (pandas, dask, modin, arrow, polars) in the fastest and most memory efficient way. [Benchmark]
What you need to do is just install connectorx
(pip install -U connectorx
) and run one line of code:
from dataprep.connector import read_sql
read_sql("postgresql://username:password@server:port/database", "SELECT * FROM lineitem")
Check out here for supported databases and dataframes and more examples usages.
Lineage
A Column Level Lineage Graph for SQL. This tool is intended to help you by creating an interactive graph on a webpage to explore the column level lineage among them.
The lineage module offers:
A general introduction of the project can be found in this blog post.
- Automatic dependency creation: When there are dependency among the SQL files, and those tables are not yet in the database, the lineage module will automatically tries to find the dependency table and creates it.
- Clean and simple but very interactive user interface: The user interface is very simple to use with minimal clutters on the page while showing all of the necessary information.
- Variety of SQL statements: The lineage module supports a variety of SQL statements, aside from the typical
SELECT
statement, it also supportsCREATE TABLE/VIEW [IF NOT EXISTS]
statement as well as theINSERT
andDELETE
statement. - dbt support: The lineage module is also implemented in the dbt-LineageX, it is added into a dbt project and by using the dbt library fal, it is able to reuse the Python core and create the similar output from the dbt project.
Uses and Demo
The interactive graph looks like this: Here is a live demo with the mimic-iv concepts_postgres files(navigation instructions) and that is created with one line of code:
from dataprep.lineage import lineagex
lineagex(sql=path/to/sql, target_schema="schema1", conn_string="postgresql://username:password@server:port/database", search_path_schema="schema1, public")
Check out more detailed usage and examples here.
Documentation
The following documentation can give you an impression of what DataPrep can do:
Contribute
There are many ways to contribute to DataPrep.
- Submit bugs and help us verify fixes as they are checked in.
- Review the source code changes.
- Engage with other DataPrep users and developers on StackOverflow.
- Ask questions & propose new ideas in our Forum.
- Contribute bug fixes.
- Providing use cases and writing down your user experience.
Please take a look at our wiki for development documentations!
Acknowledgement
Some functionalities of DataPrep are inspired by the following packages.
-
Inspired the report functionality and insights provided in
dataprep.eda
. -
Inspired the missing value analysis in
dataprep.eda
.
Citing DataPrep
If you use DataPrep, please consider citing the following paper:
Jinglin Peng, Weiyuan Wu, Brandon Lockhart, Song Bian, Jing Nathan Yan, Linghao Xu, Zhixuan Chi, Jeffrey M. Rzeszotarski, and Jiannan Wang. DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python. SIGMOD 2021.
BibTeX entry:
@inproceedings{dataprepeda2021,
author = {Jinglin Peng and Weiyuan Wu and Brandon Lockhart and Song Bian and Jing Nathan Yan and Linghao Xu and Zhixuan Chi and Jeffrey M. Rzeszotarski and Jiannan Wang},
title = {DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python},
booktitle = {Proceedings of the 2021 International Conference on Management of Data (SIGMOD '21), June 20--25, 2021, Virtual Event, China},
year = {2021}
}
Top Related Projects
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
Always know what to expect from your data.
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
SciPy library main repository
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot