ydata-profiling
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
Top Related Projects
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
Always know what to expect from your data.
DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
Visualize and compare datasets, target values and associations, with one line of code.
Open-source low code data preparation library in python. Collect, clean and visualization your data in python with a few lines of code.
Responsible AI Toolbox is a suite of tools providing model and data exploration and assessment user interfaces and libraries that enable a better understanding of AI systems. These interfaces and libraries empower developers and stakeholders of AI systems to develop and monitor AI more responsibly, and take better data-driven actions.
Quick Overview
YData Profiling (formerly Pandas Profiling) is an open-source Python library that generates profile reports from a pandas DataFrame. It provides a comprehensive analysis of datasets, including statistics, correlations, missing values, and visualizations, all in an interactive HTML report.
Pros
- Generates detailed and interactive reports with minimal code
- Supports large datasets with efficient processing
- Customizable report content and appearance
- Integrates well with Jupyter notebooks and other data science workflows
Cons
- Can be slow for extremely large datasets
- May consume significant memory for complex reports
- Limited customization options for some advanced use cases
- Requires pandas DataFrames as input, which may not suit all data formats
Code Examples
- Basic usage:
from ydata_profiling import ProfileReport
import pandas as pd
df = pd.read_csv("your_dataset.csv")
profile = ProfileReport(df, title="Profiling Report")
profile.to_file("output_report.html")
- Customizing report content:
profile = ProfileReport(df, title="Custom Report", minimal=True, explorative=True)
profile.to_file("custom_report.html")
- Generating a report with correlations and interactions:
profile = ProfileReport(df, title="Detailed Report", correlations={
"pearson": {"calculate": True},
"spearman": {"calculate": True},
"kendall": {"calculate": True},
"phi_k": {"calculate": True},
"cramers": {"calculate": True},
}, interactions={"continuous": True})
profile.to_file("detailed_report.html")
Getting Started
To get started with YData Profiling:
-
Install the library:
pip install ydata-profiling
-
Import the library and create a report:
from ydata_profiling import ProfileReport import pandas as pd # Load your data df = pd.read_csv("your_dataset.csv") # Create and save the report profile = ProfileReport(df, title="My Dataset Report") profile.to_file("report.html")
-
Open the generated HTML file in your web browser to view the interactive report.
Competitor Comparisons
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
Pros of ydata-profiling
- Comprehensive data profiling capabilities
- Generates detailed HTML reports
- Supports various data types and formats
Cons of ydata-profiling
- May be slower for large datasets
- Limited customization options for reports
- Requires additional dependencies
Code Comparison
ydata-profiling:
import pandas as pd
from ydata_profiling import ProfileReport
df = pd.read_csv("dataset.csv")
profile = ProfileReport(df)
profile.to_file("report.html")
Both repositories appear to be the same project, as ydata-profiling is likely a renamed or redirected version of the original pandas-profiling library. The code usage and functionality are identical for both repositories.
The ydata-profiling library provides a powerful and easy-to-use tool for generating comprehensive data profiling reports. It offers a wide range of statistical analyses and visualizations, making it valuable for exploratory data analysis and data quality assessment. However, users should be aware of potential performance limitations with large datasets and consider the need for additional dependencies when integrating it into their projects.
Always know what to expect from your data.
Pros of Great Expectations
- Comprehensive data validation and testing framework
- Supports multiple data sources (databases, APIs, files)
- Extensive documentation and active community support
Cons of Great Expectations
- Steeper learning curve due to its complexity
- Requires more setup and configuration
- Can be overkill for simple data profiling tasks
Code Comparison
ydata-profiling:
import pandas as pd
from ydata_profiling import ProfileReport
df = pd.read_csv("data.csv")
profile = ProfileReport(df)
profile.to_file("report.html")
Great Expectations:
import great_expectations as ge
context = ge.get_context()
suite = context.create_expectation_suite("my_suite")
validator = context.get_validator(
batch_request={"path": "data.csv"},
expectation_suite=suite
)
validator.expect_column_values_to_be_between("age", min_value=0, max_value=120)
Summary
ydata-profiling is more focused on quick and easy data profiling with automatic report generation, while Great Expectations offers a more comprehensive data validation framework with greater flexibility and customization options. The choice between the two depends on the specific needs of your project and the level of control you require over data quality checks.
DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
Pros of DoWhy
- Focuses on causal inference and effect estimation
- Provides a unified framework for causal reasoning
- Supports various causal inference methods and estimators
Cons of DoWhy
- Steeper learning curve for users unfamiliar with causal inference
- More specialized use case compared to general data profiling
- May require additional data preparation for causal analysis
Code Comparison
ydata-profiling:
import pandas as pd
from ydata_profiling import ProfileReport
df = pd.read_csv("your_data.csv")
profile = ProfileReport(df)
profile.to_file("output.html")
DoWhy:
import dowhy
from dowhy import CausalModel
model = CausalModel(
data=df,
treatment='treatment_column',
outcome='outcome_column',
common_causes=['cause1', 'cause2']
)
identified_estimand = model.identify_effect()
estimate = model.estimate_effect(identified_estimand)
ydata-profiling is designed for general data profiling and generating comprehensive reports, while DoWhy specializes in causal inference and effect estimation. ydata-profiling is more user-friendly for quick data exploration, whereas DoWhy requires a deeper understanding of causal relationships and statistical methods.
Visualize and compare datasets, target values and associations, with one line of code.
Pros of Sweetviz
- Faster processing and report generation, especially for larger datasets
- More visually appealing and interactive reports
- Includes feature correlation analysis and comparison between two datasets
Cons of Sweetviz
- Less customizable than ydata-profiling
- Fewer advanced statistical measures and data quality checks
- Limited support for categorical variables with many unique values
Code Comparison
Sweetviz:
import sweetviz as sv
report = sv.analyze(df)
report.show_html('report.html')
ydata-profiling:
from ydata_profiling import ProfileReport
profile = ProfileReport(df)
profile.to_file('report.html')
Both libraries offer simple, one-line report generation. Sweetviz provides a more streamlined API, while ydata-profiling offers more configuration options through its ProfileReport
class.
Sweetviz is ideal for quick, visually appealing reports and dataset comparisons. ydata-profiling excels in detailed statistical analysis and customization, making it suitable for more comprehensive data exploration and quality assessment tasks.
Choose Sweetviz for rapid insights and attractive visualizations, or ydata-profiling for in-depth analysis and flexibility in report generation.
Open-source low code data preparation library in python. Collect, clean and visualization your data in python with a few lines of code.
Pros of dataprep
- Offers a wider range of data preparation functions beyond profiling
- Provides interactive visualizations for data exploration
- Supports more data sources, including databases and cloud storage
Cons of dataprep
- Less focused on detailed profiling compared to ydata-profiling
- May have a steeper learning curve due to its broader functionality
- Potentially slower for large datasets due to its comprehensive approach
Code Comparison
ydata-profiling:
import pandas as pd
from ydata_profiling import ProfileReport
df = pd.read_csv("your_data.csv")
profile = ProfileReport(df)
profile.to_file("output.html")
dataprep:
from dataprep.eda import create_report
import pandas as pd
df = pd.read_csv("your_data.csv")
create_report(df).save("output.html")
Both libraries offer simple ways to generate data profiling reports, but dataprep's create_report
function is part of a larger ecosystem of data preparation tools. ydata-profiling focuses specifically on generating comprehensive profiling reports, while dataprep provides a broader range of data exploration and preparation functions.
Responsible AI Toolbox is a suite of tools providing model and data exploration and assessment user interfaces and libraries that enable a better understanding of AI systems. These interfaces and libraries empower developers and stakeholders of AI systems to develop and monitor AI more responsibly, and take better data-driven actions.
Pros of responsible-ai-toolbox
- Comprehensive suite of tools for responsible AI development
- Strong focus on model interpretability and fairness assessment
- Backed by Microsoft, ensuring ongoing support and development
Cons of responsible-ai-toolbox
- Steeper learning curve due to more complex features
- Primarily focused on model analysis rather than data profiling
- May be overkill for simple data exploration tasks
Code comparison
ydata-profiling:
import pandas as pd
from ydata_profiling import ProfileReport
df = pd.read_csv("dataset.csv")
profile = ProfileReport(df)
profile.to_file("report.html")
responsible-ai-toolbox:
from raiwidgets import ExplanationDashboard
from interpret.ext.blackbox import TabularExplainer
explainer = TabularExplainer(model, X_train, features=feature_names)
ExplanationDashboard(explainer, X_test, y_test)
ydata-profiling is more focused on quick data profiling and generating reports, while responsible-ai-toolbox provides in-depth model explanations and fairness assessments. The choice between them depends on the specific needs of the project, with ydata-profiling being more suitable for initial data exploration and responsible-ai-toolbox for comprehensive model analysis and responsible AI practices.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
ydata-profiling
Documentation | Discord | Stack Overflow | Latest changelog
Do you like this project? Show us your love and give feedback!
ydata-profiling
primary goal is to provide a one-line Exploratory Data Analysis (EDA) experience in a consistent and fast solution. Like pandas df.describe()
function, that is so handy, ydata-profiling delivers an extended analysis of a DataFrame while allowing the data analysis to be exported in different formats such as html and json.
The package outputs a simple and digested analysis of a dataset, including time-series and text.
Looking for a scalable solution that can fully integrate with your database systems?
Leverage YData Fabric Data Catalog to connect to different databases and storages (Oracle, snowflake, PostGreSQL, GCS, S3, etc.) and leverage an interactive and guided profiling experience in Fabric. Check out the Community Version.
â¶ï¸ Quickstart
Install
pip install ydata-profiling
or
conda install -c conda-forge ydata-profiling
Start profiling
Start by loading your pandas DataFrame
as you normally would, e.g. by using:
import numpy as np
import pandas as pd
from ydata_profiling import ProfileReport
df = pd.DataFrame(np.random.rand(100, 5), columns=["a", "b", "c", "d", "e"])
To generate the standard profiling report, merely run:
profile = ProfileReport(df, title="Profiling Report")
ð Key features
- Type inference: automatic detection of columns' data types (Categorical, Numerical, Date, etc.)
- Warnings: A summary of the problems/challenges in the data that you might need to work on (missing data, inaccuracies, skewness, etc.)
- Univariate analysis: including descriptive statistics (mean, median, mode, etc) and informative visualizations such as distribution histograms
- Multivariate analysis: including correlations, a detailed analysis of missing data, duplicate rows, and visual support for variables pairwise interaction
- Time-Series: including different statistical information relative to time dependent data such as auto-correlation and seasonality, along ACF and PACF plots.
- Text analysis: most common categories (uppercase, lowercase, separator), scripts (Latin, Cyrillic) and blocks (ASCII, Cyrilic)
- File and Image analysis: file sizes, creation dates, dimensions, indication of truncated images and existence of EXIF metadata
- Compare datasets: one-line solution to enable a fast and complete report on the comparison of datasets
- Flexible output formats: all analysis can be exported to an HTML report that can be easily shared with different parties, as JSON for an easy integration in automated systems and as a widget in a Jupyter Notebook.
The report contains three additional sections:
- Overview: mostly global details about the dataset (number of records, number of variables, overall missigness and duplicates, memory footprint)
- Alerts: a comprehensive and automatic list of potential data quality issues (high correlation, skewness, uniformity, zeros, missing values, constant values, between others)
- Reproduction: technical details about the analysis (time, version and configuration)
ð Latest features
- Want to scale? Check the latest release with ââ¡Spark support!
- Looking for how you can do an EDA for Time-Series ð ? Check this blogpost.
- You want to compare 2 datasets and get a report? Check this blogpost
⨠Spark
Spark support has been released, but we are always looking for an extra pair of hands ð. Check current work in progress!.
ð Use cases
YData-profiling can be used to deliver a variety of different use-case. The documentation includes guides, tips and tricks for tackling them:
Use case | Description |
---|---|
Comparing datasets | Comparing multiple version of the same dataset |
Profiling a Time-Series dataset | Generating a report for a time-series dataset with a single line of code |
Profiling large datasets | Tips on how to prepare data and configure ydata-profiling for working with large datasets |
Handling sensitive data | Generating reports which are mindful about sensitive data in the input dataset |
Dataset metadata and data dictionaries | Complementing the report with dataset details and column-specific data dictionaries |
Customizing the report's appearance | Changing the appearance of the report's page and of the contained visualizations |
Profiling Databases | For a seamless profiling experience in your organization's databases, check Fabric Data Catalog, which allows to consume data from different types of storages such as RDBMs (Azure SQL, PostGreSQL, Oracle, etc.) and object storages (Google Cloud Storage, AWS S3, Snowflake, etc.), among others. |
Using inside Jupyter Notebooks
There are two interfaces to consume the report inside a Jupyter notebook: through widgets and through an embedded HTML report.
The above is achieved by simply displaying the report as a set of widgets. In a Jupyter Notebook, run:
profile.to_widgets()
The HTML report can be directly embedded in a cell in a similar fashion:
profile.to_notebook_iframe()
Exporting the report to a file
To generate a HTML report file, save the ProfileReport
to an object and use the to_file()
function:
profile.to_file("your_report.html")
Alternatively, the report's data can be obtained as a JSON file:
# As a JSON string
json_data = profile.to_json()
# As a file
profile.to_file("your_report.json")
Using in the command line
For standard formatted CSV files (which can be read directly by pandas without additional settings), the ydata_profiling
executable can be used in the command line. The example below generates a report named Example Profiling Report, using a configuration file called default.yaml
, in the file report.html
by processing a data.csv
dataset.
ydata_profiling --title "Example Profiling Report" --config_file default.yaml data.csv report.html
Additional details on the CLI are available on the documentation.
ð Examples
The following example reports showcase the potentialities of the package across a wide range of dataset and data types:
- Census Income (US Adult Census data relating income with other demographic properties)
- NASA Meteorites (comprehensive set of meteorite landing - object properties and locations)
- Titanic (the "Wonderwall" of datasets)
- NZA (open data from the Dutch Healthcare Authority)
- Stata Auto (1978 Automobile data)
- Colors (a simple colors dataset)
- Vektis (Vektis Dutch Healthcare data)
- UCI Bank Dataset (marketing dataset from a bank)
- Russian Vocabulary (100 most common Russian words, showcasing unicode text analysis)
- Website Inaccessibility (website accessibility analysis, showcasing support for URL data)
- Orange prices and
- Coal prices (simple pricing evolution datasets, showcasing the theming options)
- USA Air Quality (Time-series air quality dataset EDA example)
- HCC (Open dataset from healthcare, showcasing compare between two sets of data, before and after preprocessing)
ð ï¸ Installation
Additional details, including information about widget support, are available on the documentation.
Using pip
You can install using the pip
package manager by running:
pip install -U ydata-profiling
Extras
The package declares "extras", sets of additional dependencies.
[notebook]
: support for rendering the report in Jupyter notebook widgets.[unicode]
: support for more detailed Unicode analysis, at the expense of additional disk space.[pyspark]
: support for pyspark for big dataset analysis
Install these with e.g.
pip install -U ydata-profiling[notebook,unicode,pyspark]
Using conda
You can install using the conda
package manager by running:
conda install -c conda-forge ydata-profiling
From source (development)
Download the source code by cloning the repository or click on Download ZIP to download the latest stable version.
Install it by navigating to the proper directory and running:
pip install -e .
The profiling report is written in HTML and CSS, which means a modern browser is required.
You need Python 3 to run the package. Other dependencies can be found in the requirements files:
Filename | Requirements |
---|---|
requirements.txt | Package requirements |
requirements-dev.txt | Requirements for development |
requirements-test.txt | Requirements for testing |
setup.py | Requirements for widgets etc. |
ð Integrations
To maximize its usefulness in real world contexts, ydata-profiling
has a set of implicit and explicit integrations with a variety of other actors in the Data Science ecosystem:
Integration type | Description |
---|---|
Other DataFrame libraries | How to compute the profiling of data stored in libraries other than pandas |
Great Expectations | Generating Great Expectations expectations suites directly from a profiling report |
Interactive applications | Embedding profiling reports in Streamlit, Dash or Panel applications |
Pipelines | Integration with DAG workflow execution tools like Airflow or Kedro |
Cloud services | Using ydata-profiling in hosted computation services like Lambda, Google Cloud or Kaggle |
IDEs | Using ydata-profiling directly from integrated development environments such as PyCharm |
ð Support
Need help? Want to share a perspective? Report a bug? Ideas for collaborations? Reach out via the following channels:
- Stack Overflow: ideal for asking questions on how to use the package
- GitHub Issues: bugs, proposals for changes, feature requests
- Discord: ideal for projects discussions, ask questions, collaborations, general chat
Need Help?
Get your questions answered with a product owner by booking a Pawsome chat! ð¼
â Before reporting an issue on GitHub, check out Common Issues.
ð¤ð½ Contributing
Learn how to get involved in the Contribution Guide.
A low-threshold place to ask questions or start contributing is the Data Centric AI Community's Discord.
A big thank you to all our amazing contributors!
Contributors wall made with contrib.rocks.
Top Related Projects
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
Always know what to expect from your data.
DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
Visualize and compare datasets, target values and associations, with one line of code.
Open-source low code data preparation library in python. Collect, clean and visualization your data in python with a few lines of code.
Responsible AI Toolbox is a suite of tools providing model and data exploration and assessment user interfaces and libraries that enable a better understanding of AI systems. These interfaces and libraries empower developers and stakeholders of AI systems to develop and monitor AI more responsibly, and take better data-driven actions.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot