sweetviz
Visualize and compare datasets, target values and associations, with one line of code.
Top Related Projects
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
Always know what to expect from your data.
An open source python library for automated feature engineering
Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
Quick Overview
Sweetviz is an open-source Python library that generates beautiful, high-density visualizations to kickstart EDA (Exploratory Data Analysis) with a single line of code. It automatically visualizes target values, compares datasets, and analyzes feature associations, making it an efficient tool for data scientists and analysts.
Pros
- Generates comprehensive, visually appealing reports with minimal code
- Supports comparison between two datasets (e.g., training vs. test)
- Provides detailed analysis of feature associations and correlations
- Automatically handles both numerical and categorical data
Cons
- Limited customization options for generated visualizations
- May be slower for very large datasets
- Doesn't support advanced statistical tests or machine learning models
- Requires a modern web browser to view the generated HTML report
Code Examples
Creating a basic report:
import sweetviz as sv
import pandas as pd
# Load your dataset
df = pd.read_csv("your_dataset.csv")
# Create and show the analysis report
my_report = sv.analyze(df)
my_report.show_html()
Comparing two datasets:
import sweetviz as sv
import pandas as pd
# Load your datasets
df_train = pd.read_csv("train_data.csv")
df_test = pd.read_csv("test_data.csv")
# Create a comparison report
compare_report = sv.compare([df_train, "Train"], [df_test, "Test"])
compare_report.show_html()
Analyzing with a target variable:
import sweetviz as sv
import pandas as pd
# Load your dataset
df = pd.read_csv("your_dataset.csv")
# Create a report with a specified target variable
target_report = sv.analyze(df, target_feat="target_column_name")
target_report.show_html()
Getting Started
To get started with Sweetviz, follow these steps:
-
Install Sweetviz using pip:
pip install sweetviz
-
Import Sweetviz in your Python script:
import sweetviz as sv
-
Load your dataset using pandas and create a report:
import pandas as pd df = pd.read_csv("your_dataset.csv") report = sv.analyze(df) report.show_html()
This will generate an HTML report that you can view in your web browser, providing a comprehensive overview of your dataset's characteristics and relationships.
Competitor Comparisons
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
Pros of ydata-profiling
- More comprehensive analysis, including correlation matrices and advanced statistical measures
- Supports a wider range of data types, including geospatial and image data
- Offers more customization options and configuration settings
Cons of ydata-profiling
- Slower performance, especially on larger datasets
- More complex to use, with a steeper learning curve
- Generates larger report files, which can be unwieldy for sharing
Code Comparison
ydata-profiling:
from ydata_profiling import ProfileReport
profile = ProfileReport(df, title="Profiling Report")
profile.to_file("output.html")
Sweetviz:
import sweetviz as sv
report = sv.analyze(df)
report.show_html("output.html")
Both libraries offer simple ways to generate reports, but ydata-profiling provides more options for customization in its ProfileReport
function. Sweetviz has a more streamlined API, making it easier to use for quick analyses.
ydata-profiling is better suited for in-depth analysis and customization, while Sweetviz excels in simplicity and speed for quick exploratory data analysis. The choice between the two depends on the specific needs of the project and the user's preference for depth versus simplicity.
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
Pros of ydata-profiling
- More comprehensive analysis, including correlation matrices and advanced statistical measures
- Supports a wider range of data types, including geospatial and image data
- Offers more customization options and configuration settings
Cons of ydata-profiling
- Slower performance, especially on larger datasets
- More complex to use, with a steeper learning curve
- Generates larger report files, which can be unwieldy for sharing
Code Comparison
ydata-profiling:
from ydata_profiling import ProfileReport
profile = ProfileReport(df, title="Profiling Report")
profile.to_file("output.html")
Sweetviz:
import sweetviz as sv
report = sv.analyze(df)
report.show_html("output.html")
Both libraries offer simple ways to generate reports, but ydata-profiling provides more options for customization in its ProfileReport
function. Sweetviz has a more streamlined API, making it easier to use for quick analyses.
ydata-profiling is better suited for in-depth analysis and customization, while Sweetviz excels in simplicity and speed for quick exploratory data analysis. The choice between the two depends on the specific needs of the project and the user's preference for depth versus simplicity.
DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
Pros of DoWhy
- Focuses on causal inference and effect estimation
- Provides a unified framework for causal reasoning
- Supports various causal inference methods and estimators
Cons of DoWhy
- Steeper learning curve for users new to causal inference
- Less emphasis on automated exploratory data analysis
- Requires more manual configuration for analysis tasks
Code Comparison
DoWhy:
import dowhy
from dowhy import CausalModel
model = CausalModel(
data=data,
treatment='treatment_column',
outcome='outcome_column',
graph=causal_graph
)
identified_estimand = model.identify_effect()
estimate = model.estimate_effect(identified_estimand)
Sweetviz:
import sweetviz as sv
report = sv.analyze(dataframe)
report.show_html()
DoWhy is designed for causal inference and requires more setup, including specifying the causal graph and treatment/outcome variables. Sweetviz, on the other hand, focuses on automated exploratory data analysis and generates reports with minimal configuration.
While DoWhy provides powerful tools for causal reasoning, Sweetviz offers quick and easy data visualization and summary statistics. The choice between the two depends on the specific analysis needs and the user's familiarity with causal inference concepts.
Always know what to expect from your data.
Pros of Great Expectations
- More comprehensive data validation and testing capabilities
- Supports multiple data sources (databases, APIs, cloud storage)
- Extensible with custom expectations and plugins
Cons of Great Expectations
- Steeper learning curve and more complex setup
- Requires more code and configuration for basic tasks
- Heavier resource usage for large datasets
Code Comparison
Sweetviz:
import sweetviz as sv
report = sv.analyze(df)
report.show_html()
Great Expectations:
import great_expectations as ge
context = ge.get_context()
suite = context.create_expectation_suite("my_suite")
validator = context.get_validator(
batch_request={"dataset": df},
expectation_suite=suite
)
validator.expect_column_values_to_be_between("column_name", min_value=0, max_value=100)
Sweetviz offers a simpler, more streamlined approach for quick data profiling and visualization, while Great Expectations provides a more robust framework for data validation and quality assurance. Sweetviz is better suited for rapid exploratory data analysis, whereas Great Expectations excels in maintaining data quality throughout the data pipeline and in production environments.
An open source python library for automated feature engineering
Pros of Featuretools
- Automated feature engineering capabilities
- Supports complex data transformations and time series analysis
- Integrates well with machine learning workflows
Cons of Featuretools
- Steeper learning curve for beginners
- May require more computational resources for large datasets
- Less focus on data visualization and reporting
Code Comparison
Sweetviz:
import sweetviz as sv
report = sv.analyze(df)
report.show_html()
Featuretools:
import featuretools as ft
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="customers")
Summary
Featuretools excels in automated feature engineering and complex data transformations, making it powerful for advanced machine learning tasks. However, it may be more challenging for beginners and resource-intensive. Sweetviz, on the other hand, focuses on quick and easy data analysis and visualization, making it more accessible for rapid insights but less suitable for in-depth feature engineering. The choice between the two depends on the specific needs of the project and the user's expertise level.
Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
Pros of mljar-supervised
- Offers automated machine learning capabilities, including model selection and hyperparameter tuning
- Supports a wide range of ML algorithms and provides explanations for model decisions
- Generates comprehensive HTML reports with detailed model performance metrics
Cons of mljar-supervised
- Requires more computational resources due to its automated ML approach
- May have a steeper learning curve for users new to machine learning concepts
- Less focused on data exploration and visualization compared to Sweetviz
Code Comparison
Sweetviz:
import sweetviz as sv
report = sv.analyze(df)
report.show_html()
mljar-supervised:
from supervised import AutoML
automl = AutoML()
automl.fit(X, y)
automl.report()
Summary
Sweetviz is primarily focused on data analysis and visualization, providing quick insights into datasets. mljar-supervised, on the other hand, is an automated machine learning tool that handles the entire ML pipeline, from data preprocessing to model selection and evaluation. While Sweetviz excels in exploratory data analysis, mljar-supervised is more suitable for users looking to quickly develop and deploy machine learning models with minimal manual intervention.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
UPDATE (November 2023) - Version 2.3.0: Verbosity parameter added, long-standing issues fixed
In-depth EDA (target analysis, comparison, feature analysis, correlation) in two lines of code!
Sweetviz is an open-source Python library that generates beautiful, high-density visualizations to kickstart EDA (Exploratory Data Analysis) with just two lines of code. Output is a fully self-contained HTML application.
The system is built around quickly visualizing target values and comparing datasets. Its goal is to help quick analysis of target characteristics, training vs testing data, and other such data characterization tasks.
Usage and parameters are described below, you can also find an article describing its features in depth and see examples in action HERE.
Sweetviz development is still ongoing! Please let me know if you run into any data, compatibility or install issues! Thank you for reporting any BUGS in the issue tracking system here, and I welcome your feedback and questions on usage/features in the brand-new GitHub "Discussions" tab right here!.
Examples & mentions
Example HTML report using the Titanic dataset
Example Notebook w/docs on Colab (Jupyter/other notebooks should also work)
Medium Article describing its features in depth
Features
- Target analysis
- Shows how a target value (e.g. "Survived" in the Titanic dataset) relates to other features
- Visualize and compare
- Distinct datasets (e.g. training vs test data)
- Intra-set characteristics (e.g. male versus female)
- Mixed-type associations
- Sweetviz integrates associations for numerical (Pearson's correlation), categorical (uncertainty coefficient) and categorical-numerical (correlation ratio) datatypes seamlessly, to provide maximum information for all data types.
- Type inference
- Automatically detects numerical, categorical and text features, with optional manual overrides
- Summary information
- Type, unique values, missing values, duplicate rows, most frequent values
- Numerical analysis:
- min/max/range, quartiles, mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
New & notable
- Version 2.2: Big compatibility update for python 3.7+ and numpy versions
- Version 2.1: Comet.ml support
- Version 2.0: Jupyter, Colab & other notebook support, report scaling & vertical layout
(see below for docs on these features)
Upgrading
Some people have experienced mixed results behavior upgrading through pip
. To update to the latest from an existing install, it is recommended to pip uninstall sweetviz
first, then simply install.
Installation
Sweetviz currently supports Python 3.6+ and Pandas 0.25.3+. Reports are output using the base "os" module, so custom environments such as Google Colab which require custom file operations are not yet supported, although I am looking into a solution.
Using pip
The best way to install sweetviz (other than from source) is to use pip:
pip install sweetviz
Installation issues & fixes
In some rare cases, users have reported errors such as ModuleNotFoundError: No module named 'sweetviz'
and AttributeError: module 'sweetviz' has no attribute 'analyze'
.
In those cases, we suggest the following:
- Make sure none of your scripts are named
sweetviz.py
, as that interferes with the library itself. Delete or rename that script (and any associated.pyc
files), and try again. - Try uninstalling the library using
pip uninstall sweetviz
, then reinstalling - The issue may stem from using multiple versions of Python, or from OS permissions. The following Stack Overflow articles have resolved many of these issues reported: Article 1, Article 2, Article 3
- If all else fails, post a bug issue here on github. Thank you for taking the time, it may help resolve the issue for you and everyone else!
Basic Usage
Creating a report is a quick 2-line process:
- Create a
DataframeReport
object using one of:analyze()
,compare()
orcompare_intra()
- Use a
show_xxx()
function to render the report. You can now use either html or notebook report options, as well as scaling: (more info on these options below)
Step 1: Create the report
There are 3 main functions for creating reports:
- analyze(...)
- compare(...)
- compare_intra(...)
Analyzing a single dataframe (and its optional target feature)
To analyze a single dataframe, simply use the analyze(...)
function, then the show_html(...)
function:
import sweetviz as sv
my_report = sv.analyze(my_dataframe)
my_report.show_html() # Default arguments will generate to "SWEETVIZ_REPORT.html"
When run, this will output a 1080p widescreen html app in your default browser:
Optional arguments
The analyze()
function can take multiple other arguments:
analyze(source: Union[pd.DataFrame, Tuple[pd.DataFrame, str]],
target_feat: str = None,
feat_cfg: FeatureConfig = None,
pairwise_analysis: str = 'auto',
verbosity: str = 'default'):
- source: Either the data frame (as in the example) or a tuple containing the data frame and a name to show in the report.
e.g.
my_df
or[my_df, "Training"]
- target_feat: A string representing the name of the feature to be marked as "target". Only BOOLEAN and NUMERICAL features can be targets for now.
- feat_cfg: A FeatureConfig object representing features to be skipped, or to be forced a certain type in the analysis. The arguments can either be a single string or list of strings. Parameters are
skip
,force_cat
,force_num
andforce_text
. The "force_" arguments override the built-in type detection. They can be constructed as follows:
feature_config = sv.FeatureConfig(skip="PassengerId", force_text=["Age"])
- verbosity: [NEW] Can be set to
full
,progress_only
(to only display the progress bar but not report generation messages) andoff
(fully quiet, except for errors or warnings). Default verbosity can also be set in the INI override, under the "General" heading (see "The Config file" section below for details). - pairwise_analysis: Correlations and other associations can take quadratic time (n^2) to complete. The default setting ("auto") will run without warning until a data set contains "association_auto_threshold" features. Past that threshold, you need to explicitly pass the parameter
pairwise_analysis="on"
(or="off"
) since processing that many features would take a long time. This parameter also covers the generation of the association graphs (based on Drazen Zaric's concept):
Comparing two dataframes (e.g. Test vs Training sets)
To compare two data sets, simply use the compare()
function. Its parameters are the same as analyze()
, except with an inserted second parameter to cover the comparison dataframe. It is recommended to use the [dataframe, "name"] format of parameters to better differentiate between the base and compared dataframes. (e.g. [my_df, "Train"]
vs my_df
)
my_report = sv.compare([my_dataframe, "Training Data"], [test_df, "Test Data"], "Survived", feature_config)
Comparing two subsets of the same dataframe (e.g. Male vs Female)
Another way to get great insights is to use the comparison functionality to split your dataset into 2 sub-populations.
Support for this is built in through the compare_intra()
function. This function takes a boolean series as one of the arguments, as well as an explicit "name" tuple for naming the (true, false) resulting datasets. Note that internally, this creates 2 separate dataframes to represent each resulting group. As such, it is more of a shorthand function of doing such processing manually.
my_report = sv.compare_intra(my_dataframe, my_dataframe["Sex"] == "male", ["Male", "Female"], "Survived", feature_config)
Step 2: Show the report
Once you have created your report object (e.g. my_report
in the examples above), simply pass it into one of the two `show' functions:
show_html()
show_html( filepath='SWEETVIZ_REPORT.html',
open_browser=True,
layout='widescreen',
scale=None)
show_html(...) will create and save an HTML report at the given file path. There are options for:
- layout: Either
'widescreen'
or'vertical'
. The widescreen layout displays details on the right side of the screen, as the mouse goes over each feature. The new (as of 2.0) vertical layout is more compact horizontally and enables expanding each detail area upon clicking. - scale: Use a floating-point number (e.g.
scale = 0.8
orNone
) to scale the entire report. This is very useful to fit reports to any output. - open_browser: Enables the automatic opening of a web browser to show the report. Since under some circumstances this is not desired (or causes issues with some IDE's), you can disable it here.
show_notebook()
show_notebook( w=None,
h=None,
scale=None,
layout='widescreen',
filepath=None,
file_layout=None,
file_scale=None)
show_notebook(...) is new as of 2.0 and will embed an IFRAME element showing the report right inside a notebook (e.g. Jupyter, Google Colab, etc.).
Note that since notebooks are generally a more constrained visual environment, it is probably a good idea to use custom width/height/scale values (w
, h
, scale
) and even set custom default values in an INI override (see below). The options are:
- w (width): Sets the width of the output window for the report (the full report may not fit; use
layout
and/orscale
for the report itself). Can be as a percentage string (w="100%"
) or number of pixels (w=900
). - h (height): Sets the height of the output window for the report. Can be as a number of pixels (
h=700
) or "Full" to stretch the window to be as tall as all the features (h="Full"
). - scale: Same as for
show_html()
, above. - layout: Same as for
show_html()
, above. - filepath: An OPTIONAL output HTML report.
- file_layout: Layout for the OPTIONAL file output ONLY (same as
layout
forshow_html()
, above) - file_scale: Scale for the OPTIONAL file output ONLY (same as
scale
forshow_html()
, above)
Customizing defaults: the Config file
The package contains an INI file for configuration. You can override any setting by providing your own then calling this before creating a report:
sv.config_parser.read("Override.ini")
IMPORTANT #1: it is best to load overrides before any other command, as many of the INI options are used in the report generation.
IMPORTANT #2: always put the header line (e.g. [General]
) before a set of values in your override INI file, otherwise your settings will be ignored. See examples below. If setting multiple values, only include the [General]
line once.
Most useful config overrides
You can look into the file sweetviz_defaults.ini
for what can be overriden (warning: much of it is a work in progress and not well documented), but the most useful overrides are as follows.
Default report layout, size
Override any of these (by putting them in your own INI, again do not forget the header), to avoid having to set them every time you do a "show" command:
Important: note the double '%' if specifying a percentage
[Output_Defaults]
html_layout = widescreen
html_scale = 1.0
notebook_layout = vertical
notebook_scale = 0.9
notebook_width = 100%%
notebook_height = 700
Chinese, Japanse, Korean (CJK) character support
[General]
use_cjk_font = 1
*If setting multiple values for [general]
only include the [General]
line once.
Will switch the font in the graphs to use a CJK-compatible font. Although this font is not as compact, it will get rid of any warnings and "unknown character" symbols for these languages.
Remove Sweetviz logo
[Layout]
show_logo = 0
Will remove the Sweetviz logo from the top of the page.
Set default verbosity level
[General]
default_verbosity = off
*If setting multiple values for [general]
only include the [General]
line once.
Can be set to full
, progress_only
(to only display the progress bar but not report generation messages) and off
(fully quiet, except for errors or warnings).
Correlation/Association analysis
A major source of insight and unique feature of Sweetviz' associations graph and analysis is that it unifies in a single graph (and detail views):
- Numerical correlation (between numerical features)
- Uncertainty coefficient (for categorical-categorical)
- Correlation ratio (for categorical-numerical)
Squares represent categorical-featured-related variables and circles represent numerical-numerical correlations. Note that the trivial diagonal is left empty, for clarity.
IMPORTANT: categorical-categorical associations (provided by the SQUARES showing the uncertainty coefficient) are ASSYMMETRICAL, meaning that each row represents how much the row title (on the left) gives information on each column. For example, "Sex", "Pclass" and "Fare" are the elements that give the most information on "Survived".
For the Titanic dataset, this information is rather symmetrical but it is not always the case!
Correlations are also displayed in the detail section of each feature, with the target value highlighted when applicable. e.g.:
Finally, it is worth noting these correlation/association methods shouldnât be taken as gospel as they make some assumptions on the underlying distribution of data and relationships. However they can be a very useful starting point.
Comet.ml integration
As of 2.1, Sweetviz now fully integrates Comet.ml. This means Sweetviz will automatically log any reports generated using show_html()
and show_notebook()
to your workspace, as long as your API key is set up correctly in your environment.
Additionally, you can also use the new function report.log_comet(experiment_object)
to explicitly upload a report for a given experiment to your workspace.
You can see an example of a Colab notebook to generate the report, and its corresponding report in a Comet.ml workspace.
Comet report parameters
You can customize how the Sweetviz report looks in your Comet workspace by overriding the [comet_ml_defaults]
section of configuration file. See above for more information on using the INI override.
You can choose to use either the widescreen
(horizontal) or vertical
layouts, as well as set your preferred scale, by putting the following in your override INI file:
[comet_ml_defaults]
html_layout = vertical
html_scale = 0.85
Troubleshooting / FAQ
- Installation issues
Please see the "Installation issues & fixes" section at the top of this document
- Asian characters, "RuntimeWarning: Glyph ### missing from current font"
See section above regarding CJK characters support. If you find the need for additional character types, definitely post a request in the issue tracking system.
- ...any other issues
Development is ongoing so absolutely feel free to report any issues and/or suggestions in the issue tracking system here or in our forum (you should be able to log in with your Github account!)
Contribute
This is my first open-source project! I built it to be the most useful tool possible and help as many people as possible with their data science work. If it is useful to you, your contribution is more than welcome and can take many forms:
1. Spread the word!
A STAR here on GitHub, and a Twitter or Instagram post are the easiest contribution and can potentially help grow this project tremendously! If you find this project useful, these quick actions from you would mean a lot and could go a long way.
Kaggle notebooks/posts, Medium articles, YouTube video tutorials and other content take more time but will help all the more!
2. Report bugs & issues
I expect there to be many quirks once the project is used by more and more people with a variety of new (& "unclean") data. If you found a bug, please open a new issue here.
3. Suggest and discuss usage/features
To make Sweetviz as useful as possible we need to hear what you would like it to do, or what it could do better! Head on to our Discourse server and post your suggestions there; no login required!.
4. Contribute to the development
I definitely welcome the help I can get on this project, simply get in touch on the issue tracker and/or our Discourse forum.
Please note that after a hectic development period, the code itself right now needs a bit of cleanup. :)
Special thanks & related materials
Contributors
A very special thanks to everyone who have contributed on Github, through reports, feedback and commits! I want to give a special shout out to Frank Male who has been of tremendous help for fixing issues and setting up the new build pipeline for 2.2.0.
Made with contrib.rocks.
Related materials
I want Sweetviz to be a hub of the best of what's out there, a way to get the most valuable information and visualization, without reinventing the wheel.
As such, I want to point some of those great resources that were inspiring and integrated into Sweetviz:
- Pandas-Profiling was the original inspiration for this project. Some of its type-detection code was included in Sweetviz.
- Shaked Zychlinski: The Search for Categorical Correlation is a great article about different types of variable interactions that was the basis of that analysis in Sweetviz.
- Drazen Zaric: Better Heatmaps and Correlation Matrix Plots in Python was the basis for our association graphs.
Top Related Projects
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
Always know what to expect from your data.
An open source python library for automated feature engineering
Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot