pyjanitor

Clean APIs for data cleaning. Python implementation of R package Janitor

1,429

173

1,429

113

View on GitHub

Top Related Projects

pandas

46,172

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

dplyr

4,895

dplyr: A grammar of data manipulation

plotly.py

17,541

The interactive graphing library for Python :sparkles:

seaborn

13,336

Statistical data visualization in Python

word_cloud

10,427

A little word cloud generator in Python

Quick Overview

pyjanitor is a Python library for data cleaning and transformation, inspired by the R package 'janitor'. It extends pandas with a suite of easy-to-use data cleaning functions, aiming to simplify and streamline the data preparation process. pyjanitor is designed to make data cleaning more intuitive and less time-consuming for data scientists and analysts.

Pros

Simplifies common data cleaning tasks with intuitive function names
Integrates seamlessly with pandas, enhancing its functionality
Supports method chaining for cleaner and more readable code
Regularly updated with new features and improvements

Cons

Requires familiarity with pandas for optimal use
Some functions may have performance overhead compared to native pandas operations
Documentation could be more comprehensive for advanced use cases
Limited support for non-tabular data structures

Code Examples

Example 1: Basic data cleaning

import pandas as pd
import janitor

df = pd.DataFrame(...)
cleaned_df = (
    df
    .clean_names()
    .remove_empty()
    .drop_duplicates()
    .encode_categorical()
)

Example 2: Filtering and transforming data

import pandas as pd
import janitor

df = pd.DataFrame(...)
result = (
    df
    .filter_on('column_name > 5')
    .transform_column('date_column', pd.to_datetime)
    .rename_column('old_name', 'new_name')
)

Example 3: Creating new columns based on existing ones

import pandas as pd
import janitor

df = pd.DataFrame(...)
result = (
    df
    .add_column('full_name', lambda x: f"{x['first_name']} {x['last_name']}")
    .mutate(bmi=lambda x: x['weight'] / (x['height'] ** 2))
)

Getting Started

To get started with pyjanitor, first install it using pip:

pip install pyjanitor

Then, import it in your Python script along with pandas:

import pandas as pd
import janitor

# Now you can use pyjanitor functions on your pandas DataFrames
df = pd.DataFrame(...)
cleaned_df = df.clean_names().remove_empty()

pyjanitor functions can be chained with regular pandas methods, allowing for flexible and readable data cleaning pipelines.

Competitor Comparisons

pandas

46,172

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Pros of pandas

Extensive functionality for data manipulation and analysis
Large, active community with frequent updates and improvements
Robust documentation and widespread adoption in data science

Cons of pandas

Steeper learning curve for beginners
Can be memory-intensive for large datasets
Some operations can be verbose or require multiple steps

Code Comparison

pandas:

import pandas as pd

df = pd.read_csv('data.csv')
df = df.dropna()
df['new_column'] = df['column_a'] + df['column_b']
df = df.groupby('category').agg({'value': 'mean'})

pyjanitor:

import janitor
import pandas as pd

df = (
    pd.read_csv('data.csv')
    .clean_names()
    .remove_empty()
    .add_column('new_column', lambda x: x['column_a'] + x['column_b'])
    .groupby_agg('category', {'value': 'mean'})
)

pyjanitor builds on pandas, offering a more intuitive and readable API for common data cleaning and manipulation tasks. It simplifies many operations that would require multiple steps in pandas, making code more concise and easier to understand. However, pandas remains the more comprehensive and widely-used library, with a broader range of features and better performance for complex operations on large datasets.

dplyr

4,895

dplyr: A grammar of data manipulation

Pros of dplyr

More comprehensive data manipulation toolkit
Tighter integration with other tidyverse packages
Larger community and more extensive documentation

Cons of dplyr

Limited to R programming language
Steeper learning curve for beginners
Less focus on data cleaning operations

Code Comparison

dplyr:

library(dplyr)

df %>%
  filter(age > 18) %>%
  group_by(city) %>%
  summarize(avg_income = mean(income))

pyjanitor:

import pandas as pd
import janitor

df.clean_names() \
  .query("age > 18") \
  .groupby("city") \
  .agg({"income": "mean"})

Both libraries offer efficient data manipulation capabilities, but dplyr provides a more extensive set of functions for data transformation in R, while pyjanitor focuses on data cleaning operations in Python, building upon pandas functionality. dplyr's syntax is often considered more intuitive for complex data operations, but pyjanitor's method chaining approach can be more familiar to Python users. The choice between the two largely depends on the preferred programming language and specific project requirements.

plotly.py

17,541

The interactive graphing library for Python :sparkles:

Pros of plotly.py

Powerful interactive data visualization library with a wide range of chart types
Supports both web-based and static output formats
Large and active community with extensive documentation

Cons of plotly.py

Steeper learning curve compared to simpler plotting libraries
Can be slower for rendering large datasets
Requires additional setup for certain features (e.g., offline mode)

Code Comparison

plotly.py:

import plotly.graph_objects as go

fig = go.Figure(data=go.Bar(y=[2, 3, 1]))
fig.show()

pyjanitor:

import pandas as pd
import janitor

df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
df.clean_names().remove_empty()

Summary

plotly.py is a comprehensive data visualization library focused on creating interactive and publication-quality charts. It offers a wide range of customization options and supports various output formats. pyjanitor, on the other hand, is primarily a data cleaning and manipulation library that extends pandas functionality. While plotly.py excels in creating complex visualizations, pyjanitor simplifies data preprocessing tasks. The choice between the two depends on the specific needs of your data analysis pipeline.

seaborn

13,336

Statistical data visualization in Python

Pros of seaborn

More comprehensive statistical visualization library
Built on matplotlib, offering a higher-level interface for complex plots
Extensive documentation and examples for various plot types

Cons of seaborn

Focused primarily on visualization, less on data cleaning
Steeper learning curve for users new to data visualization
Less flexibility for custom data transformations

Code Comparison

seaborn:

import seaborn as sns
import matplotlib.pyplot as plt

sns.scatterplot(x="total_bill", y="tip", data=tips)
plt.show()

pyjanitor:

import janitor
import pandas as pd

df = pd.DataFrame(...)
df.clean_names().remove_empty().dropna()

Summary

seaborn excels in statistical visualization with a wide range of plot types and built-in themes. It's ideal for creating publication-quality figures but may require more setup for data preprocessing. pyjanitor, on the other hand, focuses on data cleaning and transformation, offering a more intuitive API for common data wrangling tasks. While seaborn provides powerful visualization capabilities, pyjanitor simplifies the data preparation process, making it a valuable tool for data scientists working with messy datasets.

word_cloud

10,427

A little word cloud generator in Python

Pros of word_cloud

Specialized tool for generating word clouds, offering a focused and efficient solution
Provides customization options for word cloud appearance, including colors and shapes
Easy to use with simple API for quick word cloud generation

Cons of word_cloud

Limited in scope compared to pyjanitor's broader data cleaning capabilities
Less frequent updates and maintenance compared to pyjanitor
Fewer contributors and community engagement

Code Comparison

word_cloud:

from wordcloud import WordCloud
wordcloud = WordCloud().generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

pyjanitor:

import janitor
import pandas as pd

df = pd.DataFrame(...)
df = (
    df.clean_names()
    .remove_empty()
    .dropna(subset=['column_name'])
)

The code snippets demonstrate the different focuses of the libraries. word_cloud is specialized for creating word clouds, while pyjanitor offers a range of data cleaning and manipulation functions for pandas DataFrames.

plotnine

4,269

A Grammar of Graphics for Python

Pros of plotnine

Implements the Grammar of Graphics, providing a powerful and flexible approach to data visualization
Offers a wide range of geoms and statistical transformations for creating complex plots
Integrates well with pandas DataFrames, making it easy to work with data in Python

Cons of plotnine

May have a steeper learning curve for users not familiar with the Grammar of Graphics concept
Can be slower for rendering large datasets compared to some other plotting libraries
Less extensive documentation and community support compared to more established libraries

Code Comparison

plotnine:

from plotnine import ggplot, aes, geom_point
ggplot(data, aes(x='x', y='y')) + geom_point()

pyjanitor:

import janitor
import pandas as pd
df = pd.DataFrame(data).clean_names().remove_empty()

While plotnine focuses on data visualization, pyjanitor is primarily used for data cleaning and preprocessing. plotnine provides a declarative approach to creating plots, while pyjanitor offers methods for data manipulation and cleaning within the pandas ecosystem.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

pyjanitor

pyjanitor is a Python implementation of the R package janitor, and provides a clean API for cleaning data.

Quick start

Installation: conda install -c conda-forge pyjanitor. Read more installation instructions here.
Check out the collection of general functions.

Why janitor?

Originally a port of the R package, pyjanitor has evolved from a set of convenient data cleaning routines into an experiment with the method chaining paradigm.

Data preprocessing usually consists of a series of steps that involve transforming raw data into an understandable/usable format. These series of steps need to be run in a certain sequence to achieve success. We take a base data file as the starting point, and perform actions on it, such as removing null/empty rows, replacing them with other values, adding/renaming/removing columns of data, filtering rows and others. More formally, these steps along with their relationships and dependencies are commonly referred to as a Directed Acyclic Graph (DAG).

The pandas API has been invaluable for the Python data science ecosystem, and implements method chaining of a subset of methods as part of the API. For example, resetting indexes (.reset_index()), dropping null values (.dropna()), and more, are accomplished via the appropriate pd.DataFrame method calls.

Inspired by the ease-of-use and expressiveness of the dplyr package of the R statistical language ecosystem, we have evolved pyjanitor into a language for expressing the data processing DAG for pandas users.

To accomplish this, actions for which we would need to invoke imperative-style statements, can be replaced with method chains that allow one to read off the logical order of actions taken. Let us see the annotated example below. First off, here is the textual description of a data cleaning pathway:

Create a DataFrame.
Delete one column.
Drop rows with empty values in two particular columns.
Rename another two columns.
Add a new column.

Let's import some libraries and begin with some sample data for this example:

# Libraries
import numpy as np
import pandas as pd
import janitor

# Sample Data curated for this example
company_sales = {
    'SalesMonth': ['Jan', 'Feb', 'Mar', 'April'],
    'Company1': [150.0, 200.0, 300.0, 400.0],
    'Company2': [180.0, 250.0, np.nan, 500.0],
    'Company3': [400.0, 500.0, 600.0, 675.0]
}

In pandas code, most users might type something like this:

# The Pandas Way

# 1. Create a pandas DataFrame from the company_sales dictionary
df = pd.DataFrame.from_dict(company_sales)

# 2. Delete a column from the DataFrame. Say 'Company1'
del df['Company1']

# 3. Drop rows that have empty values in columns 'Company2' and 'Company3'
df = df.dropna(subset=['Company2', 'Company3'])

# 4. Rename 'Company2' to 'Amazon' and 'Company3' to 'Facebook'
df = df.rename(
    {
        'Company2': 'Amazon',
        'Company3': 'Facebook',
    },
    axis=1,
)

# 5. Let's add some data for another company. Say 'Google'
df['Google'] = [450.0, 550.0, 800.0]

# Output looks like this:
# Out[15]:
#   SalesMonth  Amazon  Facebook  Google
# 0        Jan   180.0     400.0   450.0
# 1        Feb   250.0     500.0   550.0
# 3      April   500.0     675.0   800.0

Slightly more advanced users might take advantage of the functional API:

df = (
    pd.DataFrame(company_sales)
    .drop(columns="Company1")
    .dropna(subset=["Company2", "Company3"])
    .rename(columns={"Company2": "Amazon", "Company3": "Facebook"})
    .assign(Google=[450.0, 550.0, 800.0])
)

# The output is the same as before, and looks like this:
# Out[15]:
#   SalesMonth  Amazon  Facebook  Google
# 0        Jan   180.0     400.0   450.0
# 1        Feb   250.0     500.0   550.0
# 3      April   500.0     675.0   800.0

With pyjanitor, we enable method chaining with method names that are explicitly named verbs, which describe the action taken.

df = (
    pd.DataFrame.from_dict(company_sales)
    .remove_columns(["Company1"])
    .dropna(subset=["Company2", "Company3"])
    .rename_column("Company2", "Amazon")
    .rename_column("Company3", "Facebook")
    .add_column("Google", [450.0, 550.0, 800.0])
)

# Output looks like this:
# Out[15]:
#   SalesMonth  Amazon  Facebook  Google
# 0        Jan   180.0     400.0   450.0
# 1        Feb   250.0     500.0   550.0
# 3      April   500.0     675.0   800.0

As such, pyjanitor's etymology has a two-fold relationship to "cleanliness". Firstly, it's about extending Pandas with convenient data cleaning routines. Secondly, it's about providing a cleaner, method-chaining, verb-based API for common pandas routines.

Installation

pyjanitor is currently installable from PyPI:

pip install pyjanitor

pyjanitor also can be installed by the conda package manager:

conda install pyjanitor -c conda-forge

pyjanitor can be installed by the pipenv environment manager too. This requires enabling prerelease dependencies:

pipenv install --pre pyjanitor

pyjanitor requires Python 3.6+.

Functionality

Current functionality includes:

Cleaning columns name (multi-indexes are possible!)
Removing empty rows and columns
Identifying duplicate entries
Encoding columns as categorical
Splitting your data into features and targets (for machine learning)
Adding, removing, and renaming columns
Coalesce multiple columns into a single column
Date conversions (from matlab, excel, unix) to Python datetime format
Expand a single column that has delimited, categorical values into dummy-encoded variables
Concatenating and deconcatenating columns, based on a delimiter
Syntactic sugar for filtering the dataframe based on queries on a column
Experimental submodules for finance, biology, chemistry, engineering, and pyspark

API

The idea behind the API is two-fold:

Copy the R package function names, but enable Pythonic use with method chaining or pandas piping.
Add other utility functions that make it easy to do data cleaning/preprocessing in pandas.

Continuing with the company_sales dataframe previously used:

import pandas as pd
import numpy as np
company_sales = {
    'SalesMonth': ['Jan', 'Feb', 'Mar', 'April'],
    'Company1': [150.0, 200.0, 300.0, 400.0],
    'Company2': [180.0, 250.0, np.nan, 500.0],
    'Company3': [400.0, 500.0, 600.0, 675.0]
}

As such, there are three ways to use the API. The first, and most strongly recommended one, is to use pyjanitor's functions as if they were native to pandas.

import janitor  # upon import, functions are registered as part of pandas.

# This cleans the column names as well as removes any duplicate rows
df = pd.DataFrame.from_dict(company_sales).clean_names().remove_empty()

The second is the functional API.

from janitor import clean_names, remove_empty

df = pd.DataFrame.from_dict(company_sales)
df = clean_names(df)
df = remove_empty(df)

The final way is to use the pipe() method:

from janitor import clean_names, remove_empty
df = (
    pd.DataFrame.from_dict(company_sales)
    .pipe(clean_names)
    .pipe(remove_empty)
)

Contributing

Follow the development guide for a full description of the process of contributing to pyjanitor.

Adding new functionality

Keeping in mind the etymology of pyjanitor, contributing a new function to pyjanitor is a task that is not difficult at all.

Define a function

First off, you will need to define the function that expresses the data processing/cleaning routine, such that it accepts a dataframe as the first argument, and returns a modified dataframe:

import pandas_flavor as pf

@pf.register_dataframe_method
def my_data_cleaning_function(df, arg1, arg2, ...):
    # Put data processing function here.
    return df

We use pandas_flavor to register the function natively on a pandas.DataFrame.

Add a test case

Secondly, we ask that you contribute a test case, to ensure that the function works as intended. Follow the contribution docs for further details.

Feature requests

If you have a feature request, please post it as an issue on the GitHub repository issue tracker. Even better, put in a PR for it! We are more than happy to guide you through the codebase so that you can put in a contribution to the codebase.

Because pyjanitor is currently maintained by volunteers and has no fiscal support, any feature requests will be prioritized according to what maintainers encounter as a need in our day-to-day jobs. Please temper expectations accordingly.

API Policy

pyjanitor only extends or aliases the pandas API (and other dataframe APIs), but will never fix or replace them.

Undesirable pandas behaviour should be reported upstream in the pandas issue tracker. We explicitly do not fix the pandas API. If at some point the pandas devs decide to take something from pyjanitor and internalize it as part of the official pandas API, then we will deprecate it from pyjanitor, while acknowledging the original contributors' contribution as part of the official deprecation record.

Credits

Test data for chemistry submodule can be found at Predictive Toxicology.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot