great_expectations

Always know what to expect from your data.

10,608

1,604

10,608

View on GitHub

Top Related Projects

deequ

3,436

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

pandera

3,884

A light-weight, flexible, and expressive statistical data testing library

pydantic

24,307

Data validation using Python type hints

Quick Overview

Great Expectations is an open-source Python library for data validation, documentation, and profiling. It helps data teams maintain data quality by allowing them to express what they expect from their data in a clear, human-readable format and then automatically validate those expectations.

Pros

Flexible and extensible, supporting various data sources and expectation types
Provides clear, human-readable documentation of data expectations
Integrates well with existing data pipelines and workflows
Offers data profiling capabilities to help discover and define expectations

Cons

Steep learning curve for beginners
Can be resource-intensive for large datasets
Limited support for real-time data validation scenarios
Some users report occasional inconsistencies in documentation

Code Examples

Creating a simple expectation:

import great_expectations as ge

df = ge.read_csv("your_data.csv")
df.expect_column_values_to_be_between("age", min_value=0, max_value=120)

Validating a dataset against expectations:

expectation_suite = ge.core.ExpectationSuite(expectation_suite_name="my_suite")
expectation_suite.add_expectation(
    ge.core.ExpectationConfiguration(
        expectation_type="expect_column_values_to_be_between",
        kwargs={
            "column": "age",
            "min_value": 0,
            "max_value": 120
        }
    )
)

validation_result = df.validate(expectation_suite=expectation_suite)
print(validation_result.success)

Generating a data documentation site:

context = ge.data_context.DataContext()
suite = context.create_expectation_suite("my_suite")
batch = context.get_batch({"path": "your_data.csv"}, "my_datasource")
results = context.run_validation_operator("action_list_operator", assets_to_validate=[batch])
context.build_data_docs()

Getting Started

To get started with Great Expectations:

Install the library:

pip install great_expectations

Initialize a new Great Expectations project:

great_expectations init

Connect to your data source and create expectations:

import great_expectations as ge

context = ge.get_context()
suite = context.create_expectation_suite("my_first_suite")
validator = context.get_validator(
    batch_request=context.sources.pandas_default.read_csv("your_data.csv"),
    expectation_suite=suite
)
validator.expect_column_values_to_be_between("age", min_value=0, max_value=120)
validator.save_expectation_suite()

Validate your data and generate documentation:

checkpoint = context.create_checkpoint("my_checkpoint")
checkpoint_result = checkpoint.run()
context.build_data_docs()

Competitor Comparisons

griffin

1,162

Mirror of Apache griffin

Pros of Griffin

Designed for big data environments, integrating well with Apache Hadoop and Spark ecosystems
Supports both batch and streaming data quality checks
Provides a web UI for easier management and visualization of data quality metrics

Cons of Griffin

Less active community and development compared to Great Expectations
More complex setup and configuration, especially for non-big data environments
Limited documentation and examples compared to Great Expectations

Code Comparison

Griffin (Scala):

val dqDef = DQConfig(
  "sample_dq",
  ProcessType.BatchProcessing,
  dataSources,
  evaluateRule,
  sinks
)

Great Expectations (Python):

expectation_suite = context.create_expectation_suite("my_suite")
validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite=expectation_suite
)
validator.expect_column_values_to_be_between("column_name", min_value=0, max_value=100)

Griffin focuses on defining data quality configurations using Scala, while Great Expectations uses a more intuitive Python API for defining expectations. Great Expectations provides a more user-friendly approach to defining data quality checks, whereas Griffin's approach is more suited for big data environments and integration with Apache ecosystem tools.

deequ

3,436

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

Pros of Deequ

Built on Apache Spark, offering better performance for large-scale data processing
Provides more advanced statistical metrics and anomaly detection capabilities
Integrates seamlessly with AWS ecosystem and services

Cons of Deequ

Limited to Scala and Java, while Great Expectations supports Python
Less extensive documentation and community support
Narrower focus on data quality, lacking some features like data profiling and documentation generation

Code Comparison

Deequ:

val verificationResult = VerificationSuite()
  .onData(df)
  .addCheck(
    Check(CheckLevel.Error, "Data Quality Check")
      .isComplete("id")
      .hasSize(_ >= 1000)
  )
  .run()

Great Expectations:

expectation_suite = context.create_expectation_suite("my_suite")
validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite=expectation_suite
)
validator.expect_column_values_to_not_be_null("id")
validator.expect_table_row_count_to_be_between(1000, None)

Both libraries offer data quality validation, but Deequ's syntax is more concise and Spark-oriented, while Great Expectations provides a more Pythonic and readable approach.

pandera

3,884

A light-weight, flexible, and expressive statistical data testing library

Pros of Pandera

Lightweight and focused on pandas DataFrame validation
Seamless integration with type hints and static analysis tools
Supports statistical hypothesis testing for data validation

Cons of Pandera

Less comprehensive ecosystem compared to Great Expectations
Limited support for non-pandas data structures
Fewer built-in data quality checks and expectations

Code Comparison

Pandera:

import pandera as pa

schema = pa.DataFrameSchema({
    "column1": pa.Column(int, checks=pa.Check.greater_than(0)),
    "column2": pa.Column(str, checks=pa.Check.str_length(1, 100))
})

validated_df = schema.validate(df)

Great Expectations:

import great_expectations as ge

my_df = ge.dataset.PandasDataset(df)
my_df.expect_column_values_to_be_between("column1", min_value=0, max_value=None)
my_df.expect_column_value_lengths_to_be_between("column2", min_value=1, max_value=100)

Both libraries offer data validation capabilities, but Pandera focuses on pandas DataFrames with a more concise syntax, while Great Expectations provides a broader range of features and supports multiple data sources.

pydantic

24,307

Data validation using Python type hints

Pros of Pydantic

Lightweight and fast data validation and settings management
Seamless integration with Python type hints
Extensive support for JSON Schema generation and validation

Cons of Pydantic

Limited scope compared to Great Expectations' comprehensive data quality features
Less focus on data profiling and expectation suite management
Primarily designed for individual object validation rather than large datasets

Code Comparison

Pydantic:

from pydantic import BaseModel, Field

class User(BaseModel):
    id: int
    name: str = Field(..., min_length=3)
    email: str

Great Expectations:

import great_expectations as ge

df = ge.read_csv("users.csv")
df.expect_column_values_to_be_unique("id")
df.expect_column_values_to_match_regex("email", r"[^@]+@[^@]+\.[^@]+")

Pydantic excels in object-level validation and schema definition, while Great Expectations focuses on dataset-level validation and quality checks. Pydantic is more suitable for API input validation and configuration management, whereas Great Expectations is better suited for data pipeline validation and quality assurance in data engineering workflows.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

About GX Core

GX Core combines the collective wisdom of thousands of community members with a proven track record in data quality deployments worldwide, wrapped into a super-simple package for data teams.

Its powerful technical tools start with Expectations: expressive and extensible unit tests for your data. Expectations foster collaboration by giving teams a common language to express data quality tests in an intuitive way. You can automatically generate documentation for each set of validation results, making it easy for everyone to stay on the same page. This not only simplifies your data quality processes, but helps preserve your organizationâs institutional knowledge about its data.

Learn more about how data teams are using GX Core in our featured case studies.

Integration support policy

GX Core supports Python 3.9 through 3.12. Experimental support for Python 3.13 and later can be enabled by setting a GX_PYTHON_EXPERIMENTAL environment variable when installing great_expectations.

For data sources and other integrations that GX supports, see the compatibility reference for additional information.

Get started

GX recommends deploying GX Core within a virtual environment. For more information about getting started with GX Core, see Introduction to GX Core.

Run the following command in an empty base directory inside a Python virtual environment to install GX Core:
```
pip install great_expectations
```
Run the following command to import the great_expectations module and create a Data Context:
```
import great_expectations as gx

context = gx.get_context()
```

Get support from GX and the community

They are listed in the order in which GX is prioritizing the support issues:

Issues and PRs in the GX GitHub repository
Questions posted to the GX Core Discourse forum
Questions posted to the GX Slack community channel

Contribute

We deeply value the contributions of our community. We're now accepting PRs for bug fixes.

To ensure the long-term quality of the GX Core codebase, we're not yet ready to accept feature contributions to the parts of the codebase that don't have clear interfaces for extensions. We're actively working to increase the surface area for contributions. Thank you for being a crucial part of GX Core!

Levels of contribution readiness

ð¢ Ready. Have a clear and public interface for extensions.

ð¡ Partially ready. Case-by-case.

ð´ Not ready. Will accept contributions that fix existing bugs or workflows.

GX Component	Readiness	Notes
CredentialStore	ð¢ Ready
BatchDefinition	ð¡ Partially ready	Formerly known as splitters
Action	ð¢ Ready
DataSource	ð´ Not ready	Includes MetricProvider and ExecutionEngine
DataContext	ð´ Not ready	Also known as Configuration Stores
DataAsset	ð´ Not ready
Expectation	ð´ Not ready
ValidationDefinition	ð´ Not ready
Checkpoint	ð´ Not ready
CustomExpectations	ð´ Not ready
Data Docs	ð´ Not ready	Also known as Renderers

Code of conduct

Everyone interacting in GX Core project codebases, Discourse forums, Slack channels, and email communications is expected to adhere to the GX Community Code of Conduct.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot