Convert Figma logo to code with AI

great-expectations logogreat_expectations

Always know what to expect from your data.

9,914
1,534
9,914
46

Top Related Projects

1,123

Mirror of Apache griffin

3,244

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

3,245

A light-weight, flexible, and expressive statistical data testing library

20,398

Data validation using Python type hints

Quick Overview

Great Expectations is an open-source Python library for data validation, documentation, and profiling. It helps data teams maintain data quality by allowing them to express what they expect from their data in a clear, human-readable format and then automatically validate those expectations.

Pros

  • Flexible and extensible, supporting various data sources and expectation types
  • Provides clear, human-readable documentation of data expectations
  • Integrates well with existing data pipelines and workflows
  • Offers data profiling capabilities to help discover and define expectations

Cons

  • Steep learning curve for beginners
  • Can be resource-intensive for large datasets
  • Limited support for real-time data validation scenarios
  • Some users report occasional inconsistencies in documentation

Code Examples

  1. Creating a simple expectation:
import great_expectations as ge

df = ge.read_csv("your_data.csv")
df.expect_column_values_to_be_between("age", min_value=0, max_value=120)
  1. Validating a dataset against expectations:
expectation_suite = ge.core.ExpectationSuite(expectation_suite_name="my_suite")
expectation_suite.add_expectation(
    ge.core.ExpectationConfiguration(
        expectation_type="expect_column_values_to_be_between",
        kwargs={
            "column": "age",
            "min_value": 0,
            "max_value": 120
        }
    )
)

validation_result = df.validate(expectation_suite=expectation_suite)
print(validation_result.success)
  1. Generating a data documentation site:
context = ge.data_context.DataContext()
suite = context.create_expectation_suite("my_suite")
batch = context.get_batch({"path": "your_data.csv"}, "my_datasource")
results = context.run_validation_operator("action_list_operator", assets_to_validate=[batch])
context.build_data_docs()

Getting Started

To get started with Great Expectations:

  1. Install the library:
pip install great_expectations
  1. Initialize a new Great Expectations project:
great_expectations init
  1. Connect to your data source and create expectations:
import great_expectations as ge

context = ge.get_context()
suite = context.create_expectation_suite("my_first_suite")
validator = context.get_validator(
    batch_request=context.sources.pandas_default.read_csv("your_data.csv"),
    expectation_suite=suite
)
validator.expect_column_values_to_be_between("age", min_value=0, max_value=120)
validator.save_expectation_suite()
  1. Validate your data and generate documentation:
checkpoint = context.create_checkpoint("my_checkpoint")
checkpoint_result = checkpoint.run()
context.build_data_docs()

Competitor Comparisons

1,123

Mirror of Apache griffin

Pros of Griffin

  • Designed for big data environments, integrating well with Apache Hadoop and Spark ecosystems
  • Supports both batch and streaming data quality checks
  • Provides a web UI for easier management and visualization of data quality metrics

Cons of Griffin

  • Less active community and development compared to Great Expectations
  • More complex setup and configuration, especially for non-big data environments
  • Limited documentation and examples compared to Great Expectations

Code Comparison

Griffin (Scala):

val dqDef = DQConfig(
  "sample_dq",
  ProcessType.BatchProcessing,
  dataSources,
  evaluateRule,
  sinks
)

Great Expectations (Python):

expectation_suite = context.create_expectation_suite("my_suite")
validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite=expectation_suite
)
validator.expect_column_values_to_be_between("column_name", min_value=0, max_value=100)

Griffin focuses on defining data quality configurations using Scala, while Great Expectations uses a more intuitive Python API for defining expectations. Great Expectations provides a more user-friendly approach to defining data quality checks, whereas Griffin's approach is more suited for big data environments and integration with Apache ecosystem tools.

3,244

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

Pros of Deequ

  • Built on Apache Spark, offering better performance for large-scale data processing
  • Provides more advanced statistical metrics and anomaly detection capabilities
  • Integrates seamlessly with AWS ecosystem and services

Cons of Deequ

  • Limited to Scala and Java, while Great Expectations supports Python
  • Less extensive documentation and community support
  • Narrower focus on data quality, lacking some features like data profiling and documentation generation

Code Comparison

Deequ:

val verificationResult = VerificationSuite()
  .onData(df)
  .addCheck(
    Check(CheckLevel.Error, "Data Quality Check")
      .isComplete("id")
      .hasSize(_ >= 1000)
  )
  .run()

Great Expectations:

expectation_suite = context.create_expectation_suite("my_suite")
validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite=expectation_suite
)
validator.expect_column_values_to_not_be_null("id")
validator.expect_table_row_count_to_be_between(1000, None)

Both libraries offer data quality validation, but Deequ's syntax is more concise and Spark-oriented, while Great Expectations provides a more Pythonic and readable approach.

3,245

A light-weight, flexible, and expressive statistical data testing library

Pros of Pandera

  • Lightweight and focused on pandas DataFrame validation
  • Seamless integration with type hints and static analysis tools
  • Supports statistical hypothesis testing for data validation

Cons of Pandera

  • Less comprehensive ecosystem compared to Great Expectations
  • Limited support for non-pandas data structures
  • Fewer built-in data quality checks and expectations

Code Comparison

Pandera:

import pandera as pa

schema = pa.DataFrameSchema({
    "column1": pa.Column(int, checks=pa.Check.greater_than(0)),
    "column2": pa.Column(str, checks=pa.Check.str_length(1, 100))
})

validated_df = schema.validate(df)

Great Expectations:

import great_expectations as ge

my_df = ge.dataset.PandasDataset(df)
my_df.expect_column_values_to_be_between("column1", min_value=0, max_value=None)
my_df.expect_column_value_lengths_to_be_between("column2", min_value=1, max_value=100)

Both libraries offer data validation capabilities, but Pandera focuses on pandas DataFrames with a more concise syntax, while Great Expectations provides a broader range of features and supports multiple data sources.

20,398

Data validation using Python type hints

Pros of Pydantic

  • Lightweight and fast data validation and settings management
  • Seamless integration with Python type hints
  • Extensive support for JSON Schema generation and validation

Cons of Pydantic

  • Limited scope compared to Great Expectations' comprehensive data quality features
  • Less focus on data profiling and expectation suite management
  • Primarily designed for individual object validation rather than large datasets

Code Comparison

Pydantic:

from pydantic import BaseModel, Field

class User(BaseModel):
    id: int
    name: str = Field(..., min_length=3)
    email: str

Great Expectations:

import great_expectations as ge

df = ge.read_csv("users.csv")
df.expect_column_values_to_be_unique("id")
df.expect_column_values_to_match_regex("email", r"[^@]+@[^@]+\.[^@]+")

Pydantic excels in object-level validation and schema definition, while Great Expectations focuses on dataset-level validation and quality checks. Pydantic is more suitable for API input validation and configuration management, whereas Great Expectations is better suited for data pipeline validation and quality assurance in data engineering workflows.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Python Versions PyPI PyPI Downloads Build Status pre-commit.ci Status codecov DOI Twitter Follow Slack Status Contributors Ruff

About GX Core

GX Core is the engine of the GX platform. It combines the collective wisdom of thousands of community members with a proven track record in data quality deployments worldwide, wrapped into a super-simple package for data teams.

Its powerful technical tools start with Expectations: expressive and extensible unit tests for your data. Expectations foster collaboration by giving teams a common language to express data quality tests in an intuitive way. You can automatically generate documentation for each set of validation results, making it easy for everyone to stay on the same page. This not only simplifies your data quality processes, but helps preserve your organization’s institutional knowledge about its data.

Learn more about how data teams are using GX Core in our featured case studies.

Integration support policy

GX Core supports Python 3.8 through 3.11. Experimental support for Python 3.12 and later can be enabled by setting a GX_PYTHON_EXPERIMENTAL environment variable when installing great_expectations.

For data sources and other integrations that GX supports, see GX integration support policy for additional information.

Get started

GX recommends deploying GX Core within a virtual environment. For more information about getting started with GX Core, see Get started with Great Expectations.

  1. Run the following command in an empty base directory inside a Python virtual environment to install GX Core:

    pip install great_expectations
    
  2. Run the following command to import the great_expectations module and create a Data Context:

    import great_expectations as gx
    
    context = gx.get_context()
    

Get support from GX and the community

They are listed in the order in which GX is prioritizing the support issues:

  1. Issues and PRs in the GX GitHub repository
  2. Questions posted to the GX Core Discourse forum
  3. Questions posted to the GX Slack community channel

Contribute

We deeply value the contributions of our community. We're now accepting PRs for bug fixes.

To ensure the long-term quality of the GX Core codebase, we're not yet ready to accept feature contributions to the parts of the codebase that don't have clear APIs for extensions. We're actively working to increase the surface area for contributions. Thank you being a crucial part of GX's data quality platform!

Levels of contribution readiness

🟢 Ready. Have a clear and public API for extensions.

🟡 Partially ready. Case-by-case.

🔴 Not ready. Will accept contributions that fix existing bugs or workflows.

GX ComponentReadinessNotes
Action🟢 Ready
CredentialStore🟢 Ready
BatchDefinition🟡 Partially readyFormerly known as splitters
DataSource🔴 Not readyIncludes MetricProvider and ExecutionEngine
DataContext🔴 Not readyAlso known as Configuration Stores
DataAsset🔴 Not ready
Expectation🔴 Not ready
ValidationDefinition🔴 Not ready
Checkpoint🔴 Not ready
CustomExpectations🔴 Not ready
Data Docs🔴 Not readyAlso known as Renderers

Code of conduct

Everyone interacting in GX Core project codebases, Discourse forums, Slack channels, and email communications is expected to adhere to the GX Community Code of Conduct.