Top Related Projects
Mirror of Apache griffin
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
A light-weight, flexible, and expressive statistical data testing library
Data validation using Python type hints
Quick Overview
Great Expectations is an open-source Python library for data validation, documentation, and profiling. It helps data teams maintain data quality by allowing them to express what they expect from their data in a clear, human-readable format and then automatically validate those expectations.
Pros
- Flexible and extensible, supporting various data sources and expectation types
- Provides clear, human-readable documentation of data expectations
- Integrates well with existing data pipelines and workflows
- Offers data profiling capabilities to help discover and define expectations
Cons
- Steep learning curve for beginners
- Can be resource-intensive for large datasets
- Limited support for real-time data validation scenarios
- Some users report occasional inconsistencies in documentation
Code Examples
- Creating a simple expectation:
import great_expectations as ge
df = ge.read_csv("your_data.csv")
df.expect_column_values_to_be_between("age", min_value=0, max_value=120)
- Validating a dataset against expectations:
expectation_suite = ge.core.ExpectationSuite(expectation_suite_name="my_suite")
expectation_suite.add_expectation(
ge.core.ExpectationConfiguration(
expectation_type="expect_column_values_to_be_between",
kwargs={
"column": "age",
"min_value": 0,
"max_value": 120
}
)
)
validation_result = df.validate(expectation_suite=expectation_suite)
print(validation_result.success)
- Generating a data documentation site:
context = ge.data_context.DataContext()
suite = context.create_expectation_suite("my_suite")
batch = context.get_batch({"path": "your_data.csv"}, "my_datasource")
results = context.run_validation_operator("action_list_operator", assets_to_validate=[batch])
context.build_data_docs()
Getting Started
To get started with Great Expectations:
- Install the library:
pip install great_expectations
- Initialize a new Great Expectations project:
great_expectations init
- Connect to your data source and create expectations:
import great_expectations as ge
context = ge.get_context()
suite = context.create_expectation_suite("my_first_suite")
validator = context.get_validator(
batch_request=context.sources.pandas_default.read_csv("your_data.csv"),
expectation_suite=suite
)
validator.expect_column_values_to_be_between("age", min_value=0, max_value=120)
validator.save_expectation_suite()
- Validate your data and generate documentation:
checkpoint = context.create_checkpoint("my_checkpoint")
checkpoint_result = checkpoint.run()
context.build_data_docs()
Competitor Comparisons
Mirror of Apache griffin
Pros of Griffin
- Designed for big data environments, integrating well with Apache Hadoop and Spark ecosystems
- Supports both batch and streaming data quality checks
- Provides a web UI for easier management and visualization of data quality metrics
Cons of Griffin
- Less active community and development compared to Great Expectations
- More complex setup and configuration, especially for non-big data environments
- Limited documentation and examples compared to Great Expectations
Code Comparison
Griffin (Scala):
val dqDef = DQConfig(
"sample_dq",
ProcessType.BatchProcessing,
dataSources,
evaluateRule,
sinks
)
Great Expectations (Python):
expectation_suite = context.create_expectation_suite("my_suite")
validator = context.get_validator(
batch_request=batch_request,
expectation_suite=expectation_suite
)
validator.expect_column_values_to_be_between("column_name", min_value=0, max_value=100)
Griffin focuses on defining data quality configurations using Scala, while Great Expectations uses a more intuitive Python API for defining expectations. Great Expectations provides a more user-friendly approach to defining data quality checks, whereas Griffin's approach is more suited for big data environments and integration with Apache ecosystem tools.
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Pros of Deequ
- Built on Apache Spark, offering better performance for large-scale data processing
- Provides more advanced statistical metrics and anomaly detection capabilities
- Integrates seamlessly with AWS ecosystem and services
Cons of Deequ
- Limited to Scala and Java, while Great Expectations supports Python
- Less extensive documentation and community support
- Narrower focus on data quality, lacking some features like data profiling and documentation generation
Code Comparison
Deequ:
val verificationResult = VerificationSuite()
.onData(df)
.addCheck(
Check(CheckLevel.Error, "Data Quality Check")
.isComplete("id")
.hasSize(_ >= 1000)
)
.run()
Great Expectations:
expectation_suite = context.create_expectation_suite("my_suite")
validator = context.get_validator(
batch_request=batch_request,
expectation_suite=expectation_suite
)
validator.expect_column_values_to_not_be_null("id")
validator.expect_table_row_count_to_be_between(1000, None)
Both libraries offer data quality validation, but Deequ's syntax is more concise and Spark-oriented, while Great Expectations provides a more Pythonic and readable approach.
A light-weight, flexible, and expressive statistical data testing library
Pros of Pandera
- Lightweight and focused on pandas DataFrame validation
- Seamless integration with type hints and static analysis tools
- Supports statistical hypothesis testing for data validation
Cons of Pandera
- Less comprehensive ecosystem compared to Great Expectations
- Limited support for non-pandas data structures
- Fewer built-in data quality checks and expectations
Code Comparison
Pandera:
import pandera as pa
schema = pa.DataFrameSchema({
"column1": pa.Column(int, checks=pa.Check.greater_than(0)),
"column2": pa.Column(str, checks=pa.Check.str_length(1, 100))
})
validated_df = schema.validate(df)
Great Expectations:
import great_expectations as ge
my_df = ge.dataset.PandasDataset(df)
my_df.expect_column_values_to_be_between("column1", min_value=0, max_value=None)
my_df.expect_column_value_lengths_to_be_between("column2", min_value=1, max_value=100)
Both libraries offer data validation capabilities, but Pandera focuses on pandas DataFrames with a more concise syntax, while Great Expectations provides a broader range of features and supports multiple data sources.
Data validation using Python type hints
Pros of Pydantic
- Lightweight and fast data validation and settings management
- Seamless integration with Python type hints
- Extensive support for JSON Schema generation and validation
Cons of Pydantic
- Limited scope compared to Great Expectations' comprehensive data quality features
- Less focus on data profiling and expectation suite management
- Primarily designed for individual object validation rather than large datasets
Code Comparison
Pydantic:
from pydantic import BaseModel, Field
class User(BaseModel):
id: int
name: str = Field(..., min_length=3)
email: str
Great Expectations:
import great_expectations as ge
df = ge.read_csv("users.csv")
df.expect_column_values_to_be_unique("id")
df.expect_column_values_to_match_regex("email", r"[^@]+@[^@]+\.[^@]+")
Pydantic excels in object-level validation and schema definition, while Great Expectations focuses on dataset-level validation and quality checks. Pydantic is more suitable for API input validation and configuration management, whereas Great Expectations is better suited for data pipeline validation and quality assurance in data engineering workflows.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
About GX Core
GX Core is the engine of the GX platform. It combines the collective wisdom of thousands of community members with a proven track record in data quality deployments worldwide, wrapped into a super-simple package for data teams.
Its powerful technical tools start with Expectations: expressive and extensible unit tests for your data. Expectations foster collaboration by giving teams a common language to express data quality tests in an intuitive way. You can automatically generate documentation for each set of validation results, making it easy for everyone to stay on the same page. This not only simplifies your data quality processes, but helps preserve your organizationâs institutional knowledge about its data.
Learn more about how data teams are using GX Core in our featured case studies.
Integration support policy
GX Core supports Python 3.8
through 3.11
.
Experimental support for Python 3.12
and later can be enabled by setting a GX_PYTHON_EXPERIMENTAL
environment variable when installing great_expectations
.
For data sources and other integrations that GX supports, see GX integration support policy for additional information.
Get started
GX recommends deploying GX Core within a virtual environment. For more information about getting started with GX Core, see Get started with Great Expectations.
-
Run the following command in an empty base directory inside a Python virtual environment to install GX Core:
pip install great_expectations
-
Run the following command to import the
great_expectations module
and create a Data Context:import great_expectations as gx context = gx.get_context()
Get support from GX and the community
They are listed in the order in which GX is prioritizing the support issues:
- Issues and PRs in the GX GitHub repository
- Questions posted to the GX Core Discourse forum
- Questions posted to the GX Slack community channel
Contribute
We deeply value the contributions of our community. We're now accepting PRs for bug fixes.
To ensure the long-term quality of the GX Core codebase, we're not yet ready to accept feature contributions to the parts of the codebase that don't have clear APIs for extensions. We're actively working to increase the surface area for contributions. Thank you being a crucial part of GX's data quality platform!
Levels of contribution readiness
ð¢ Ready. Have a clear and public API for extensions.
ð¡ Partially ready. Case-by-case.
ð´ Not ready. Will accept contributions that fix existing bugs or workflows.
GX Component | Readiness | Notes |
---|---|---|
Action | ð¢ Ready | |
CredentialStore | ð¢ Ready | |
BatchDefinition | ð¡ Partially ready | Formerly known as splitters |
DataSource | ð´ Not ready | Includes MetricProvider and ExecutionEngine |
DataContext | ð´ Not ready | Also known as Configuration Stores |
DataAsset | ð´ Not ready | |
Expectation | ð´ Not ready | |
ValidationDefinition | ð´ Not ready | |
Checkpoint | ð´ Not ready | |
CustomExpectations | ð´ Not ready | |
Data Docs | ð´ Not ready | Also known as Renderers |
Code of conduct
Everyone interacting in GX Core project codebases, Discourse forums, Slack channels, and email communications is expected to adhere to the GX Community Code of Conduct.
Top Related Projects
Mirror of Apache griffin
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
A light-weight, flexible, and expressive statistical data testing library
Data validation using Python type hints
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot