deequ
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Top Related Projects
Always know what to expect from your data.
Mirror of Apache griffin
Context aware, pluggable and customizable data protection and de-identification SDK for text and images
Quick Overview
Deequ is an open-source library for data quality checks on large-scale datasets in Apache Spark. It provides a declarative language to specify data quality checks and a modular architecture to extend the library with custom checks and metrics.
Pros
- Declarative Approach: Deequ uses a declarative approach to define data quality checks, making it easier to understand and maintain the checks over time.
- Extensibility: The library's modular architecture allows users to extend it with custom checks and metrics, making it highly flexible and adaptable to different use cases.
- Scalability: Deequ is built on top of Apache Spark, allowing it to handle large-scale datasets efficiently.
- Comprehensive Checks: Deequ provides a wide range of built-in checks, including completeness, uniqueness, distribution, and constraint checks, covering a broad spectrum of data quality concerns.
Cons
- Steep Learning Curve: Deequ's declarative approach and modular architecture may have a steeper learning curve for users who are not familiar with Apache Spark or data quality concepts.
- Limited Non-Spark Support: Deequ is primarily designed for Apache Spark, and its integration with other data processing frameworks may be limited.
- Dependency on Spark: The library's reliance on Apache Spark may be a drawback for users who prefer to work with other data processing tools or frameworks.
- Potential Performance Overhead: The overhead of running data quality checks on large datasets may impact the overall performance of the data processing pipeline.
Code Examples
Here are a few code examples demonstrating the usage of Deequ:
- Defining a Completeness Check:
val completenessCheck = Check(CheckLevel.Warning, "Completeness check")
.hasCompleteness("column1", _ >= 0.95)
.hasCompleteness("column2", _ >= 0.90)
This code defines a completeness check that ensures that at least 95% of the values in column1
and 90% of the values in column2
are non-null.
- Defining a Uniqueness Check:
val uniquenessCheck = Check(CheckLevel.Error, "Uniqueness check")
.isUnique("column1")
.isUnique(List("column2", "column3"))
This code defines a uniqueness check that ensures that the values in column1
are unique, and the combination of values in column2
and column3
are also unique.
- Defining a Constraint Check:
val constraintCheck = Check(CheckLevel.Error, "Constraint check")
.hasMin("column1", _ >= 0)
.hasMax("column1", _ <= 100)
.hasApproxQuantile("column2", 0.5, _ == 50.0)
This code defines a constraint check that ensures that the values in column1
are between 0 and 100, and the median value in column2
is approximately 50.0.
- Running the Checks:
val result = VerificationSuite()
.onData(df)
.addChecks(
completenessCheck,
uniquenessCheck,
constraintCheck
)
.run()
This code runs the defined checks on the input DataFrame df
and returns a VerificationResult
object that contains the results of the checks.
Getting Started
To get started with Deequ, you can follow these steps:
- Add the Deequ dependency to your project:
libraryDependencies += "com.amazon.deequ" % "deequ" % "2.0.0-rc1"
- Import the necessary Deequ classes:
import com.amazon.deequ.VerificationSuite
import com.amazon.deequ.checks.Check
import com.amazon.deequ.checks.CheckLevel
import com.amazon.deequ.repository.ResultKey
import com.amazon.deequ.repository.memory.InMemoryMetricsRepository
- Create a DataFrame and define your data quality checks
Competitor Comparisons
Always know what to expect from your data.
Pros of Great Expectations
- More flexible and supports multiple data sources (databases, files, cloud storage)
- Extensive documentation and active community support
- Integrates well with modern data workflows and CI/CD pipelines
Cons of Great Expectations
- Steeper learning curve due to more complex architecture
- Can be slower for large datasets compared to Deequ's Spark-based approach
- Requires more setup and configuration for advanced use cases
Code Comparison
Great Expectations:
import great_expectations as ge
df = ge.read_csv("data.csv")
df.expect_column_values_to_be_between("age", min_value=0, max_value=120)
df.expect_column_values_to_not_be_null("name")
Deequ:
import com.amazon.deequ.VerificationSuite
import com.amazon.deequ.checks.Check
VerificationSuite()
.onData(df)
.addCheck(
Check(CheckLevel.Error, "Data quality check")
.isComplete("id")
.isUnique("id")
)
.run()
Both libraries offer data quality validation, but Great Expectations provides a more Pythonic interface and supports various data sources, while Deequ leverages Spark for efficient processing of large datasets.
Mirror of Apache griffin
Pros of Griffin
- Supports multiple data sources (HDFS, Hive, Kafka)
- Provides a web UI for data quality monitoring
- Offers real-time data quality checking capabilities
Cons of Griffin
- Steeper learning curve due to more complex architecture
- Requires more setup and configuration
- Less active development compared to Deequ
Code Comparison
Griffin (Scala):
val dfSource = spark.table("source_table")
val dfTarget = spark.table("target_table")
val rule = BasicRule()
.in("source_table", "target_table")
.out("output")
.compareColumns("id", "id")
.expectEqual()
val job = BatchDQJob(spark, rule)
job.execute()
Deequ (Scala):
val verificationResult = VerificationSuite()
.onData(df)
.addCheck(
Check(CheckLevel.Error, "Data Quality Check")
.hasSize(_ >= 1000)
.isComplete("id")
.isUnique("id")
)
.run()
Both Griffin and Deequ are data quality tools, but they differ in their approach and features. Griffin offers a more comprehensive solution with support for multiple data sources and real-time checking, while Deequ provides a simpler, more focused approach to data quality checks within Spark environments. The choice between them depends on specific project requirements and infrastructure constraints.
Context aware, pluggable and customizable data protection and de-identification SDK for text and images
Pros of Presidio
- Focuses on data protection and privacy, with built-in PII detection and anonymization
- Supports multiple programming languages (Python, Java, Go)
- Offers flexible deployment options (as a service or library)
Cons of Presidio
- Narrower scope, primarily focused on PII detection and anonymization
- Less extensive data quality validation capabilities
- Smaller community and fewer contributors compared to Deequ
Code Comparison
Presidio (PII detection):
from presidio_analyzer import AnalyzerEngine
analyzer = AnalyzerEngine()
text = "John Smith's phone number is 212-555-5555"
results = analyzer.analyze(text=text, language="en")
Deequ (Data quality validation):
val verificationResult = VerificationSuite()
.onData(df)
.addCheck(
Check(CheckLevel.Error, "Data quality check")
.isComplete("id")
.hasSize(_ >= 1000)
)
.run()
Presidio is tailored for PII detection and anonymization, making it ideal for privacy-focused applications. Deequ, on the other hand, offers more comprehensive data quality validation features, making it better suited for general data quality assurance tasks. The choice between the two depends on the specific requirements of your project, with Presidio excelling in privacy protection and Deequ in broader data quality checks.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Deequ - Unit Tests for Data
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets. We are happy to receive feedback and contributions.
Python users may also be interested in PyDeequ, a Python interface for Deequ. You can find PyDeequ on GitHub, readthedocs, and PyPI.
Requirements and Installation
Deequ depends on Java 8. Deequ version 2.x only runs with Spark 3.1, and vice versa. If you rely on a previous Spark version, please use a Deequ 1.x version (legacy version is maintained in legacy-spark-3.0 branch). We provide legacy releases compatible with Apache Spark versions 2.2.x to 3.0.x. The Spark 2.2.x and 2.3.x releases depend on Scala 2.11 and the Spark 2.4.x, 3.0.x, and 3.1.x releases depend on Scala 2.12.
Available via maven central.
Choose the latest release that matches your Spark version from the available versions. Add the release as a dependency to your project. For example, for Spark 3.1.x:
Maven
<dependency>
<groupId>com.amazon.deequ</groupId>
<artifactId>deequ</artifactId>
<version>2.0.0-spark-3.1</version>
</dependency>
sbt
libraryDependencies += "com.amazon.deequ" % "deequ" % "2.0.0-spark-3.1"
Example
Deequ's purpose is to "unit-test" data to find errors early, before the data gets fed to consuming systems or machine learning algorithms. In the following, we will walk you through a toy example to showcase the most basic usage of our library. An executable version of the example is available here.
Deequ works on tabular data, e.g., CSV files, database tables, logs, flattened json files, basically anything that you can fit into a Spark dataframe. For this example, we assume that we work on some kind of Item
data, where every item has an id, a productName, a description, a priority and a count of how often it has been viewed.
case class Item(
id: Long,
productName: String,
description: String,
priority: String,
numViews: Long
)
Our library is built on Apache Spark and is designed to work with very large datasets (think billions of rows) that typically live in a distributed filesystem or a data warehouse. For the sake of simplicity in this example, we just generate a few toy records though.
val rdd = spark.sparkContext.parallelize(Seq(
Item(1, "Thingy A", "awesome thing.", "high", 0),
Item(2, "Thingy B", "available at http://thingb.com", null, 0),
Item(3, null, null, "low", 5),
Item(4, "Thingy D", "checkout https://thingd.ca", "low", 10),
Item(5, "Thingy E", null, "high", 12)))
val data = spark.createDataFrame(rdd)
Most applications that work with data have implicit assumptions about that data, e.g., that attributes have certain types, do not contain NULL values, and so on. If these assumptions are violated, your application might crash or produce wrong outputs. The idea behind deequ is to explicitly state these assumptions in the form of a "unit-test" for data, which can be verified on a piece of data at hand. If the data has errors, we can "quarantine" and fix it, before we feed it to an application.
The main entry point for defining how you expect your data to look is the VerificationSuite from which you can add Checks that define constraints on attributes of the data. In this example, we test for the following properties of our data:
- there are 5 rows in total
- values of the
id
attribute are never NULL and unique - values of the
productName
attribute are never NULL - the
priority
attribute can only contain "high" or "low" as value numViews
should not contain negative values- at least half of the values in
description
should contain a url - the median of
numViews
should be less than or equal to 10
In code this looks as follows:
import com.amazon.deequ.VerificationSuite
import com.amazon.deequ.checks.{Check, CheckLevel, CheckStatus}
val verificationResult = VerificationSuite()
.onData(data)
.addCheck(
Check(CheckLevel.Error, "unit testing my data")
.hasSize(_ == 5) // we expect 5 rows
.isComplete("id") // should never be NULL
.isUnique("id") // should not contain duplicates
.isComplete("productName") // should never be NULL
// should only contain the values "high" and "low"
.isContainedIn("priority", Array("high", "low"))
.isNonNegative("numViews") // should not contain negative values
// at least half of the descriptions should contain a url
.containsURL("description", _ >= 0.5)
// half of the items should have less than 10 views
.hasApproxQuantile("numViews", 0.5, _ <= 10))
.run()
After calling run
, deequ translates your test to a series of Spark jobs, which it executes to compute metrics on the data. Afterwards it invokes your assertion functions (e.g., _ == 5
for the size check) on these metrics to see if the constraints hold on the data. We can inspect the VerificationResult to see if the test found errors:
import com.amazon.deequ.constraints.ConstraintStatus
if (verificationResult.status == CheckStatus.Success) {
println("The data passed the test, everything is fine!")
} else {
println("We found errors in the data:\n")
val resultsForAllConstraints = verificationResult.checkResults
.flatMap { case (_, checkResult) => checkResult.constraintResults }
resultsForAllConstraints
.filter { _.status != ConstraintStatus.Success }
.foreach { result => println(s"${result.constraint}: ${result.message.get}") }
}
If we run the example, we get the following output:
We found errors in the data:
CompletenessConstraint(Completeness(productName)): Value: 0.8 does not meet the requirement!
PatternConstraint(containsURL(description)): Value: 0.4 does not meet the requirement!
The test found that our assumptions are violated! Only 4 out of 5 (80%) of the values of the productName
attribute are non-null and only 2 out of 5 (40%) values of the description
attribute did contain a url. Fortunately, we ran a test and found the errors, somebody should immediately fix the data :)
More examples
Our library contains much more functionality than what we showed in the basic example. We are in the process of adding more examples for its advanced features. So far, we showcase the following functionality:
- Persistence and querying of computed metrics of the data with a MetricsRepository
- Data profiling of large data sets
- Anomaly detection on data quality metrics over time
- Automatic suggestion of constraints for large datasets
- Incremental metrics computation on growing data and metric updates on partitioned data (advanced)
Citation
If you would like to reference this package in a research paper, please cite:
Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. 2018. Automating large-scale data quality verification. Proc. VLDB Endow. 11, 12 (August 2018), 1781-1794.
License
This library is licensed under the Apache 2.0 License.
Top Related Projects
Always know what to expect from your data.
Mirror of Apache griffin
Context aware, pluggable and customizable data protection and de-identification SDK for text and images
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot