deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

3,436

555

3,436

158

View on GitHub

Top Related Projects

great_expectations

10,608

Always know what to expect from your data.

presidio

5,169

An open-source framework for detecting, redacting, masking, and anonymizing sensitive data (PII) across text, images, and structured data. Supports NLP, pattern matching, and customizable pipelines.

Quick Overview

Deequ is an open-source library for data quality checks on large-scale datasets in Apache Spark. It provides a declarative language to specify data quality checks and a modular architecture to extend the library with custom checks and metrics.

Pros

Declarative Approach: Deequ uses a declarative approach to define data quality checks, making it easier to understand and maintain the checks over time.
Extensibility: The library's modular architecture allows users to extend it with custom checks and metrics, making it highly flexible and adaptable to different use cases.
Scalability: Deequ is built on top of Apache Spark, allowing it to handle large-scale datasets efficiently.
Comprehensive Checks: Deequ provides a wide range of built-in checks, including completeness, uniqueness, distribution, and constraint checks, covering a broad spectrum of data quality concerns.

Cons

Steep Learning Curve: Deequ's declarative approach and modular architecture may have a steeper learning curve for users who are not familiar with Apache Spark or data quality concepts.
Limited Non-Spark Support: Deequ is primarily designed for Apache Spark, and its integration with other data processing frameworks may be limited.
Dependency on Spark: The library's reliance on Apache Spark may be a drawback for users who prefer to work with other data processing tools or frameworks.
Potential Performance Overhead: The overhead of running data quality checks on large datasets may impact the overall performance of the data processing pipeline.

Code Examples

Here are a few code examples demonstrating the usage of Deequ:

Defining a Completeness Check:

val completenessCheck = Check(CheckLevel.Warning, "Completeness check")
  .hasCompleteness("column1", _ >= 0.95)
  .hasCompleteness("column2", _ >= 0.90)

This code defines a completeness check that ensures that at least 95% of the values in column1 and 90% of the values in column2 are non-null.

Defining a Uniqueness Check:

val uniquenessCheck = Check(CheckLevel.Error, "Uniqueness check")
  .isUnique("column1")
  .isUnique(List("column2", "column3"))

This code defines a uniqueness check that ensures that the values in column1 are unique, and the combination of values in column2 and column3 are also unique.

Defining a Constraint Check:

val constraintCheck = Check(CheckLevel.Error, "Constraint check")
  .hasMin("column1", _ >= 0)
  .hasMax("column1", _ <= 100)
  .hasApproxQuantile("column2", 0.5, _ == 50.0)

This code defines a constraint check that ensures that the values in column1 are between 0 and 100, and the median value in column2 is approximately 50.0.

Running the Checks:

val result = VerificationSuite()
  .onData(df)
  .addChecks(
    completenessCheck,
    uniquenessCheck,
    constraintCheck
  )
  .run()

This code runs the defined checks on the input DataFrame df and returns a VerificationResult object that contains the results of the checks.

Getting Started

To get started with Deequ, you can follow these steps:

Add the Deequ dependency to your project:

libraryDependencies += "com.amazon.deequ" % "deequ" % "2.0.0-rc1"

Import the necessary Deequ classes:

import com.amazon.deequ.VerificationSuite
import com.amazon.deequ.checks.Check
import com.amazon.deequ.checks.CheckLevel
import com.amazon.deequ.repository.ResultKey
import com.amazon.deequ.repository.memory.InMemoryMetricsRepository

Create a DataFrame and define your data quality checks

Competitor Comparisons

great_expectations

10,608

Always know what to expect from your data.

Pros of Great Expectations

More flexible and supports multiple data sources (databases, files, cloud storage)
Extensive documentation and active community support
Integrates well with modern data workflows and CI/CD pipelines

Cons of Great Expectations

Steeper learning curve due to more complex architecture
Can be slower for large datasets compared to Deequ's Spark-based approach
Requires more setup and configuration for advanced use cases

Code Comparison

Great Expectations:

import great_expectations as ge

df = ge.read_csv("data.csv")
df.expect_column_values_to_be_between("age", min_value=0, max_value=120)
df.expect_column_values_to_not_be_null("name")

Deequ:

import com.amazon.deequ.VerificationSuite
import com.amazon.deequ.checks.Check

VerificationSuite()
  .onData(df)
  .addCheck(
    Check(CheckLevel.Error, "Data quality check")
      .isComplete("id")
      .isUnique("id")
  )
  .run()

Both libraries offer data quality validation, but Great Expectations provides a more Pythonic interface and supports various data sources, while Deequ leverages Spark for efficient processing of large datasets.

griffin

1,162

Mirror of Apache griffin

Pros of Griffin

Supports multiple data sources (HDFS, Hive, Kafka)
Provides a web UI for data quality monitoring
Offers real-time data quality checking capabilities

Cons of Griffin

Steeper learning curve due to more complex architecture
Requires more setup and configuration
Less active development compared to Deequ

Code Comparison

Griffin (Scala):

val dfSource = spark.table("source_table")
val dfTarget = spark.table("target_table")

val rule = BasicRule()
  .in("source_table", "target_table")
  .out("output")
  .compareColumns("id", "id")
  .expectEqual()

val job = BatchDQJob(spark, rule)
job.execute()

Deequ (Scala):

val verificationResult = VerificationSuite()
  .onData(df)
  .addCheck(
    Check(CheckLevel.Error, "Data Quality Check")
      .hasSize(_ >= 1000)
      .isComplete("id")
      .isUnique("id")
  )
  .run()

Both Griffin and Deequ are data quality tools, but they differ in their approach and features. Griffin offers a more comprehensive solution with support for multiple data sources and real-time checking, while Deequ provides a simpler, more focused approach to data quality checks within Spark environments. The choice between them depends on specific project requirements and infrastructure constraints.

presidio

5,169

An open-source framework for detecting, redacting, masking, and anonymizing sensitive data (PII) across text, images, and structured data. Supports NLP, pattern matching, and customizable pipelines.

Pros of Presidio

Focuses on data protection and privacy, with built-in PII detection and anonymization
Supports multiple programming languages (Python, Java, Go)
Offers flexible deployment options (as a service or library)

Cons of Presidio

Narrower scope, primarily focused on PII detection and anonymization
Less extensive data quality validation capabilities
Smaller community and fewer contributors compared to Deequ

Code Comparison

Presidio (PII detection):

from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()
text = "John Smith's phone number is 212-555-5555"
results = analyzer.analyze(text=text, language="en")

Deequ (Data quality validation):

val verificationResult = VerificationSuite()
  .onData(df)
  .addCheck(
    Check(CheckLevel.Error, "Data quality check")
      .isComplete("id")
      .hasSize(_ >= 1000)
  )
  .run()

Presidio is tailored for PII detection and anonymization, making it ideal for privacy-focused applications. Deequ, on the other hand, offers more comprehensive data quality validation features, making it better suited for general data quality assurance tasks. The choice between the two depends on the specific requirements of your project, with Presidio excelling in privacy protection and Deequ in broader data quality checks.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Deequ - Unit Tests for Data

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets. We are happy to receive feedback and contributions.

Python users may also be interested in PyDeequ, a Python interface for Deequ. You can find PyDeequ on GitHub, readthedocs, and PyPI.

Requirements and Installation

Deequ depends on Java 8. Deequ version 2.x only runs with Spark 3.1, and vice versa. If you rely on a previous Spark version, please use a Deequ 1.x version (legacy version is maintained in legacy-spark-3.0 branch). We provide legacy releases compatible with Apache Spark versions 2.2.x to 3.0.x. The Spark 2.2.x and 2.3.x releases depend on Scala 2.11 and the Spark 2.4.x, 3.0.x, and 3.1.x releases depend on Scala 2.12.

Available via maven central.

Choose the latest release that matches your Spark version from the available versions. Add the release as a dependency to your project. For example, for Spark 3.1.x:

Maven

<dependency>
  <groupId>com.amazon.deequ</groupId>
  <artifactId>deequ</artifactId>
  <version>2.0.0-spark-3.1</version>
</dependency>

sbt

libraryDependencies += "com.amazon.deequ" % "deequ" % "2.0.0-spark-3.1"

Example

Deequ's purpose is to "unit-test" data to find errors early, before the data gets fed to consuming systems or machine learning algorithms. In the following, we will walk you through a toy example to showcase the most basic usage of our library. An executable version of the example is available here.

Deequ works on tabular data, e.g., CSV files, database tables, logs, flattened json files, basically anything that you can fit into a Spark dataframe. For this example, we assume that we work on some kind of Item data, where every item has an id, a productName, a description, a priority and a count of how often it has been viewed.

case class Item(
  id: Long,
  productName: String,
  description: String,
  priority: String,
  numViews: Long
)

Our library is built on Apache Spark and is designed to work with very large datasets (think billions of rows) that typically live in a distributed filesystem or a data warehouse. For the sake of simplicity in this example, we just generate a few toy records though.

val rdd = spark.sparkContext.parallelize(Seq(
  Item(1, "Thingy A", "awesome thing.", "high", 0),
  Item(2, "Thingy B", "available at http://thingb.com", null, 0),
  Item(3, null, null, "low", 5),
  Item(4, "Thingy D", "checkout https://thingd.ca", "low", 10),
  Item(5, "Thingy E", null, "high", 12)))

val data = spark.createDataFrame(rdd)

Most applications that work with data have implicit assumptions about that data, e.g., that attributes have certain types, do not contain NULL values, and so on. If these assumptions are violated, your application might crash or produce wrong outputs. The idea behind deequ is to explicitly state these assumptions in the form of a "unit-test" for data, which can be verified on a piece of data at hand. If the data has errors, we can "quarantine" and fix it, before we feed it to an application.

The main entry point for defining how you expect your data to look is the VerificationSuite from which you can add Checks that define constraints on attributes of the data. In this example, we test for the following properties of our data:

there are 5 rows in total
values of the id attribute are never NULL and unique
values of the productName attribute are never NULL
the priority attribute can only contain "high" or "low" as value
numViews should not contain negative values
at least half of the values in description should contain a url
the median of numViews should be less than or equal to 10

In code this looks as follows:

import com.amazon.deequ.VerificationSuite
import com.amazon.deequ.checks.{Check, CheckLevel, CheckStatus}


val verificationResult = VerificationSuite()
  .onData(data)
  .addCheck(
    Check(CheckLevel.Error, "unit testing my data")
      .hasSize(_ == 5) // we expect 5 rows
      .isComplete("id") // should never be NULL
      .isUnique("id") // should not contain duplicates
      .isComplete("productName") // should never be NULL
      // should only contain the values "high" and "low"
      .isContainedIn("priority", Array("high", "low"))
      .isNonNegative("numViews") // should not contain negative values
      // at least half of the descriptions should contain a url
      .containsURL("description", _ >= 0.5)
      // half of the items should have less than 10 views
      .hasApproxQuantile("numViews", 0.5, _ <= 10))
    .run()

After calling run, deequ translates your test to a series of Spark jobs, which it executes to compute metrics on the data. Afterwards it invokes your assertion functions (e.g., _ == 5 for the size check) on these metrics to see if the constraints hold on the data. We can inspect the VerificationResult to see if the test found errors:

import com.amazon.deequ.constraints.ConstraintStatus


if (verificationResult.status == CheckStatus.Success) {
  println("The data passed the test, everything is fine!")
} else {
  println("We found errors in the data:\n")

  val resultsForAllConstraints = verificationResult.checkResults
    .flatMap { case (_, checkResult) => checkResult.constraintResults }

  resultsForAllConstraints
    .filter { _.status != ConstraintStatus.Success }
    .foreach { result => println(s"${result.constraint}: ${result.message.get}") }
}

If we run the example, we get the following output:

We found errors in the data:

CompletenessConstraint(Completeness(productName)): Value: 0.8 does not meet the requirement!
PatternConstraint(containsURL(description)): Value: 0.4 does not meet the requirement!

The test found that our assumptions are violated! Only 4 out of 5 (80%) of the values of the productName attribute are non-null and only 2 out of 5 (40%) values of the description attribute did contain a url. Fortunately, we ran a test and found the errors, somebody should immediately fix the data :)

More examples

Our library contains much more functionality than what we showed in the basic example. We are in the process of adding more examples for its advanced features. So far, we showcase the following functionality:

Persistence and querying of computed metrics of the data with a MetricsRepository
Data profiling of large data sets
Anomaly detection on data quality metrics over time
Automatic suggestion of constraints for large datasets
Incremental metrics computation on growing data and metric updates on partitioned data (advanced)

Citation

If you would like to reference this package in a research paper, please cite:

Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. 2018. Automating large-scale data quality verification. Proc. VLDB Endow. 11, 12 (August 2018), 1781-1794.

License

This library is licensed under the Apache 2.0 License.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot