Convert Figma logo to code with AI

apache logogriffin

Mirror of Apache griffin

1,138
589
1,138
1

Top Related Projects

10,048

The Metadata Platform for your Data and AI Stack

An Open Standard for lineage metadata collection

9,797

The Metadata Platform for your Data Stack

First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Quick Overview

Apache Griffin is an open-source Data Quality Service platform designed for big data. It provides a unified process to measure data quality from different perspectives, helping organizations build trusted data assets and improve data quality.

Pros

  • Supports both batch and streaming mode for data quality measurement
  • Offers a flexible rule definition system for various data quality dimensions
  • Provides a user-friendly web UI for easy configuration and visualization
  • Integrates well with popular big data ecosystems (Hadoop, Spark, Hive, etc.)

Cons

  • Steep learning curve for users new to big data technologies
  • Limited documentation and examples for advanced use cases
  • Requires significant setup and configuration for optimal performance
  • May be overkill for smaller-scale data quality needs

Code Examples

  1. Defining a data quality measure:
val dqMeasure = DQMeasure()
  .setName("total_count")
  .setRule("select count(*) as total from source")
  .setDqType(DQType.Accuracy)
  1. Creating a data quality job:
val dqJob = DQJob()
  .setName("example_job")
  .setDataSource(dataSource)
  .setTarget(target)
  .setMeasures(Seq(dqMeasure))
  1. Running a data quality check:
val result = griffin.runJob(dqJob)
println(s"Data quality score: ${result.getScore}")

Getting Started

  1. Install Apache Griffin:
git clone https://github.com/apache/griffin.git
cd griffin
mvn clean install
  1. Configure your data sources in conf/datasources.json

  2. Define your data quality measures in conf/measures.json

  3. Start the Griffin service:

bin/griffin-service.sh start
  1. Access the web UI at http://localhost:8080 to monitor and manage your data quality jobs

Competitor Comparisons

10,048

The Metadata Platform for your Data and AI Stack

Pros of DataHub

  • More comprehensive metadata management platform with broader data ecosystem integration
  • Active development and larger community support
  • Richer UI for data discovery, lineage visualization, and governance

Cons of DataHub

  • More complex setup and configuration compared to Griffin
  • Steeper learning curve due to its extensive features
  • Higher resource requirements for deployment and operation

Code Comparison

DataHub (Python client example):

from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.com.linkedin.pegasus2avro.metadata.snapshot import DatasetSnapshot
from datahub.metadata.schema_classes import DatasetPropertiesClass

emitter = DatahubRestEmitter("http://localhost:8080")
dataset_snapshot = DatasetSnapshot(...)
dataset_snapshot.aspects.append(DatasetPropertiesClass(description="Example dataset"))
emitter.emit_metadata(dataset_snapshot)

Griffin (Java example for data quality check):

public class SampleRule extends Rule {
    @Override
    public ExecutionResult execute(SparkSession spark, Map<String, DataFrame> dataSets) {
        DataFrame df = dataSets.get("source");
        long count = df.filter("id IS NULL").count();
        return new ExecutionResult(count == 0, "Null ID check");
    }
}

Both projects aim to improve data quality and governance, but DataHub offers a more comprehensive solution for metadata management and data discovery, while Griffin focuses primarily on data quality validation and profiling.

An Open Standard for lineage metadata collection

Pros of OpenLineage

  • More active development with frequent updates and contributions
  • Broader ecosystem integration, supporting various data platforms and tools
  • Standardized metadata model for easier interoperability

Cons of OpenLineage

  • Steeper learning curve due to more complex architecture
  • Requires more setup and configuration compared to Griffin

Code Comparison

Griffin (Data Quality Check):

public class AccuracyRule extends Rule {
    @Override
    public boolean execute(DataFrame df) {
        // Implement accuracy check logic
    }
}

OpenLineage (Lineage Event):

from openlineage.client import OpenLineageClient

client = OpenLineageClient()
client.emit(
    RunEvent(
        eventType=RunState.START,
        job=Job(namespace="my_namespace", name="my_job"),
        run=Run(runId="my_run_id"),
        inputs=[InputDataset(namespace="my_namespace", name="input_table")],
        outputs=[OutputDataset(namespace="my_namespace", name="output_table")]
    )
)

Summary

Griffin focuses on data quality and validation, while OpenLineage emphasizes data lineage and metadata tracking. OpenLineage offers broader integration capabilities and a standardized metadata model, but may require more setup. Griffin provides simpler data quality checks but has a narrower scope. Choose based on your specific needs for data quality vs. lineage tracking.

9,797

The Metadata Platform for your Data Stack

Pros of DataHub

  • More comprehensive metadata management platform with broader data ecosystem integration
  • Active development and larger community support
  • Richer UI for data discovery, lineage visualization, and governance

Cons of DataHub

  • More complex setup and configuration compared to Griffin
  • Steeper learning curve due to its extensive features
  • Higher resource requirements for deployment and operation

Code Comparison

DataHub (Python client example):

from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.com.linkedin.pegasus2avro.metadata.snapshot import DatasetSnapshot
from datahub.metadata.schema_classes import DatasetPropertiesClass

emitter = DatahubRestEmitter("http://localhost:8080")
dataset_snapshot = DatasetSnapshot(...)
dataset_snapshot.aspects.append(DatasetPropertiesClass(description="Example dataset"))
emitter.emit_metadata(dataset_snapshot)

Griffin (Java example for data quality check):

public class SampleRule extends Rule {
    @Override
    public ExecutionResult execute(SparkSession spark, Map<String, DataFrame> dataSets) {
        DataFrame df = dataSets.get("source");
        long count = df.filter("id IS NULL").count();
        return new ExecutionResult(count == 0, "Null ID check");
    }
}

Both projects aim to improve data quality and governance, but DataHub offers a more comprehensive solution for metadata management and data discovery, while Griffin focuses primarily on data quality validation and profiling.

First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.

Pros of ODD Platform

  • More active development with frequent updates and contributions
  • Broader scope, covering data discovery, lineage, and quality management
  • User-friendly web interface for easier data exploration and management

Cons of ODD Platform

  • Less mature project compared to Griffin's Apache incubator status
  • Steeper learning curve due to more complex architecture
  • Potentially higher resource requirements for deployment

Code Comparison

Griffin (Data Quality Check):

public class AccuracyRule extends Rule {
    @Override
    public boolean execute(Record record) {
        return record.getValue("field") != null;
    }
}

ODD Platform (Data Quality Check):

def check_accuracy(df: pd.DataFrame) -> Dict[str, Any]:
    return {
        "null_count": df["field"].isnull().sum(),
        "total_count": len(df),
    }

Both projects aim to improve data quality and governance, but ODD Platform offers a more comprehensive solution with a modern tech stack. Griffin focuses primarily on data quality and has the advantage of Apache Foundation backing. ODD Platform provides a more user-friendly experience and broader functionality, but may require more resources to set up and maintain.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Pros of Amundsen

  • More comprehensive metadata management and data discovery platform
  • Active community with regular updates and contributions
  • Integrates with popular data ecosystems like Airflow, Spark, and Tableau

Cons of Amundsen

  • More complex setup and configuration compared to Griffin
  • Requires additional components like Neo4j and Elasticsearch
  • May be overkill for smaller organizations or simpler data quality needs

Code Comparison

Amundsen (Python):

class TableMetadata(BaseModel):
    database: str
    cluster: str
    schema: str
    name: str
    description: Optional[str] = None
    tags: List[str] = []

Griffin (Java):

public class DataConnector {
    private String name;
    private String type;
    private String version;
    private String dataFrameName;
    private Map<String, Object> config;
}

While both projects deal with data management, Amundsen focuses on metadata and discovery, offering a more comprehensive solution for large-scale data ecosystems. Griffin, on the other hand, specializes in data quality and validation, providing a simpler setup for organizations primarily concerned with data integrity.

Amundsen's code example shows its focus on metadata structure, while Griffin's code demonstrates its emphasis on data connections and quality checks. The choice between the two depends on specific organizational needs and existing infrastructure.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Apache Griffin

Build Status License: Apache 2.0

The data quality (DQ) is a key criteria for many data consumers like IoT, machine learning etc., however, there is no standard agreement on how to determine “good” data. Apache Griffin is a model-driven data quality service platform where you can examine your data on-demand. It provides a standard process to define data quality measures, executions and reports, allowing those examinations across multiple data systems. When you don't trust your data, or concern that poorly controlled data can negatively impact critical decision, you can utilize Apache Griffin to ensure data quality.

Getting Started

Quick Start

You can try running Griffin in docker following the docker guide.

Environment for Dev

Follow Apache Griffin Development Environment Build Guide to set up development environment.
If you want to contribute codes to Griffin, please follow Apache Griffin Development Code Style Config Guide to keep consistent code style.

Deployment at Local

If you want to deploy Griffin in your local environment, please follow Apache Griffin Deployment Guide.

Community

For more information about Griffin, please visit our website at: griffin home page.

You can contact us via email:

You can also subscribe the latest information by sending a email to subscribe dev-list and subscribe user-list. You can also subscribe the latest information by sending a email to subscribe dev-list and user-list:

dev-subscribe@griffin.apache.org
users-subscribe@griffin.apache.org

You can access our issues on JIRA page

Contributing

See How to Contribute for details on how to contribute code, documentation, etc.

Here's the most direct way to contribute your work merged into Apache Griffin.

  • Fork the project from github
  • Clone down your fork
  • Implement your feature or bug fix and commit changes
  • Push the branch up to your fork
  • Send a pull request to Apache Griffin master branch

References