griffin

Mirror of Apache griffin

1,162

589

1,162

View on GitHub

Top Related Projects

datahub

10,791

The Metadata Platform for your Data and AI Stack

OpenLineage

1,992

An Open Standard for lineage metadata collection

datahub

10,907

The Metadata Platform for your Data and AI Stack

odd-platform

1,328

First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.

amundsen

4,619

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Quick Overview

Apache Griffin is an open-source Data Quality Service platform designed for big data. It provides a unified process to measure data quality from different perspectives, helping organizations build trusted data assets and improve data quality.

Pros

Supports both batch and streaming mode for data quality measurement
Offers a flexible rule definition system for various data quality dimensions
Provides a user-friendly web UI for easy configuration and visualization
Integrates well with popular big data ecosystems (Hadoop, Spark, Hive, etc.)

Cons

Steep learning curve for users new to big data technologies
Limited documentation and examples for advanced use cases
Requires significant setup and configuration for optimal performance
May be overkill for smaller-scale data quality needs

Code Examples

Defining a data quality measure:

val dqMeasure = DQMeasure()
  .setName("total_count")
  .setRule("select count(*) as total from source")
  .setDqType(DQType.Accuracy)

Creating a data quality job:

val dqJob = DQJob()
  .setName("example_job")
  .setDataSource(dataSource)
  .setTarget(target)
  .setMeasures(Seq(dqMeasure))

Running a data quality check:

val result = griffin.runJob(dqJob)
println(s"Data quality score: ${result.getScore}")

Getting Started

Install Apache Griffin:

git clone https://github.com/apache/griffin.git
cd griffin
mvn clean install

Configure your data sources in conf/datasources.json
Define your data quality measures in conf/measures.json
Start the Griffin service:

bin/griffin-service.sh start

Access the web UI at http://localhost:8080 to monitor and manage your data quality jobs

Competitor Comparisons

datahub

10,791

The Metadata Platform for your Data and AI Stack

Pros of DataHub

More comprehensive metadata management platform with broader data ecosystem integration
Active development and larger community support
Richer UI for data discovery, lineage visualization, and governance

Cons of DataHub

More complex setup and configuration compared to Griffin
Steeper learning curve due to its extensive features
Higher resource requirements for deployment and operation

Code Comparison

DataHub (Python client example):

from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.com.linkedin.pegasus2avro.metadata.snapshot import DatasetSnapshot
from datahub.metadata.schema_classes import DatasetPropertiesClass

emitter = DatahubRestEmitter("http://localhost:8080")
dataset_snapshot = DatasetSnapshot(...)
dataset_snapshot.aspects.append(DatasetPropertiesClass(description="Example dataset"))
emitter.emit_metadata(dataset_snapshot)

Griffin (Java example for data quality check):

public class SampleRule extends Rule {
    @Override
    public ExecutionResult execute(SparkSession spark, Map<String, DataFrame> dataSets) {
        DataFrame df = dataSets.get("source");
        long count = df.filter("id IS NULL").count();
        return new ExecutionResult(count == 0, "Null ID check");
    }
}

Both projects aim to improve data quality and governance, but DataHub offers a more comprehensive solution for metadata management and data discovery, while Griffin focuses primarily on data quality validation and profiling.

OpenLineage

1,992

An Open Standard for lineage metadata collection

Pros of OpenLineage

More active development with frequent updates and contributions
Broader ecosystem integration, supporting various data platforms and tools
Standardized metadata model for easier interoperability

Cons of OpenLineage

Steeper learning curve due to more complex architecture
Requires more setup and configuration compared to Griffin

Code Comparison

Griffin (Data Quality Check):

public class AccuracyRule extends Rule {
    @Override
    public boolean execute(DataFrame df) {
        // Implement accuracy check logic
    }
}

OpenLineage (Lineage Event):

from openlineage.client import OpenLineageClient

client = OpenLineageClient()
client.emit(
    RunEvent(
        eventType=RunState.START,
        job=Job(namespace="my_namespace", name="my_job"),
        run=Run(runId="my_run_id"),
        inputs=[InputDataset(namespace="my_namespace", name="input_table")],
        outputs=[OutputDataset(namespace="my_namespace", name="output_table")]
    )
)

Summary

Griffin focuses on data quality and validation, while OpenLineage emphasizes data lineage and metadata tracking. OpenLineage offers broader integration capabilities and a standardized metadata model, but may require more setup. Griffin provides simpler data quality checks but has a narrower scope. Choose based on your specific needs for data quality vs. lineage tracking.

datahub

10,907

The Metadata Platform for your Data and AI Stack

Pros of DataHub

More comprehensive metadata management platform with broader data ecosystem integration
Active development and larger community support
Richer UI for data discovery, lineage visualization, and governance

Cons of DataHub

More complex setup and configuration compared to Griffin
Steeper learning curve due to its extensive features
Higher resource requirements for deployment and operation

Code Comparison

DataHub (Python client example):

from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.com.linkedin.pegasus2avro.metadata.snapshot import DatasetSnapshot
from datahub.metadata.schema_classes import DatasetPropertiesClass

emitter = DatahubRestEmitter("http://localhost:8080")
dataset_snapshot = DatasetSnapshot(...)
dataset_snapshot.aspects.append(DatasetPropertiesClass(description="Example dataset"))
emitter.emit_metadata(dataset_snapshot)

Griffin (Java example for data quality check):

public class SampleRule extends Rule {
    @Override
    public ExecutionResult execute(SparkSession spark, Map<String, DataFrame> dataSets) {
        DataFrame df = dataSets.get("source");
        long count = df.filter("id IS NULL").count();
        return new ExecutionResult(count == 0, "Null ID check");
    }
}

odd-platform

1,328

First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.

Pros of ODD Platform

More active development with frequent updates and contributions
Broader scope, covering data discovery, lineage, and quality management
User-friendly web interface for easier data exploration and management

Cons of ODD Platform

Less mature project compared to Griffin's Apache incubator status
Steeper learning curve due to more complex architecture
Potentially higher resource requirements for deployment

Code Comparison

Griffin (Data Quality Check):

public class AccuracyRule extends Rule {
    @Override
    public boolean execute(Record record) {
        return record.getValue("field") != null;
    }
}

ODD Platform (Data Quality Check):

def check_accuracy(df: pd.DataFrame) -> Dict[str, Any]:
    return {
        "null_count": df["field"].isnull().sum(),
        "total_count": len(df),
    }

Both projects aim to improve data quality and governance, but ODD Platform offers a more comprehensive solution with a modern tech stack. Griffin focuses primarily on data quality and has the advantage of Apache Foundation backing. ODD Platform provides a more user-friendly experience and broader functionality, but may require more resources to set up and maintain.

amundsen

4,619

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Pros of Amundsen

More comprehensive metadata management and data discovery platform
Active community with regular updates and contributions
Integrates with popular data ecosystems like Airflow, Spark, and Tableau

Cons of Amundsen

More complex setup and configuration compared to Griffin
Requires additional components like Neo4j and Elasticsearch
May be overkill for smaller organizations or simpler data quality needs

Code Comparison

Amundsen (Python):

class TableMetadata(BaseModel):
    database: str
    cluster: str
    schema: str
    name: str
    description: Optional[str] = None
    tags: List[str] = []

Griffin (Java):

public class DataConnector {
    private String name;
    private String type;
    private String version;
    private String dataFrameName;
    private Map<String, Object> config;
}

While both projects deal with data management, Amundsen focuses on metadata and discovery, offering a more comprehensive solution for large-scale data ecosystems. Griffin, on the other hand, specializes in data quality and validation, providing a simpler setup for organizations primarily concerned with data integrity.

Amundsen's code example shows its focus on metadata structure, while Griffin's code demonstrates its emphasis on data connections and quality checks. The choice between the two depends on specific organizational needs and existing infrastructure.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Apache Griffin

The data quality (DQ) is a key criteria for many data consumers like IoT, machine learning etc., however, there is no standard agreement on how to determine âgoodâ data. Apache Griffin is a model-driven data quality service platform where you can examine your data on-demand. It provides a standard process to define data quality measures, executions and reports, allowing those examinations across multiple data systems. When you don't trust your data, or concern that poorly controlled data can negatively impact critical decision, you can utilize Apache Griffin to ensure data quality.

Getting Started

Quick Start

You can try running Griffin in docker following the docker guide.

Environment for Dev

Follow Apache Griffin Development Environment Build Guide to set up development environment.
If you want to contribute codes to Griffin, please follow Apache Griffin Development Code Style Config Guide to keep consistent code style.

Deployment at Local

If you want to deploy Griffin in your local environment, please follow Apache Griffin Deployment Guide.

Community

For more information about Griffin, please visit our website at: griffin home page.

You can contact us via email:

dev-list: dev@griffin.apache.org
user-list: users@griffin.apache.org

You can also subscribe the latest information by sending a email to subscribe dev-list and subscribe user-list. You can also subscribe the latest information by sending a email to subscribe dev-list and user-list:

dev-subscribe@griffin.apache.org
users-subscribe@griffin.apache.org

You can access our issues on JIRA page

Contributing

See How to Contribute for details on how to contribute code, documentation, etc.

Here's the most direct way to contribute your work merged into Apache Griffin.

Fork the project from github
Clone down your fork
Implement your feature or bug fix and commit changes
Push the branch up to your fork
Send a pull request to Apache Griffin master branch

References

Home Page
Wiki
Documents:
- Measure
- Service
- UI
- Docker usage
- Postman API

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot