Top Related Projects
The Metadata Platform for your Data and AI Stack
An Open Standard for lineage metadata collection
The Metadata Platform for your Data Stack
First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.
Quick Overview
Apache Griffin is an open-source Data Quality Service platform designed for big data. It provides a unified process to measure data quality from different perspectives, helping organizations build trusted data assets and improve data quality.
Pros
- Supports both batch and streaming mode for data quality measurement
- Offers a flexible rule definition system for various data quality dimensions
- Provides a user-friendly web UI for easy configuration and visualization
- Integrates well with popular big data ecosystems (Hadoop, Spark, Hive, etc.)
Cons
- Steep learning curve for users new to big data technologies
- Limited documentation and examples for advanced use cases
- Requires significant setup and configuration for optimal performance
- May be overkill for smaller-scale data quality needs
Code Examples
- Defining a data quality measure:
val dqMeasure = DQMeasure()
.setName("total_count")
.setRule("select count(*) as total from source")
.setDqType(DQType.Accuracy)
- Creating a data quality job:
val dqJob = DQJob()
.setName("example_job")
.setDataSource(dataSource)
.setTarget(target)
.setMeasures(Seq(dqMeasure))
- Running a data quality check:
val result = griffin.runJob(dqJob)
println(s"Data quality score: ${result.getScore}")
Getting Started
- Install Apache Griffin:
git clone https://github.com/apache/griffin.git
cd griffin
mvn clean install
-
Configure your data sources in
conf/datasources.json
-
Define your data quality measures in
conf/measures.json
-
Start the Griffin service:
bin/griffin-service.sh start
- Access the web UI at
http://localhost:8080
to monitor and manage your data quality jobs
Competitor Comparisons
The Metadata Platform for your Data and AI Stack
Pros of DataHub
- More comprehensive metadata management platform with broader data ecosystem integration
- Active development and larger community support
- Richer UI for data discovery, lineage visualization, and governance
Cons of DataHub
- More complex setup and configuration compared to Griffin
- Steeper learning curve due to its extensive features
- Higher resource requirements for deployment and operation
Code Comparison
DataHub (Python client example):
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.com.linkedin.pegasus2avro.metadata.snapshot import DatasetSnapshot
from datahub.metadata.schema_classes import DatasetPropertiesClass
emitter = DatahubRestEmitter("http://localhost:8080")
dataset_snapshot = DatasetSnapshot(...)
dataset_snapshot.aspects.append(DatasetPropertiesClass(description="Example dataset"))
emitter.emit_metadata(dataset_snapshot)
Griffin (Java example for data quality check):
public class SampleRule extends Rule {
@Override
public ExecutionResult execute(SparkSession spark, Map<String, DataFrame> dataSets) {
DataFrame df = dataSets.get("source");
long count = df.filter("id IS NULL").count();
return new ExecutionResult(count == 0, "Null ID check");
}
}
Both projects aim to improve data quality and governance, but DataHub offers a more comprehensive solution for metadata management and data discovery, while Griffin focuses primarily on data quality validation and profiling.
An Open Standard for lineage metadata collection
Pros of OpenLineage
- More active development with frequent updates and contributions
- Broader ecosystem integration, supporting various data platforms and tools
- Standardized metadata model for easier interoperability
Cons of OpenLineage
- Steeper learning curve due to more complex architecture
- Requires more setup and configuration compared to Griffin
Code Comparison
Griffin (Data Quality Check):
public class AccuracyRule extends Rule {
@Override
public boolean execute(DataFrame df) {
// Implement accuracy check logic
}
}
OpenLineage (Lineage Event):
from openlineage.client import OpenLineageClient
client = OpenLineageClient()
client.emit(
RunEvent(
eventType=RunState.START,
job=Job(namespace="my_namespace", name="my_job"),
run=Run(runId="my_run_id"),
inputs=[InputDataset(namespace="my_namespace", name="input_table")],
outputs=[OutputDataset(namespace="my_namespace", name="output_table")]
)
)
Summary
Griffin focuses on data quality and validation, while OpenLineage emphasizes data lineage and metadata tracking. OpenLineage offers broader integration capabilities and a standardized metadata model, but may require more setup. Griffin provides simpler data quality checks but has a narrower scope. Choose based on your specific needs for data quality vs. lineage tracking.
The Metadata Platform for your Data Stack
Pros of DataHub
- More comprehensive metadata management platform with broader data ecosystem integration
- Active development and larger community support
- Richer UI for data discovery, lineage visualization, and governance
Cons of DataHub
- More complex setup and configuration compared to Griffin
- Steeper learning curve due to its extensive features
- Higher resource requirements for deployment and operation
Code Comparison
DataHub (Python client example):
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.com.linkedin.pegasus2avro.metadata.snapshot import DatasetSnapshot
from datahub.metadata.schema_classes import DatasetPropertiesClass
emitter = DatahubRestEmitter("http://localhost:8080")
dataset_snapshot = DatasetSnapshot(...)
dataset_snapshot.aspects.append(DatasetPropertiesClass(description="Example dataset"))
emitter.emit_metadata(dataset_snapshot)
Griffin (Java example for data quality check):
public class SampleRule extends Rule {
@Override
public ExecutionResult execute(SparkSession spark, Map<String, DataFrame> dataSets) {
DataFrame df = dataSets.get("source");
long count = df.filter("id IS NULL").count();
return new ExecutionResult(count == 0, "Null ID check");
}
}
Both projects aim to improve data quality and governance, but DataHub offers a more comprehensive solution for metadata management and data discovery, while Griffin focuses primarily on data quality validation and profiling.
First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.
Pros of ODD Platform
- More active development with frequent updates and contributions
- Broader scope, covering data discovery, lineage, and quality management
- User-friendly web interface for easier data exploration and management
Cons of ODD Platform
- Less mature project compared to Griffin's Apache incubator status
- Steeper learning curve due to more complex architecture
- Potentially higher resource requirements for deployment
Code Comparison
Griffin (Data Quality Check):
public class AccuracyRule extends Rule {
@Override
public boolean execute(Record record) {
return record.getValue("field") != null;
}
}
ODD Platform (Data Quality Check):
def check_accuracy(df: pd.DataFrame) -> Dict[str, Any]:
return {
"null_count": df["field"].isnull().sum(),
"total_count": len(df),
}
Both projects aim to improve data quality and governance, but ODD Platform offers a more comprehensive solution with a modern tech stack. Griffin focuses primarily on data quality and has the advantage of Apache Foundation backing. ODD Platform provides a more user-friendly experience and broader functionality, but may require more resources to set up and maintain.
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.
Pros of Amundsen
- More comprehensive metadata management and data discovery platform
- Active community with regular updates and contributions
- Integrates with popular data ecosystems like Airflow, Spark, and Tableau
Cons of Amundsen
- More complex setup and configuration compared to Griffin
- Requires additional components like Neo4j and Elasticsearch
- May be overkill for smaller organizations or simpler data quality needs
Code Comparison
Amundsen (Python):
class TableMetadata(BaseModel):
database: str
cluster: str
schema: str
name: str
description: Optional[str] = None
tags: List[str] = []
Griffin (Java):
public class DataConnector {
private String name;
private String type;
private String version;
private String dataFrameName;
private Map<String, Object> config;
}
While both projects deal with data management, Amundsen focuses on metadata and discovery, offering a more comprehensive solution for large-scale data ecosystems. Griffin, on the other hand, specializes in data quality and validation, providing a simpler setup for organizations primarily concerned with data integrity.
Amundsen's code example shows its focus on metadata structure, while Griffin's code demonstrates its emphasis on data connections and quality checks. The choice between the two depends on specific organizational needs and existing infrastructure.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Apache Griffin
The data quality (DQ) is a key criteria for many data consumers like IoT, machine learning etc., however, there is no standard agreement on how to determine âgoodâ data. Apache Griffin is a model-driven data quality service platform where you can examine your data on-demand. It provides a standard process to define data quality measures, executions and reports, allowing those examinations across multiple data systems. When you don't trust your data, or concern that poorly controlled data can negatively impact critical decision, you can utilize Apache Griffin to ensure data quality.
Getting Started
Quick Start
You can try running Griffin in docker following the docker guide.
Environment for Dev
Follow Apache Griffin Development Environment Build Guide to set up development environment.
If you want to contribute codes to Griffin, please follow Apache Griffin Development Code Style Config Guide to keep consistent code style.
Deployment at Local
If you want to deploy Griffin in your local environment, please follow Apache Griffin Deployment Guide.
Community
For more information about Griffin, please visit our website at: griffin home page.
You can contact us via email:
- dev-list: dev@griffin.apache.org
- user-list: users@griffin.apache.org
You can also subscribe the latest information by sending a email to subscribe dev-list and subscribe user-list. You can also subscribe the latest information by sending a email to subscribe dev-list and user-list:
dev-subscribe@griffin.apache.org
users-subscribe@griffin.apache.org
You can access our issues on JIRA page
Contributing
See How to Contribute for details on how to contribute code, documentation, etc.
Here's the most direct way to contribute your work merged into Apache Griffin.
- Fork the project from github
- Clone down your fork
- Implement your feature or bug fix and commit changes
- Push the branch up to your fork
- Send a pull request to Apache Griffin master branch
References
Top Related Projects
The Metadata Platform for your Data and AI Stack
An Open Standard for lineage metadata collection
The Metadata Platform for your Data Stack
First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot