Convert Figma logo to code with AI

datahub-project logodatahub

The Metadata Platform for your Data Stack

9,797
2,890
9,797
274

Top Related Projects

10,041

The Metadata Platform for your Data and AI Stack

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

1,851

Apache Atlas

An Open Standard for lineage metadata collection

Quick Overview

DataHub is an open-source metadata platform for the modern data stack. It enables data discovery, data observability, and federated governance to help organizations maximize the value of their data ecosystem. DataHub provides a centralized platform for collecting, organizing, and utilizing metadata from various data sources and tools.

Pros

  • Comprehensive metadata management with support for a wide range of data sources and tools
  • Powerful search capabilities for easy data discovery
  • Extensible architecture allowing for custom integrations and plugins
  • Active community and regular updates

Cons

  • Steep learning curve for initial setup and configuration
  • Resource-intensive for large-scale deployments
  • Limited out-of-the-box visualizations compared to some commercial alternatives
  • Documentation can be inconsistent or outdated in some areas

Code Examples

# Connecting to DataHub using the Python SDK
from datahub.ingestion.sdk.client import DataHubClient

client = DataHubClient(server="http://localhost:8080")
result = client.get_aspect("urn:li:dataset:(urn:li:dataPlatform:bigquery,my_project.my_dataset.my_table,PROD)", "schemaMetadata")
print(result)
# Ingesting metadata using a recipe file
from datahub.ingestion.run.pipeline import Pipeline

Pipeline.create({
    "source": {
        "type": "mysql",
        "config": {
            "username": "user",
            "password": "pass",
            "host_port": "localhost:3306",
            "database": "my_database"
        }
    },
    "sink": {
        "type": "datahub-rest",
        "config": {
            "server": "http://localhost:8080"
        }
    }
}).run()
# Searching for datasets using the GraphQL API
from gql import gql, Client
from gql.transport.requests import RequestsHTTPTransport

transport = RequestsHTTPTransport(url='http://localhost:8080/api/graphql')
client = Client(transport=transport, fetch_schema_from_transport=True)

query = gql('''
query searchDatasets($input: SearchInput!) {
  search(input: $input) {
    searchResults {
      entity {
        urn
        type
        ... on Dataset {
          name
          platform {
            name
          }
        }
      }
    }
  }
}
''')

variables = {
    "input": {
        "type": "DATASET",
        "query": "sales",
        "start": 0,
        "count": 10
    }
}

result = client.execute(query, variable_values=variables)
print(result)

Getting Started

  1. Install DataHub using Docker:

    git clone https://github.com/datahub-project/datahub.git
    cd datahub
    docker-compose -f docker-compose.yml up
    
  2. Access the DataHub web interface at http://localhost:9002

  3. Install the Python SDK:

    pip install acryl-datahub
    
  4. Ingest metadata using a simple Python script:

    from datahub.ingestion.run.pipeline import Pipeline
    
    Pipeline.create({
        "source": {
            "type": "file",
            "config": {"filename": "sample_data.json"}
        },
        "sink": {
            "type": "datahub-rest",
            "config": {"server": "http://localhost:8080"}
        }
    }).run()
    

Competitor Comparisons

10,041

The Metadata Platform for your Data and AI Stack

Pros of datahub

  • More comprehensive metadata management platform
  • Broader ecosystem with integrations for various data sources
  • Active community and regular updates

Cons of datahub

  • Steeper learning curve due to more complex architecture
  • Higher resource requirements for deployment and maintenance
  • May be overkill for smaller organizations or simpler use cases

Code comparison

datahub:

from datahub.ingestion.run.pipeline import Pipeline

pipeline = Pipeline.create({
    "source": {"type": "mysql", "config": {...}},
    "sink": {"type": "datahub-rest", "config": {...}}
})
pipeline.run()

datahub>:

# No direct code comparison available
# datahub> appears to be a placeholder or non-existent repository

Summary

DataHub is a robust metadata management platform with a wide range of features and integrations. It offers comprehensive solutions for data discovery, lineage, and governance. However, its complexity may be challenging for smaller teams or projects with simpler requirements. The datahub> repository doesn't seem to exist or contain relevant code for comparison, so it's difficult to provide a meaningful contrast between the two.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Pros of Amundsen

  • Simpler architecture and easier initial setup
  • Strong focus on data discovery and search functionality
  • Better integration with Apache Atlas for metadata management

Cons of Amundsen

  • Less extensive features compared to DataHub
  • Smaller community and ecosystem
  • Limited support for complex data lineage scenarios

Code Comparison

Amundsen (Python):

from amundsen_common.models.table import Table
from amundsen_databuilder.models.neo4j_csv_serde import Neo4jCsvSerializable

class TableMetadata(Neo4jCsvSerializable):
    TABLE_NODE_LABEL = 'Table'

DataHub (Java):

import com.linkedin.metadata.aspect.DatasetAspect;
import com.linkedin.metadata.entity.EntityService;

public class DatasetService {
    private final EntityService _entityService;

Both projects use different programming languages and frameworks, reflecting their distinct approaches to metadata management and data cataloging.

1,851

Apache Atlas

Pros of Atlas

  • More mature project with longer history and established community
  • Stronger integration with Hadoop ecosystem and enterprise data governance
  • Comprehensive data lineage and impact analysis capabilities

Cons of Atlas

  • Steeper learning curve and more complex setup
  • Less focus on modern data stack and cloud-native technologies
  • UI can be less intuitive and user-friendly compared to DataHub

Code Comparison

Atlas (Java):

AtlasEntity entity = new AtlasEntity("hive_table", "employees");
entity.setAttribute("name", "employees");
entity.setAttribute("owner", "hr_department");
atlasClient.createEntity(entity);

DataHub (Python):

from datahub.emitter.mce_builder import make_dataset_urn
from datahub.emitter.rest_emitter import DatahubRestEmitter

dataset_urn = make_dataset_urn("hive", "employees", "prod")
emitter = DatahubRestEmitter("http://localhost:8080")
emitter.emit_mce(dataset_urn)

Both projects aim to provide metadata management and data governance solutions, but they differ in their approach and target use cases. Atlas is more suited for traditional enterprise environments with a focus on Hadoop ecosystems, while DataHub is designed for modern data stacks and offers a more user-friendly experience. The choice between them depends on specific organizational needs and existing infrastructure.

An Open Standard for lineage metadata collection

Pros of OpenLineage

  • Focused specifically on data lineage, providing a more specialized solution
  • Lightweight and easier to integrate into existing data pipelines
  • Supports a wider range of data processing frameworks out-of-the-box

Cons of OpenLineage

  • Less comprehensive metadata management compared to DataHub
  • Smaller community and ecosystem of integrations
  • Limited features for data discovery and governance

Code Comparison

OpenLineage example (Python):

from openlineage.client import OpenLineageClient

client = OpenLineageClient.from_environment()
client.emit(run_event)

DataHub example (Python):

from datahub.emitter.mce_builder import make_dataset_urn
from datahub.emitter.rest_emitter import DatahubRestEmitter

emitter = DatahubRestEmitter("http://localhost:8080")
dataset_urn = make_dataset_urn("bigquery", "my_dataset.my_table")
emitter.emit_mce(dataset_urn)

Both projects aim to improve data observability and lineage tracking, but they differ in scope and implementation. OpenLineage focuses specifically on lineage tracking across various data processing frameworks, while DataHub offers a more comprehensive metadata management solution. OpenLineage's lightweight nature makes it easier to integrate into existing pipelines, but DataHub provides more extensive features for data discovery and governance. The code examples demonstrate the different approaches: OpenLineage uses a simple client to emit run events, while DataHub requires more setup to emit metadata change events.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

DataHub

DataHub: The Data Discovery Platform for the Modern Data Stack

Built with ❤️ by Acryl Data and LinkedIn

Version PyPI version build & test Docker Pulls Slack PRs Welcome GitHub commit activity License YouTube Medium Follow

🏠 Hosted DataHub Docs (Courtesy of Acryl Data): datahubproject.io


Quickstart | Features | Roadmap | Adoption | Demo | Town Hall


📣 DataHub Town Hall is the 4th Thursday at 9am US PT of every month - add it to your calendar!

✨ DataHub Community Highlights:

Introduction

DataHub is an open-source data catalog for the modern data stack. Read about the architectures of different metadata systems and why DataHub excels here. Also read our LinkedIn Engineering blog post, check out our Strata presentation and watch our Crunch Conference Talk. You should also visit DataHub Architecture to get a better understanding of how DataHub is implemented.

Features & Roadmap

Check out DataHub's Features & Roadmap.

Demo and Screenshots

There's a hosted demo environment courtesy of Acryl Data where you can explore DataHub without installing it locally

Quickstart

Please follow the DataHub Quickstart Guide to get a copy of DataHub up & running locally using Docker. As the guide assumes some basic knowledge of Docker, we'd recommend you to go through the "Hello World" example of A Docker Tutorial for Beginners if Docker is completely foreign to you.

Development

If you're looking to build & modify datahub please take a look at our Development Guide.

Source Code and Repositories

  • datahub-project/datahub: This repository contains the complete source code for DataHub's metadata model, metadata services, integration connectors and the web application.
  • acryldata/datahub-actions: DataHub Actions is a framework for responding to changes to your DataHub Metadata Graph in real time.
  • acryldata/datahub-helm: Repository of helm charts for deploying DataHub on a Kubernetes cluster
  • acryldata/meta-world: A repository to store recipes, custom sources, transformations and other things to make your DataHub experience magical
  • dbt-impact-action : This repository contains a github action for commenting on your PRs with a summary of the impact of changes within a dbt project
  • datahub-tools : Additional python tools to interact with the DataHub GraphQL endpoints, built by Notion
  • business-glossary-sync-action : This repository contains a github action that opens PRs to update your business glossary yaml file.

Releases

See Releases page for more details. We follow the SemVer Specification when versioning the releases and adopt the Keep a Changelog convention for the changelog format.

Contributing

We welcome contributions from the community. Please refer to our Contributing Guidelines for more details. We also have a contrib directory for incubating experimental features.

Community

Join our Slack workspace for discussions and important announcements. You can also find out more about our upcoming town hall meetings and view past recordings.

Security

See Security Stance for information on DataHub's Security.

Adoption

Here are the companies that have officially adopted DataHub. Please feel free to add yours to the list if we missed it.

Select Articles & Talks

See the full list here.

License

Apache License 2.0.