atlas

Apache Atlas - Open Metadata Management and Governance capabilities across the Hadoop platform and beyond

1,954

883

1,954

View on GitHub

Top Related Projects

datahub

10,791

The Metadata Platform for your Data and AI Stack

datahub

10,907

The Metadata Platform for your Data and AI Stack

amundsen

4,619

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

OpenLineage

1,992

An Open Standard for lineage metadata collection

Quick Overview

Apache Atlas is an open-source data governance and metadata framework for Hadoop and Big Data platforms. It provides a comprehensive set of tools and services to enable data discovery, lineage, and governance across various data sources and processing frameworks.

Pros

Comprehensive Data Governance: Atlas provides a centralized platform for managing metadata, data lineage, and data policies across multiple data sources and processing frameworks.
Scalable and Extensible: Atlas is designed to handle large-scale data environments and can be extended to support new data sources and processing frameworks.
Flexible Data Modeling: Atlas supports a flexible data model that can be customized to fit the specific needs of an organization's data landscape.
Robust Security and Access Control: Atlas provides fine-grained access control and security features to ensure data privacy and compliance.

Cons

Steep Learning Curve: Setting up and configuring Atlas can be a complex process, especially for organizations new to data governance.
Limited Community Support: Compared to some other open-source projects, the Apache Atlas community may be smaller and less active, which can make it harder to find support and resources.
Integration Challenges: Integrating Atlas with existing data infrastructure and tools can be time-consuming and may require significant technical expertise.
Performance Limitations: Depending on the size and complexity of the data environment, Atlas may experience performance issues, especially when handling large volumes of metadata.

Getting Started

To get started with Apache Atlas, follow these steps:

Install Apache Atlas: You can download the latest version of Apache Atlas from the official website. Follow the installation instructions for your specific platform.
Configure Data Sources: Atlas supports a variety of data sources, including Hadoop, Hive, Kafka, and more. Configure the necessary connectors and integrations to connect Atlas to your data ecosystem.
Define Data Entities and Relationships: Use the Atlas web UI or the REST API to define the data entities and relationships in your organization's data landscape. This includes creating business glossaries, data lineage, and data policies.
Manage Data Governance Policies: Leverage Atlas's policy management features to define and enforce data governance policies, such as data classification, access control, and data retention.
Utilize Atlas's Metadata Search and Discovery: Take advantage of Atlas's search and discovery capabilities to find and understand the data assets across your organization.
Monitor and Audit Data Lineage: Use Atlas's data lineage and impact analysis features to track the flow of data and understand the relationships between different data assets.
Integrate Atlas with Other Tools: Explore the various integration options available for Atlas, such as connecting it with data processing frameworks, data catalogs, and business intelligence tools.

By following these steps, you can start leveraging the power of Apache Atlas to improve data governance, metadata management, and data discovery within your organization.

Competitor Comparisons

datahub

10,791

The Metadata Platform for your Data and AI Stack

Pros of DataHub

More modern architecture with a focus on scalability and extensibility
Richer UI and user experience, including advanced search capabilities
Better support for cloud-native environments and microservices

Cons of DataHub

Younger project with potentially less stability compared to Atlas
Smaller community and ecosystem of integrations
Steeper learning curve due to more complex architecture

Code Comparison

Atlas (Java):

public class AtlasEntity extends Referenceable {
    public static final String TYPE_NAME = "AtlasEntity";
    private Map<String, Object> attributes;
}

DataHub (Python):

class DatasetSnapshot(Snapshot):
    """Snapshot class for datasets"""
    def __init__(self, urn: str, aspects: List[Union[DatasetProperties, SchemaMetadata, ...]] = None):
        super().__init__(urn, aspects)

Both projects use different programming languages for their core implementations. Atlas primarily uses Java, while DataHub uses a combination of Python, Java, and TypeScript. The code snippets demonstrate the different approaches to defining metadata entities in each project.

Atlas follows a more traditional Java object-oriented approach, while DataHub utilizes Python's type hinting and modern language features. This reflects DataHub's more recent development and focus on developer productivity.

datahub

10,907

The Metadata Platform for your Data and AI Stack

Pros of DataHub

More modern architecture with a focus on scalability and extensibility
Richer UI and user experience, including advanced search capabilities
Better support for cloud-native environments and microservices

Cons of DataHub

Younger project with potentially less stability compared to Atlas
Smaller community and ecosystem of integrations
Steeper learning curve due to more complex architecture

Code Comparison

Atlas (Java):

public class AtlasEntity extends Referenceable {
    public static final String TYPE_NAME = "AtlasEntity";
    private Map<String, Object> attributes;
}

DataHub (Python):

class DatasetSnapshot(Snapshot):
    """Snapshot class for datasets"""
    def __init__(self, urn: str, aspects: List[Union[DatasetProperties, SchemaMetadata, ...]] = None):
        super().__init__(urn, aspects)

amundsen

4,619

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Pros of Amundsen

More user-friendly interface and search functionality
Better integration with modern data stack tools (e.g., Airflow, dbt)
Faster setup and deployment process

Cons of Amundsen

Less comprehensive metadata management capabilities
Smaller community and ecosystem compared to Atlas
Limited support for complex data lineage scenarios

Code Comparison

Amundsen (Python):

class TableMetadata(BaseModel):
    database: str
    cluster: str
    schema: str
    name: str
    description: Optional[str] = None
    tags: List[str] = []

Atlas (Java):

public class AtlasEntity extends AtlasStruct implements Serializable {
    private String guid;
    private String typeName;
    private String status;
    private String createdBy;
    private String updatedBy;
}

Both projects aim to provide data discovery and metadata management solutions, but they differ in their approach and focus. Amundsen emphasizes ease of use and modern integrations, while Atlas offers more comprehensive metadata management capabilities. The code snippets showcase the different languages and data modeling approaches used by each project.

OpenLineage

1,992

An Open Standard for lineage metadata collection

Pros of OpenLineage

Lightweight and focused specifically on data lineage
Easier integration with modern data stack tools
More active community and frequent updates

Cons of OpenLineage

Less comprehensive metadata management features
Smaller ecosystem of integrations compared to Atlas
Limited governance and security capabilities

Code Comparison

Atlas (Java):

AtlasEntity entity = new AtlasEntity("hive_table", "employees");
entity.setAttribute("name", "employees");
entity.setAttribute("owner", "hr_department");
atlasClient.createEntity(entity);

OpenLineage (Python):

from openlineage.client import OpenLineageClient

client = OpenLineageClient()
client.emit(
    run_id="job123",
    job_name="process_employees",
    inputs=[{"namespace": "hive", "name": "employees"}],
    outputs=[{"namespace": "hive", "name": "processed_employees"}]
)

Summary

OpenLineage is a more modern, lightweight solution focused on data lineage, while Atlas offers a more comprehensive metadata management platform. OpenLineage is easier to integrate with contemporary data tools but has fewer features for governance and security. Atlas provides a broader range of functionalities but may be more complex to set up and maintain.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Apache Atlas Overview

Apache Atlas framework is an extensible set of core foundational governance services â enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem.

This will provide true visibility in Hadoop by using both a prescriptive and forensic model, along with technical and operational audit as well as lineage enriched by business taxonomical metadata. It also enables any metadata consumer to work inter-operably without discrete interfaces to each other -- the metadata store is common.

The metadata veracity is maintained by leveraging Apache Ranger to prevent non-authorized access paths to data at runtime. Security is both role based (RBAC) and attribute based (ABAC).

NOTE

Apache Atlas allows contributions via pull requests (PRs) on GitHub. Alternatively, use this to submit changes for review using the Review Board. Also create a atlas jira to go along with the review and mention it in the pull request/review board review.

Building Atlas in Docker

Instructions to build and run atlas in docker: dev-support/atlas-docker/README.md

Regular Build Process

Get Atlas sources to your local directory, for example with following commands

cd <your-local-directory>
git clone https://github.com/apache/atlas.git
cd atlas

# Checkout the branch or tag you would like to build

# to checkout a branch
git checkout <branch>

# to checkout a tag
git checkout tags/<tag>

Execute the following commands to build Apache Atlas

export MAVEN_OPTS="-Xms2g -Xmx2g"
mvn clean install
mvn clean package -Pdist

After above build commands successfully complete, you should see the following files

distro/target/apache-atlas-<version>-bin.tar.gz
distro/target/apache-atlas-<version>-hbase-hook.tar.gz
distro/target/apache-atlas-<version>-hive-hook.tar.gz
distro/target/apache-atlas-<version>-impala-hook.tar.gz
distro/target/apache-atlas-<version>-kafka-hook.tar.gz
distro/target/apache-atlas-<version>-server.tar.gz
distro/target/apache-atlas-<version>-sources.tar.gz
distro/target/apache-atlas-<version>-sqoop-hook.tar.gz
distro/target/apache-atlas-<version>-storm-hook.tar.gz
distro/target/apache-atlas-<version>-falcon-hook.tar.gz
distro/target/apache-atlas-<version>-couchbase-hook.tar.gz

For more details on installing and running Apache Atlas, please refer to https://atlas.apache.org/#/Installation

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot