metacat

No description available

1,654

289

1,654

View on GitHub

Top Related Projects

datahub

10,791

The Metadata Platform for your Data and AI Stack

amundsen

4,619

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

cartography

3,340

Cartography is a Python tool that consolidates infrastructure assets and the relationships between them in an intuitive graph view powered by a Neo4j database.

datahub

10,550

The Metadata Platform for your Data and AI Stack

atlas

1,954

Apache Atlas - Open Metadata Management and Governance capabilities across the Hadoop platform and beyond

Quick Overview

Metacat is an open-source metadata management and data discovery service developed by Netflix. It provides a unified view of metadata across various data sources, enabling data discovery, lineage, and governance capabilities for enterprises.

Pros

Unified Metadata Management: Metacat aggregates metadata from diverse data sources, providing a centralized view of an organization's data assets.
Data Discovery: The platform offers advanced search and browsing capabilities, making it easier for users to find and understand available data.
Data Lineage: Metacat tracks the lineage of data, allowing users to understand the origin and transformation of data.
Scalability: The system is designed to handle large-scale metadata management, supporting enterprises with growing data needs.

Cons

Complexity: Integrating Metacat with existing data infrastructure may require significant setup and configuration, which can be challenging for some organizations.
Limited Native Integrations: While Metacat supports a range of data sources, the list of native integrations may not cover all the data sources used by an organization.
Learning Curve: Users may need to invest time in understanding the Metacat platform and its features, which can be a barrier to adoption.
Dependency on External Components: Metacat relies on other components, such as Elasticsearch and Hive, which adds complexity to the overall system management.

Code Examples

N/A (Metacat is not a code library)

Getting Started

N/A (Metacat is not a code library)

Competitor Comparisons

datahub

10,791

The Metadata Platform for your Data and AI Stack

Pros of DataHub

More comprehensive data catalog solution with features like data lineage, data quality, and data governance
Active community development with frequent updates and contributions
Supports a wider range of data sources and integrations

Cons of DataHub

More complex setup and configuration compared to Metacat
Steeper learning curve due to its extensive feature set
May require more resources to run and maintain

Code Comparison

DataHub (Python client example):

from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.com.linkedin.pegasus2avro.metadata.snapshot import DatasetSnapshot
from datahub.metadata.schema_classes import DatasetPropertiesClass

emitter = DatahubRestEmitter("http://localhost:8080")
dataset = DatasetSnapshot(urn="urn:li:dataset:(urn:li:dataPlatform:mysql,my_database.my_table,PROD)")
dataset.aspects.append(DatasetPropertiesClass(description="My dataset description"))
emitter.emit(dataset)

Metacat (Java API example):

import com.netflix.metacat.common.server.api.v1.MetacatV1;
import com.netflix.metacat.common.dto.TableDto;

MetacatV1 api = ...;
TableDto table = api.getTable("catalog", "database", "table");
table.setMetadata(ImmutableMap.of("description", "My table description"));
api.updateTable("catalog", "database", "table", table);

Both projects aim to provide metadata management solutions, but DataHub offers a more comprehensive platform with advanced features, while Metacat focuses on simpler metadata management for data warehouses and lakes.

amundsen

4,619

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Pros of Amundsen

More comprehensive data discovery and metadata management solution
Stronger focus on data lineage and relationships between data assets
Better suited for large-scale data ecosystems with diverse data sources

Cons of Amundsen

More complex setup and configuration compared to Metacat
Requires additional components (e.g., Neo4j, Elasticsearch) for full functionality
May have a steeper learning curve for users and administrators

Code Comparison

Amundsen (Python):

from databuilder.extractor.neo4j_extractor import Neo4jExtractor
from databuilder.job.job import DefaultJob
from databuilder.loader.file_system_neo4j_csv_loader import FsNeo4jCSVLoader

job_config = ConfigFactory.from_dict({
    'extractor.neo4j.graph_url': 'bolt://localhost:7687',
    'loader.filesystem_csv_neo4j.node_dir_path': '/tmp/nodes',
    'loader.filesystem_csv_neo4j.relationship_dir_path': '/tmp/relationships',
})

Metacat (Java):

import com.netflix.metacat.common.server.properties.Config;
import com.netflix.metacat.main.api.v1.MetacatV1;

public class MetacatExample {
    private final MetacatV1 api;
    private final Config config;

    public MetacatExample(MetacatV1 api, Config config) {
        this.api = api;
        this.config = config;
    }
}

cartography

3,340

Cartography is a Python tool that consolidates infrastructure assets and the relationships between them in an intuitive graph view powered by a Neo4j database.

Pros of Cartography

Broader scope: Covers multiple cloud providers and services, not just metadata
Graph-based data model: Allows for complex relationship mapping and analysis
Open-source community-driven project: Potentially faster development and wider adoption

Cons of Cartography

Less focused on metadata management: May not have as deep metadata capabilities as Metacat
Potentially more complex setup: Requires graph database and multiple integrations
Younger project: May have less maturity and stability compared to Metacat

Code Comparison

Cartography (Python):

def sync(self, neo4j_session, update_tag):
    common_job_parameters = {
        'UPDATE_TAG': update_tag,
        'AWS_ID': self.account.id,
    }
    # ... (additional code)

Metacat (Java):

@Transactional(readOnly = true)
public List<QualifiedName> searchByOwners(Set<String> owners) {
    return tableDao.searchByOwners(owners);
}

Both projects use different languages and have distinct purposes, making direct code comparison challenging. Cartography focuses on syncing and mapping cloud resources, while Metacat emphasizes metadata management and search functionality.

datahub

10,550

The Metadata Platform for your Data and AI Stack

Pros of DataHub

More comprehensive data catalog solution with features like data lineage, data quality, and data governance
Active community development with frequent updates and contributions
Supports a wider range of data sources and integrations

Cons of DataHub

More complex setup and configuration compared to Metacat
Steeper learning curve due to its extensive feature set
May require more resources to run and maintain

Code Comparison

DataHub (Python client example):

from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.com.linkedin.pegasus2avro.metadata.snapshot import DatasetSnapshot
from datahub.metadata.schema_classes import DatasetPropertiesClass

emitter = DatahubRestEmitter("http://localhost:8080")
dataset = DatasetSnapshot(urn="urn:li:dataset:(urn:li:dataPlatform:mysql,my_database.my_table,PROD)")
dataset.aspects.append(DatasetPropertiesClass(description="My dataset description"))
emitter.emit(dataset)

Metacat (Java API example):

import com.netflix.metacat.common.server.api.v1.MetacatV1;
import com.netflix.metacat.common.dto.TableDto;

MetacatV1 api = ...;
TableDto table = api.getTable("catalog", "database", "table");
table.setMetadata(ImmutableMap.of("description", "My table description"));
api.updateTable("catalog", "database", "table", table);

atlas

1,954

Apache Atlas - Open Metadata Management and Governance capabilities across the Hadoop platform and beyond

Pros of Atlas

More comprehensive data governance and lineage capabilities
Stronger integration with Hadoop ecosystem components
Active Apache project with broader community support

Cons of Atlas

Steeper learning curve and more complex setup
Less focus on cloud-native environments compared to Metacat
Potentially heavier resource requirements for deployment

Code Comparison

Atlas (Java):

AtlasClient atlasClient = new AtlasClient(atlasUrls, new String[]{"admin", "admin"});
Referenceable db = new Referenceable("hive_db");
db.set("name", "default");
db.set("description", "Default Hive database");
atlasClient.createEntity(db);

Metacat (Java):

MetacatClient metacatClient = new MetacatClient(config);
DatabaseCreateRequestDto createDto = new DatabaseCreateRequestDto();
createDto.setDefinitionMetadata(ImmutableMap.of("owner", "team_data"));
metacatClient.createDatabase("hive", "default", createDto);

Both projects aim to provide metadata management solutions, but Atlas offers more extensive data governance features, while Metacat focuses on cloud-native environments and simplicity. Atlas integrates better with Hadoop ecosystems, whereas Metacat excels in multi-cloud deployments. The code snippets demonstrate the different approaches to creating database entities, with Atlas using a more detailed object model and Metacat opting for a simpler request-based approach.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Metacat

Introduction

Metacat is a unified metadata exploration API service. You can explore Hive, RDS, Teradata, Redshift, S3 and Cassandra. Metacat provides you information about what data you have, where it resides and how to process it. Metadata in the end is really data about the data. So the primary purpose of Metacat is to give a place to describe the data so that we could do more useful things with it.

Metacat focusses on solving these three problems:

Federate views of metadata systems.
Allow arbitrary metadata storage about data sets.
Metadata discovery

Documentation

TODO

Releases

Builds

Metacat builds are run on Travis CI here.

Getting Started

git clone git@github.com:Netflix/metacat.git
cd metacat
./gradlew clean build

Once the build is completed, the metacat WAR file is generated under metacat-war/build/libs directory. Metacat needs two basic configurations:

metacat.plugin.config.location: Path to the directory containing the catalog configuration. Please look at catalog samples used for functional testing.
metacat.usermetadata.config.location: Path to the configuration file containing the connection properties to store user metadata. Please look at this sample.

Running Locally

Take the build WAR in metacat-war/build/libs and deploy it to an existing Tomcat as ROOT.war.

The REST API can be accessed @ http://localhost:8080/mds/v1/catalog

Swagger API documentation can be accessed @ http://localhost:8080/swagger-ui/index.html

Docker Compose Example

Pre-requisite: Docker compose is installed

To start a self contained Metacat environment with some sample catalogs run the command below. This will start a docker compose cluster containing a Metacat container, a Hive Metastore Container, a Cassandra container and a PostgreSQL container.

./gradlew metacatPorts

metacatPorts - Prints out what exposed ports are mapped to the internal container ports. Look for the mapped port (MAPPED_PORT) to port 8080.

REST API can be accessed @ http://localhost:<MAPPED_PORT>/mds/v1/catalog

Swagger API documentation can be accessed @ http://localhost:<MAPPED_PORT>/swagger-ui/index.html

To stop the docker compose cluster:

./gradlew stopMetacatCluster

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot