Top Related Projects
The Metadata Platform for your Data Stack
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.
Cartography is a Python tool that consolidates infrastructure assets and the relationships between them in an intuitive graph view powered by a Neo4j database.
The Metadata Platform for your Data Stack
Apache Atlas
Quick Overview
Metacat is an open-source metadata management and data discovery service developed by Netflix. It provides a unified view of metadata across various data sources, enabling data discovery, lineage, and governance capabilities for enterprises.
Pros
- Unified Metadata Management: Metacat aggregates metadata from diverse data sources, providing a centralized view of an organization's data assets.
- Data Discovery: The platform offers advanced search and browsing capabilities, making it easier for users to find and understand available data.
- Data Lineage: Metacat tracks the lineage of data, allowing users to understand the origin and transformation of data.
- Scalability: The system is designed to handle large-scale metadata management, supporting enterprises with growing data needs.
Cons
- Complexity: Integrating Metacat with existing data infrastructure may require significant setup and configuration, which can be challenging for some organizations.
- Limited Native Integrations: While Metacat supports a range of data sources, the list of native integrations may not cover all the data sources used by an organization.
- Learning Curve: Users may need to invest time in understanding the Metacat platform and its features, which can be a barrier to adoption.
- Dependency on External Components: Metacat relies on other components, such as Elasticsearch and Hive, which adds complexity to the overall system management.
Code Examples
N/A (Metacat is not a code library)
Getting Started
N/A (Metacat is not a code library)
Competitor Comparisons
The Metadata Platform for your Data Stack
Pros of DataHub
- More comprehensive data catalog solution with features like data lineage, data quality, and data governance
- Active community development with frequent updates and contributions
- Supports a wider range of data sources and integrations
Cons of DataHub
- More complex setup and configuration compared to Metacat
- Steeper learning curve due to its extensive feature set
- May require more resources to run and maintain
Code Comparison
DataHub (Python client example):
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.com.linkedin.pegasus2avro.metadata.snapshot import DatasetSnapshot
from datahub.metadata.schema_classes import DatasetPropertiesClass
emitter = DatahubRestEmitter("http://localhost:8080")
dataset = DatasetSnapshot(urn="urn:li:dataset:(urn:li:dataPlatform:mysql,my_database.my_table,PROD)")
dataset.aspects.append(DatasetPropertiesClass(description="My dataset description"))
emitter.emit(dataset)
Metacat (Java API example):
import com.netflix.metacat.common.server.api.v1.MetacatV1;
import com.netflix.metacat.common.dto.TableDto;
MetacatV1 api = ...;
TableDto table = api.getTable("catalog", "database", "table");
table.setMetadata(ImmutableMap.of("description", "My table description"));
api.updateTable("catalog", "database", "table", table);
Both projects aim to provide metadata management solutions, but DataHub offers a more comprehensive platform with advanced features, while Metacat focuses on simpler metadata management for data warehouses and lakes.
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.
Pros of Amundsen
- More comprehensive data discovery and metadata management solution
- Stronger focus on data lineage and relationships between data assets
- Better suited for large-scale data ecosystems with diverse data sources
Cons of Amundsen
- More complex setup and configuration compared to Metacat
- Requires additional components (e.g., Neo4j, Elasticsearch) for full functionality
- May have a steeper learning curve for users and administrators
Code Comparison
Amundsen (Python):
from databuilder.extractor.neo4j_extractor import Neo4jExtractor
from databuilder.job.job import DefaultJob
from databuilder.loader.file_system_neo4j_csv_loader import FsNeo4jCSVLoader
job_config = ConfigFactory.from_dict({
'extractor.neo4j.graph_url': 'bolt://localhost:7687',
'loader.filesystem_csv_neo4j.node_dir_path': '/tmp/nodes',
'loader.filesystem_csv_neo4j.relationship_dir_path': '/tmp/relationships',
})
Metacat (Java):
import com.netflix.metacat.common.server.properties.Config;
import com.netflix.metacat.main.api.v1.MetacatV1;
public class MetacatExample {
private final MetacatV1 api;
private final Config config;
public MetacatExample(MetacatV1 api, Config config) {
this.api = api;
this.config = config;
}
}
Cartography is a Python tool that consolidates infrastructure assets and the relationships between them in an intuitive graph view powered by a Neo4j database.
Pros of Cartography
- Focuses on security and infrastructure analysis, providing a more comprehensive view of cloud assets and their relationships
- Offers visualization capabilities, making it easier to understand complex infrastructure setups
- Supports multiple cloud providers (AWS, GCP, Azure) out of the box
Cons of Cartography
- Less emphasis on metadata management and data discovery compared to Metacat
- May require more setup and configuration for data-centric use cases
- Smaller community and fewer integrations with data processing tools
Code Comparison
Cartography (Python):
from cartography.intel.aws import ec2
from cartography.intel.aws.ec2 import sync_ec2_instances
def sync(neo4j_session, boto3_session, regions, update_tag):
ec2.sync_ec2_instances(neo4j_session, boto3_session, regions, update_tag)
Metacat (Java):
@Slf4j
@Singleton
public class MetacatThriftHiveClient extends HiveClientFactory {
@Inject
public MetacatThriftHiveClient(Config config, MetacatHMSHandler handler) {
super(config, handler);
}
}
The Metadata Platform for your Data Stack
Pros of DataHub
- More comprehensive data catalog solution with features like data lineage, data quality, and data governance
- Active community development with frequent updates and contributions
- Supports a wider range of data sources and integrations
Cons of DataHub
- More complex setup and configuration compared to Metacat
- Steeper learning curve due to its extensive feature set
- May require more resources to run and maintain
Code Comparison
DataHub (Python client example):
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.com.linkedin.pegasus2avro.metadata.snapshot import DatasetSnapshot
from datahub.metadata.schema_classes import DatasetPropertiesClass
emitter = DatahubRestEmitter("http://localhost:8080")
dataset = DatasetSnapshot(urn="urn:li:dataset:(urn:li:dataPlatform:mysql,my_database.my_table,PROD)")
dataset.aspects.append(DatasetPropertiesClass(description="My dataset description"))
emitter.emit(dataset)
Metacat (Java API example):
import com.netflix.metacat.common.server.api.v1.MetacatV1;
import com.netflix.metacat.common.dto.TableDto;
MetacatV1 api = ...;
TableDto table = api.getTable("catalog", "database", "table");
table.setMetadata(ImmutableMap.of("description", "My table description"));
api.updateTable("catalog", "database", "table", table);
Both projects aim to provide metadata management solutions, but DataHub offers a more comprehensive platform with advanced features, while Metacat focuses on simpler metadata management for data warehouses and lakes.
Apache Atlas
Pros of Atlas
- More comprehensive data governance and lineage capabilities
- Stronger integration with Hadoop ecosystem components
- Active Apache project with broader community support
Cons of Atlas
- Steeper learning curve and more complex setup
- Less focus on cloud-native environments compared to Metacat
- Potentially heavier resource requirements for deployment
Code Comparison
Atlas (Java):
AtlasClient atlasClient = new AtlasClient(atlasUrls, new String[]{"admin", "admin"});
Referenceable db = new Referenceable("hive_db");
db.set("name", "default");
db.set("description", "Default Hive database");
atlasClient.createEntity(db);
Metacat (Java):
MetacatClient metacatClient = new MetacatClient(config);
DatabaseCreateRequestDto createDto = new DatabaseCreateRequestDto();
createDto.setDefinitionMetadata(ImmutableMap.of("owner", "team_data"));
metacatClient.createDatabase("hive", "default", createDto);
Both projects aim to provide metadata management solutions, but Atlas offers more extensive data governance features, while Metacat focuses on cloud-native environments and simplicity. Atlas integrates better with Hadoop ecosystems, whereas Metacat excels in multi-cloud deployments. The code snippets demonstrate the different approaches to creating database entities, with Atlas using a more detailed object model and Metacat opting for a simpler request-based approach.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Metacat
Introduction
Metacat is a unified metadata exploration API service. You can explore Hive, RDS, Teradata, Redshift, S3 and Cassandra. Metacat provides you information about what data you have, where it resides and how to process it. Metadata in the end is really data about the data. So the primary purpose of Metacat is to give a place to describe the data so that we could do more useful things with it.
Metacat focusses on solving these three problems:
- Federate views of metadata systems.
- Allow arbitrary metadata storage about data sets.
- Metadata discovery
Documentation
TODO
Releases
Builds
Metacat builds are run on Travis CI here.
Getting Started
git clone git@github.com:Netflix/metacat.git
cd metacat
./gradlew clean build
Once the build is completed, the metacat WAR file is generated under metacat-war/build/libs
directory. Metacat needs
two basic configurations:
metacat.plugin.config.location
: Path to the directory containing the catalog configuration. Please look at catalog samples used for functional testing.metacat.usermetadata.config.location
: Path to the configuration file containing the connection properties to store user metadata. Please look at this sample.
Running Locally
Take the build WAR in metacat-war/build/libs
and deploy it to an existing Tomcat as ROOT.war
.
The REST API can be accessed @ http://localhost:8080/mds/v1/catalog
Swagger API documentation can be accessed @ http://localhost:8080/swagger-ui/index.html
Docker Compose Example
Pre-requisite: Docker compose is installed
To start a self contained Metacat environment with some sample catalogs run the command below.
This will start a docker compose
cluster containing a Metacat container, a Hive Metastore Container, a Cassandra
container and a PostgreSQL container.
./gradlew metacatPorts
metacatPorts
- Prints out what exposed ports are mapped to the internal container ports. Look for the mapped port (MAPPED_PORT
) to port 8080.
REST API can be accessed @ http://localhost:<MAPPED_PORT>/mds/v1/catalog
Swagger API documentation can be accessed @ http://localhost:<MAPPED_PORT>/swagger-ui/index.html
To stop the docker compose cluster:
./gradlew stopMetacatCluster
Top Related Projects
The Metadata Platform for your Data Stack
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.
Cartography is a Python tool that consolidates infrastructure assets and the relationships between them in an intuitive graph view powered by a Neo4j database.
The Metadata Platform for your Data Stack
Apache Atlas
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot