Top Related Projects
Mirror of Apache Kafka
Confluent's Apache Kafka Golang client
Apache Pulsar - distributed pub-sub messaging system
Apache Flink
Apache Beam is a unified programming model for Batch and Streaming data processing.
Change data capture for a variety of databases. Please log issues at https://issues.redhat.com/browse/DBZ.
Quick Overview
Databus is an open-source data change capture platform developed by LinkedIn. It provides a unified solution for streaming data changes from various sources, enabling real-time data integration and processing across different systems and applications.
Pros
- Scalable and fault-tolerant architecture suitable for large-scale data pipelines
- Supports multiple data sources and targets, including relational databases and NoSQL systems
- Provides strong consistency guarantees and exactly-once semantics
- Offers flexible data transformation and routing capabilities
Cons
- Steep learning curve due to complex architecture and configuration
- Limited documentation and community support compared to some other data streaming solutions
- Requires significant infrastructure and operational overhead to set up and maintain
- May be overkill for smaller-scale data integration needs
Code Examples
// Creating a Databus event producer
DatabusEventProducer producer = new DatabusEventProducer.Builder()
.setUri("databus://localhost:8080/myDataSource")
.setEventClass(MyEvent.class)
.build();
MyEvent event = new MyEvent("key", "value");
producer.enqueueEvent(event);
// Setting up a Databus consumer
DatabusConsumer consumer = new DatabusConsumer.Builder()
.setUri("databus://localhost:8080/myDataSource")
.setConsumerName("myConsumer")
.build();
consumer.start();
// Implementing a custom event processor
public class MyEventProcessor implements DatabusEventProcessor {
@Override
public void onEvent(DatabusEvent event) {
// Process the event
System.out.println("Received event: " + event.toString());
}
}
Getting Started
To get started with Databus:
-
Clone the repository:
git clone https://github.com/linkedin/databus.git
-
Build the project:
cd databus ./gradlew build
-
Configure your data sources and targets in the
databus-core/src/main/resources/databus.properties
file. -
Start the Databus server:
./gradlew run
-
Implement your custom event producers and consumers using the Databus API.
Competitor Comparisons
Mirror of Apache Kafka
Pros of Kafka
- Higher throughput and scalability, capable of handling millions of messages per second
- More robust ecosystem with extensive tooling and integrations
- Active development and widespread adoption in the industry
Cons of Kafka
- More complex setup and configuration compared to Databus
- Higher resource requirements, especially for larger clusters
- Steeper learning curve for developers new to distributed systems
Code Comparison
Kafka producer example:
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer<>(props);
Databus producer example:
DbusEventProducer producer = new DbusEventProducer();
producer.start(1000);
DbusEventKey eventKey = new DbusEventKey(1L);
DbusEventInfo eventInfo = new DbusEventInfo(DbusOpcode.UPSERT, 0L, (short)0);
producer.enqueueEvent(eventKey, eventInfo, "payload".getBytes());
Both Kafka and Databus are messaging systems, but Kafka has become more popular due to its scalability and ecosystem. Databus, developed by LinkedIn, is more focused on change data capture and database replication scenarios. Kafka's wider adoption has led to more community support and integrations, making it a more versatile choice for various use cases.
Confluent's Apache Kafka Golang client
Pros of confluent-kafka-go
- Native Go implementation, offering better performance and integration with Go ecosystems
- Actively maintained with regular updates and community support
- Supports both producer and consumer APIs for Apache Kafka
Cons of confluent-kafka-go
- Focused solely on Kafka, lacking broader data streaming capabilities
- Steeper learning curve for developers new to Kafka concepts
- May require additional components for complex data processing pipelines
Code Comparison
Databus (Java):
DbusEventBuffer eventBuffer = new DbusEventBuffer(config);
eventBuffer.start();
DbusEventIterator iter = eventBuffer.acquireIterator("example");
while (iter.hasNext()) {
DbusEvent event = iter.next();
// Process event
}
confluent-kafka-go (Go):
consumer, err := kafka.NewConsumer(&kafka.ConfigMap{"bootstrap.servers": "localhost"})
consumer.SubscribeTopics([]string{"example-topic"}, nil)
for {
msg, err := consumer.ReadMessage(-1)
// Process message
}
Both libraries provide mechanisms for consuming data streams, but confluent-kafka-go is specifically tailored for Kafka, while Databus offers a more generic approach to change data capture and streaming.
Apache Pulsar - distributed pub-sub messaging system
Pros of Pulsar
- Multi-tenancy support with built-in isolation between tenants
- Geo-replication capabilities for global data distribution
- Supports both streaming and queuing messaging models
Cons of Pulsar
- More complex setup and configuration compared to Databus
- Steeper learning curve for developers new to the system
- Higher resource requirements for optimal performance
Code Comparison
Pulsar consumer example:
Consumer<byte[]> consumer = client.newConsumer()
.topic("my-topic")
.subscriptionName("my-subscription")
.subscribe();
while (true) {
Message<byte[]> msg = consumer.receive();
// Process the message
}
Databus consumer example:
DatabusConsumer consumer = new DatabusConsumer();
consumer.registerCallback(new DatabusCallback() {
@Override
public void onDataEvent(DbusEvent event) {
// Process the event
}
});
consumer.start();
Key Differences
- Pulsar offers a more feature-rich and scalable solution for distributed messaging
- Databus is more focused on change data capture and replication scenarios
- Pulsar has broader language support and client libraries
- Databus is tightly integrated with LinkedIn's infrastructure and use cases
Use Cases
- Pulsar: Real-time analytics, event streaming, pub/sub messaging
- Databus: Database replication, change data capture, data synchronization
Apache Flink
Pros of Flink
- More comprehensive data processing framework, supporting batch and stream processing
- Larger and more active community, with frequent updates and contributions
- Extensive ecosystem with various connectors and libraries
Cons of Flink
- Steeper learning curve due to its complexity and wide range of features
- Higher resource requirements for deployment and operation
- May be overkill for simpler data streaming use cases
Code Comparison
Flink (Java):
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> text = env.readTextFile("input.txt");
DataStream<Tuple2<String, Integer>> counts = text
.flatMap(new Tokenizer())
.keyBy(value -> value.f0)
.sum(1);
counts.print();
Databus (Java):
DatabusHttpClientImpl.Config config = new DatabusHttpClientImpl.Config();
config.setUri("http://localhost:8080/databus");
DatabusHttpClientImpl client = new DatabusHttpClientImpl(config);
client.registerDatabusStreamListener(new MyStreamListener());
client.start();
Summary
Flink is a more comprehensive data processing framework with a larger community and ecosystem, while Databus focuses specifically on change data capture and distribution. Flink offers more features but has a steeper learning curve, whereas Databus is simpler but more limited in scope. The choice between the two depends on the specific requirements of the data processing task at hand.
Apache Beam is a unified programming model for Batch and Streaming data processing.
Pros of Beam
- Supports multiple programming languages (Java, Python, Go)
- Offers a unified programming model for batch and streaming data processing
- Has a larger and more active community, with frequent updates and contributions
Cons of Beam
- Steeper learning curve due to its more complex architecture
- May be overkill for simpler data processing tasks
- Requires more setup and configuration compared to Databus
Code Comparison
Databus (Java):
public class SampleConsumer extends AbstractDatabusCombinedConsumer
{
@Override
public ConsumerCallbackResult onDataEvent(DbusEvent event,
DbusEventDecoder eventDecoder)
{
// Process event
}
}
Beam (Java):
public class SampleTransform extends PTransform<PCollection<String>, PCollection<String>>
{
@Override
public PCollection<String> expand(PCollection<String> input) {
return input.apply(ParDo.of(new DoFn<String, String>() {
@ProcessElement
public void processElement(@Element String element, OutputReceiver<String> out) {
// Process element
}
}));
}
}
Both repositories focus on data processing, but Beam offers a more comprehensive solution for various data processing scenarios, while Databus is specifically designed for change data capture and real-time data distribution. Beam's flexibility comes at the cost of increased complexity, while Databus provides a more straightforward approach for its specific use case.
Change data capture for a variety of databases. Please log issues at https://issues.redhat.com/browse/DBZ.
Pros of Debezium
- Wider database support, including MySQL, PostgreSQL, MongoDB, and more
- Active development and community support
- Built-in Kafka Connect integration for easier deployment
Cons of Debezium
- Higher resource consumption due to its Java-based architecture
- Potentially more complex setup for simple use cases
- Less mature than Databus, which has been in production at LinkedIn for longer
Code Comparison
Debezium configuration example:
{
"name": "inventory-connector",
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"database.hostname": "mysql",
"database.port": "3306",
"database.user": "debezium",
"database.password": "dbz",
"database.server.id": "184054",
"database.server.name": "dbserver1",
"database.whitelist": "inventory",
"database.history.kafka.bootstrap.servers": "kafka:9092",
"database.history.kafka.topic": "schema-changes.inventory"
}
}
Databus configuration example:
<databus-client name="example-client">
<sources>
<source name="com.linkedin.events.example.Person">
<subscriptions>
<subscription>example.Person</subscription>
</subscriptions>
</source>
</sources>
<checkpoint-persistence>
<type>file</type>
<file-system>
<root-directory>/tmp/databus-checkpoints</root-directory>
</file-system>
</checkpoint-persistence>
</databus-client>
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Introduction
==============
In Internet architectures, data systems are typically categorized into source-of-truth systems that serve as primary stores for the user-generated writes, and derived data stores or indexes which serve reads and other complex queries. The data in these secondary stores is often derived from the primary data through custom transformations, sometimes involving complex processing driven by business logic. Similarly, data in caching tiers is derived from reads against the primary data store, but needs to get invalidated or refreshed when the primary data gets mutated. A fundamental requirement emerging from these kinds of data architectures is the need to reliably capture, flow and process primary data changes.
We have built Databus, a source-agnostic distributed change data capture system, which is an integral part of LinkedIn's data processing pipeline. The Databus transport layer provides latencies in the low milliseconds and handles throughput of thousands of events per second per server while supporting infinite look back capabilities and rich subscription functionality.
Use-cases
Typically, Primary OLTP data-stores take user facing writes and some reads, while other specialized systems serve complex queries or accelerate query results through caching. The most common data systems found in these architectures include relational databases, NoSQL data stores, caching engines, search indexes and graph query engines. This specialization has in turn made it critical to have a reliable and scalable data pipeline that can capture these changes happening for primary source-of-truth systems and route them through the rest of the complex data eco-system. There are two families of solutions that are typically used for building such a pipeline.
Application-driven Dual Writes:
In this model, the application layer writes to the database and in parallel, writes to another messaging system. This looks simple to implement since the application code writing to the database is under our control. However it introduces a consistency problem because without a complex co-ordination protocol (e.g. Paxos or 2-Phase Commit ) it is hard to ensure that both the database and the messaging system are in complete lock step with each other in the face of failures. Both systems need to process exactly the same writes and need to serialize them in exactly the same order. Things get even more complex if the writes are conditional or have partial update semantics.
Database Log Mining:
In this model, we make the database the single source-of-truth and extract changes from its transaction or commit log. This solves our consistency issue, but is practically hard to implement because databases like Oracle and MySQL (the primary data stores in use at LinkedIn) have transaction log formats and replication solutions that are proprietary and not guaranteed to have stable on-disk or on-the-wire representations across version upgrades. Since we want to process the data changes with application code and then write to secondary data stores, we need the replication system to be user-space and source-agnostic. This independence from the data source is especially important in fast-moving technology companies, because it avoids technology lock-in and tie-in to binary formats throughout the application stack.
After evaluating the pros and cons of the two approaches, we decided to pursue the log mining option, prioritizing consistency and "single source of truth" over ease of implementation. In this paper, we introduce Databus, Change Data Capture pipeline at LinkedIn, which supports Oracle sources and a wide range of downstream applications. The Social Graph Index which serves all graph queries at LinkedIn, the People Search Index that powers all searches for members at LinkedIn and the various read replicas for our Member Profile data are all fed and kept consistent via Databus.
More details about the architecture, usecases and performance evaluation can be obtained from a paper that got accepted for publication at the ACM Symposium on Cloud Computing - 2012. The slides for the presentation are available here
How to build ?
Databus requires a library distributed by Oracle Inc under Oracle Technology Network License. Please accept that license here, and download ojdbc6.jar with version at 11.2.0.2.0 here. Once you download the driver jar, please copy it under sandbox-repo/com/oracle/ojdbc6/11.2.0.2.0/ and name it ojdbc6-11.2.0.2.0.jar as shown below. We have provided a sample .ivy file to facilitate the build.
Databus will NOT build without this step. After downloading the jars, they may be copied under the directory sandbox-repo as :
- sandbox-repo/com/oracle/ojdbc6/11.2.0.2.0/ojdbc6-11.2.0.2.0.jar
- sandbox-repo/com/oracle/ojdbc6/11.2.0.2.0/ojdbc6-11.2.0.2.0.ivy
Build System
Databus currently needs gradle version 1.0 or above to build. The commands to build are :
gradle -Dopen_source=true assemble
-- builds the jars and command line packagegradle -Dopen_source=true clean
-- cleans the build directorygradle -Dopen_source=true test
-- runs all the unit-tests that come packaged with the source
Licensing
Databus will be licensed under Apache 2.0 license.
Full Documentation
See our wiki for full documentation and examples.
Example Relay
An example of writing a DatabusRelay is available at PersonRelayServer.java. To be able to start a relay process, the code is packaged into a startable command-line package. The tarball may be obtained from build/databus2-example-relay-pkg/distributions/databus2-example-relay-pkg.tgz. This relay is configured to get changestreams for a view "Person".
After extracting to a directory, please cd to that directory and start the relay using the following command :
./bin/start-example-relay.sh person
If the relay is started successfully, the output of the following curl command would look like :
- $
curl http://localhost:11115/sources
[{ânameâ:âcom.linkedin.events.example.person.Personâ,âidâ:40}]
Example Client
An example of writing a DatabusClient is available at PersonClientMain.java. To easily be able to start the client process, the code is packaged into a startable command-line package. The tarball may be obtained from build/databus2-example-client-pkg/distributions/databus2-example-client-pkg.tgz. This client is configured to get data from the relay started previously, and configured to susbscribe for table Person.
After extracting to a directory, please cd to that directory and start the client using the following command :
./bin/start-example-client.sh person
If the client successfully connects to the relay we created earlier, the output of the following curl command would look like below ( indicating a client from localhost has connected to the relay ):
- $
curl http://localhost:11115/relayStats/outbound/http/clients
["localhost"]
Top Related Projects
Mirror of Apache Kafka
Confluent's Apache Kafka Golang client
Apache Pulsar - distributed pub-sub messaging system
Apache Flink
Apache Beam is a unified programming model for Batch and Streaming data processing.
Change data capture for a variety of databases. Please log issues at https://issues.redhat.com/browse/DBZ.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot