databus

Source-agnostic distributed change data capture system

3,663

737

3,663

View on GitHub View on NPM

Top Related Projects

confluent-kafka-go

4,914

Confluent's Apache Kafka Golang client

pulsar

14,586

Apache Pulsar - distributed pub-sub messaging system

beam

8,082

Apache Beam is a unified programming model for Batch and Streaming data processing.

debezium

11,333

Change data capture for a variety of databases. Please log issues at https://issues.redhat.com/browse/DBZ.

Quick Overview

Databus is an open-source data change capture platform developed by LinkedIn. It provides a unified solution for streaming data changes from various sources, enabling real-time data integration and processing across different systems and applications.

Pros

Scalable and fault-tolerant architecture suitable for large-scale data pipelines
Supports multiple data sources and targets, including relational databases and NoSQL systems
Provides strong consistency guarantees and exactly-once semantics
Offers flexible data transformation and routing capabilities

Cons

Steep learning curve due to complex architecture and configuration
Limited documentation and community support compared to some other data streaming solutions
Requires significant infrastructure and operational overhead to set up and maintain
May be overkill for smaller-scale data integration needs

Code Examples

// Creating a Databus event producer
DatabusEventProducer producer = new DatabusEventProducer.Builder()
    .setUri("databus://localhost:8080/myDataSource")
    .setEventClass(MyEvent.class)
    .build();

MyEvent event = new MyEvent("key", "value");
producer.enqueueEvent(event);

// Setting up a Databus consumer
DatabusConsumer consumer = new DatabusConsumer.Builder()
    .setUri("databus://localhost:8080/myDataSource")
    .setConsumerName("myConsumer")
    .build();

consumer.start();

// Implementing a custom event processor
public class MyEventProcessor implements DatabusEventProcessor {
    @Override
    public void onEvent(DatabusEvent event) {
        // Process the event
        System.out.println("Received event: " + event.toString());
    }
}

Getting Started

To get started with Databus:

Clone the repository:

git clone https://github.com/linkedin/databus.git

Build the project:
```
cd databus
./gradlew build
```
Configure your data sources and targets in the databus-core/src/main/resources/databus.properties file.
Start the Databus server:
```
./gradlew run
```
Implement your custom event producers and consumers using the Databus API.

Competitor Comparisons

kafka

29,982

Mirror of Apache Kafka

Pros of Kafka

Higher throughput and scalability, capable of handling millions of messages per second
More robust ecosystem with extensive tooling and integrations
Active development and widespread adoption in the industry

Cons of Kafka

More complex setup and configuration compared to Databus
Higher resource requirements, especially for larger clusters
Steeper learning curve for developers new to distributed systems

Code Comparison

Kafka producer example:

Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer<>(props);

Databus producer example:

DbusEventProducer producer = new DbusEventProducer();
producer.start(1000);
DbusEventKey eventKey = new DbusEventKey(1L);
DbusEventInfo eventInfo = new DbusEventInfo(DbusOpcode.UPSERT, 0L, (short)0);
producer.enqueueEvent(eventKey, eventInfo, "payload".getBytes());

Both Kafka and Databus are messaging systems, but Kafka has become more popular due to its scalability and ecosystem. Databus, developed by LinkedIn, is more focused on change data capture and database replication scenarios. Kafka's wider adoption has led to more community support and integrations, making it a more versatile choice for various use cases.

confluent-kafka-go

4,914

Confluent's Apache Kafka Golang client

Pros of confluent-kafka-go

Native Go implementation, offering better performance and integration with Go ecosystems
Actively maintained with regular updates and community support
Supports both producer and consumer APIs for Apache Kafka

Cons of confluent-kafka-go

Focused solely on Kafka, lacking broader data streaming capabilities
Steeper learning curve for developers new to Kafka concepts
May require additional components for complex data processing pipelines

Code Comparison

Databus (Java):

DbusEventBuffer eventBuffer = new DbusEventBuffer(config);
eventBuffer.start();
DbusEventIterator iter = eventBuffer.acquireIterator("example");
while (iter.hasNext()) {
    DbusEvent event = iter.next();
    // Process event
}

confluent-kafka-go (Go):

consumer, err := kafka.NewConsumer(&kafka.ConfigMap{"bootstrap.servers": "localhost"})
consumer.SubscribeTopics([]string{"example-topic"}, nil)
for {
    msg, err := consumer.ReadMessage(-1)
    // Process message
}

Both libraries provide mechanisms for consuming data streams, but confluent-kafka-go is specifically tailored for Kafka, while Databus offers a more generic approach to change data capture and streaming.

pulsar

14,586

Apache Pulsar - distributed pub-sub messaging system

Pros of Pulsar

Multi-tenancy support with built-in isolation between tenants
Geo-replication capabilities for global data distribution
Supports both streaming and queuing messaging models

Cons of Pulsar

More complex setup and configuration compared to Databus
Steeper learning curve for developers new to the system
Higher resource requirements for optimal performance

Code Comparison

Pulsar consumer example:

Consumer<byte[]> consumer = client.newConsumer()
    .topic("my-topic")
    .subscriptionName("my-subscription")
    .subscribe();

while (true) {
    Message<byte[]> msg = consumer.receive();
    // Process the message
}

Databus consumer example:

DatabusConsumer consumer = new DatabusConsumer();
consumer.registerCallback(new DatabusCallback() {
    @Override
    public void onDataEvent(DbusEvent event) {
        // Process the event
    }
});
consumer.start();

Key Differences

Pulsar offers a more feature-rich and scalable solution for distributed messaging
Databus is more focused on change data capture and replication scenarios
Pulsar has broader language support and client libraries
Databus is tightly integrated with LinkedIn's infrastructure and use cases

Use Cases

Pulsar: Real-time analytics, event streaming, pub/sub messaging
Databus: Database replication, change data capture, data synchronization

flink

24,808

Apache Flink

Pros of Flink

More comprehensive data processing framework, supporting batch and stream processing
Larger and more active community, with frequent updates and contributions
Extensive ecosystem with various connectors and libraries

Cons of Flink

Steeper learning curve due to its complexity and wide range of features
Higher resource requirements for deployment and operation
May be overkill for simpler data streaming use cases

Code Comparison

Flink (Java):

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> text = env.readTextFile("input.txt");
DataStream<Tuple2<String, Integer>> counts = text
    .flatMap(new Tokenizer())
    .keyBy(value -> value.f0)
    .sum(1);
counts.print();

Databus (Java):

DatabusHttpClientImpl.Config config = new DatabusHttpClientImpl.Config();
config.setUri("http://localhost:8080/databus");
DatabusHttpClientImpl client = new DatabusHttpClientImpl(config);
client.registerDatabusStreamListener(new MyStreamListener());
client.start();

Summary

Flink is a more comprehensive data processing framework with a larger community and ecosystem, while Databus focuses specifically on change data capture and distribution. Flink offers more features but has a steeper learning curve, whereas Databus is simpler but more limited in scope. The choice between the two depends on the specific requirements of the data processing task at hand.

beam

8,082

Apache Beam is a unified programming model for Batch and Streaming data processing.

Pros of Beam

Supports multiple programming languages (Java, Python, Go)
Offers a unified programming model for batch and streaming data processing
Has a larger and more active community, with frequent updates and contributions

Cons of Beam

Steeper learning curve due to its more complex architecture
May be overkill for simpler data processing tasks
Requires more setup and configuration compared to Databus

Code Comparison

Databus (Java):

public class SampleConsumer extends AbstractDatabusCombinedConsumer
{
  @Override
  public ConsumerCallbackResult onDataEvent(DbusEvent event,
                                            DbusEventDecoder eventDecoder)
  {
    // Process event
  }
}

Beam (Java):

public class SampleTransform extends PTransform<PCollection<String>, PCollection<String>>
{
  @Override
  public PCollection<String> expand(PCollection<String> input) {
    return input.apply(ParDo.of(new DoFn<String, String>() {
      @ProcessElement
      public void processElement(@Element String element, OutputReceiver<String> out) {
        // Process element
      }
    }));
  }
}

Both repositories focus on data processing, but Beam offers a more comprehensive solution for various data processing scenarios, while Databus is specifically designed for change data capture and real-time data distribution. Beam's flexibility comes at the cost of increased complexity, while Databus provides a more straightforward approach for its specific use case.

debezium

11,333

Change data capture for a variety of databases. Please log issues at https://issues.redhat.com/browse/DBZ.

Pros of Debezium

Wider database support, including MySQL, PostgreSQL, MongoDB, and more
Active development and community support
Built-in Kafka Connect integration for easier deployment

Cons of Debezium

Higher resource consumption due to its Java-based architecture
Potentially more complex setup for simple use cases
Less mature than Databus, which has been in production at LinkedIn for longer

Code Comparison

Debezium configuration example:

{
  "name": "inventory-connector",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "database.hostname": "mysql",
    "database.port": "3306",
    "database.user": "debezium",
    "database.password": "dbz",
    "database.server.id": "184054",
    "database.server.name": "dbserver1",
    "database.whitelist": "inventory",
    "database.history.kafka.bootstrap.servers": "kafka:9092",
    "database.history.kafka.topic": "schema-changes.inventory"
  }
}

Databus configuration example:

<databus-client name="example-client">
  <sources>
    <source name="com.linkedin.events.example.Person">
      <subscriptions>
        <subscription>example.Person</subscription>
      </subscriptions>
    </source>
  </sources>
  <checkpoint-persistence>
    <type>file</type>
    <file-system>
      <root-directory>/tmp/databus-checkpoints</root-directory>
    </file-system>
  </checkpoint-persistence>
</databus-client>

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Introduction

==============

In Internet architectures, data systems are typically categorized into source-of-truth systems that serve as primary stores for the user-generated writes, and derived data stores or indexes which serve reads and other complex queries. The data in these secondary stores is often derived from the primary data through custom transformations, sometimes involving complex processing driven by business logic. Similarly, data in caching tiers is derived from reads against the primary data store, but needs to get invalidated or refreshed when the primary data gets mutated. A fundamental requirement emerging from these kinds of data architectures is the need to reliably capture, flow and process primary data changes.

We have built Databus, a source-agnostic distributed change data capture system, which is an integral part of LinkedIn's data processing pipeline. The Databus transport layer provides latencies in the low milliseconds and handles throughput of thousands of events per second per server while supporting infinite look back capabilities and rich subscription functionality.

Use-cases

Typically, Primary OLTP data-stores take user facing writes and some reads, while other specialized systems serve complex queries or accelerate query results through caching. The most common data systems found in these architectures include relational databases, NoSQL data stores, caching engines, search indexes and graph query engines. This specialization has in turn made it critical to have a reliable and scalable data pipeline that can capture these changes happening for primary source-of-truth systems and route them through the rest of the complex data eco-system. There are two families of solutions that are typically used for building such a pipeline.

Application-driven Dual Writes:

In this model, the application layer writes to the database and in parallel, writes to another messaging system. This looks simple to implement since the application code writing to the database is under our control. However it introduces a consistency problem because without a complex co-ordination protocol (e.g. Paxos or 2-Phase Commit ) it is hard to ensure that both the database and the messaging system are in complete lock step with each other in the face of failures. Both systems need to process exactly the same writes and need to serialize them in exactly the same order. Things get even more complex if the writes are conditional or have partial update semantics.

Database Log Mining:

In this model, we make the database the single source-of-truth and extract changes from its transaction or commit log. This solves our consistency issue, but is practically hard to implement because databases like Oracle and MySQL (the primary data stores in use at LinkedIn) have transaction log formats and replication solutions that are proprietary and not guaranteed to have stable on-disk or on-the-wire representations across version upgrades. Since we want to process the data changes with application code and then write to secondary data stores, we need the replication system to be user-space and source-agnostic. This independence from the data source is especially important in fast-moving technology companies, because it avoids technology lock-in and tie-in to binary formats throughout the application stack.

After evaluating the pros and cons of the two approaches, we decided to pursue the log mining option, prioritizing consistency and "single source of truth" over ease of implementation. In this paper, we introduce Databus, Change Data Capture pipeline at LinkedIn, which supports Oracle sources and a wide range of downstream applications. The Social Graph Index which serves all graph queries at LinkedIn, the People Search Index that powers all searches for members at LinkedIn and the various read replicas for our Member Profile data are all fed and kept consistent via Databus.

More details about the architecture, usecases and performance evaluation can be obtained from a paper that got accepted for publication at the ACM Symposium on Cloud Computing - 2012. The slides for the presentation are available here

How to build ?

Databus requires a library distributed by Oracle Inc under Oracle Technology Network License. Please accept that license here, and download ojdbc6.jar with version at 11.2.0.2.0 here. Once you download the driver jar, please copy it under sandbox-repo/com/oracle/ojdbc6/11.2.0.2.0/ and name it ojdbc6-11.2.0.2.0.jar as shown below. We have provided a sample .ivy file to facilitate the build.

Databus will NOT build without this step. After downloading the jars, they may be copied under the directory sandbox-repo as :

sandbox-repo/com/oracle/ojdbc6/11.2.0.2.0/ojdbc6-11.2.0.2.0.jar
sandbox-repo/com/oracle/ojdbc6/11.2.0.2.0/ojdbc6-11.2.0.2.0.ivy

Build System

Databus currently needs gradle version 1.0 or above to build. The commands to build are :

gradle -Dopen_source=true assemble -- builds the jars and command line package
gradle -Dopen_source=true clean -- cleans the build directory
gradle -Dopen_source=true test -- runs all the unit-tests that come packaged with the source

Licensing

Databus will be licensed under Apache 2.0 license.

Full Documentation

See our wiki for full documentation and examples.

Example Relay

An example of writing a DatabusRelay is available at PersonRelayServer.java. To be able to start a relay process, the code is packaged into a startable command-line package. The tarball may be obtained from build/databus2-example-relay-pkg/distributions/databus2-example-relay-pkg.tgz. This relay is configured to get changestreams for a view "Person".

After extracting to a directory, please cd to that directory and start the relay using the following command :

./bin/start-example-relay.sh person

If the relay is started successfully, the output of the following curl command would look like :

$ curl http://localhost:11115/sources
[{ânameâ:âcom.linkedin.events.example.person.Personâ,âidâ:40}]

Example Client

An example of writing a DatabusClient is available at PersonClientMain.java. To easily be able to start the client process, the code is packaged into a startable command-line package. The tarball may be obtained from build/databus2-example-client-pkg/distributions/databus2-example-client-pkg.tgz. This client is configured to get data from the relay started previously, and configured to susbscribe for table Person.

After extracting to a directory, please cd to that directory and start the client using the following command :

./bin/start-example-client.sh person

If the client successfully connects to the relay we created earlier, the output of the following curl command would look like below ( indicating a client from localhost has connected to the relay ):

$curl http://localhost:11115/relayStats/outbound/http/clients
["localhost"]

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of Kafka

Cons of Kafka

Code Comparison

Pros of confluent-kafka-go

Cons of confluent-kafka-go

Code Comparison

Pros of Pulsar

Cons of Pulsar

Code Comparison

Key Differences

Use Cases

Pros of Flink

Cons of Flink

Code Comparison

Summary

Pros of Beam

Cons of Beam

Code Comparison

Pros of Debezium

Cons of Debezium

Code Comparison

Convert designs to code with AI

README

Introduction

Use-cases

Application-driven Dual Writes:

Database Log Mining:

How to build ?

Build System

Licensing

Full Documentation

Example Relay

Example Client

Top Related Projects

Convert designs to code with AI

NPM DownloadsLast 30 Days