Convert Figma logo to code with AI

linkedin logodatabus

Source-agnostic distributed change data capture system

3,643
736
3,643
38

Top Related Projects

28,601

Mirror of Apache Kafka

Confluent's Apache Kafka Golang client

14,185

Apache Pulsar - distributed pub-sub messaging system

23,929

Apache Flink

7,828

Apache Beam is a unified programming model for Batch and Streaming data processing.

10,544

Change data capture for a variety of databases. Please log issues at https://issues.redhat.com/browse/DBZ.

Quick Overview

Databus is an open-source data change capture platform developed by LinkedIn. It provides a unified solution for streaming data changes from various sources, enabling real-time data integration and processing across different systems and applications.

Pros

  • Scalable and fault-tolerant architecture suitable for large-scale data pipelines
  • Supports multiple data sources and targets, including relational databases and NoSQL systems
  • Provides strong consistency guarantees and exactly-once semantics
  • Offers flexible data transformation and routing capabilities

Cons

  • Steep learning curve due to complex architecture and configuration
  • Limited documentation and community support compared to some other data streaming solutions
  • Requires significant infrastructure and operational overhead to set up and maintain
  • May be overkill for smaller-scale data integration needs

Code Examples

// Creating a Databus event producer
DatabusEventProducer producer = new DatabusEventProducer.Builder()
    .setUri("databus://localhost:8080/myDataSource")
    .setEventClass(MyEvent.class)
    .build();

MyEvent event = new MyEvent("key", "value");
producer.enqueueEvent(event);
// Setting up a Databus consumer
DatabusConsumer consumer = new DatabusConsumer.Builder()
    .setUri("databus://localhost:8080/myDataSource")
    .setConsumerName("myConsumer")
    .build();

consumer.start();
// Implementing a custom event processor
public class MyEventProcessor implements DatabusEventProcessor {
    @Override
    public void onEvent(DatabusEvent event) {
        // Process the event
        System.out.println("Received event: " + event.toString());
    }
}

Getting Started

To get started with Databus:

  1. Clone the repository:

    git clone https://github.com/linkedin/databus.git
    
  2. Build the project:

    cd databus
    ./gradlew build
    
  3. Configure your data sources and targets in the databus-core/src/main/resources/databus.properties file.

  4. Start the Databus server:

    ./gradlew run
    
  5. Implement your custom event producers and consumers using the Databus API.

Competitor Comparisons

28,601

Mirror of Apache Kafka

Pros of Kafka

  • Higher throughput and scalability, capable of handling millions of messages per second
  • More robust ecosystem with extensive tooling and integrations
  • Active development and widespread adoption in the industry

Cons of Kafka

  • More complex setup and configuration compared to Databus
  • Higher resource requirements, especially for larger clusters
  • Steeper learning curve for developers new to distributed systems

Code Comparison

Kafka producer example:

Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer<>(props);

Databus producer example:

DbusEventProducer producer = new DbusEventProducer();
producer.start(1000);
DbusEventKey eventKey = new DbusEventKey(1L);
DbusEventInfo eventInfo = new DbusEventInfo(DbusOpcode.UPSERT, 0L, (short)0);
producer.enqueueEvent(eventKey, eventInfo, "payload".getBytes());

Both Kafka and Databus are messaging systems, but Kafka has become more popular due to its scalability and ecosystem. Databus, developed by LinkedIn, is more focused on change data capture and database replication scenarios. Kafka's wider adoption has led to more community support and integrations, making it a more versatile choice for various use cases.

Confluent's Apache Kafka Golang client

Pros of confluent-kafka-go

  • Native Go implementation, offering better performance and integration with Go ecosystems
  • Actively maintained with regular updates and community support
  • Supports both producer and consumer APIs for Apache Kafka

Cons of confluent-kafka-go

  • Focused solely on Kafka, lacking broader data streaming capabilities
  • Steeper learning curve for developers new to Kafka concepts
  • May require additional components for complex data processing pipelines

Code Comparison

Databus (Java):

DbusEventBuffer eventBuffer = new DbusEventBuffer(config);
eventBuffer.start();
DbusEventIterator iter = eventBuffer.acquireIterator("example");
while (iter.hasNext()) {
    DbusEvent event = iter.next();
    // Process event
}

confluent-kafka-go (Go):

consumer, err := kafka.NewConsumer(&kafka.ConfigMap{"bootstrap.servers": "localhost"})
consumer.SubscribeTopics([]string{"example-topic"}, nil)
for {
    msg, err := consumer.ReadMessage(-1)
    // Process message
}

Both libraries provide mechanisms for consuming data streams, but confluent-kafka-go is specifically tailored for Kafka, while Databus offers a more generic approach to change data capture and streaming.

14,185

Apache Pulsar - distributed pub-sub messaging system

Pros of Pulsar

  • Multi-tenancy support with built-in isolation between tenants
  • Geo-replication capabilities for global data distribution
  • Supports both streaming and queuing messaging models

Cons of Pulsar

  • More complex setup and configuration compared to Databus
  • Steeper learning curve for developers new to the system
  • Higher resource requirements for optimal performance

Code Comparison

Pulsar consumer example:

Consumer<byte[]> consumer = client.newConsumer()
    .topic("my-topic")
    .subscriptionName("my-subscription")
    .subscribe();

while (true) {
    Message<byte[]> msg = consumer.receive();
    // Process the message
}

Databus consumer example:

DatabusConsumer consumer = new DatabusConsumer();
consumer.registerCallback(new DatabusCallback() {
    @Override
    public void onDataEvent(DbusEvent event) {
        // Process the event
    }
});
consumer.start();

Key Differences

  • Pulsar offers a more feature-rich and scalable solution for distributed messaging
  • Databus is more focused on change data capture and replication scenarios
  • Pulsar has broader language support and client libraries
  • Databus is tightly integrated with LinkedIn's infrastructure and use cases

Use Cases

  • Pulsar: Real-time analytics, event streaming, pub/sub messaging
  • Databus: Database replication, change data capture, data synchronization
23,929

Apache Flink

Pros of Flink

  • More comprehensive data processing framework, supporting batch and stream processing
  • Larger and more active community, with frequent updates and contributions
  • Extensive ecosystem with various connectors and libraries

Cons of Flink

  • Steeper learning curve due to its complexity and wide range of features
  • Higher resource requirements for deployment and operation
  • May be overkill for simpler data streaming use cases

Code Comparison

Flink (Java):

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> text = env.readTextFile("input.txt");
DataStream<Tuple2<String, Integer>> counts = text
    .flatMap(new Tokenizer())
    .keyBy(value -> value.f0)
    .sum(1);
counts.print();

Databus (Java):

DatabusHttpClientImpl.Config config = new DatabusHttpClientImpl.Config();
config.setUri("http://localhost:8080/databus");
DatabusHttpClientImpl client = new DatabusHttpClientImpl(config);
client.registerDatabusStreamListener(new MyStreamListener());
client.start();

Summary

Flink is a more comprehensive data processing framework with a larger community and ecosystem, while Databus focuses specifically on change data capture and distribution. Flink offers more features but has a steeper learning curve, whereas Databus is simpler but more limited in scope. The choice between the two depends on the specific requirements of the data processing task at hand.

7,828

Apache Beam is a unified programming model for Batch and Streaming data processing.

Pros of Beam

  • Supports multiple programming languages (Java, Python, Go)
  • Offers a unified programming model for batch and streaming data processing
  • Has a larger and more active community, with frequent updates and contributions

Cons of Beam

  • Steeper learning curve due to its more complex architecture
  • May be overkill for simpler data processing tasks
  • Requires more setup and configuration compared to Databus

Code Comparison

Databus (Java):

public class SampleConsumer extends AbstractDatabusCombinedConsumer
{
  @Override
  public ConsumerCallbackResult onDataEvent(DbusEvent event,
                                            DbusEventDecoder eventDecoder)
  {
    // Process event
  }
}

Beam (Java):

public class SampleTransform extends PTransform<PCollection<String>, PCollection<String>>
{
  @Override
  public PCollection<String> expand(PCollection<String> input) {
    return input.apply(ParDo.of(new DoFn<String, String>() {
      @ProcessElement
      public void processElement(@Element String element, OutputReceiver<String> out) {
        // Process element
      }
    }));
  }
}

Both repositories focus on data processing, but Beam offers a more comprehensive solution for various data processing scenarios, while Databus is specifically designed for change data capture and real-time data distribution. Beam's flexibility comes at the cost of increased complexity, while Databus provides a more straightforward approach for its specific use case.

10,544

Change data capture for a variety of databases. Please log issues at https://issues.redhat.com/browse/DBZ.

Pros of Debezium

  • Wider database support, including MySQL, PostgreSQL, MongoDB, and more
  • Active development and community support
  • Built-in Kafka Connect integration for easier deployment

Cons of Debezium

  • Higher resource consumption due to its Java-based architecture
  • Potentially more complex setup for simple use cases
  • Less mature than Databus, which has been in production at LinkedIn for longer

Code Comparison

Debezium configuration example:

{
  "name": "inventory-connector",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "database.hostname": "mysql",
    "database.port": "3306",
    "database.user": "debezium",
    "database.password": "dbz",
    "database.server.id": "184054",
    "database.server.name": "dbserver1",
    "database.whitelist": "inventory",
    "database.history.kafka.bootstrap.servers": "kafka:9092",
    "database.history.kafka.topic": "schema-changes.inventory"
  }
}

Databus configuration example:

<databus-client name="example-client">
  <sources>
    <source name="com.linkedin.events.example.Person">
      <subscriptions>
        <subscription>example.Person</subscription>
      </subscriptions>
    </source>
  </sources>
  <checkpoint-persistence>
    <type>file</type>
    <file-system>
      <root-directory>/tmp/databus-checkpoints</root-directory>
    </file-system>
  </checkpoint-persistence>
</databus-client>

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Introduction

==============

Join the chat at https://gitter.im/linkedin/databus

In Internet architectures, data systems are typically categorized into source-of-truth systems that serve as primary stores for the user-generated writes, and derived data stores or indexes which serve reads and other complex queries. The data in these secondary stores is often derived from the primary data through custom transformations, sometimes involving complex processing driven by business logic. Similarly, data in caching tiers is derived from reads against the primary data store, but needs to get invalidated or refreshed when the primary data gets mutated. A fundamental requirement emerging from these kinds of data architectures is the need to reliably capture, flow and process primary data changes.

We have built Databus, a source-agnostic distributed change data capture system, which is an integral part of LinkedIn's data processing pipeline. The Databus transport layer provides latencies in the low milliseconds and handles throughput of thousands of events per second per server while supporting infinite look back capabilities and rich subscription functionality.

Use-cases


Typically, Primary OLTP data-stores take user facing writes and some reads, while other specialized systems serve complex queries or accelerate query results through caching. The most common data systems found in these architectures include relational databases, NoSQL data stores, caching engines, search indexes and graph query engines. This specialization has in turn made it critical to have a reliable and scalable data pipeline that can capture these changes happening for primary source-of-truth systems and route them through the rest of the complex data eco-system. There are two families of solutions that are typically used for building such a pipeline.

Application-driven Dual Writes:

In this model, the application layer writes to the database and in parallel, writes to another messaging system. This looks simple to implement since the application code writing to the database is under our control. However it introduces a consistency problem because without a complex co-ordination protocol (e.g. Paxos or 2-Phase Commit ) it is hard to ensure that both the database and the messaging system are in complete lock step with each other in the face of failures. Both systems need to process exactly the same writes and need to serialize them in exactly the same order. Things get even more complex if the writes are conditional or have partial update semantics.

Database Log Mining:

In this model, we make the database the single source-of-truth and extract changes from its transaction or commit log. This solves our consistency issue, but is practically hard to implement because databases like Oracle and MySQL (the primary data stores in use at LinkedIn) have transaction log formats and replication solutions that are proprietary and not guaranteed to have stable on-disk or on-the-wire representations across version upgrades. Since we want to process the data changes with application code and then write to secondary data stores, we need the replication system to be user-space and source-agnostic. This independence from the data source is especially important in fast-moving technology companies, because it avoids technology lock-in and tie-in to binary formats throughout the application stack.

After evaluating the pros and cons of the two approaches, we decided to pursue the log mining option, prioritizing consistency and "single source of truth" over ease of implementation. In this paper, we introduce Databus, Change Data Capture pipeline at LinkedIn, which supports Oracle sources and a wide range of downstream applications. The Social Graph Index which serves all graph queries at LinkedIn, the People Search Index that powers all searches for members at LinkedIn and the various read replicas for our Member Profile data are all fed and kept consistent via Databus.

More details about the architecture, usecases and performance evaluation can be obtained from a paper that got accepted for publication at the ACM Symposium on Cloud Computing - 2012. The slides for the presentation are available here

How to build ?


Databus requires a library distributed by Oracle Inc under Oracle Technology Network License. Please accept that license here, and download ojdbc6.jar with version at 11.2.0.2.0 here. Once you download the driver jar, please copy it under sandbox-repo/com/oracle/ojdbc6/11.2.0.2.0/ and name it ojdbc6-11.2.0.2.0.jar as shown below. We have provided a sample .ivy file to facilitate the build.

Databus will NOT build without this step. After downloading the jars, they may be copied under the directory sandbox-repo as :

  • sandbox-repo/com/oracle/ojdbc6/11.2.0.2.0/ojdbc6-11.2.0.2.0.jar
  • sandbox-repo/com/oracle/ojdbc6/11.2.0.2.0/ojdbc6-11.2.0.2.0.ivy

Build System


Databus currently needs gradle version 1.0 or above to build. The commands to build are :

  • gradle -Dopen_source=true assemble -- builds the jars and command line package
  • gradle -Dopen_source=true clean -- cleans the build directory
  • gradle -Dopen_source=true test -- runs all the unit-tests that come packaged with the source

Licensing


Databus will be licensed under Apache 2.0 license.

Full Documentation


See our wiki for full documentation and examples.

Example Relay


An example of writing a DatabusRelay is available at PersonRelayServer.java. To be able to start a relay process, the code is packaged into a startable command-line package. The tarball may be obtained from build/databus2-example-relay-pkg/distributions/databus2-example-relay-pkg.tgz. This relay is configured to get changestreams for a view "Person".

After extracting to a directory, please cd to that directory and start the relay using the following command :

  • ./bin/start-example-relay.sh person

If the relay is started successfully, the output of the following curl command would look like :

  • $ curl http://localhost:11115/sources
  • [{“name”:“com.linkedin.events.example.person.Person”,“id”:40}]

Example Client


An example of writing a DatabusClient is available at PersonClientMain.java. To easily be able to start the client process, the code is packaged into a startable command-line package. The tarball may be obtained from build/databus2-example-client-pkg/distributions/databus2-example-client-pkg.tgz. This client is configured to get data from the relay started previously, and configured to susbscribe for table Person.

After extracting to a directory, please cd to that directory and start the client using the following command :

  • ./bin/start-example-client.sh person

If the client successfully connects to the relay we created earlier, the output of the following curl command would look like below ( indicating a client from localhost has connected to the relay ):

  • $curl http://localhost:11115/relayStats/outbound/http/clients
  • ["localhost"]

NPM DownloadsLast 30 Days