Convert Figma logo to code with AI

spring-cloud logospring-cloud-dataflow

A microservices-based Streaming and Batch data processing in Cloud Foundry and Kubernetes

1,118
584
1,118
305

Top Related Projects

4,830

Apache NiFi

36,684

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

12,810

Conductor is a microservices orchestration engine.

23,929

Apache Flink

17,783

Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

Quick Overview

Spring Cloud Data Flow is an open-source toolkit for building data integration and real-time data processing pipelines. It provides a unified programming model for creating, deploying, and managing data pipelines on modern platforms like Cloud Foundry and Kubernetes. Spring Cloud Data Flow simplifies the development of microservices-based streaming and batch data processing applications.

Pros

  • Seamless integration with Spring ecosystem and other popular data processing frameworks
  • Provides a visual UI for designing and managing data pipelines
  • Supports both stream and batch processing use cases
  • Offers built-in monitoring and management capabilities

Cons

  • Steep learning curve for developers new to Spring ecosystem
  • Limited support for non-JVM languages and frameworks
  • Can be resource-intensive for small-scale applications
  • Requires additional setup and configuration for production deployments

Code Examples

  1. Defining a simple stream:
@EnableBinding(Source.class)
public class TimeSource {

    @Bean
    @InboundChannelAdapter(value = Source.OUTPUT, poller = @Poller(fixedDelay = "10000", maxMessagesPerPoll = "1"))
    public MessageSource<String> timerMessageSource() {
        return () -> new GenericMessage<>(new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(new Date()));
    }
}

This code defines a source that emits the current timestamp every 10 seconds.

  1. Creating a processor:
@EnableBinding(Processor.class)
public class UppercaseProcessor {

    @Transformer(inputChannel = Processor.INPUT, outputChannel = Processor.OUTPUT)
    public String transform(String message) {
        return message.toUpperCase();
    }
}

This processor converts incoming messages to uppercase.

  1. Defining a sink:
@EnableBinding(Sink.class)
public class LogSink {

    @StreamListener(Sink.INPUT)
    public void log(String message) {
        System.out.println("Received: " + message);
    }
}

This sink logs received messages to the console.

Getting Started

  1. Install Spring Cloud Data Flow Server:

    wget https://repo.spring.io/release/org/springframework/cloud/spring-cloud-dataflow-server-local/2.9.2/spring-cloud-dataflow-server-local-2.9.2.jar
    java -jar spring-cloud-dataflow-server-local-2.9.2.jar
    
  2. Install Spring Cloud Data Flow Shell:

    wget https://repo.spring.io/release/org/springframework/cloud/spring-cloud-dataflow-shell/2.9.2/spring-cloud-dataflow-shell-2.9.2.jar
    java -jar spring-cloud-dataflow-shell-2.9.2.jar
    
  3. Create and deploy a simple stream:

    dataflow:> app import --uri https://dataflow.spring.io/kafka-maven-latest
    dataflow:> stream create --name logtime --definition "time | log"
    dataflow:> stream deploy logtime
    

This creates and deploys a simple stream that logs the current time periodically.

Competitor Comparisons

4,830

Apache NiFi

Pros of NiFi

  • More mature and feature-rich, with a longer development history
  • Supports a wider range of data sources and protocols out-of-the-box
  • Provides a user-friendly drag-and-drop interface for designing data flows

Cons of NiFi

  • Steeper learning curve due to its extensive feature set
  • Requires more system resources, especially for large-scale deployments
  • Less integrated with the Spring ecosystem

Code Comparison

NiFi (Java):

public class MyProcessor extends AbstractProcessor {
    @Override
    public void onTrigger(ProcessContext context, ProcessSession session) throws ProcessException {
        FlowFile flowFile = session.get();
        if (flowFile == null) {
            return;
        }
        // Process the FlowFile
    }
}

Spring Cloud Data Flow (Java):

@EnableBinding(Processor.class)
public class MyProcessor {
    @StreamListener(Processor.INPUT)
    @SendTo(Processor.OUTPUT)
    public String process(String input) {
        // Process the input
        return processedOutput;
    }
}

Both projects aim to simplify data integration and processing, but they have different approaches. NiFi offers a more comprehensive solution with a visual interface, while Spring Cloud Data Flow integrates seamlessly with the Spring ecosystem and focuses on microservices-based stream processing.

36,684

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Pros of Airflow

  • More extensive ecosystem with a wide range of plugins and integrations
  • Better suited for complex, long-running workflows and ETL processes
  • Stronger community support and more frequent updates

Cons of Airflow

  • Steeper learning curve, especially for users new to Python
  • Can be resource-intensive for smaller-scale deployments
  • Less integrated with Spring ecosystem and Java-based microservices

Code Comparison

Spring Cloud Data Flow:

@EnableTask
@SpringBootApplication
public class BatchJobTask {
    @Bean
    public Job job(JobBuilderFactory jobBuilderFactory, StepBuilderFactory stepBuilderFactory) {
        // Job configuration
    }
}

Airflow:

from airflow import DAG
from airflow.operators.python_operator import PythonOperator

def my_task():
    # Task logic

dag = DAG('my_dag', default_args=default_args, schedule_interval=timedelta(days=1))
task = PythonOperator(task_id='my_task', python_callable=my_task, dag=dag)

Spring Cloud Data Flow is more tightly integrated with Spring Boot and Java, making it easier for Spring developers to create and manage data pipelines. Airflow, on the other hand, offers more flexibility and scalability for complex workflows, especially in Python-centric environments.

12,810

Conductor is a microservices orchestration engine.

Pros of Conductor

  • More flexible workflow definition with JSON-based DSL
  • Better support for long-running, complex workflows
  • Built-in support for task queuing and prioritization

Cons of Conductor

  • Steeper learning curve due to more complex architecture
  • Less integration with Spring ecosystem
  • Requires additional setup for external dependencies (e.g., Elasticsearch)

Code Comparison

Spring Cloud Data Flow task definition:

@EnableTask
@SpringBootApplication
public class MyTask {
    public static void main(String[] args) {
        SpringApplication.run(MyTask.class, args);
    }
}

Conductor workflow definition:

{
  "name": "my_workflow",
  "tasks": [
    {
      "name": "my_task",
      "taskReferenceName": "my_task_ref",
      "type": "SIMPLE"
    }
  ]
}

Spring Cloud Data Flow focuses on simple, Spring-based task definitions, while Conductor uses a more detailed JSON structure for workflow definitions. This reflects Conductor's emphasis on complex workflow orchestration compared to Spring Cloud Data Flow's streamlined approach for simpler data processing pipelines.

23,929

Apache Flink

Pros of Flink

  • More powerful and flexible for complex stream processing scenarios
  • Better performance and scalability for large-scale data processing
  • Supports both batch and stream processing with a unified API

Cons of Flink

  • Steeper learning curve and more complex setup
  • Less integration with Spring ecosystem
  • Requires more low-level programming and configuration

Code Comparison

Spring Cloud Data Flow:

@EnableBinding(Processor.class)
public class MyProcessor {
    @StreamListener(Processor.INPUT)
    @SendTo(Processor.OUTPUT)
    public String process(String input) {
        return input.toUpperCase();
    }
}

Apache Flink:

DataStream<String> input = env.addSource(new FlinkKafkaConsumer<>(...));
DataStream<String> processed = input.map(String::toUpperCase);
processed.addSink(new FlinkKafkaProducer<>(...));

Spring Cloud Data Flow provides a higher-level abstraction with annotations and Spring integration, making it easier to develop simple data processing pipelines. Flink offers more fine-grained control over data processing and allows for more complex operations, but requires more detailed programming.

Both frameworks have their strengths, with Spring Cloud Data Flow excelling in ease of use and Spring ecosystem integration, while Flink shines in performance, scalability, and advanced stream processing capabilities.

17,783

Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

Pros of Luigi

  • More flexible and language-agnostic, supporting Python, Scala, and other languages
  • Better suited for complex, interdependent data pipelines and ETL workflows
  • Stronger support for distributed computing and parallel task execution

Cons of Luigi

  • Steeper learning curve compared to Spring Cloud Data Flow's visual interface
  • Less integrated with Spring ecosystem and microservices architecture
  • Requires more manual configuration and code for task definitions

Code Comparison

Luigi task definition:

class MyTask(luigi.Task):
    def requires(self):
        return SomeOtherTask()

    def run(self):
        # Task logic here

Spring Cloud Data Flow stream definition:

http | filter | log

Key Differences

  1. Luigi focuses on batch processing and complex workflows, while Spring Cloud Data Flow excels in real-time data processing and microservices orchestration.
  2. Luigi provides more granular control over task dependencies and execution, whereas Spring Cloud Data Flow offers a higher-level abstraction for stream processing.
  3. Spring Cloud Data Flow integrates seamlessly with Spring Boot applications, making it a natural choice for Java-based microservices architectures.
  4. Luigi's Python-centric approach makes it more accessible to data scientists and analysts, while Spring Cloud Data Flow caters to enterprise Java developers.

Both tools have their strengths, and the choice between them depends on the specific use case, existing technology stack, and team expertise.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Spring Data Flow Dashboard

Build Status - CI

Spring Cloud Data Flow is a microservices-based toolkit for building streaming and batch data processing pipelines in Cloud Foundry and Kubernetes.

Data processing pipelines consist of Spring Boot apps, built using the Spring Cloud Stream or Spring Cloud Task microservice frameworks.

This makes Spring Cloud Data Flow ideal for a range of data processing use cases, from import/export to event streaming and predictive analytics.


Components

Architecture: The Spring Cloud Data Flow Server is a Spring Boot application that provides RESTful API and REST clients (Shell, Dashboard, Java DSL). A single Spring Cloud Data Flow installation can support orchestrating the deployment of streams and tasks to Local, Cloud Foundry, and Kubernetes.

Familiarize yourself with the Spring Cloud Data Flow architecture and feature capabilities.

Deployer SPI: A Service Provider Interface (SPI) is defined in the Spring Cloud Deployer project. The Deployer SPI provides an abstraction layer for deploying the apps for a given streaming or batch data pipeline and managing the application lifecycle.

Spring Cloud Deployer Implementations:

Domain Model: The Spring Cloud Data Flow domain module includes the concept of a stream that is a composition of Spring Cloud Stream applications in a linear data pipeline from a source to a sink, optionally including processor application(s) in between. The domain also includes the concept of a task, which may be any process that does not run indefinitely, including Spring Batch jobs.

Application Registry: The App Registry maintains the metadata of the catalog of reusable applications. For example, if relying on Maven coordinates, an application URI would be of the format: maven://<groupId>:<artifactId>:<version>.

Shell/CLI: The Shell connects to the Spring Cloud Data Flow Server's REST API and supports a DSL that simplifies the process of defining a stream or task and managing its lifecycle.


Building

Clone the repo and type

$ ./mvnw -s .settings.xml clean install 

Looking for more information? Follow this link.

Building on Windows

When using Git on Windows to check out the project, it is important to handle line-endings correctly during checkouts. By default Git will change the line-endings during checkout to CRLF. This is, however, not desired for Spring Cloud Data Flow as this may lead to test failures under Windows.

Therefore, please ensure that you set Git property core.autocrlf to false, e.g. using: $ git config core.autocrlf false. For more information please refer to the Git documentation, Formatting and Whitespace.


Running Locally w/ Oracle

By default, the Dataflow server jar does not include the Oracle database driver dependency. If you want to use Oracle for development/testing when running locally, you can specify the local-dev-oracle Maven profile when building. The following command will include the Oracle driver dependency in the jar:

$ ./mvnw -s .settings.xml clean package -Plocal-dev-oracle

You can follow the steps in the Oracle on Mac ARM64 Wiki to run Oracle XE locally in Docker with Dataflow pointing at it.

NOTE: If you are not running Mac ARM64 just skip the steps related to Homebrew and Colima


Running Locally w/ Microsoft SQL Server

By default, the Dataflow server jar does not include the MSSQL database driver dependency. If you want to use MSSQL for development/testing when running locally, you can specify the local-dev-mssql Maven profile when building. The following command will include the MSSQL driver dependency in the jar:

$ ./mvnw -s .settings.xml clean package -Plocal-dev-mssql

You can follow the steps in the MSSQL on Mac ARM64 Wiki to run MSSQL locally in Docker with Dataflow pointing at it.

NOTE: If you are not running Mac ARM64 just skip the steps related to Homebrew and Colima


Running Locally w/ IBM DB2

By default, the Dataflow server jar does not include the DB2 database driver dependency. If you want to use DB2 for development/testing when running locally, you can specify the local-dev-db2 Maven profile when building. The following command will include the DB2 driver dependency in the jar:

$ ./mvnw -s .settings.xml clean package -Plocal-dev-db2

You can follow the steps in the DB2 on Mac ARM64 Wiki to run DB2 locally in Docker with Dataflow pointing at it.

NOTE: If you are not running Mac ARM64 just skip the steps related to Homebrew and Colima


Contributing

We welcome contributions! See the CONTRIBUTING guide for details.


Code formatting guidelines

  • The directory ./src/eclipse has two files for use with code formatting, eclipse-code-formatter.xml for the majority of the code formatting rules and eclipse.importorder to order the import statements.

  • In eclipse you import these files by navigating Windows -> Preferences and then the menu items Preferences > Java > Code Style > Formatter and Preferences > Java > Code Style > Organize Imports respectfully.

  • In IntelliJ, install the plugin Eclipse Code Formatter. You can find it by searching the "Browse Repositories" under the plugin option within IntelliJ (Once installed you will need to reboot Intellij for it to take effect). Then navigate to Intellij IDEA > Preferences and select the Eclipse Code Formatter. Select the eclipse-code-formatter.xml file for the field Eclipse Java Formatter config file and the file eclipse.importorder for the field Import order. Enable the Eclipse code formatter by clicking Use the Eclipse code formatter then click the OK button. ** NOTE: If you configure the Eclipse Code Formatter from File > Other Settings > Default Settings it will set this policy across all of your Intellij projects.

License

Spring Cloud Data Flow is Open Source software released under the Apache 2.0 license.