spring-cloud-dataflow
A microservices-based Streaming and Batch data processing in Cloud Foundry and Kubernetes
Top Related Projects
A microservices-based Streaming and Batch data processing in Cloud Foundry and Kubernetes
Apache NiFi
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Apache Beam is a unified programming model for Batch and Streaming data processing.
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
Quick Overview
Spring Cloud Data Flow is a toolkit for building data integration and real-time data processing pipelines. It provides out-of-the-box capabilities to deploy and scale data pipelines for use cases such as data ingestion, real-time analytics, and data import/export.
Pros
- Simplifies the development of cloud-native data microservices
- Provides a unified programming model for batch and streaming applications
- Offers a rich ecosystem of pre-built stream and task/batch starters
- Supports multiple runtime platforms (local, Cloud Foundry, Kubernetes)
Cons
- Learning curve can be steep for newcomers to Spring ecosystem
- Configuration and setup can be complex for large-scale deployments
- Limited support for non-JVM languages
- Requires careful resource management in production environments
Code Examples
- Defining a simple stream:
@EnableBinding(Source.class)
public class TimeSource {
@Bean
@InboundChannelAdapter(value = Source.OUTPUT, poller = @Poller(fixedDelay = "10", maxMessagesPerPoll = "1"))
public MessageSource<String> timerMessageSource() {
return () -> new GenericMessage<>(new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(new Date()));
}
}
This code defines a source that emits the current timestamp every 10 seconds.
- Creating a processor:
@EnableBinding(Processor.class)
public class UppercaseProcessor {
@StreamListener(Processor.INPUT)
@SendTo(Processor.OUTPUT)
public String process(String input) {
return input.toUpperCase();
}
}
This processor converts incoming messages to uppercase.
- Defining a sink:
@EnableBinding(Sink.class)
public class LogSink {
@StreamListener(Sink.INPUT)
public void log(String message) {
System.out.println("Received: " + message);
}
}
This sink logs received messages to the console.
Getting Started
-
Install Spring Cloud Data Flow Server:
wget https://repo.spring.io/release/org/springframework/cloud/spring-cloud-dataflow-server-local/1.7.4.RELEASE/spring-cloud-dataflow-server-local-1.7.4.RELEASE.jar java -jar spring-cloud-dataflow-server-local-1.7.4.RELEASE.jar
-
Install Spring Cloud Data Flow Shell:
wget https://repo.spring.io/release/org/springframework/cloud/spring-cloud-dataflow-shell/1.7.4.RELEASE/spring-cloud-dataflow-shell-1.7.4.RELEASE.jar java -jar spring-cloud-dataflow-shell-1.7.4.RELEASE.jar
-
Create and deploy a stream:
dataflow:> app import --uri https://dataflow.spring.io/kafka-maven-latest dataflow:> stream create --name logtime --definition "time | log" dataflow:> stream deploy logtime
This sets up a local Data Flow server, launches the shell, and creates a simple stream that logs the current time.
Competitor Comparisons
A microservices-based Streaming and Batch data processing in Cloud Foundry and Kubernetes
Pros of spring-cloud-dataflow
- Provides a comprehensive platform for building and deploying data-intensive applications
- Offers a user-friendly UI for managing and monitoring data pipelines
- Supports a wide range of data processing and integration scenarios
Cons of spring-cloud-dataflow
- May have a steeper learning curve for developers new to Spring ecosystem
- Requires additional setup and configuration for complex deployment scenarios
- Could be considered overkill for simpler data processing tasks
Code Comparison
Both repositories contain the same codebase for Spring Cloud Data Flow. As they are identical, there are no relevant code differences to highlight. However, here's a sample of the project structure from both repositories:
spring-cloud-dataflow/
├── spring-cloud-dataflow-core/
├── spring-cloud-dataflow-rest-client/
├── spring-cloud-dataflow-shell-core/
└── spring-cloud-dataflow-server-core/
This structure showcases the modular design of Spring Cloud Data Flow, with separate components for core functionality, REST client, shell interface, and server core.
Apache NiFi
Pros of NiFi
- More mature and actively maintained project with a larger community
- Offers a user-friendly drag-and-drop interface for designing data flows
- Supports a wider range of data sources and protocols out-of-the-box
Cons of NiFi
- Steeper learning curve due to its extensive feature set
- Can be resource-intensive, especially for large-scale deployments
- Less integrated with the Spring ecosystem
Code Comparison
NiFi uses a Java-based configuration approach:
public class MyProcessor extends AbstractProcessor {
@Override
public void onTrigger(ProcessContext context, ProcessSession session) throws ProcessException {
// Process data
}
}
Spring Cloud Data Flow uses a more Spring-like configuration:
@EnableBinding(Processor.class)
public class MyProcessor {
@StreamListener(Processor.INPUT)
@SendTo(Processor.OUTPUT)
public String process(String input) {
// Process data
}
}
Both projects aim to simplify data integration and processing, but they take different approaches. NiFi focuses on a visual, flow-based programming model, while Spring Cloud Data Flow leverages the Spring ecosystem and programming model. The choice between them depends on specific project requirements, existing infrastructure, and team expertise.
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Pros of Airflow
- More extensive ecosystem with a wide range of plugins and integrations
- Better suited for complex, long-running workflows and data pipelines
- Active development and large community support
Cons of Airflow
- Steeper learning curve, especially for non-Python users
- Can be resource-intensive for smaller-scale operations
- Less focus on stream processing compared to Spring Cloud Data Flow
Code Comparison
Spring Cloud Data Flow:
@EnableTask
@SpringBootApplication
public class BatchJobApplication {
public static void main(String[] args) {
SpringApplication.run(BatchJobApplication.class, args);
}
}
Airflow:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
def my_task():
print("Hello from Airflow!")
dag = DAG('my_dag', start_date=datetime(2023, 1, 1))
task = PythonOperator(task_id='my_task', python_callable=my_task, dag=dag)
Spring Cloud Data Flow focuses on Spring Boot-based microservices and stream processing, while Airflow provides a more general-purpose workflow management system with Python-based DAGs. Airflow offers greater flexibility for complex workflows, but Spring Cloud Data Flow may be more suitable for Spring-centric environments and real-time data processing scenarios.
Apache Beam is a unified programming model for Batch and Streaming data processing.
Pros of Beam
- Broader language support (Java, Python, Go, etc.)
- More extensive and flexible data processing capabilities
- Active development and larger community support
Cons of Beam
- Steeper learning curve due to more complex API
- Potentially overkill for simpler data processing tasks
- Less integrated with Spring ecosystem
Code Comparison
Spring Cloud Data Flow:
@EnableBinding(Processor.class)
public class MyProcessor {
@StreamListener(Processor.INPUT)
@SendTo(Processor.OUTPUT)
public String process(String input) {
return input.toUpperCase();
}
}
Apache Beam:
PCollection<String> input = ...;
PCollection<String> output = input.apply(MapElements.into(TypeDescriptors.strings())
.via((String word) -> word.toUpperCase()));
Key Differences
- Spring Cloud Data Flow focuses on microservices and stream processing within the Spring ecosystem
- Beam is a more general-purpose data processing framework with a unified programming model
- Spring Cloud Data Flow provides a higher-level abstraction, while Beam offers more fine-grained control
- Beam supports both batch and streaming processing, whereas Spring Cloud Data Flow is primarily stream-oriented
- Spring Cloud Data Flow includes built-in orchestration and monitoring tools, while Beam relies on external runners and monitoring solutions
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Pros of Luigi
- More flexible and language-agnostic, supporting Python, Scala, and other languages
- Better suited for complex data pipelines and large-scale data processing
- Extensive visualization capabilities for workflow monitoring and debugging
Cons of Luigi
- Steeper learning curve due to its flexibility and extensive features
- Less integrated with Spring ecosystem and Java-based microservices
- Requires more manual configuration and setup compared to Spring Cloud Data Flow
Code Comparison
Luigi task example:
class MyTask(luigi.Task):
def requires(self):
return SomeOtherTask()
def run(self):
# Task logic here
Spring Cloud Data Flow stream definition:
http | filter | log
Key Differences
- Luigi focuses on building complex data pipelines, while Spring Cloud Data Flow emphasizes stream processing and batch jobs.
- Luigi offers more granular control over task dependencies and execution, whereas Spring Cloud Data Flow provides a higher-level abstraction for data flow.
- Spring Cloud Data Flow integrates seamlessly with Spring Boot applications, making it more suitable for Java-based microservices architectures.
- Luigi's visualization tools are more advanced, offering detailed insights into task execution and dependencies.
- Spring Cloud Data Flow provides easier deployment and scaling options within cloud-native environments.
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
Pros of Prefect
- More modern and actively maintained project
- Supports Python natively, making it accessible to data scientists
- Offers a wider range of execution environments, including cloud and serverless
Cons of Prefect
- Steeper learning curve for users familiar with Spring ecosystem
- Less integration with Java-based microservices architecture
- Potentially more complex setup for traditional enterprise environments
Code Comparison
Spring Cloud Data Flow:
@EnableTask
@SpringBootApplication
public class TaskApplication {
public static void main(String[] args) {
SpringApplication.run(TaskApplication.class, args);
}
}
Prefect:
from prefect import task, Flow
@task
def hello_task():
print("Hello, Prefect!")
with Flow("My First Flow") as flow:
hello_task()
Spring Cloud Data Flow uses Java and Spring Boot annotations to define tasks, while Prefect uses Python decorators and a context manager to define flows. Prefect's syntax is more concise and Pythonic, which may be preferable for data science workflows. However, Spring Cloud Data Flow's approach integrates seamlessly with the broader Spring ecosystem, which can be advantageous in enterprise Java environments.
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Spring Cloud Data Flow is no longer maintained as an open-source project by Broadcom, Inc.
For information about extended support or commercial options for Spring Cloud Data Flow, please read the official blog post here.
Spring Cloud Data Flow is a microservices-based toolkit for building streaming and batch data processing pipelines in Cloud Foundry and Kubernetes.
Data processing pipelines consist of Spring Boot apps, built using the Spring Cloud Stream or Spring Cloud Task microservice frameworks.
This makes Spring Cloud Data Flow ideal for a range of data processing use cases, from import/export to event streaming and predictive analytics.
Components
Architecture: The Spring Cloud Data Flow Server is a Spring Boot application that provides RESTful API and REST clients (Shell, Dashboard, Java DSL). A single Spring Cloud Data Flow installation can support orchestrating the deployment of streams and tasks to Local, Cloud Foundry, and Kubernetes.
Familiarize yourself with the Spring Cloud Data Flow architecture and feature capabilities.
Deployer SPI: A Service Provider Interface (SPI) is defined in the Spring Cloud Deployer project. The Deployer SPI provides an abstraction layer for deploying the apps for a given streaming or batch data pipeline and managing the application lifecycle.
Spring Cloud Deployer Implementations:
Domain Model: The Spring Cloud Data Flow domain module includes the concept of a stream that is a composition of Spring Cloud Stream applications in a linear data pipeline from a source to a sink, optionally including processor application(s) in between. The domain also includes the concept of a task, which may be any process that does not run indefinitely, including Spring Batch jobs.
Application Registry: The App Registry
maintains the metadata of the catalog of reusable applications.
For example, if relying on Maven coordinates, an application URI would be of the format:
maven://<groupId>:<artifactId>:<version>
.
Shell/CLI: The Shell connects to the Spring Cloud Data Flow Server's REST API and supports a DSL that simplifies the process of defining a stream or task and managing its lifecycle.
Building
Clone the repo and type
$ ./mvnw -s .settings.xml clean install
Looking for more information? Follow this link.
Building on Windows
When using Git on Windows to check out the project, it is important to handle line-endings correctly during checkouts.
By default Git will change the line-endings during checkout to CRLF
. This is, however, not desired for Spring Cloud Data Flow
as this may lead to test failures under Windows.
Therefore, please ensure that you set Git property core.autocrlf
to false
, e.g. using: $ git config core.autocrlf false
.
For more information please refer to the Git documentation, Formatting and Whitespace.
Running Locally w/ Oracle
By default, the Dataflow server jar does not include the Oracle database driver dependency.
If you want to use Oracle for development/testing when running locally, you can specify the local-dev-oracle
Maven profile when building.
The following command will include the Oracle driver dependency in the jar:
$ ./mvnw -s .settings.xml clean package -Plocal-dev-oracle
You can follow the steps in the Oracle on Mac ARM64 Wiki to run Oracle XE locally in Docker with Dataflow pointing at it.
NOTE: If you are not running Mac ARM64 just skip the steps related to Homebrew and Colima
Running Locally w/ Microsoft SQL Server
By default, the Dataflow server jar does not include the MSSQL database driver dependency.
If you want to use MSSQL for development/testing when running locally, you can specify the local-dev-mssql
Maven profile when building.
The following command will include the MSSQL driver dependency in the jar:
$ ./mvnw -s .settings.xml clean package -Plocal-dev-mssql
You can follow the steps in the MSSQL on Mac ARM64 Wiki to run MSSQL locally in Docker with Dataflow pointing at it.
NOTE: If you are not running Mac ARM64 just skip the steps related to Homebrew and Colima
Running Locally w/ IBM DB2
By default, the Dataflow server jar does not include the DB2 database driver dependency.
If you want to use DB2 for development/testing when running locally, you can specify the local-dev-db2
Maven profile when building.
The following command will include the DB2 driver dependency in the jar:
$ ./mvnw -s .settings.xml clean package -Plocal-dev-db2
You can follow the steps in the DB2 on Mac ARM64 Wiki to run DB2 locally in Docker with Dataflow pointing at it.
NOTE: If you are not running Mac ARM64 just skip the steps related to Homebrew and Colima
Contributing
We welcome contributions! See the CONTRIBUTING guide for details.
Code formatting guidelines
-
The directory ./src/eclipse has two files for use with code formatting,
eclipse-code-formatter.xml
for the majority of the code formatting rules andeclipse.importorder
to order the import statements. -
In eclipse you import these files by navigating
Windows -> Preferences
and then the menu itemsPreferences > Java > Code Style > Formatter
andPreferences > Java > Code Style > Organize Imports
respectfully. -
In
IntelliJ
, install the pluginEclipse Code Formatter
. You can find it by searching the "Browse Repositories" under the plugin option withinIntelliJ
(Once installed you will need to reboot Intellij for it to take effect). Then navigate toIntellij IDEA > Preferences
and select the Eclipse Code Formatter. Select theeclipse-code-formatter.xml
file for the fieldEclipse Java Formatter config file
and the fileeclipse.importorder
for the fieldImport order
. Enable theEclipse code formatter
by clickingUse the Eclipse code formatter
then click the OK button. ** NOTE: If you configure theEclipse Code Formatter
fromFile > Other Settings > Default Settings
it will set this policy across all of your Intellij projects.
License
Spring Cloud Data Flow is Open Source software released under the Apache 2.0 license.
Top Related Projects
A microservices-based Streaming and Batch data processing in Cloud Foundry and Kubernetes
Apache NiFi
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Apache Beam is a unified programming model for Batch and Streaming data processing.
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot