cruise-control
Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
Top Related Projects
Mirror of Apache Kafka
Confluent's Apache Kafka Golang client
CMAK is a tool for managing Apache Kafka clusters
Quick Overview
Cruise Control is an open-source project developed by LinkedIn to automate the operation of large-scale Apache Kafka clusters. It provides a set of tools and APIs for monitoring, balancing, and managing Kafka clusters, aiming to reduce the operational overhead and improve the reliability of Kafka deployments.
Pros
- Automated load balancing and resource management for Kafka clusters
- Supports multi-tenancy and provides isolation between different applications
- Offers a REST API for easy integration with existing systems and workflows
- Provides a web UI for monitoring and manual interventions when needed
Cons
- Requires careful configuration and tuning for optimal performance
- May introduce additional complexity to the Kafka ecosystem
- Learning curve for teams not familiar with Cruise Control's concepts and operations
- Limited documentation for advanced use cases and troubleshooting
Code Examples
- Starting a rebalance operation:
KafkaCruiseControl kafkaCruiseControl = new KafkaCruiseControl(config);
kafkaCruiseControl.rebalance(goals, dryRun, allowCapacityEstimation, concurrentPartitionMovements,
concurrentLeaderMovements, skipHardGoalCheck, excludedTopics);
- Getting the cluster load:
KafkaCruiseControl kafkaCruiseControl = new KafkaCruiseControl(config);
ClusterModel clusterModel = kafkaCruiseControl.clusterModel(start, end, allowCapacityEstimation, populateReplicaPlacementInfo);
- Adding a new broker to the cluster:
KafkaCruiseControl kafkaCruiseControl = new KafkaCruiseControl(config);
kafkaCruiseControl.addBrokers(brokerIds, dryRun, throttle, goals, allowCapacityEstimation, concurrentPartitionMovements,
concurrentLeaderMovements, skipHardGoalCheck, excludedTopics);
Getting Started
To get started with Cruise Control:
-
Clone the repository:
git clone https://github.com/linkedin/cruise-control.git
-
Build the project:
./gradlew build
-
Configure Cruise Control by editing
config/cruisecontrol.properties
-
Start Cruise Control:
./kafka-cruise-control-start.sh config/cruisecontrol.properties
-
Access the web UI at
http://localhost:9090/kafkacruisecontrol/
For more detailed instructions and configuration options, refer to the project's documentation on GitHub.
Competitor Comparisons
Mirror of Apache Kafka
Pros of Kafka
- Widely adopted, mature distributed streaming platform with extensive ecosystem
- Supports high-throughput, fault-tolerant data pipelines and streaming applications
- Offers built-in partitioning, replication, and fault-tolerance for scalability
Cons of Kafka
- Requires manual configuration and tuning for optimal performance
- Can be complex to set up and manage, especially for smaller deployments
- Limited built-in monitoring and self-healing capabilities
Code Comparison
Kafka (Producer example):
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer<>(props);
Cruise Control (Anomaly detection configuration):
anomaly.detection.goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,\
com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal
While Kafka focuses on core messaging functionality, Cruise Control provides automated management and optimization for Kafka clusters, complementing Kafka's capabilities with features like load balancing, anomaly detection, and self-healing.
Confluent's Apache Kafka Golang client
Pros of confluent-kafka-go
- Native Go implementation for Kafka client, offering better performance and integration with Go ecosystems
- Comprehensive support for Kafka features, including consumer groups, transactions, and admin operations
- Actively maintained by Confluent, ensuring compatibility with latest Kafka versions
Cons of confluent-kafka-go
- Focused solely on Kafka client functionality, lacking broader cluster management features
- Requires more manual configuration and management compared to automated solutions
- Limited built-in monitoring and metrics capabilities
Code Comparison
Cruise-control (Java):
KafkaCruiseControl kafkaCruiseControl = new KafkaCruiseControl(config, time);
kafkaCruiseControl.startUp();
GoalOptimizer.OptimizerResult result = kafkaCruiseControl.getOptimizationProposals();
confluent-kafka-go (Go):
producer, err := kafka.NewProducer(&kafka.ConfigMap{"bootstrap.servers": "localhost"})
topic := "myTopic"
message := &kafka.Message{TopicPartition: kafka.TopicPartition{Topic: &topic, Partition: kafka.PartitionAny}, Value: []byte("test")}
err = producer.Produce(message, nil)
While Cruise-control focuses on cluster management and optimization, confluent-kafka-go provides a low-level client API for Kafka interactions in Go applications. Cruise-control offers automated management features, while confluent-kafka-go requires more manual configuration but provides finer control over Kafka operations.
CMAK is a tool for managing Apache Kafka clusters
Pros of CMAK
- User-friendly web interface for managing Apache Kafka clusters
- Supports multiple cluster management and monitoring
- Provides detailed metrics and visual representations of Kafka topics and brokers
Cons of CMAK
- Less focused on automated cluster balancing and optimization
- May require more manual intervention for cluster management tasks
- Limited integration with other tools compared to Cruise Control
Code Comparison
CMAK (Scala):
def getClusterList: Future[ApiError \/ IndexedSeq[String]] = {
kafkaManager.getClusterList.map { errorOrClusterList =>
errorOrClusterList.map { clusterList =>
clusterList.map(_.name)
}
}
}
Cruise Control (Java):
public void addBroker(BrokerApi brokerApi) {
_brokerApis.add(brokerApi);
_clusterModel.createReplica(brokerApi.brokerId());
_brokerIdToApi.put(brokerApi.brokerId(), brokerApi);
}
Summary
CMAK offers a user-friendly interface for Kafka cluster management, while Cruise Control focuses on automated optimization and balancing. CMAK provides visual representations and supports multiple clusters, but may require more manual intervention. Cruise Control excels in automated management but has a less intuitive interface. The code comparison shows CMAK's use of Scala for web-based operations, while Cruise Control uses Java for core cluster management functions.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Cruise Control for Apache Kafka
Introduction
Cruise Control is a product that helps run Apache Kafka clusters at large scale. Due to the popularity of Apache Kafka, many companies have increasingly large Kafka clusters with hundreds of brokers. At LinkedIn, we have 10K+ Kafka brokers, which means broker deaths are an almost daily occurrence and balancing the workload of Kafka also becomes a big overhead.
Kafka Cruise Control is designed to address this operational scalability issue.
Features
Kafka Cruise Control provides the following features out of the box:
-
Resource utilization tracking for brokers, topics, and partitions.
-
Query the current Kafka cluster state to see the online and offline partitions, in-sync and out-of-sync replicas, replicas under
min.insync.replicas
, online and offline logDirs, and distribution of replicas in the cluster. -
Multi-goal rebalance proposal generation for:
- Rack-awareness
- Resource capacity violation checks (CPU, DISK, Network I/O)
- Per-broker replica count violation check
- Resource utilization balance (CPU, DISK, Network I/O)
- Leader traffic distribution
- Replica distribution for topics
- Global replica distribution
- Global leader replica distribution
- Custom goals that you wrote and plugged in
-
Anomaly detection, alerting, and self-healing for the Kafka cluster, including:
- Goal violation
- Broker failure detection
- Metric anomaly detection
- Disk failure detection (not available in
kafka_0_11_and_1_0
branch) - Slow broker detection (not available in
kafka_0_11_and_1_0
branch)
-
Admin operations, including:
- Add brokers
- Remove brokers
- Demote brokers
- Rebalance the cluster
- Fix offline replicas (not available in
kafka_0_11_and_1_0
branch) - Perform preferred leader election (PLE)
- Fix offline replicas
- Adjust replication factor
Environment Requirements
- The
main
(previouslymigrate_to_kafka_2_5
) branch of Cruise Control is compatible with Apache Kafka2.5+
(i.e. Releases with2.5.*
),2.6
(i.e. Releases with2.5.11+
),2.7
(i.e. Releases with2.5.36+
),2.8
(i.e. Releases with2.5.66+
),3.0
(i.e. Releases with2.5.85+
), and3.1
(i.e. Releases with2.5.85+
). - The
migrate_to_kafka_2_4
branch of Cruise Control is compatible with Apache Kafka2.4
(i.e. Releases with2.4.*
). - The
kafka_2_0_to_2_3
branch (deprecated) of Cruise Control is compatible with Apache Kafka2.0
,2.1
,2.2
, and2.3
(i.e. Releases with2.0.*
). - The
kafka_0_11_and_1_0
branch (deprecated) of Cruise Control is compatible with Apache Kafka0.11.0.0
,1.0
, and1.1
(i.e. Releases with0.1.*
). message.format.version
0.10.0
and above is needed.- The
kafka_2_0_to_2_3
andkafka_0_11_and_1_0
branches compile withScala 2.11
. - The branch
migrate_to_kafka_2_4
compiles withScala 2.12
. - The branch
migrate_to_kafka_2_5
compile withScala 2.13
. - This project requires Java 11.
Known Compatibility Issues
- Support for Apache Kafka
2.0
,2.1
,2.2
, and2.3
requires KAFKA-8875 hotfix.
Quick Start
- Get Cruise Control
- (Option-1): via
git clone
git clone https://github.com/linkedin/cruise-control.git && cd cruise-control/
- (Option-2): via browsing the available releases:
- Browse
https://github.com/linkedin/cruise-control/releases
to pick a release -- e.g.0.1.10
- Get and extract the release:
wget https://github.com/linkedin/cruise-control/archive/0.1.10.tar.gz && tar zxvf 0.1.10.tar.gz && cd cruise-control-0.1.10/
- Initialize the local repo:
git init && git add . && git commit -m "Init local repo." && git tag -a 0.1.10 -m "Init local version."
- Browse
- (Option-1): via
- This step is required if
CruiseControlMetricsReporter
is used for metrics collection (i.e. the default for Cruise Control). The metrics reporter periodically samples the Kafka raw metrics on the broker and sends them to a Kafka topic../gradlew jar
(Note: This project requires Java 11)- Copy
./cruise-control-metrics-reporter/build/libs/cruise-control-metrics-reporter-A.B.C.jar
(WhereA.B.C
is the version of the Cruise Control) to your Kafka server dependency jar folder. For Apache Kafka, the folder would becore/build/dependant-libs-SCALA_VERSION/
(for a Kafka source checkout) orlibs/
(for a Kafka release download). - Modify Kafka server configuration to set
metric.reporters
tocom.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter
. For Apache Kafka, server properties are located at./config/server.properties
. - If
SSL
is enabled, ensure that the relevant client configurations are properly set for all brokers in./config/server.properties
. Note thatCruiseControlMetricsReporter
takes all configurations for vanillaKafkaProducer
with a prefix ofcruise.control.metrics.reporter.
-- e.g.cruise.control.metrics.reporter.ssl.truststore.password
. - If the default broker cleanup policy is
compact
, make sure that the topic to which Cruise Control metrics reporter should send messages is created with thedelete
cleanup policy -- the default metrics reporter topic is__CruiseControlMetrics
.
- Start ZooKeeper and Kafka server (See tutorial).
- Modify
config/cruisecontrol.properties
of Cruise Control:- (Required) fill in
bootstrap.servers
andzookeeper.connect
to the Kafka cluster to be monitored. - (Required) update
capacity.config.file
to the path of your capacity file.- Capacity file is a JSON file that provides the capacity of the brokers
- You can start Cruise Control server with the default file (
config/capacityJBOD.json
), but it may not reflect the actual capacity of the brokers - See BrokerCapacityConfigurationFileResolver configurations for more information and examples
- (Optional) set
metric.sampler.class
to your implementation (the default sampler class isCruiseControlMetricsReporterSampler
) - (Optional) set
sample.store.class
to your implementation if you have one (the defaultSampleStore
isKafkaSampleStore
)
- (Required) fill in
- Run the following command
JAR files correspond to your applications and./gradlew jar copyDependantLibs ./kafka-cruise-control-start.sh [-jars PATH_TO_YOUR_JAR_1,PATH_TO_YOUR_JAR_2] config/cruisecontrol.properties [port]
port
enables customizing the Cruise Control port number (default:9090
).- (Note) To emit Cruise Control JMX metrics on a particular port (e.g.
56666
),export JMX_PORT=56666
before runningkafka-cruise-control-start.sh
- (Note) To emit Cruise Control JMX metrics on a particular port (e.g.
- (Verify your setup) Visit
http://localhost:9090/kafkacruisecontrol/state
(orhttp://localhost:\[port\]/kafkacruisecontrol/state
if you specified the port when starting Cruise Control).
Note:
- Cruise Control will need some time to read the raw Kafka metrics from the cluster.
- The metrics of a newly up broker may take a few minutes to get stable. Cruise Control will drop the inconsistent metrics (e.g when topic bytes-in is higher than broker bytes-in), so first few windows may not have enough valid partitions.
REST API
Cruise Control provides a REST API for users to interact with. See the wiki page for more details.
How Does It Work
Cruise Control relies on the recent load information of replicas to optimize the cluster.
Cruise Control periodically collects resource utilization samples at both broker- and partition-level to infer the traffic pattern of each partition. Based on the traffic characteristics and distribution of all the partitions, it derives the load impact of each partition over the brokers. Cruise Control then builds a workload model to simulate the workload of the Kafka cluster. The goal optimizer explores different ways to generate cluster workload optimization proposals based on the user-specified list of goals.
Cruise Control also monitors the liveness of all the brokers in the cluster. To avoid the loss of redundancy, Cruise Control automatically moves replicas from failed brokers to alive ones.
For more details about how Cruise Control achieves that, see these slides.
Configurations for Cruise Control
To read more about the configurations. Check the configurations wiki page.
Artifactory
Published at Jfrog Artifactory. See available releases.
Pluggable Components
More about pluggable components can be found in the pluggable components wiki page.
Metric Sampler
The metric sampler enables users to deploy Cruise Control to various environments and work with the existing metric systems.
Cruise Control provides a metrics reporter that can be configured in your Apache Kafka server. Metrics reporter generates performance metrics to a Kafka metrics topic that can be consumed by Cruise Control.
Sample Store
The Sample Store enables storage of collected metric samples and training samples in an external storage.
Metric sampling uses derived data from the raw metrics, and the accuracy of the derived data depends on the metadata of the cluster at that point. Hence, when we look at the old metrics, if we do not know the metadata at the point the metric was collected, the derived data would not be accurate. Sample Store helps solving this problem by storing the derived data directly to an external storage for later loading.
The default Sample Store implementation produces metric samples back to Kafka.
Goals
The goals in Cruise Control are pluggable with different priorities. The default goals in order of decreasing priority are:
- RackAwareGoal - Ensures that all replicas of each partition are assigned in a rack aware manner -- i.e. no more than one replica of each partition resides in the same rack.
- RackAwareDistributionGoal - A relaxed version of
RackAwareGoal
. Contrary toRackAwareGoal
, as long as replicas of each partition can achieve a perfectly even distribution across the racks, this goal lets placement of multiple replicas of a partition into a single rack. - MinTopicLeadersPerBrokerGoal - Ensures that each alive broker has at least a certain number of leader replica of each topic in a configured set of topics
- ReplicaCapacityGoal - Ensures that the maximum number of replicas per broker is under the specified maximum limit.
- DiskCapacityGoal - Ensures that Disk space usage of each broker is below a given threshold.
- NetworkInboundCapacityGoal - Ensures that inbound network utilization of each broker is below a given threshold.
- NetworkOutboundCapacityGoal - Ensures that outbound network utilization of each broker is below a given threshold.
- CpuCapacityGoal - Ensures that CPU utilization of each broker is below a given threshold.
- ReplicaDistributionGoal - Attempts to make all the brokers in a cluster have a similar number of replicas.
- PotentialNwOutGoal - Ensures that the potential network output (when all the replicas in the broker become leaders) on each of the broker do not exceed the brokerâs network outbound bandwidth capacity.
- DiskUsageDistributionGoal - Attempts to keep the Disk space usage variance among brokers within a certain range relative to the average Disk utilization.
- NetworkInboundUsageDistributionGoal - Attempts to keep the inbound network utilization variance among brokers within a certain range relative to the average inbound network utilization.
- NetworkOutboundUsageDistributionGoal - Attempts to keep the outbound network utilization variance among brokers within a certain range relative to the average outbound network utilization.
- CpuUsageDistributionGoal - Attempts to keep the CPU usage variance among brokers within a certain range relative to the average CPU utilization.
- LeaderReplicaDistributionGoal - Attempts to make all the brokers in a cluster have a similar number of leader replicas.
- LeaderBytesInDistributionGoal - Attempts to equalize the leader bytes in rate on each host.
- TopicReplicaDistributionGoal - Attempts to maintain an even distribution of any topic's partitions across the entire cluster.
- PreferredLeaderElectionGoal - Simply move the leaders to the first replica of each partition.
- KafkaAssignerDiskUsageDistributionGoal - (Kafka-assigner mode) Attempts to distribute disk usage evenly among brokers based on swap.
- IntraBrokerDiskCapacityGoal - (Rebalance-disk mode, not available in
kafka_0_11_and_1_0
branch) Ensures that Disk space usage of each disk is below a given threshold. - IntraBrokerDiskUsageDistributionGoal - (Rebalance-disk mode, not available in
kafka_0_11_and_1_0
branch) Attempts to keep the Disk space usage variance among disks within a certain range relative to the average broker Disk utilization.
Anomaly Notifier
The anomaly notifier allows users to be notified when an anomaly is detected. Anomalies include:
- Broker failure
- Goal violation
- Metric anomaly
- Disk failure (not available in
kafka_0_11_and_1_0
branch) - Slow brokers (not available in
kafka_0_11_and_1_0
branch) - Topic replication factor anomaly (not available in
kafka_0_11_and_1_0
branch) - Topic partition size anomaly (not available in
kafka_0_11_and_1_0
branch) - Maintenance Events (not available in
kafka_0_11_and_1_0
branch)
In addition to anomaly notifications, users can enable actions to be taken in response to an anomaly by turning self-healing on for the relevant anomaly detectors. Multiple anomaly detectors work in harmony using distinct mitigation mechanisms. Their actions broadly fall into the following categories:
- fix - fix the problem right away (e.g. start a rebalance, fix offline replicas)
- check - check the situation again after a configurable delay (e.g. adopt a grace period before fixing broker failures)
- ignore - ignore the anomaly (e.g. self-healing is disabled)
Top Related Projects
Mirror of Apache Kafka
Confluent's Apache Kafka Golang client
CMAK is a tool for managing Apache Kafka clusters
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot