logging-flume
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log-like data
Top Related Projects
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log-like data
Logstash - transport and process your logs, events, or other data
Fluentd: Unified Logging Layer (project under CNCF)
A high-performance observability data pipeline.
Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
Like Prometheus, but for logs.
Quick Overview
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating, and moving large amounts of log data from many different sources to a centralized data store. It is designed to handle high-volume streaming data and can be used for log aggregation, event processing, and data ingestion into systems like Hadoop.
Pros
- Highly scalable and fault-tolerant architecture
- Flexible and customizable with a wide range of built-in and custom components
- Supports multiple sources and sinks, allowing integration with various data systems
- Provides reliable data delivery with transaction-based approach
Cons
- Can be complex to set up and configure for advanced use cases
- Limited built-in support for data transformation and processing
- May require significant resources for high-volume data streams
- Learning curve for newcomers to understand the concepts and architecture
Code Examples
- Basic Flume configuration for reading from a file and writing to HDFS:
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /var/log/myapp.log
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://namenode/flume/events
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
- Configuring a custom interceptor:
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.example.CustomInterceptor$Builder
a1.sources.r1.interceptors.i1.someProperty = someValue
- Setting up a load balancing sink processor:
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.selector = round_robin
a1.sinkgroups.g1.processor.backoff = true
Getting Started
- Download and install Apache Flume from the official website.
- Create a configuration file (e.g.,
example.conf
) with your desired setup. - Start the Flume agent:
$ bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console
- Flume will now start collecting and processing data according to your configuration.
Competitor Comparisons
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log-like data
Pros of logging-flume
- Identical repository, so all features and functionality are the same
- No differences in performance or capabilities
- Consistent development and maintenance as they are the same project
Cons of logging-flume
- No unique advantages over the other repository
- Potential confusion for users due to duplicate repositories
- Redundant maintenance efforts if both repositories are actively managed
Code comparison
As both repositories are identical, there are no code differences to compare. Here's a sample of the code structure found in both repositories:
public class FlumeConfiguration {
private static final Logger LOGGER = LoggerFactory.getLogger(FlumeConfiguration.class);
private final Map<String, AgentConfiguration> agentConfigMap;
private final Properties properties;
}
Both repositories contain the same codebase for Apache Flume, a distributed, reliable, and available system for efficiently collecting, aggregating, and moving large amounts of log data from many different sources to a centralized data store.
Logstash - transport and process your logs, events, or other data
Pros of Logstash
- More extensive plugin ecosystem, allowing for greater flexibility in data processing and output
- Tighter integration with the Elastic Stack, providing seamless compatibility with Elasticsearch and Kibana
- Better support for real-time data processing and analytics
Cons of Logstash
- Higher resource consumption, especially in terms of memory usage
- Steeper learning curve for configuration and setup compared to Flume
- Less suitable for high-volume, high-throughput scenarios without additional tuning
Code Comparison
Flume configuration example:
agent.sources = s1
agent.channels = c1
agent.sinks = k1
agent.sources.s1.type = netcat
agent.sources.s1.bind = localhost
agent.sources.s1.port = 44444
Logstash configuration example:
input {
tcp {
port => 44444
type => "example"
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
}
}
Both examples show basic configurations for receiving data on port 44444. Flume uses a more structured approach with explicit source, channel, and sink definitions, while Logstash uses a more streamlined input-output pipeline configuration.
Fluentd: Unified Logging Layer (project under CNCF)
Pros of Fluentd
- More flexible plugin system with over 500 community-contributed plugins
- Better performance and scalability, especially for high-volume data streams
- Easier configuration with a unified logging layer
Cons of Fluentd
- Steeper learning curve due to more complex architecture
- Requires more system resources, especially memory
Code Comparison
Flume configuration example:
agent.sources = s1
agent.channels = c1
agent.sinks = k1
agent.sources.s1.type = netcat
agent.sources.s1.bind = localhost
agent.sources.s1.port = 44444
Fluentd configuration example:
<source>
@type tcp
port 24224
tag myapp.access
</source>
<match myapp.access>
@type file
path /var/log/fluent/access
</match>
Both Flume and Fluentd are popular log collection and aggregation tools, but they differ in their approach and capabilities. Flume is designed specifically for Hadoop ecosystems, while Fluentd is more versatile and can be used in various environments. Fluentd's plugin system and unified logging layer make it more adaptable to different use cases, but it may require more resources and have a steeper learning curve compared to Flume's simpler architecture.
A high-performance observability data pipeline.
Pros of Vector
- More modern and actively maintained, with frequent updates and releases
- Supports a wider range of data sources and sinks, including cloud-native integrations
- Written in Rust, offering better performance and lower resource usage
Cons of Vector
- Younger project with a smaller community compared to Flume
- Less documentation and fewer third-party resources available
- Steeper learning curve for users familiar with Java-based tools like Flume
Code Comparison
Vector configuration example:
[sources.apache_logs]
type = "file"
include = ["/var/log/apache2/*.log"]
[transforms.parse_apache]
type = "regex_parser"
inputs = ["apache_logs"]
pattern = '%{COMBINEDAPACHELOG}'
[sinks.elasticsearch]
type = "elasticsearch"
inputs = ["parse_apache"]
host = "http://localhost:9200"
Flume configuration example:
agent.sources = webserver
agent.channels = memoryChannel
agent.sinks = elasticsearch
agent.sources.webserver.type = exec
agent.sources.webserver.command = tail -F /var/log/apache2/access.log
agent.channels.memoryChannel.type = memory
agent.sinks.elasticsearch.type = elasticsearch
agent.sinks.elasticsearch.hostNames = 127.0.0.1:9200
Both examples show basic log collection and forwarding to Elasticsearch, but Vector's configuration is more concise and offers built-in parsing capabilities.
Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
Pros of Telegraf
- More versatile: Supports a wider range of input plugins and data sources
- Better performance: Designed for high-throughput data collection
- Active development: More frequent updates and community contributions
Cons of Telegraf
- Steeper learning curve: More complex configuration due to its extensive features
- Resource intensive: Can consume more system resources, especially with many plugins
Code Comparison
Flume configuration example:
agent.sources = netcat-source
agent.sinks = logger-sink
agent.channels = memory-channel
agent.sources.netcat-source.type = netcat
agent.sources.netcat-source.bind = localhost
agent.sources.netcat-source.port = 44444
Telegraf configuration example:
[[inputs.cpu]]
percpu = true
totalcpu = true
collect_cpu_time = false
report_active = false
[[outputs.influxdb]]
urls = ["http://localhost:8086"]
database = "telegraf"
Both Flume and Telegraf are data collection and forwarding tools, but they have different strengths. Flume is primarily designed for log data collection and aggregation in Hadoop environments, while Telegraf is a more general-purpose metrics collection agent with broader application support.
Telegraf offers greater flexibility and a wider range of integrations, making it suitable for various monitoring scenarios. However, this versatility comes at the cost of increased complexity and potentially higher resource usage.
Flume's configuration tends to be simpler and more focused on log data, while Telegraf's configuration allows for more detailed customization of metrics collection and processing.
Like Prometheus, but for logs.
Pros of Loki
- Designed for cloud-native environments and Kubernetes
- Efficient storage and indexing optimized for logs
- Seamless integration with Grafana for visualization
Cons of Loki
- Less mature compared to Flume
- Limited support for non-Prometheus-based metrics
- Steeper learning curve for those unfamiliar with Prometheus ecosystem
Code Comparison
Flume configuration example:
agent.sources = r1
agent.channels = c1
agent.sinks = k1
agent.sources.r1.type = netcat
agent.sources.r1.bind = localhost
agent.sources.r1.port = 44444
Loki configuration example:
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
address: 127.0.0.1
ring:
kvstore:
store: inmemory
replication_factor: 1
Both Flume and Loki serve as log collection and aggregation tools, but they cater to different use cases and environments. Flume is more established and versatile for traditional data center setups, while Loki is tailored for modern cloud-native architectures. The code examples showcase the configuration differences, with Flume using a properties-based format and Loki utilizing YAML.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Project status
[!WARNING] This project is not maintained anymore! It has been marked as dormant by Apache Logging Services consensus on 2024-10-10. Users are advised to migrate to alternatives. For other inquiries, see the support policy.
Welcome to Apache Flume!
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic application.
The Apache Flume 1.x (NG) code line is a refactoring of the first generation Flume to solve certain known issues and limitations of the original design.
Apache Flume is open-sourced under the Apache Software Foundation License v2.0.
Documentation
Documentation is included in the binary distribution under the docs directory. In source form, it can be found in the flume-ng-doc directory.
The Flume 1.x guide and FAQ are available here:
Contact us!
- Mailing lists: https://cwiki.apache.org/confluence/display/FLUME/Mailing+Lists
- IRC channel #flume on irc.freenode.net
Bug and Issue tracker.
Compiling Flume
Compiling Flume requires the following tools:
- Oracle Java JDK 1.8
- Apache Maven 3.x
Note: The Apache Flume build requires more memory than the default configuration. We recommend you set the following Maven options:
export MAVEN_OPTS="-Xms512m -Xmx1024m"
To compile Flume and build a distribution tarball, run mvn install
from the
top level directory. The artifacts will be placed under flume-ng-dist/target/
.
Top Related Projects
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log-like data
Logstash - transport and process your logs, events, or other data
Fluentd: Unified Logging Layer (project under CNCF)
A high-performance observability data pipeline.
Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
Like Prometheus, but for logs.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot