logging-flume

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log-like data

2,551

1,568

2,551

View on GitHub

Top Related Projects

logging-flume

2,551

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log-like data

logstash

14,461

Logstash - transport and process your logs, events, or other data

fluentd

13,121

Fluentd: Unified Logging Layer (project under CNCF)

vector

19,315

A high-performance observability data pipeline.

telegraf

15,712

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.

Quick Overview

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating, and moving large amounts of log data from many different sources to a centralized data store. It is designed to handle high-volume streaming data and can be used for log aggregation, event processing, and data ingestion into systems like Hadoop.

Pros

Highly scalable and fault-tolerant architecture
Flexible and customizable with a wide range of built-in and custom components
Supports multiple sources and sinks, allowing integration with various data systems
Provides reliable data delivery with transaction-based approach

Cons

Can be complex to set up and configure for advanced use cases
Limited built-in support for data transformation and processing
May require significant resources for high-volume data streams
Learning curve for newcomers to understand the concepts and architecture

Code Examples

Basic Flume configuration for reading from a file and writing to HDFS:

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /var/log/myapp.log

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://namenode/flume/events

# Use a channel which buffers events in memory
a1.channels.c1.type = memory

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Configuring a custom interceptor:

a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.example.CustomInterceptor$Builder
a1.sources.r1.interceptors.i1.someProperty = someValue

Setting up a load balancing sink processor:

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.selector = round_robin
a1.sinkgroups.g1.processor.backoff = true

Getting Started

Download and install Apache Flume from the official website.
Create a configuration file (e.g., example.conf) with your desired setup.
Start the Flume agent:

$ bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console

Flume will now start collecting and processing data according to your configuration.

Competitor Comparisons

logging-flume

2,551

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log-like data

Pros of logging-flume

Identical repository, so all features and functionality are the same
No differences in performance or capabilities
Consistent development and maintenance as they are the same project

Cons of logging-flume

No unique advantages over the other repository
Potential confusion for users due to duplicate repositories
Redundant maintenance efforts if both repositories are actively managed

Code comparison

As both repositories are identical, there are no code differences to compare. Here's a sample of the code structure found in both repositories:

public class FlumeConfiguration {
  private static final Logger LOGGER = LoggerFactory.getLogger(FlumeConfiguration.class);

  private final Map<String, AgentConfiguration> agentConfigMap;
  private final Properties properties;
}

Both repositories contain the same codebase for Apache Flume, a distributed, reliable, and available system for efficiently collecting, aggregating, and moving large amounts of log data from many different sources to a centralized data store.

logstash

14,461

Logstash - transport and process your logs, events, or other data

Pros of Logstash

More extensive plugin ecosystem, allowing for greater flexibility in data processing and output
Tighter integration with the Elastic Stack, providing seamless compatibility with Elasticsearch and Kibana
Better support for real-time data processing and analytics

Cons of Logstash

Higher resource consumption, especially in terms of memory usage
Steeper learning curve for configuration and setup compared to Flume
Less suitable for high-volume, high-throughput scenarios without additional tuning

Code Comparison

Flume configuration example:

agent.sources = s1
agent.channels = c1
agent.sinks = k1

agent.sources.s1.type = netcat
agent.sources.s1.bind = localhost
agent.sources.s1.port = 44444

Logstash configuration example:

input {
  tcp {
    port => 44444
    type => "example"
  }
}
output {
  elasticsearch {
    hosts => ["localhost:9200"]
  }
}

Both examples show basic configurations for receiving data on port 44444. Flume uses a more structured approach with explicit source, channel, and sink definitions, while Logstash uses a more streamlined input-output pipeline configuration.

fluentd

13,121

Fluentd: Unified Logging Layer (project under CNCF)

Pros of Fluentd

More flexible plugin system with over 500 community-contributed plugins
Better performance and scalability, especially for high-volume data streams
Easier configuration with a unified logging layer

Cons of Fluentd

Steeper learning curve due to more complex architecture
Requires more system resources, especially memory

Code Comparison

Flume configuration example:

agent.sources = s1
agent.channels = c1
agent.sinks = k1

agent.sources.s1.type = netcat
agent.sources.s1.bind = localhost
agent.sources.s1.port = 44444

Fluentd configuration example:

<source>
  @type tcp
  port 24224
  tag myapp.access
</source>

<match myapp.access>
  @type file
  path /var/log/fluent/access
</match>

Both Flume and Fluentd are popular log collection and aggregation tools, but they differ in their approach and capabilities. Flume is designed specifically for Hadoop ecosystems, while Fluentd is more versatile and can be used in various environments. Fluentd's plugin system and unified logging layer make it more adaptable to different use cases, but it may require more resources and have a steeper learning curve compared to Flume's simpler architecture.

vector

19,315

A high-performance observability data pipeline.

Pros of Vector

More modern and actively maintained, with frequent updates and releases
Supports a wider range of data sources and sinks, including cloud-native integrations
Written in Rust, offering better performance and lower resource usage

Cons of Vector

Younger project with a smaller community compared to Flume
Less documentation and fewer third-party resources available
Steeper learning curve for users familiar with Java-based tools like Flume

Code Comparison

Vector configuration example:

[sources.apache_logs]
type = "file"
include = ["/var/log/apache2/*.log"]

[transforms.parse_apache]
type = "regex_parser"
inputs = ["apache_logs"]
pattern = '%{COMBINEDAPACHELOG}'

[sinks.elasticsearch]
type = "elasticsearch"
inputs = ["parse_apache"]
host = "http://localhost:9200"

Flume configuration example:

agent.sources = webserver
agent.channels = memoryChannel
agent.sinks = elasticsearch

agent.sources.webserver.type = exec
agent.sources.webserver.command = tail -F /var/log/apache2/access.log

agent.channels.memoryChannel.type = memory

agent.sinks.elasticsearch.type = elasticsearch
agent.sinks.elasticsearch.hostNames = 127.0.0.1:9200

Both examples show basic log collection and forwarding to Elasticsearch, but Vector's configuration is more concise and offers built-in parsing capabilities.

telegraf

15,712

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.

Pros of Telegraf

More versatile: Supports a wider range of input plugins and data sources
Better performance: Designed for high-throughput data collection
Active development: More frequent updates and community contributions

Cons of Telegraf

Steeper learning curve: More complex configuration due to its extensive features
Resource intensive: Can consume more system resources, especially with many plugins

Code Comparison

Flume configuration example:

agent.sources = netcat-source
agent.sinks = logger-sink
agent.channels = memory-channel

agent.sources.netcat-source.type = netcat
agent.sources.netcat-source.bind = localhost
agent.sources.netcat-source.port = 44444

Telegraf configuration example:

[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false

[[outputs.influxdb]]
  urls = ["http://localhost:8086"]
  database = "telegraf"

Both Flume and Telegraf are data collection and forwarding tools, but they have different strengths. Flume is primarily designed for log data collection and aggregation in Hadoop environments, while Telegraf is a more general-purpose metrics collection agent with broader application support.

Telegraf offers greater flexibility and a wider range of integrations, making it suitable for various monitoring scenarios. However, this versatility comes at the cost of increased complexity and potentially higher resource usage.

Flume's configuration tends to be simpler and more focused on log data, while Telegraf's configuration allows for more detailed customization of metrics collection and processing.

loki

25,277

Like Prometheus, but for logs.

Pros of Loki

Designed for cloud-native environments and Kubernetes
Efficient storage and indexing optimized for logs
Seamless integration with Grafana for visualization

Cons of Loki

Less mature compared to Flume
Limited support for non-Prometheus-based metrics
Steeper learning curve for those unfamiliar with Prometheus ecosystem

Code Comparison

Flume configuration example:

agent.sources = r1
agent.channels = c1
agent.sinks = k1

agent.sources.r1.type = netcat
agent.sources.r1.bind = localhost
agent.sources.r1.port = 44444

Loki configuration example:

auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1

Both Flume and Loki serve as log collection and aggregation tools, but they cater to different use cases and environments. Flume is more established and versatile for traditional data center setups, while Loki is tailored for modern cloud-native architectures. The code examples showcase the configuration differences, with Flume using a properties-based format and Loki utilizing YAML.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Project status

[!WARNING] This project is not maintained anymore! It has been marked as dormant by Apache Logging Services consensus on 2024-10-10. Users are advised to migrate to alternatives. For other inquiries, see the support policy.

Welcome to Apache Flume!

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic application.

The Apache Flume 1.x (NG) code line is a refactoring of the first generation Flume to solve certain known issues and limitations of the original design.

Apache Flume is open-sourced under the Apache Software Foundation License v2.0.

Documentation

Documentation is included in the binary distribution under the docs directory. In source form, it can be found in the flume-ng-doc directory.

The Flume 1.x guide and FAQ are available here:

Contact us!

Mailing lists: https://cwiki.apache.org/confluence/display/FLUME/Mailing+Lists
IRC channel #flume on irc.freenode.net

Bug and Issue tracker.

https://issues.apache.org/jira/browse/FLUME

Compiling Flume

Compiling Flume requires the following tools:

Oracle Java JDK 1.8
Apache Maven 3.x

Note: The Apache Flume build requires more memory than the default configuration. We recommend you set the following Maven options:

export MAVEN_OPTS="-Xms512m -Xmx1024m"

To compile Flume and build a distribution tarball, run mvn install from the top level directory. The artifacts will be placed under flume-ng-dist/target/.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot