hadoop

Apache Hadoop

15,192

9,084

15,192

1,163

View on GitHub

Top Related Projects

spark

42,015

Apache Spark - A unified analytics engine for large-scale data processing

Quick Overview

Apache Hadoop is an open-source framework designed for distributed storage and processing of large datasets across clusters of computers. It provides a reliable, scalable, and distributed computing platform for big data analytics and processing.

Pros

Scalability: Hadoop can handle petabytes of data by distributing storage and processing across multiple nodes
Fault tolerance: The system is designed to continue operating even if individual nodes fail
Cost-effective: Hadoop can run on commodity hardware, reducing infrastructure costs
Flexibility: Supports various data types and can integrate with multiple data sources and tools

Cons

Complexity: Setting up and maintaining a Hadoop cluster can be challenging and requires specialized skills
Performance: Not ideal for real-time or low-latency processing tasks
Small file problem: Hadoop is optimized for large files and can be inefficient when dealing with many small files
Steep learning curve: Requires understanding of distributed systems and MapReduce programming model

Code Examples

Reading a file from HDFS:

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path path = new Path("/path/to/file.txt");
FSDataInputStream inputStream = fs.open(path);
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
String line;
while ((line = reader.readLine()) != null) {
    System.out.println(line);
}
reader.close();
inputStream.close();

Writing a file to HDFS:

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path path = new Path("/path/to/output.txt");
FSDataOutputStream outputStream = fs.create(path);
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(outputStream));
writer.write("Hello, Hadoop!");
writer.close();
outputStream.close();

Running a MapReduce job:

Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);

Getting Started

Download Hadoop from the official Apache website
Set up JAVA_HOME and HADOOP_HOME environment variables
Configure Hadoop by editing core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml
Format the HDFS namenode: bin/hdfs namenode -format
Start Hadoop services: sbin/start-dfs.sh and sbin/start-yarn.sh
Access the Hadoop web interface at http://localhost:9870

Run a sample MapReduce job:

bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar wordcount /input /output

Competitor Comparisons

spark

42,015

Apache Spark - A unified analytics engine for large-scale data processing

Pros of Spark

Faster processing speed due to in-memory computation
More versatile with support for SQL, streaming, and machine learning
Simpler API and easier to use for developers

Cons of Spark

Higher memory requirements, which can be costly for large datasets
Less mature ecosystem compared to Hadoop
Not as efficient for batch processing of very large datasets

Code Comparison

Hadoop MapReduce (Java):

public static class MapClass extends Mapper<LongWritable, Text, Text, IntWritable> {
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            context.write(new Text(tokenizer.nextToken()), new IntWritable(1));
        }
    }
}

Spark (Scala):

val textFile = spark.read.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
                     .map(word => (word, 1))
                     .reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")

The Spark code is more concise and easier to read, demonstrating its simpler API. However, Hadoop's MapReduce offers more fine-grained control over the mapping and reducing processes, which can be beneficial for complex data processing tasks.

flink

25,110

Apache Flink

Pros of Flink

Faster processing for streaming data and real-time analytics
More flexible API with support for multiple programming languages
Lower latency and higher throughput for stream processing tasks

Cons of Flink

Smaller community and ecosystem compared to Hadoop
Less mature for batch processing workloads
Steeper learning curve for developers new to stream processing

Code Comparison

Flink (Java):

DataStream<String> stream = env.addSource(new FlinkKafkaConsumer<>("topic", new SimpleStringSchema(), properties));
stream.map(s -> s.toUpperCase())
      .filter(s -> s.startsWith("A"))
      .addSink(new FlinkKafkaProducer<>("output-topic", new SimpleStringSchema(), properties));

Hadoop (Java):

Job job = Job.getInstance(conf, "WordCount");
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

The Flink code example shows a simple streaming pipeline, while the Hadoop example demonstrates a typical MapReduce job setup. Flink's API is more concise and expressive for stream processing tasks, whereas Hadoop's MapReduce model is better suited for batch processing operations.

hive

5,749

Apache Hive

Pros of Hive

Easier to use for data analysts with SQL-like query language (HiveQL)
Better suited for data warehousing and analytics on structured data
Faster query execution for complex analytical queries

Cons of Hive

Limited real-time querying capabilities compared to Hadoop
Less flexible for unstructured data processing
Dependent on Hadoop for underlying storage and processing

Code Comparison

Hive (HiveQL):

SELECT customer_id, SUM(order_total)
FROM orders
GROUP BY customer_id
HAVING SUM(order_total) > 1000;

Hadoop (MapReduce in Java):

public static class MapClass extends Mapper<LongWritable, Text, Text, DoubleWritable> {
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] fields = value.toString().split(",");
        context.write(new Text(fields[0]), new DoubleWritable(Double.parseDouble(fields[1])));
    }
}

public static class ReduceClass extends Reducer<Text, DoubleWritable, Text, DoubleWritable> {
    public void reduce(Text key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException {
        double sum = 0;
        for (DoubleWritable val : values) {
            sum += val.get();
        }
        if (sum > 1000) {
            context.write(key, new DoubleWritable(sum));
        }
    }
}

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

For the latest information about Hadoop, please visit our website at:

http://hadoop.apache.org/

and our wiki, at:

https://cwiki.apache.org/confluence/display/HADOOP/

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot