scalding

A Scala API for Cascading

3,516

708

3,516

316

View on GitHub

Top Related Projects

spark

41,366

Apache Spark - A unified analytics engine for large-scale data processing

beam

8,228

Apache Beam is a unified programming model for Batch and Streaming data processing.

Quick Overview

Scalding is a Scala library that makes it easy to write MapReduce jobs in Hadoop. It's built on top of Cascading, a Java library for building data processing workflows. Scalding provides a high-level DSL for expressing complex data transformations, making it easier to write and maintain big data processing pipelines.

Pros

High-level abstraction: Simplifies writing complex MapReduce jobs
Scala-based: Leverages Scala's type system and functional programming features
Integration with Hadoop ecosystem: Works well with other Hadoop tools and frameworks
Rich set of built-in operations: Provides many common data processing functions out of the box

Cons

Learning curve: Requires knowledge of Scala and functional programming concepts
Limited to batch processing: Not suitable for real-time or streaming data processing
Dependency on Cascading: May introduce additional complexity and potential version conflicts
Performance overhead: Abstraction layer may introduce some performance overhead compared to raw MapReduce

Code Examples

Reading and filtering data:

val input = TypedPipe.from(TextLine("input.txt"))
val filtered = input.filter(line => line.contains("important"))
filtered.write(TypedTsv("output.txt"))

Word count example:

val words = TypedPipe.from(TextLine("input.txt"))
  .flatMap(line => line.split("\\s+"))
  .map(word => (word, 1))
  .sumByKey
  .write(TypedTsv[(String, Int)]("word_count.txt"))

Join operation:

val users = TypedPipe.from(TypedTsv[(Int, String)]("users.tsv"))
val orders = TypedPipe.from(TypedTsv[(Int, String, Double)]("orders.tsv"))

val joined = users.join(orders).map {
  case (userId, (name, (orderId, amount))) => (userId, name, orderId, amount)
}
joined.write(TypedTsv[(Int, String, String, Double)]("user_orders.tsv"))

Getting Started

To use Scalding in your project, add the following dependency to your build.sbt file:

libraryDependencies += "com.twitter" %% "scalding-core" % "0.17.4"

Create a new Scala object extending Job:

import com.twitter.scalding._

class MyJob(args: Args) extends Job(args) {
  TypedPipe.from(TextLine("input.txt"))
    .flatMap(_.split("\\s+"))
    .map(word => (word, 1))
    .sumByKey
    .write(TypedTsv[(String, Int)]("output.txt"))
}

Run the job using the scald.rb script provided by Scalding or by submitting it to a Hadoop cluster.

Competitor Comparisons

spark

41,366

Apache Spark - A unified analytics engine for large-scale data processing

Pros of Spark

Faster performance for large-scale data processing
More comprehensive ecosystem with support for streaming, machine learning, and graph processing
Better support for interactive data exploration and ad-hoc queries

Cons of Spark

Steeper learning curve, especially for developers new to distributed computing
Higher memory requirements, which can be challenging in resource-constrained environments
More complex setup and configuration compared to Scalding

Code Comparison

Scalding (Scala):

val input = TextLine("input.txt")
val wordCounts = input.flatMap(line => line.split("\\s+"))
  .groupBy(word => word)
  .size
wordCounts.write(Tsv("output.txt"))

Spark (Scala):

val input = sc.textFile("input.txt")
val wordCounts = input.flatMap(line => line.split("\\s+"))
  .map(word => (word, 1))
  .reduceByKey(_ + _)
wordCounts.saveAsTextFile("output.txt")

Both examples demonstrate a simple word count operation, showcasing the similarities in syntax and approach between Scalding and Spark. However, Spark's code is more concise and leverages its distributed computing capabilities more explicitly.

flink

25,110

Apache Flink

Pros of Flink

Supports both batch and stream processing, offering more versatility than Scalding
Provides lower latency and higher throughput for real-time data processing
Has a larger and more active community, resulting in frequent updates and improvements

Cons of Flink

Steeper learning curve compared to Scalding, especially for those familiar with Hadoop
Requires more complex setup and configuration, particularly for cluster deployment
May have higher resource requirements, especially for large-scale stream processing

Code Comparison

Scalding (Scala):

val input = TextLine("input.txt")
val wordCounts = input.flatMap(line => line.split("\\s+"))
  .groupBy(word => word)
  .size
wordCounts.write(Tsv("output.txt"))

Flink (Java):

DataStream<String> text = env.readTextFile("input.txt");
DataStream<Tuple2<String, Integer>> counts = text
    .flatMap(new Tokenizer())
    .keyBy(0)
    .sum(1);
counts.writeAsCsv("output.txt");

Both examples demonstrate a simple word count operation, showcasing the different syntax and approaches used in Scalding and Flink. Scalding's code is more concise due to Scala's expressiveness, while Flink's Java code is more verbose but offers more explicit control over the data processing pipeline.

beam

8,228

Apache Beam is a unified programming model for Batch and Streaming data processing.

Pros of Beam

Supports multiple programming languages (Java, Python, Go)
Provides a unified model for batch and streaming data processing
Offers portability across various execution engines (Flink, Spark, Dataflow)

Cons of Beam

Steeper learning curve due to more complex API
Less focus on Hadoop-specific optimizations
Potentially slower development for simple MapReduce jobs

Code Comparison

Scalding (Scala):

val input = TextLine("input.txt")
val wordCounts = input.flatMap(line => line.split("\\s+"))
  .groupBy(word => word)
  .size
wordCounts.write(Tsv("output.txt"))

Beam (Java):

PCollection<String> input = p.apply(TextIO.read().from("input.txt"));
PCollection<KV<String, Long>> wordCounts = input
    .apply(FlatMapElements.into(TypeDescriptors.strings())
        .via((String line) -> Arrays.asList(line.split("\\s+"))))
    .apply(Count.perElement());
wordCounts.apply(TextIO.write().to("output.txt"));

Summary

Beam offers a more versatile and portable solution for data processing across different engines and languages, while Scalding provides a simpler, Scala-based approach optimized for Hadoop environments. The choice between the two depends on specific project requirements, team expertise, and the desired level of flexibility in execution environments.

storm

6,645

Apache Storm

Pros of Storm

Real-time processing: Storm excels at processing data streams in real-time, making it suitable for applications requiring immediate insights
Fault-tolerant and scalable: Built to handle large-scale distributed processing with automatic fault recovery
Language-agnostic: Supports multiple programming languages, offering flexibility in development

Cons of Storm

Steeper learning curve: More complex to set up and configure compared to Scalding
Higher resource requirements: Typically requires more computational resources to run effectively
Less integrated with Hadoop ecosystem: While it can work with Hadoop, it's not as tightly integrated as Scalding

Code Comparison

Storm topology definition:

TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentenceBolt(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCountBolt(), 12).fieldsGrouping("split", new Fields("word"));

Scalding job definition:

class WordCountJob(args: Args) extends Job(args) {
  TypedPipe.from(TextLine(args("input")))
    .flatMap(_.toLowerCase.split("\\s+"))
    .groupBy(identity)
    .size
    .write(TypedTsv[(String, Long)](args("output")))
}

hadoop

15,192

Apache Hadoop

Pros of Hadoop

More comprehensive ecosystem for distributed computing and big data processing
Wider industry adoption and community support
Better suited for handling very large-scale data processing tasks

Cons of Hadoop

Steeper learning curve and more complex setup
Can be overkill for smaller data processing tasks
Requires more hardware resources and infrastructure

Code Comparison

Scalding (Scala):

val input = TextLine("input.txt")
val wordCounts = input.flatMap(line => line.split("\\s+"))
  .groupBy(word => word)
  .size
wordCounts.write(Tsv("output.txt"))

Hadoop (Java):

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());
        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            context.write(word, one);
        }
    }
}

The code comparison shows a word count example in both frameworks. Scalding's code is more concise and leverages Scala's functional programming features, while Hadoop's Java code is more verbose but provides fine-grained control over the MapReduce process.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Scalding

Scalding is a Scala library that makes it easy to specify Hadoop MapReduce jobs. Scalding is built on top of Cascading, a Java library that abstracts away low-level Hadoop details. Scalding is comparable to Pig, but offers tight integration with Scala, bringing advantages of Scala to your MapReduce jobs.

Word Count

Hadoop is a distributed system for counting words. Here is how it's done in Scalding.

package com.twitter.scalding.examples

import com.twitter.scalding._
import com.twitter.scalding.source.TypedText

class WordCountJob(args: Args) extends Job(args) {
  TypedPipe.from(TextLine(args("input")))
    .flatMap { line => tokenize(line) }
    .groupBy { word => word } // use each word for a key
    .size // in each group, get the size
    .write(TypedText.tsv[(String, Long)](args("output")))

  // Split a piece of text into individual words.
  def tokenize(text: String): Array[String] = {
    // Lowercase each word and remove punctuation.
    text.toLowerCase.replaceAll("[^a-zA-Z0-9\\s]", "").split("\\s+")
  }
}

Notice that the tokenize function, which is standard Scala, integrates naturally with the rest of the MapReduce job. This is a very powerful feature of Scalding. (Compare it to the use of UDFs in Pig.)

You can find more example code under examples/. If you're interested in comparing Scalding to other languages, see our Rosetta Code page, which has several MapReduce tasks in Scalding and other frameworks (e.g., Pig and Hadoop Streaming).

Documentation and Getting Started

Getting Started page on the Scalding Wiki
Scalding Scaladocs provide details beyond the API References. Prefer using this as it's always up to date.
REPL in Wonderland a hands-on tour of the scalding REPL requiring only git and java installed.
Runnable tutorials in the source.
The API Reference, including many example Scalding snippets:
- Type-safe API Reference
- Fields-based API Reference
The Matrix Library provides a way of working with key-attribute-value scalding pipes:
- The Introduction to Matrix Library contains an overview and a "getting started" example
- The Matrix API Reference contains the Matrix Library API reference with examples
Introduction to Scalding Execution contains general rules and examples of calling Scalding from inside another application.

Please feel free to use the beautiful Scalding logo artwork anywhere.

Contact

For user questions or scalding development (internals, extending, release planning): https://groups.google.com/forum/#!forum/scalding-dev (Google search also works as a first step)

In the remote possibility that there exist bugs in this code, please report them to: https://github.com/twitter/scalding/issues

Follow @Scalding on Twitter for updates.

Chat:

Get Involved + Code of Conduct

Pull requests and bug reports are always welcome!

We use a lightweight form of project governence inspired by the one used by Apache projects. Please see Contributing and Committership for our code of conduct and our pull request review process. The TL;DR is send us a pull request, iterate on the feedback + discussion, and get a +1 from a Committer in order to get your PR accepted.

The current list of active committers (who can +1 a pull request) can be found here: Committers

A list of contributors to the project can be found here: Contributors

Building

There is a script (called sbt) in the root that loads the correct sbt version to build:

./sbt update (takes 2 minutes or more)
./sbt test
./sbt assembly (needed to make the jar used by the scald.rb script)

The test suite takes a while to run. When you're in sbt, here's a shortcut to run just one test:

> test-only com.twitter.scalding.FileSourceTest

Please refer to FAQ page if you encounter problems when using sbt.

We use Github Actions to verify the build:

We use Coveralls for code coverage results:

Scalding modules are available from maven central.

The current groupid and version for all modules is, respectively, "com.twitter" and 0.17.2.

Current published artifacts are

scalding-core_2.11, scalding-core_2.12
scalding-args_2.11, scalding-args_2.12
scalding-date_2.11, scalding-date_2.12
scalding-commons_2.11, scalding-commons_2.12
scalding-avro_2.11, scalding-avro_2.12
scalding-parquet_2.11, scalding-parquet_2.12
scalding-repl_2.11, scalding-repl_2.12

The suffix denotes the scala version.

Adopters

Ebay
Etsy
Sharethrough
Snowplow Analytics
Soundcloud
Twitter

To see a full list of users or to add yourself, see the wiki

Authors:

Avi Bryant http://twitter.com/avibryant
Oscar Boykin http://twitter.com/posco
Argyris Zymnis http://twitter.com/argyris

Thanks for assistance and contributions:

Sam Ritchie http://twitter.com/sritchie
Aaron Siegel: http://twitter.com/asiegel
Ian O'Connell http://twitter.com/0x138
Alex Levenson http://twitter.com/THISWILLWORK
Jonathan Coveney http://twitter.com/jco
Kevin Lin http://twitter.com/reconditesea
Brad Greenlee: http://twitter.com/bgreenlee
Edwin Chen http://twitter.com/edchedch
Arkajit Dey: http://twitter.com/arkajit
Krishnan Raman: http://twitter.com/dxbydt_jasq
Flavian Vasile http://twitter.com/flavianv
Chris Wensel http://twitter.com/cwensel
Ning Liang http://twitter.com/ningliang
Dmitriy Ryaboy http://twitter.com/squarecog
Dong Wang http://twitter.com/dongwang218
Josh Attenberg http://twitter.com/jattenberg
Juliet Hougland https://twitter.com/j_houg
Eddie Xie https://twitter.com/eddiex

A full list of contributors can be found on GitHub.

License

Licensed under the Apache License, Version 2.0

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot