Top Related Projects
Quick Overview
Scalding is a Scala library that makes it easy to write MapReduce jobs in Hadoop. It's built on top of Cascading, a Java library for building data processing workflows. Scalding provides a high-level DSL for expressing complex data transformations, making it easier to write and maintain big data processing pipelines.
Pros
- High-level abstraction: Simplifies writing complex MapReduce jobs
- Scala-based: Leverages Scala's type system and functional programming features
- Integration with Hadoop ecosystem: Works well with other Hadoop tools and frameworks
- Rich set of built-in operations: Provides many common data processing functions out of the box
Cons
- Learning curve: Requires knowledge of Scala and functional programming concepts
- Limited to batch processing: Not suitable for real-time or streaming data processing
- Dependency on Cascading: May introduce additional complexity and potential version conflicts
- Performance overhead: Abstraction layer may introduce some performance overhead compared to raw MapReduce
Code Examples
- Reading and filtering data:
val input = TypedPipe.from(TextLine("input.txt"))
val filtered = input.filter(line => line.contains("important"))
filtered.write(TypedTsv("output.txt"))
- Word count example:
val words = TypedPipe.from(TextLine("input.txt"))
.flatMap(line => line.split("\\s+"))
.map(word => (word, 1))
.sumByKey
.write(TypedTsv[(String, Int)]("word_count.txt"))
- Join operation:
val users = TypedPipe.from(TypedTsv[(Int, String)]("users.tsv"))
val orders = TypedPipe.from(TypedTsv[(Int, String, Double)]("orders.tsv"))
val joined = users.join(orders).map {
case (userId, (name, (orderId, amount))) => (userId, name, orderId, amount)
}
joined.write(TypedTsv[(Int, String, String, Double)]("user_orders.tsv"))
Getting Started
To use Scalding in your project, add the following dependency to your build.sbt
file:
libraryDependencies += "com.twitter" %% "scalding-core" % "0.17.4"
Create a new Scala object extending Job
:
import com.twitter.scalding._
class MyJob(args: Args) extends Job(args) {
TypedPipe.from(TextLine("input.txt"))
.flatMap(_.split("\\s+"))
.map(word => (word, 1))
.sumByKey
.write(TypedTsv[(String, Int)]("output.txt"))
}
Run the job using the scald.rb
script provided by Scalding or by submitting it to a Hadoop cluster.
Competitor Comparisons
Apache Spark - A unified analytics engine for large-scale data processing
Pros of Spark
- Faster performance for large-scale data processing
- More comprehensive ecosystem with support for streaming, machine learning, and graph processing
- Better support for interactive data exploration and ad-hoc queries
Cons of Spark
- Steeper learning curve, especially for developers new to distributed computing
- Higher memory requirements, which can be challenging in resource-constrained environments
- More complex setup and configuration compared to Scalding
Code Comparison
Scalding (Scala):
val input = TextLine("input.txt")
val wordCounts = input.flatMap(line => line.split("\\s+"))
.groupBy(word => word)
.size
wordCounts.write(Tsv("output.txt"))
Spark (Scala):
val input = sc.textFile("input.txt")
val wordCounts = input.flatMap(line => line.split("\\s+"))
.map(word => (word, 1))
.reduceByKey(_ + _)
wordCounts.saveAsTextFile("output.txt")
Both examples demonstrate a simple word count operation, showcasing the similarities in syntax and approach between Scalding and Spark. However, Spark's code is more concise and leverages its distributed computing capabilities more explicitly.
Apache Flink
Pros of Flink
- Supports both batch and stream processing, offering more versatility than Scalding
- Provides lower latency and higher throughput for real-time data processing
- Has a larger and more active community, resulting in frequent updates and improvements
Cons of Flink
- Steeper learning curve compared to Scalding, especially for those familiar with Hadoop
- Requires more complex setup and configuration, particularly for cluster deployment
- May have higher resource requirements, especially for large-scale stream processing
Code Comparison
Scalding (Scala):
val input = TextLine("input.txt")
val wordCounts = input.flatMap(line => line.split("\\s+"))
.groupBy(word => word)
.size
wordCounts.write(Tsv("output.txt"))
Flink (Java):
DataStream<String> text = env.readTextFile("input.txt");
DataStream<Tuple2<String, Integer>> counts = text
.flatMap(new Tokenizer())
.keyBy(0)
.sum(1);
counts.writeAsCsv("output.txt");
Both examples demonstrate a simple word count operation, showcasing the different syntax and approaches used in Scalding and Flink. Scalding's code is more concise due to Scala's expressiveness, while Flink's Java code is more verbose but offers more explicit control over the data processing pipeline.
Apache Beam is a unified programming model for Batch and Streaming data processing.
Pros of Beam
- Supports multiple programming languages (Java, Python, Go)
- Provides a unified model for batch and streaming data processing
- Offers portability across various execution engines (Flink, Spark, Dataflow)
Cons of Beam
- Steeper learning curve due to more complex API
- Less focus on Hadoop-specific optimizations
- Potentially slower development for simple MapReduce jobs
Code Comparison
Scalding (Scala):
val input = TextLine("input.txt")
val wordCounts = input.flatMap(line => line.split("\\s+"))
.groupBy(word => word)
.size
wordCounts.write(Tsv("output.txt"))
Beam (Java):
PCollection<String> input = p.apply(TextIO.read().from("input.txt"));
PCollection<KV<String, Long>> wordCounts = input
.apply(FlatMapElements.into(TypeDescriptors.strings())
.via((String line) -> Arrays.asList(line.split("\\s+"))))
.apply(Count.perElement());
wordCounts.apply(TextIO.write().to("output.txt"));
Summary
Beam offers a more versatile and portable solution for data processing across different engines and languages, while Scalding provides a simpler, Scala-based approach optimized for Hadoop environments. The choice between the two depends on specific project requirements, team expertise, and the desired level of flexibility in execution environments.
Apache Storm
Pros of Storm
- Real-time processing: Storm excels at processing data streams in real-time, making it suitable for applications requiring immediate insights
- Fault-tolerant and scalable: Built to handle large-scale distributed processing with automatic fault recovery
- Language-agnostic: Supports multiple programming languages, offering flexibility in development
Cons of Storm
- Steeper learning curve: More complex to set up and configure compared to Scalding
- Higher resource requirements: Typically requires more computational resources to run effectively
- Less integrated with Hadoop ecosystem: While it can work with Hadoop, it's not as tightly integrated as Scalding
Code Comparison
Storm topology definition:
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentenceBolt(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCountBolt(), 12).fieldsGrouping("split", new Fields("word"));
Scalding job definition:
class WordCountJob(args: Args) extends Job(args) {
TypedPipe.from(TextLine(args("input")))
.flatMap(_.toLowerCase.split("\\s+"))
.groupBy(identity)
.size
.write(TypedTsv[(String, Long)](args("output")))
}
Apache Hadoop
Pros of Hadoop
- More comprehensive ecosystem for distributed computing and big data processing
- Wider industry adoption and community support
- Better suited for handling very large-scale data processing tasks
Cons of Hadoop
- Steeper learning curve and more complex setup
- Can be overkill for smaller data processing tasks
- Requires more hardware resources and infrastructure
Code Comparison
Scalding (Scala):
val input = TextLine("input.txt")
val wordCounts = input.flatMap(line => line.split("\\s+"))
.groupBy(word => word)
.size
wordCounts.write(Tsv("output.txt"))
Hadoop (Java):
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
The code comparison shows a word count example in both frameworks. Scalding's code is more concise and leverages Scala's functional programming features, while Hadoop's Java code is more verbose but provides fine-grained control over the MapReduce process.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Scalding
Scalding is a Scala library that makes it easy to specify Hadoop MapReduce jobs. Scalding is built on top of Cascading, a Java library that abstracts away low-level Hadoop details. Scalding is comparable to Pig, but offers tight integration with Scala, bringing advantages of Scala to your MapReduce jobs.
Word Count
Hadoop is a distributed system for counting words. Here is how it's done in Scalding.
package com.twitter.scalding.examples
import com.twitter.scalding._
import com.twitter.scalding.source.TypedText
class WordCountJob(args: Args) extends Job(args) {
TypedPipe.from(TextLine(args("input")))
.flatMap { line => tokenize(line) }
.groupBy { word => word } // use each word for a key
.size // in each group, get the size
.write(TypedText.tsv[(String, Long)](args("output")))
// Split a piece of text into individual words.
def tokenize(text: String): Array[String] = {
// Lowercase each word and remove punctuation.
text.toLowerCase.replaceAll("[^a-zA-Z0-9\\s]", "").split("\\s+")
}
}
Notice that the tokenize
function, which is standard Scala, integrates naturally with the rest of the MapReduce job. This is a very powerful feature of Scalding. (Compare it to the use of UDFs in Pig.)
You can find more example code under examples/. If you're interested in comparing Scalding to other languages, see our Rosetta Code page, which has several MapReduce tasks in Scalding and other frameworks (e.g., Pig and Hadoop Streaming).
Documentation and Getting Started
- Getting Started page on the Scalding Wiki
- Scalding Scaladocs provide details beyond the API References. Prefer using this as it's always up to date.
- REPL in Wonderland a hands-on tour of the scalding REPL requiring only git and java installed.
- Runnable tutorials in the source.
- The API Reference, including many example Scalding snippets:
- The Matrix Library provides a way of working with key-attribute-value scalding pipes:
- The Introduction to Matrix Library contains an overview and a "getting started" example
- The Matrix API Reference contains the Matrix Library API reference with examples
- Introduction to Scalding Execution contains general rules and examples of calling Scalding from inside another application.
Please feel free to use the beautiful Scalding logo artwork anywhere.
Contact
For user questions or scalding development (internals, extending, release planning): https://groups.google.com/forum/#!forum/scalding-dev (Google search also works as a first step)
In the remote possibility that there exist bugs in this code, please report them to: https://github.com/twitter/scalding/issues
Follow @Scalding on Twitter for updates.
Get Involved + Code of Conduct
Pull requests and bug reports are always welcome!
We use a lightweight form of project governence inspired by the one used by Apache projects. Please see Contributing and Committership for our code of conduct and our pull request review process. The TL;DR is send us a pull request, iterate on the feedback + discussion, and get a +1 from a Committer in order to get your PR accepted.
The current list of active committers (who can +1 a pull request) can be found here: Committers
A list of contributors to the project can be found here: Contributors
Building
There is a script (called sbt) in the root that loads the correct sbt version to build:
./sbt update
(takes 2 minutes or more)./sbt test
./sbt assembly
(needed to make the jar used by the scald.rb script)
The test suite takes a while to run. When you're in sbt, here's a shortcut to run just one test:
> test-only com.twitter.scalding.FileSourceTest
Please refer to FAQ page if you encounter problems when using sbt.
We use Github Actions to verify the build:
We use Coveralls for code coverage results:
Scalding modules are available from maven central.
The current groupid and version for all modules is, respectively, "com.twitter"
and 0.17.2
.
Current published artifacts are
scalding-core_2.11
,scalding-core_2.12
scalding-args_2.11
,scalding-args_2.12
scalding-date_2.11
,scalding-date_2.12
scalding-commons_2.11
,scalding-commons_2.12
scalding-avro_2.11
,scalding-avro_2.12
scalding-parquet_2.11
,scalding-parquet_2.12
scalding-repl_2.11
,scalding-repl_2.12
The suffix denotes the scala version.
Adopters
- Ebay
- Etsy
- Sharethrough
- Snowplow Analytics
- Soundcloud
To see a full list of users or to add yourself, see the wiki
Authors:
- Avi Bryant http://twitter.com/avibryant
- Oscar Boykin http://twitter.com/posco
- Argyris Zymnis http://twitter.com/argyris
Thanks for assistance and contributions:
- Sam Ritchie http://twitter.com/sritchie
- Aaron Siegel: http://twitter.com/asiegel
- Ian O'Connell http://twitter.com/0x138
- Alex Levenson http://twitter.com/THISWILLWORK
- Jonathan Coveney http://twitter.com/jco
- Kevin Lin http://twitter.com/reconditesea
- Brad Greenlee: http://twitter.com/bgreenlee
- Edwin Chen http://twitter.com/edchedch
- Arkajit Dey: http://twitter.com/arkajit
- Krishnan Raman: http://twitter.com/dxbydt_jasq
- Flavian Vasile http://twitter.com/flavianv
- Chris Wensel http://twitter.com/cwensel
- Ning Liang http://twitter.com/ningliang
- Dmitriy Ryaboy http://twitter.com/squarecog
- Dong Wang http://twitter.com/dongwang218
- Josh Attenberg http://twitter.com/jattenberg
- Juliet Hougland https://twitter.com/j_houg
- Eddie Xie https://twitter.com/eddiex
A full list of contributors can be found on GitHub.
License
Copyright 2016 Twitter, Inc.
Licensed under the Apache License, Version 2.0
Top Related Projects
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot