Convert Figma logo to code with AI

twitter logoscalding

A Scala API for Cascading

3,505
707
3,505
316

Top Related Projects

40,184

Apache Spark - A unified analytics engine for large-scale data processing

23,929

Apache Flink

7,828

Apache Beam is a unified programming model for Batch and Streaming data processing.

6,592

Apache Storm

14,703

Apache Hadoop

Quick Overview

Scalding is a Scala library that makes it easy to write MapReduce jobs in Hadoop. It's built on top of Cascading, a Java library for building data processing workflows. Scalding provides a high-level DSL for expressing complex data transformations, making it easier to write and maintain big data processing pipelines.

Pros

  • High-level abstraction: Simplifies writing complex MapReduce jobs
  • Scala-based: Leverages Scala's type system and functional programming features
  • Integration with Hadoop ecosystem: Works well with other Hadoop tools and frameworks
  • Rich set of built-in operations: Provides many common data processing functions out of the box

Cons

  • Learning curve: Requires knowledge of Scala and functional programming concepts
  • Limited to batch processing: Not suitable for real-time or streaming data processing
  • Dependency on Cascading: May introduce additional complexity and potential version conflicts
  • Performance overhead: Abstraction layer may introduce some performance overhead compared to raw MapReduce

Code Examples

  1. Reading and filtering data:
val input = TypedPipe.from(TextLine("input.txt"))
val filtered = input.filter(line => line.contains("important"))
filtered.write(TypedTsv("output.txt"))
  1. Word count example:
val words = TypedPipe.from(TextLine("input.txt"))
  .flatMap(line => line.split("\\s+"))
  .map(word => (word, 1))
  .sumByKey
  .write(TypedTsv[(String, Int)]("word_count.txt"))
  1. Join operation:
val users = TypedPipe.from(TypedTsv[(Int, String)]("users.tsv"))
val orders = TypedPipe.from(TypedTsv[(Int, String, Double)]("orders.tsv"))

val joined = users.join(orders).map {
  case (userId, (name, (orderId, amount))) => (userId, name, orderId, amount)
}
joined.write(TypedTsv[(Int, String, String, Double)]("user_orders.tsv"))

Getting Started

To use Scalding in your project, add the following dependency to your build.sbt file:

libraryDependencies += "com.twitter" %% "scalding-core" % "0.17.4"

Create a new Scala object extending Job:

import com.twitter.scalding._

class MyJob(args: Args) extends Job(args) {
  TypedPipe.from(TextLine("input.txt"))
    .flatMap(_.split("\\s+"))
    .map(word => (word, 1))
    .sumByKey
    .write(TypedTsv[(String, Int)]("output.txt"))
}

Run the job using the scald.rb script provided by Scalding or by submitting it to a Hadoop cluster.

Competitor Comparisons

40,184

Apache Spark - A unified analytics engine for large-scale data processing

Pros of Spark

  • Faster performance for large-scale data processing
  • More comprehensive ecosystem with support for streaming, machine learning, and graph processing
  • Better support for interactive data exploration and ad-hoc queries

Cons of Spark

  • Steeper learning curve, especially for developers new to distributed computing
  • Higher memory requirements, which can be challenging in resource-constrained environments
  • More complex setup and configuration compared to Scalding

Code Comparison

Scalding (Scala):

val input = TextLine("input.txt")
val wordCounts = input.flatMap(line => line.split("\\s+"))
  .groupBy(word => word)
  .size
wordCounts.write(Tsv("output.txt"))

Spark (Scala):

val input = sc.textFile("input.txt")
val wordCounts = input.flatMap(line => line.split("\\s+"))
  .map(word => (word, 1))
  .reduceByKey(_ + _)
wordCounts.saveAsTextFile("output.txt")

Both examples demonstrate a simple word count operation, showcasing the similarities in syntax and approach between Scalding and Spark. However, Spark's code is more concise and leverages its distributed computing capabilities more explicitly.

23,929

Apache Flink

Pros of Flink

  • Supports both batch and stream processing, offering more versatility than Scalding
  • Provides lower latency and higher throughput for real-time data processing
  • Has a larger and more active community, resulting in frequent updates and improvements

Cons of Flink

  • Steeper learning curve compared to Scalding, especially for those familiar with Hadoop
  • Requires more complex setup and configuration, particularly for cluster deployment
  • May have higher resource requirements, especially for large-scale stream processing

Code Comparison

Scalding (Scala):

val input = TextLine("input.txt")
val wordCounts = input.flatMap(line => line.split("\\s+"))
  .groupBy(word => word)
  .size
wordCounts.write(Tsv("output.txt"))

Flink (Java):

DataStream<String> text = env.readTextFile("input.txt");
DataStream<Tuple2<String, Integer>> counts = text
    .flatMap(new Tokenizer())
    .keyBy(0)
    .sum(1);
counts.writeAsCsv("output.txt");

Both examples demonstrate a simple word count operation, showcasing the different syntax and approaches used in Scalding and Flink. Scalding's code is more concise due to Scala's expressiveness, while Flink's Java code is more verbose but offers more explicit control over the data processing pipeline.

7,828

Apache Beam is a unified programming model for Batch and Streaming data processing.

Pros of Beam

  • Supports multiple programming languages (Java, Python, Go)
  • Provides a unified model for batch and streaming data processing
  • Offers portability across various execution engines (Flink, Spark, Dataflow)

Cons of Beam

  • Steeper learning curve due to more complex API
  • Less focus on Hadoop-specific optimizations
  • Potentially slower development for simple MapReduce jobs

Code Comparison

Scalding (Scala):

val input = TextLine("input.txt")
val wordCounts = input.flatMap(line => line.split("\\s+"))
  .groupBy(word => word)
  .size
wordCounts.write(Tsv("output.txt"))

Beam (Java):

PCollection<String> input = p.apply(TextIO.read().from("input.txt"));
PCollection<KV<String, Long>> wordCounts = input
    .apply(FlatMapElements.into(TypeDescriptors.strings())
        .via((String line) -> Arrays.asList(line.split("\\s+"))))
    .apply(Count.perElement());
wordCounts.apply(TextIO.write().to("output.txt"));

Summary

Beam offers a more versatile and portable solution for data processing across different engines and languages, while Scalding provides a simpler, Scala-based approach optimized for Hadoop environments. The choice between the two depends on specific project requirements, team expertise, and the desired level of flexibility in execution environments.

6,592

Apache Storm

Pros of Storm

  • Real-time processing: Storm excels at processing data streams in real-time, making it suitable for applications requiring immediate insights
  • Fault-tolerant and scalable: Built to handle large-scale distributed processing with automatic fault recovery
  • Language-agnostic: Supports multiple programming languages, offering flexibility in development

Cons of Storm

  • Steeper learning curve: More complex to set up and configure compared to Scalding
  • Higher resource requirements: Typically requires more computational resources to run effectively
  • Less integrated with Hadoop ecosystem: While it can work with Hadoop, it's not as tightly integrated as Scalding

Code Comparison

Storm topology definition:

TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentenceBolt(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCountBolt(), 12).fieldsGrouping("split", new Fields("word"));

Scalding job definition:

class WordCountJob(args: Args) extends Job(args) {
  TypedPipe.from(TextLine(args("input")))
    .flatMap(_.toLowerCase.split("\\s+"))
    .groupBy(identity)
    .size
    .write(TypedTsv[(String, Long)](args("output")))
}
14,703

Apache Hadoop

Pros of Hadoop

  • More comprehensive ecosystem for distributed computing and big data processing
  • Wider industry adoption and community support
  • Better suited for handling very large-scale data processing tasks

Cons of Hadoop

  • Steeper learning curve and more complex setup
  • Can be overkill for smaller data processing tasks
  • Requires more hardware resources and infrastructure

Code Comparison

Scalding (Scala):

val input = TextLine("input.txt")
val wordCounts = input.flatMap(line => line.split("\\s+"))
  .groupBy(word => word)
  .size
wordCounts.write(Tsv("output.txt"))

Hadoop (Java):

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());
        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            context.write(word, one);
        }
    }
}

The code comparison shows a word count example in both frameworks. Scalding's code is more concise and leverages Scala's functional programming features, while Hadoop's Java code is more verbose but provides fine-grained control over the MapReduce process.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Scalding

Build status Coverage Status Latest version Chat

Scalding is a Scala library that makes it easy to specify Hadoop MapReduce jobs. Scalding is built on top of Cascading, a Java library that abstracts away low-level Hadoop details. Scalding is comparable to Pig, but offers tight integration with Scala, bringing advantages of Scala to your MapReduce jobs.

Scalding Logo

Word Count

Hadoop is a distributed system for counting words. Here is how it's done in Scalding.

package com.twitter.scalding.examples

import com.twitter.scalding._
import com.twitter.scalding.source.TypedText

class WordCountJob(args: Args) extends Job(args) {
  TypedPipe.from(TextLine(args("input")))
    .flatMap { line => tokenize(line) }
    .groupBy { word => word } // use each word for a key
    .size // in each group, get the size
    .write(TypedText.tsv[(String, Long)](args("output")))

  // Split a piece of text into individual words.
  def tokenize(text: String): Array[String] = {
    // Lowercase each word and remove punctuation.
    text.toLowerCase.replaceAll("[^a-zA-Z0-9\\s]", "").split("\\s+")
  }
}

Notice that the tokenize function, which is standard Scala, integrates naturally with the rest of the MapReduce job. This is a very powerful feature of Scalding. (Compare it to the use of UDFs in Pig.)

You can find more example code under examples/. If you're interested in comparing Scalding to other languages, see our Rosetta Code page, which has several MapReduce tasks in Scalding and other frameworks (e.g., Pig and Hadoop Streaming).

Documentation and Getting Started

Please feel free to use the beautiful Scalding logo artwork anywhere.

Contact

For user questions or scalding development (internals, extending, release planning): https://groups.google.com/forum/#!forum/scalding-dev (Google search also works as a first step)

In the remote possibility that there exist bugs in this code, please report them to: https://github.com/twitter/scalding/issues

Follow @Scalding on Twitter for updates.

Chat: Gitter

Get Involved + Code of Conduct

Pull requests and bug reports are always welcome!

We use a lightweight form of project governence inspired by the one used by Apache projects. Please see Contributing and Committership for our code of conduct and our pull request review process. The TL;DR is send us a pull request, iterate on the feedback + discussion, and get a +1 from a Committer in order to get your PR accepted.

The current list of active committers (who can +1 a pull request) can be found here: Committers

A list of contributors to the project can be found here: Contributors

Building

There is a script (called sbt) in the root that loads the correct sbt version to build:

  1. ./sbt update (takes 2 minutes or more)
  2. ./sbt test
  3. ./sbt assembly (needed to make the jar used by the scald.rb script)

The test suite takes a while to run. When you're in sbt, here's a shortcut to run just one test:

> test-only com.twitter.scalding.FileSourceTest

Please refer to FAQ page if you encounter problems when using sbt.

We use Github Actions to verify the build: Build Status

We use Coveralls for code coverage results: Coverage Status

Scalding modules are available from maven central.

The current groupid and version for all modules is, respectively, "com.twitter" and 0.17.2.

Current published artifacts are

  • scalding-core_2.11, scalding-core_2.12
  • scalding-args_2.11, scalding-args_2.12
  • scalding-date_2.11, scalding-date_2.12
  • scalding-commons_2.11, scalding-commons_2.12
  • scalding-avro_2.11, scalding-avro_2.12
  • scalding-parquet_2.11, scalding-parquet_2.12
  • scalding-repl_2.11, scalding-repl_2.12

The suffix denotes the scala version.

Adopters

  • Ebay
  • Etsy
  • Sharethrough
  • Snowplow Analytics
  • Soundcloud
  • Twitter

To see a full list of users or to add yourself, see the wiki

Authors:

Thanks for assistance and contributions:

A full list of contributors can be found on GitHub.

License

Copyright 2016 Twitter, Inc.

Licensed under the Apache License, Version 2.0