Convert Figma logo to code with AI

apache logopinot

Apache Pinot - A realtime distributed OLAP datastore

5,385
1,262
5,385
1,561

Top Related Projects

1,740

Distributed object store

13,388

Apache Druid: a high performance real-time analytics database.

5,194

Apache HBase

Apache Cassandra®

14,635

Apache Hadoop

39,274

Apache Spark - A unified analytics engine for large-scale data processing

Quick Overview

Apache Pinot is a real-time distributed OLAP datastore designed to deliver scalable real-time analytics with low latency. It is particularly well-suited for user-facing analytical applications that require fast analytics on freshly ingested data.

Pros

  • High performance and low latency for real-time analytics
  • Scalable and distributed architecture
  • Supports both batch and streaming data ingestion
  • Flexible query language (PQL) and SQL support

Cons

  • Steep learning curve for beginners
  • Limited support for complex joins compared to traditional databases
  • Requires careful schema design for optimal performance
  • Resource-intensive for large-scale deployments

Code Examples

  1. Creating a Pinot table:
Schema schema = new Schema.SchemaBuilder()
    .setSchemaName("myTable")
    .addSingleValueDimension("dimension1", FieldSpec.DataType.STRING)
    .addMetric("metric1", FieldSpec.DataType.LONG)
    .build();

TableConfig tableConfig = new TableConfig.Builder(TableType.OFFLINE)
    .setTableName("myTable")
    .setSchemaName("myTable")
    .build();

_pinotAdmin.addSchema(schema, true);
_pinotAdmin.addTable(tableConfig);
  1. Querying data using PQL:
Connection connection = DriverManager.getConnection("jdbc:pinot://localhost:9000");
Statement statement = connection.createStatement();
ResultSet resultSet = statement.executeQuery("SELECT COUNT(*) FROM myTable WHERE dimension1 = 'value'");
  1. Ingesting data using Kafka:
StreamConfig streamConfig = new StreamConfig();
streamConfig.setStreamType("kafka");
streamConfig.setStreamKafkaConsumerType("highLevel");
streamConfig.setStreamKafkaConsumerFactory("org.apache.pinot.plugin.stream.kafka.KafkaConsumerFactory");
streamConfig.setStreamKafkaConsumerProps(kafkaConsumerProps);

tableConfig.setIngestionConfig(new IngestionConfig(null, streamConfig, null, null));
_pinotAdmin.addTable(tableConfig);

Getting Started

  1. Download and install Apache Pinot:
wget https://downloads.apache.org/pinot/apache-pinot-0.10.0/apache-pinot-0.10.0-bin.tar.gz
tar -xvf apache-pinot-0.10.0-bin.tar.gz
cd apache-pinot-0.10.0-bin
  1. Start Pinot components:
bin/quick-start-offline.sh
  1. Create a table and ingest data:
bin/pinot-admin.sh AddTable -schemaFile examples/batch/airlineStats/airlineStats_schema.json -tableConfigFile examples/batch/airlineStats/airlineStats_offline_table_config.json -exec
bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile examples/batch/airlineStats/airlineStats_job_spec.yml
  1. Query data using Pinot Query Console at http://localhost:9000

Competitor Comparisons

1,740

Distributed object store

Pros of Ambry

  • Specialized for blob storage, offering efficient handling of large binary objects
  • Simpler architecture, potentially easier to set up and maintain for specific use cases
  • Designed with a focus on durability and availability of stored data

Cons of Ambry

  • More limited in scope compared to Pinot's broader analytical capabilities
  • Less active community and development compared to Pinot's Apache project status
  • Fewer built-in features for real-time data ingestion and querying

Code Comparison

Ambry (Java):

BlobId blobId = new BlobId(version, BlobIdType.NATIVE, datacenterId, accountId, containerId, partitionId, isEncrypted, blobDataType);
BlobProperties blobProperties = new BlobProperties(blobSize, serviceId);
MessageFormatInputStream messageFormatInputStream = new PutMessageFormatInputStream(blobId, blobProperties, userMetadata, blobStream);

Pinot (Java):

Schema schema = new Schema.SchemaBuilder().setSchemaName("mySchema")
    .addSingleValueDimension("dimension", FieldSpec.DataType.STRING)
    .addMetric("metric", FieldSpec.DataType.LONG)
    .build();
TableConfig tableConfig = new TableConfigBuilder(TableType.OFFLINE).setTableName("myTable")
    .setSchemaName("mySchema").build();

Both projects use Java, but their APIs reflect their different purposes. Ambry focuses on blob storage operations, while Pinot's code emphasizes schema and table configurations for analytical processing.

13,388

Apache Druid: a high performance real-time analytics database.

Pros of Druid

  • More mature project with a longer history and larger community
  • Better support for complex aggregations and multi-dimensional analytics
  • More flexible query language (SQL-like DruidSQL)

Cons of Druid

  • Higher resource consumption and complexity in setup
  • Steeper learning curve for configuration and optimization
  • Less efficient for real-time ingestion of high-volume data streams

Code Comparison

Druid query example:

SELECT COUNT(*) AS count,
       SUM(price) AS total_price
FROM transactions
WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '1' DAY
GROUP BY product_id
HAVING count > 100
ORDER BY total_price DESC
LIMIT 10

Pinot query example:

SELECT COUNT(*) AS count,
       SUM(price) AS total_price
FROM transactions
WHERE timestamp >= NOW() - 1d
GROUP BY product_id
HAVING count > 100
ORDER BY total_price DESC
LIMIT 10

Both systems support SQL-like queries, but Druid's query language offers more advanced features and flexibility. Pinot's syntax is generally simpler and more focused on real-time analytics use cases.

5,194

Apache HBase

Pros of HBase

  • Mature and widely adopted in big data ecosystems
  • Strong support for random read/write operations
  • Excellent scalability for handling large datasets

Cons of HBase

  • Higher latency for real-time queries
  • More complex setup and maintenance
  • Less optimized for analytical workloads

Code Comparison

HBase example (Java):

Put put = new Put(Bytes.toBytes("row1"));
put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("qual1"), Bytes.toBytes("value1"));
table.put(put);

Pinot example (Java):

GenericRow row = new GenericRow();
row.putField("dimension1", "value1");
row.putField("metric1", 100);
segmentWriter.index(row);

Key Differences

  • HBase is a distributed, column-oriented database, while Pinot is a real-time distributed OLAP datastore
  • Pinot is optimized for low-latency analytical queries, whereas HBase excels in random access scenarios
  • HBase provides strong consistency, while Pinot focuses on eventual consistency for better query performance

Use Cases

  • HBase: Large-scale data storage, random read/write operations, and applications requiring strong consistency
  • Pinot: Real-time analytics, user-facing dashboards, and scenarios demanding low-latency queries on large datasets

Both projects are part of the Apache Software Foundation and have active communities, but they serve different purposes in the big data ecosystem.

Apache Cassandra®

Pros of Cassandra

  • Highly scalable and distributed architecture, suitable for large-scale deployments
  • Strong support for write-heavy workloads and high availability
  • Mature ecosystem with extensive tooling and community support

Cons of Cassandra

  • Complex query language (CQL) compared to SQL-like queries in Pinot
  • Less efficient for real-time analytics and low-latency queries
  • Requires more manual tuning and optimization for optimal performance

Code Comparison

Cassandra (CQL):

CREATE TABLE users (
  id UUID PRIMARY KEY,
  name TEXT,
  email TEXT
);

Pinot:

CREATE TABLE users (
  id STRING,
  name STRING,
  email STRING
) WITH (
  "segmentName": "users"
);

Both systems use different approaches for table creation. Cassandra uses CQL, which is similar to SQL but with some differences, while Pinot uses a SQL-like syntax with additional configuration options.

Cassandra is well-suited for large-scale, distributed databases with high write throughput, while Pinot excels in real-time analytics and low-latency queries. The choice between the two depends on specific use cases and requirements.

14,635

Apache Hadoop

Pros of Hadoop

  • Mature ecosystem with extensive tooling and community support
  • Highly scalable for processing massive datasets across distributed clusters
  • Flexible for both batch and real-time data processing

Cons of Hadoop

  • Complex setup and configuration process
  • Higher latency for real-time analytics compared to Pinot
  • Steeper learning curve for developers and administrators

Code Comparison

Hadoop MapReduce job:

public class WordCount extends Configured implements Tool {
    public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();
        // ... (mapper implementation)
    }
    // ... (reducer and main method)
}

Pinot query:

SELECT COUNT(*) AS count, word
FROM myTable
GROUP BY word
ORDER BY count DESC
LIMIT 10

Summary

Hadoop excels in processing large-scale batch data and offers a comprehensive ecosystem. Pinot, on the other hand, is designed for real-time analytics with lower latency. Hadoop's MapReduce paradigm requires more complex Java code, while Pinot uses SQL-like queries for data analysis. Choose Hadoop for diverse big data processing needs, and Pinot for fast, real-time analytics on specific use cases.

39,274

Apache Spark - A unified analytics engine for large-scale data processing

Pros of Spark

  • More mature and widely adopted ecosystem with extensive libraries and integrations
  • Supports a broader range of data processing tasks, including batch processing, streaming, and machine learning
  • Offers multiple language APIs (Scala, Java, Python, R) for flexibility in development

Cons of Spark

  • Higher resource consumption and longer startup times, especially for small tasks
  • Steeper learning curve due to its extensive feature set and distributed computing concepts
  • Can be overkill for simpler data processing tasks or smaller datasets

Code Comparison

Spark (PySpark):

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Example").getOrCreate()
df = spark.read.csv("data.csv", header=True)
result = df.groupBy("column").count()
result.show()

Pinot:

Connection connection = DriverManager.getConnection("jdbc:pinot://localhost:8000");
Statement statement = connection.createStatement();
ResultSet resultSet = statement.executeQuery("SELECT column, COUNT(*) FROM myTable GROUP BY column");
while (resultSet.next()) {
    System.out.println(resultSet.getString(1) + ": " + resultSet.getInt(2));
}

Note: Pinot is primarily used as a real-time distributed OLAP datastore, while Spark is a more general-purpose data processing engine. The code comparison showcases basic data querying in both systems, but their use cases and typical implementations can differ significantly.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Apache Pinot

Build Status Release codecov.io Join the chat at https://communityinviter.com/apps/apache-pinot/apache-pinot Twitter Follow License

What is Apache Pinot?

Apache Pinot is a real-time distributed OLAP datastore, built to deliver scalable real-time analytics with low latency. It can ingest from batch data sources (such as Hadoop HDFS, Amazon S3, Azure ADLS, Google Cloud Storage) as well as stream data sources (such as Apache Kafka).

Pinot was built by engineers at LinkedIn and Uber and is designed to scale up and out with no upper bound. Performance always remains constant based on the size of your cluster and an expected query per second (QPS) threshold.

For getting started guides, deployment recipes, tutorials, and more, please visit our project documentation at https://docs.pinot.apache.org.

Apache Pinot

Features

Pinot was originally built at LinkedIn to power rich interactive real-time analytic applications such as Who Viewed Profile, Company Analytics, Talent Insights, and many more. UberEats Restaurant Manager is another example of a customer facing Analytics App. At LinkedIn, Pinot powers 50+ user-facing products, ingesting millions of events per second and serving 100k+ queries per second at millisecond latency.

  • Column-oriented: a column-oriented database with various compression schemes such as Run Length, Fixed Bit Length.

  • Pluggable indexing: pluggable indexing technologies Sorted Index, Bitmap Index, Inverted Index.

  • Query optimization: ability to optimize query/execution plan based on query and segment metadata.

  • Stream and batch ingest: near real time ingestion from streams and batch ingestion from Hadoop.

  • Query: SQL based query execution engine.

  • Upsert during real-time ingestion: update the data at-scale with consistency

  • Multi-valued fields: support for multi-valued fields, allowing you to query fields as comma separated values.

  • Cloud-native on Kubernetes: Helm chart provides a horizontally scalable and fault-tolerant clustered deployment that is easy to manage using Kubernetes.

Apache Pinot query console

When should I use Pinot?

Pinot is designed to execute real-time OLAP queries with low latency on massive amounts of data and events. In addition to real-time stream ingestion, Pinot also supports batch use cases with the same low latency guarantees. It is suited in contexts where fast analytics, such as aggregations, are needed on immutable data, possibly, with real-time data ingestion. Pinot works very well for querying time series data with lots of dimensions and metrics.

Example query:

SELECT sum(clicks), sum(impressions) FROM AdAnalyticsTable
  WHERE
       ((daysSinceEpoch >= 17849 AND daysSinceEpoch <= 17856)) AND
       accountId IN (123456789)
  GROUP BY
       daysSinceEpoch TOP 100

Building Pinot

More detailed instructions can be found at Quick Demo section in the documentation.

# Clone a repo
$ git clone https://github.com/apache/pinot.git
$ cd pinot

# Build Pinot
$ mvn clean install -DskipTests -Pbin-dist

# Run the Quick Demo
$ cd build/
$ bin/quick-start-batch.sh

For UI development setup refer this doc.

Deploying Pinot to Kubernetes

Please refer to Running Pinot on Kubernetes in our project documentation. Pinot also provides Kubernetes integrations with the interactive query engine, Trino Presto, and the data visualization tool, Apache Superset.

Join the Community

Documentation

Check out Pinot documentation for a complete description of Pinot's features.

License

Apache Pinot is under Apache License, Version 2.0