parquet-format

Apache Parquet Format

2,002

446

2,002

View on GitHub

Top Related Projects

arrow

15,787

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

avro

3,128

Apache Avro is a data serialization system.

zstd

25,390

Zstandard - Fast real-time compression algorithm

Quick Overview

The Apache Parquet format is a columnar data storage format that is designed to be efficient, scalable, and interoperable across a wide range of big data processing frameworks and programming languages. It is a key component of the Apache Hadoop ecosystem and is widely used in data warehousing, analytics, and machine learning applications.

Pros

Efficient Data Storage: Parquet uses a columnar storage format, which allows for efficient compression and encoding, resulting in smaller file sizes and faster data processing.
Cross-Platform Interoperability: Parquet is supported by a wide range of big data processing frameworks, including Apache Spark, Apache Hive, Apache Impala, and others, making it easy to work with data across different tools and platforms.
Scalability: Parquet is designed to handle large datasets and can be efficiently partitioned and distributed across multiple nodes in a cluster.
Rich Metadata: Parquet files include detailed metadata about the data they contain, which can be used to optimize query performance and enable advanced data management features.

Cons

Complexity: Parquet is a relatively complex format, with a number of configuration options and advanced features that can be challenging for beginners to understand and use effectively.
Performance Overhead: While Parquet is generally more efficient than row-based storage formats, there is still some performance overhead associated with the columnar storage and compression algorithms.
Limited Support for Nested Data Structures: Parquet has limited support for complex, nested data structures, which can be a limitation for certain types of applications.
Vendor Lock-in: As a key component of the Apache Hadoop ecosystem, the use of Parquet may lead to some degree of vendor lock-in, as it may be more difficult to migrate data to other platforms or formats.

Code Examples

Not applicable, as Parquet is a data storage format and not a code library.

Getting Started

To get started with Parquet, you'll need to have a big data processing framework that supports the format, such as Apache Spark or Apache Hive. Here's a brief example of how to read and write Parquet files using Apache Spark:

# Reading a Parquet file
df = spark.read.parquet("path/to/parquet/file")

# Selecting columns and filtering data
filtered_df = df.select("column1", "column2").where("column3 > 10")

# Writing a Parquet file
filtered_df.write.parquet("path/to/output/parquet/file")

In this example, we first read a Parquet file into a Spark DataFrame, then select a subset of columns and filter the data based on a condition. Finally, we write the filtered data to a new Parquet file.

For more detailed instructions on working with Parquet in different big data processing frameworks, please refer to the Apache Parquet documentation.

Competitor Comparisons

arrow

15,787

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

Pros of Arrow

Broader scope: Arrow is a more comprehensive data processing framework, while Parquet-format focuses specifically on the Parquet file format
Language support: Arrow provides implementations in multiple programming languages, offering greater flexibility
In-memory processing: Arrow's columnar in-memory format enables faster data processing and analytics

Cons of Arrow

Complexity: Arrow's broader scope can make it more complex to implement and use compared to Parquet-format
Resource usage: Arrow's in-memory processing may require more RAM, especially for large datasets

Code Comparison

Arrow (Python):

import pyarrow as pa

# Create an Arrow table
data = [
    pa.array([1, 2, 3, 4]),
    pa.array(['a', 'b', 'c', 'd'])
]
table = pa.Table.from_arrays(data, names=['numbers', 'letters'])

Parquet-format (using pyarrow):

import pyarrow.parquet as pq

# Write a Parquet file
pq.write_table(table, 'example.parquet')

# Read a Parquet file
read_table = pq.read_table('example.parquet')

While both examples use pyarrow, Arrow focuses on in-memory data representation, while Parquet-format deals with file storage and retrieval.

avro

3,128

Apache Avro is a data serialization system.

Pros of Avro

Supports schema evolution, allowing for easier updates to data structures
Offers better support for complex data types and nested structures
Provides more efficient serialization and deserialization

Cons of Avro

Larger file sizes compared to Parquet for some use cases
Less efficient for columnar data access and analytics queries

Code Comparison

Avro schema example:

{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "name", "type": "string"},
    {"name": "age", "type": "int"}
  ]
}

Parquet schema example:

message User {
  required binary name (UTF8);
  required int32 age;
}

Both formats support schema definitions, but Avro's JSON-based schema is more human-readable and flexible. Parquet's schema is more compact and closely tied to its columnar storage format.

Avro excels in scenarios requiring frequent schema changes and complex data structures, while Parquet is optimized for analytical workloads and columnar data access. The choice between the two depends on specific use cases and requirements.

thrift

10,754

Apache Thrift

Pros of Thrift

Supports multiple programming languages, enabling cross-language service development
Provides a complete RPC framework with built-in serialization
Offers both binary and compact protocols for efficient data transfer

Cons of Thrift

More complex setup and usage compared to Parquet's focused columnar storage
Less optimized for analytical workloads and big data processing
Requires more manual schema management and versioning

Code Comparison

Thrift IDL example:

struct Person {
  1: string name
  2: i32 age
  3: bool is_employed
}

Parquet schema example:

{
  "name": "Person",
  "fields": [
    {"name": "name", "type": "STRING"},
    {"name": "age", "type": "INT32"},
    {"name": "is_employed", "type": "BOOLEAN"}
  ]
}

Summary

Thrift is a versatile RPC framework supporting multiple languages and protocols, while Parquet-format focuses on efficient columnar storage for big data. Thrift offers more flexibility for service development but requires more setup. Parquet excels in analytical workloads and provides simpler schema definition for columnar data storage.

snappy

6,399

A fast compressor/decompressor

Pros of Snappy

Faster compression and decompression speeds compared to Parquet
Simpler implementation and easier integration into existing projects
Smaller memory footprint during compression/decompression operations

Cons of Snappy

Lower compression ratio compared to Parquet, resulting in larger file sizes
Lacks built-in support for columnar data storage and querying
Not optimized for analytical workloads or big data processing

Code Comparison

Snappy (C++):

snappy::Compress(input_data, input_length, &compressed);
snappy::Uncompress(compressed_data, compressed_length, &uncompressed);

Parquet (Java):

ParquetWriter<GenericRecord> writer = AvroParquetWriter.<GenericRecord>builder(path)
    .withSchema(schema)
    .withCompressionCodec(CompressionCodecName.SNAPPY)
    .build();
writer.write(record);

While Snappy focuses on fast compression/decompression of general data, Parquet is designed for efficient storage and querying of columnar data in big data ecosystems. Snappy can be used as a compression algorithm within Parquet, but Parquet offers additional features for data organization and analysis. The code examples show the simplicity of Snappy's API compared to Parquet's more complex setup for writing data.

zstd

25,390

Zstandard - Fast real-time compression algorithm

Pros of zstd

Higher compression ratios and faster compression/decompression speeds
Versatile, can be used for various data types beyond columnar storage
Active development with frequent updates and improvements

Cons of zstd

Not specifically designed for columnar data storage like Parquet
May require additional implementation work to integrate with big data ecosystems
Lacks built-in support for schema evolution and nested data structures

Code Comparison

parquet-format:

struct SchemaElement {
  1: required string name;
  2: optional Type type;
  3: optional i32 type_length;
  4: optional i32 repetition_type;
  5: optional i32 num_children;
}

zstd:

size_t ZSTD_compress( void* dst, size_t dstCapacity,
                const void* src, size_t srcSize,
                int compressionLevel);

While parquet-format defines data structures for columnar storage, zstd focuses on general-purpose compression algorithms. Parquet is tailored for big data processing, offering schema definitions and optimizations for analytical queries. zstd, on the other hand, provides a flexible compression library that can be applied to various data types and scenarios, potentially offering better compression ratios and performance for raw data compression tasks.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Parquet

This repository contains the specification for Apache Parquet and Apache Thrift definitions to read and write Parquet metadata.

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming language and analytics tools.

Motivation

We created Parquet to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem.

Parquet is built from the ground up with complex nested data structures in mind, and uses the record shredding and assembly algorithm described in the Dremel paper. We believe this approach is superior to simple flattening of nested name spaces.

Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented.

Parquet is built to be used by anyone. The Hadoop ecosystem is rich with data processing frameworks, and we are not interested in playing favorites. We believe that an efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies.

Modules

The parquet-format project contains format specifications and Thrift definitions of metadata required to properly read Parquet files.

The parquet-java project contains multiple sub-modules, which implement the core components of reading and writing a nested, column-oriented data stream, map this core onto the parquet format, and provide Hadoop Input/Output Formats, Pig loaders, and other java-based utilities for interacting with Parquet.

The parquet-testing project contains a set of files that can be used to verify that implementations in different languages can read and write each other's files.

Building

Java resources can be built using mvn package. The current stable version should always be available from Maven Central.

C++ thrift resources can be generated via make.

Thrift can be also code-generated into any other thrift-supported language.

Glossary

Block (HDFS block): This means a block in HDFS and the meaning is unchanged for describing this file format. The file format is designed to work well on top of HDFS.
File: A HDFS file that must include the metadata for the file. It does not need to actually contain the data.
Row group: A logical horizontal partitioning of the data into rows. There is no physical structure that is guaranteed for a row group. A row group consists of a column chunk for each column in the dataset.
Column chunk: A chunk of the data for a particular column. They live in a particular row group and are guaranteed to be contiguous in the file.
Page: Column chunks are divided up into pages. A page is conceptually an indivisible unit (in terms of compression and encoding). There can be multiple page types which are interleaved in a column chunk.

Hierarchically, a file consists of one or more row groups. A row group contains exactly one column chunk per column. Column chunks contain one or more pages.

Unit of parallelization

MapReduce - File/Row Group
IO - Column chunk
Encoding/Compression - Page

File format

This file and the Thrift definition should be read together to understand the format.

4-byte magic number "PAR1"
<Column 1 Chunk 1>
<Column 2 Chunk 1>
...
<Column N Chunk 1>
<Column 1 Chunk 2>
<Column 2 Chunk 2>
...
<Column N Chunk 2>
...
<Column 1 Chunk M>
<Column 2 Chunk M>
...
<Column N Chunk M>
File Metadata
4-byte length in bytes of file metadata (little endian)
4-byte magic number "PAR1"

In the above example, there are N columns in this table, split into M row groups. The file metadata contains the locations of all the column chunk start locations. More details on what is contained in the metadata can be found in the Thrift definition.

File Metadata is written after the data to allow for single pass writing.

Readers are expected to first read the file metadata to find all the column chunks they are interested in. The columns chunks should then be read sequentially.

File Layout

Metadata

There are two types of metadata: file metadata and page header metadata. All thrift structures are serialized using the TCompactProtocol.

Metadata diagram

Types

The types supported by the file format are intended to be as minimal as possible, with a focus on how the types effect on disk storage. For example, 16-bit ints are not explicitly supported in the storage format since they are covered by 32-bit ints with an efficient encoding. This reduces the complexity of implementing readers and writers for the format. The types are:

BOOLEAN: 1 bit boolean
INT32: 32 bit signed ints
INT64: 64 bit signed ints
INT96: 96 bit signed ints
FLOAT: IEEE 32-bit floating point values
DOUBLE: IEEE 64-bit floating point values
BYTE_ARRAY: arbitrarily long byte arrays
FIXED_LEN_BYTE_ARRAY: fixed length byte arrays

Logical Types

Logical types are used to extend the types that parquet can be used to store, by specifying how the primitive types should be interpreted. This keeps the set of primitive types to a minimum and reuses parquet's efficient encodings. For example, strings are stored with the primitive type BYTE_ARRAY with a STRING annotation. These annotations define how to further decode and interpret the data. Annotations are stored as LogicalType fields in the file metadata and are documented in LogicalTypes.md.

Sort Order

Parquet stores min/max statistics at several levels (such as Column Chunk, Column Index, and Data Page). These statistics are according to a sort order, which is defined for each column in the file footer. Parquet supports common sort orders for logical and primitve types. The details are documented in the Thrift definition in the ColumnOrder union.

Nested Encoding

To encode nested columns, Parquet uses the Dremel encoding with definition and repetition levels. Definition levels specify how many optional fields in the path for the column are defined. Repetition levels specify at what repeated field in the path has the value repeated. The max definition and repetition levels can be computed from the schema (i.e. how much nesting there is). This defines the maximum number of bits required to store the levels (levels are defined for all values in the column).

Two encodings for the levels are supported BIT_PACKED and RLE. Only RLE is now used as it supersedes BIT_PACKED.

Nulls

Nullity is encoded in the definition levels (which is run-length encoded). NULL values are not encoded in the data. For example, in a non-nested schema, a column with 1000 NULLs would be encoded with run-length encoding (0, 1000 times) for the definition levels and nothing else.

Data Pages

For data pages, the 3 pieces of information are encoded back to back, after the page header. No padding is allowed in the data page. In order we have:

repetition levels data
definition levels data
encoded values

The value of uncompressed_page_size specified in the header is for all the 3 pieces combined.

The encoded values for the data page is always required. The definition and repetition levels are optional, based on the schema definition. If the column is not nested (i.e. the path to the column has length 1), we do not encode the repetition levels (it would always have the value 1). For data that is required, the definition levels are skipped (if encoded, it will always have the value of the max definition level).

For example, in the case where the column is non-nested and required, the data in the page is only the encoded values.

The supported encodings are described in Encodings.md

The supported compression codecs are described in Compression.md

Column chunks

Column chunks are composed of pages written back to back. The pages share a common header and readers can skip over pages they are not interested in. The data for the page follows the header and can be compressed and/or encoded. The compression and encoding is specified in the page metadata.

A column chunk might be partly or completely dictionary encoded. It means that dictionary indexes are saved in the data pages instead of the actual values. The actual values are stored in the dictionary page. See details in Encodings.md. The dictionary page must be placed at the first position of the column chunk. At most one dictionary page can be placed in a column chunk.

Additionally, files can contain an optional column index to allow readers to skip pages more efficiently. See PageIndex.md for details and the reasoning behind adding these to the format.

Checksumming

Pages of all kinds can be individually checksummed. This allows disabling of checksums at the HDFS file level, to better support single row lookups. Checksums are calculated using the standard CRC32 algorithm - as used in e.g. GZip - on the serialized binary representation of a page (not including the page header itself).

Error recovery

If the file metadata is corrupt, the file is lost. If the column metadata is corrupt, that column chunk is lost (but column chunks for this column in other row groups are okay). If a page header is corrupt, the remaining pages in that chunk are lost. If the data within a page is corrupt, that page is lost. The file will be more resilient to corruption with smaller row groups.

Potential extension: With smaller row groups, the biggest issue is placing the file metadata at the end. If an error happens while writing the file metadata, all the data written will be unreadable. This can be fixed by writing the file metadata every Nth row group. Each file metadata would be cumulative and include all the row groups written so far. Combining this with the strategy used for rc or avro files using sync markers, a reader could recover partially written files.

Separating metadata and column data.

The format is explicitly designed to separate the metadata from the data. This allows splitting columns into multiple files, as well as having a single metadata file reference multiple parquet files.

Configurations

Row group size: Larger row groups allow for larger column chunks which makes it possible to do larger sequential IO. Larger groups also require more buffering in the write path (or a two pass write). We recommend large row groups (512MB - 1GB). Since an entire row group might need to be read, we want it to completely fit on one HDFS block. Therefore, HDFS block sizes should also be set to be larger. An optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block per HDFS file.
Data page size: Data pages should be considered indivisible so smaller data pages allow for more fine grained reading (e.g. single row lookup). Larger page sizes incur less space overhead (less page headers) and potentially less parsing overhead (processing headers). Note: for sequential scans, it is not expected to read a page at a time; this is not the IO chunk. We recommend 8KB for page sizes.

Extensibility

There are many places in the format for compatible extensions:

File Version: The file metadata contains a version.
Encodings: Encodings are specified by enum and more can be added in the future.
Page types: Additional page types can be added and safely skipped.

Binary Protocol Extensions

Parquet Thrift IDL reserves field-id 32767 of every Thrift struct for extensions. The (Thrift) type of this field is always binary.

Contributing

Comment on the issue and/or contact the parquet-dev mailing list with your questions and ideas. Changes to this core format definition are proposed and discussed in depth on the mailing list. You may also be interested in contributing to the Parquet-Java subproject, which contains all the Java-side implementation and APIs. See the "How To Contribute" section of the Parquet-Java project

Code of Conduct

We hold ourselves and the Parquet developer community to a code of conduct as described by Twitter OSS: https://github.com/twitter/code-of-conduct/blob/master/code-of-conduct.md.

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot