ambry

Distributed object store

1,758

275

1,758

148

View on GitHub

Top Related Projects

pulsar

14,765

Apache Pulsar - distributed pub-sub messaging system

etcd

49,955

Distributed reliable key-value store for the most critical data of a distributed system

tikv

15,948

Distributed transactional key-value database, originally created to complement TiDB

Quick Overview

Ambry is a distributed object store developed by LinkedIn for storing and serving large binary objects (blobs). It is designed to handle high-throughput, low-latency read and write operations, making it suitable for various use cases such as storing and serving media files, documents, and other large data objects.

Pros

Highly scalable and distributed architecture
Low-latency read and write operations
Support for multi-datacenter replication
Flexible storage options (in-memory, on-disk, or tiered)

Cons

Complex setup and configuration process
Limited documentation for advanced use cases
Steep learning curve for newcomers to distributed systems
Primarily designed for large-scale deployments, may be overkill for smaller projects

Code Examples

Creating a BlobId:

BlobId blobId = new BlobId(version, BlobIdType.NATIVE, datacenterId, accountId, containerId, partitionId, isEncrypted, blobDataType);

This code creates a unique identifier for a blob in Ambry.

Uploading a blob:

BlobProperties blobProperties = new BlobProperties(blobSize, serviceId);
ByteBufferInputStream stream = new ByteBufferInputStream(ByteBuffer.wrap(data));
router.putBlob(blobId, blobProperties, stream, null, callback);

This example demonstrates how to upload a blob to Ambry using the Router interface.

Retrieving a blob:

router.getBlob(blobId, new GetBlobOptions(), new Callback<GetBlobResult>() {
    @Override
    public void onCompletion(GetBlobResult result, Exception exception) {
        if (exception == null) {
            InputStream blobContent = result.getBlobDataChannel();
            // Process the blob content
        } else {
            // Handle the exception
        }
    }
});

This code shows how to retrieve a blob from Ambry using its BlobId.

Getting Started

To get started with Ambry, follow these steps:

Clone the repository:

git clone https://github.com/linkedin/ambry.git

Build the project:
```
cd ambry
./gradlew build
```
Run a single node Ambry server:
```
./gradlew startServer
```

Use the Ambry client to interact with the server:

Properties props = new Properties();
props.setProperty("host.name", "localhost");
props.setProperty("port", "6667");
NetworkClient networkClient = new NetworkClientFactory(props).getNetworkClient();
Router router = new RouterFactory(props, networkClient).getRouter();
// Use the router to perform operations on Ambry

For more detailed instructions and advanced configurations, refer to the project's documentation on GitHub.

Competitor Comparisons

kafka

30,603

Mirror of Apache Kafka

Pros of Kafka

Highly scalable and distributed streaming platform with high throughput
Robust ecosystem with wide industry adoption and extensive tooling
Supports real-time data processing and complex event streaming use cases

Cons of Kafka

Higher complexity and steeper learning curve for setup and management
Can be resource-intensive, especially for smaller deployments
Less optimized for large object storage compared to Ambry

Code Comparison

Kafka producer example:

Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer<>(props);

Ambry client example:

RouterConfig routerConfig = new RouterConfig(verifiableProperties);
ClusterMap clusterMap = new HelixClusterManager(clusterMapConfig, helixManagerFactory);
Router router = new NonBlockingRouterFactory(routerConfig, clusterMap).getRouter();

Both Kafka and Ambry are distributed storage systems, but they serve different purposes. Kafka excels in real-time data streaming and processing, while Ambry is designed for large object storage and retrieval. Kafka offers more flexibility for complex event streaming scenarios, whereas Ambry provides better optimization for storing and accessing large files or blobs.

pulsar

14,765

Apache Pulsar - distributed pub-sub messaging system

Pros of Pulsar

Supports both streaming and queuing messaging models
Built-in multi-tenancy and geo-replication features
Scalable architecture with separate storage and serving layers

Cons of Pulsar

More complex setup and configuration compared to Ambry
Higher resource requirements for deployment
Steeper learning curve for developers new to the system

Code Comparison

Ambry (Java):

BlobId blobId = new BlobId(version, BlobIdType.NATIVE, dataCenterId, accountId, containerId, partitionId, isEncrypted, BlobIdSourceType.NATIVE);
BlobProperties blobProperties = new BlobProperties(blobSize, serviceId, accountId, containerId, false);
router.putBlob(blobId, blobProperties, userMetadata, channel, new PutBlobOptionsBuilder().build(), callback);

Pulsar (Java):

PulsarClient client = PulsarClient.builder().serviceUrl("pulsar://localhost:6650").build();
Producer<byte[]> producer = client.newProducer().topic("my-topic").create();
producer.send("Hello, Pulsar!".getBytes());

Both Ambry and Pulsar are distributed storage systems, but they serve different purposes. Ambry is primarily designed for blob storage, while Pulsar focuses on messaging and event streaming. Pulsar offers more features for real-time data processing and supports multiple messaging patterns, making it suitable for a wider range of use cases. However, this comes at the cost of increased complexity and resource requirements compared to Ambry's more focused approach to blob storage.

hadoop

15,192

Apache Hadoop

Pros of Hadoop

Widely adopted and supported by a large community
Offers a comprehensive ecosystem for big data processing
Provides robust scalability for handling massive datasets

Cons of Hadoop

More complex setup and configuration compared to Ambry
Higher resource requirements for running a full Hadoop cluster
Steeper learning curve for new users

Code Comparison

Hadoop (HDFS write operation):

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path file = new Path("hdfs://localhost:9000/user/hadoop/file.txt");
FSDataOutputStream out = fs.create(file);
out.writeUTF("Hello, Hadoop!");
out.close();

Ambry (BlobStore write operation):

BlobId blobId = new BlobId(partitionId, "file.txt");
ByteBuffer buffer = ByteBuffer.wrap("Hello, Ambry!".getBytes());
BlobProperties properties = new BlobProperties(buffer.capacity(), "application/text");
store.put(blobId, buffer, properties);

Both Hadoop and Ambry provide distributed storage solutions, but Hadoop offers a more comprehensive ecosystem for big data processing, while Ambry focuses on providing a simpler, more lightweight blob storage system. Hadoop's HDFS is designed for handling large files and batch processing, whereas Ambry is optimized for smaller objects and faster access times.

hbase

5,369

Apache HBase

Pros of HBase

Mature and widely adopted project with a large community and extensive documentation
Supports real-time read/write access to large datasets
Offers strong consistency and automatic sharding

Cons of HBase

Higher complexity and steeper learning curve
Requires more resources and maintenance overhead
Less efficient for storing and retrieving small objects

Code Comparison

HBase example (Java):

Put put = new Put(Bytes.toBytes("row1"));
put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("qual1"), Bytes.toBytes("value1"));
table.put(put);

Ambry example (Java):

BlobProperties blobProperties = new BlobProperties(blobSize, serviceId);
Router router = new RouterFactory(verifiableProperties, clusterMap).getRouter();
router.putBlob(blobId, blobProperties, userMetadata, blobContent, callback);

Key Differences

Ambry is designed specifically for storing and serving large binary objects (blobs), while HBase is a more general-purpose columnar database
HBase provides stronger consistency guarantees, while Ambry focuses on high throughput and low latency for blob storage
Ambry offers simpler deployment and management for blob storage use cases, whereas HBase provides more flexibility for complex data models

Both projects have their strengths, with HBase being more suitable for general-purpose distributed database needs and Ambry excelling in blob storage scenarios.

etcd

49,955

Distributed reliable key-value store for the most critical data of a distributed system

Pros of etcd

Designed for distributed systems and highly available key-value storage
Strong consistency and reliability through the Raft consensus algorithm
Well-suited for storing configuration data and service discovery

Cons of etcd

Limited scalability for large datasets compared to Ambry
Not optimized for storing and retrieving large binary objects
Higher complexity in setup and maintenance for simple use cases

Code Comparison

etcd (Go):

cli, _ := clientv3.New(clientv3.Config{Endpoints: []string{"localhost:2379"}})
defer cli.Close()
ctx, cancel := context.WithTimeout(context.Background(), time.Second)
_, err := cli.Put(ctx, "key", "value")
cancel()

Ambry (Java):

RouterConfig routerConfig = new RouterConfig(verifiableProperties);
NonBlockingRouter router = new NonBlockingRouter(routerConfig, clusterMap, notificationSystem);
BlobProperties blobProperties = new BlobProperties(blobSize, serviceId);
router.putBlob(blobId, blobProperties, userMetadata, channel, null);

Summary

etcd is a distributed key-value store focused on consistency and reliability, making it ideal for configuration management and service discovery in distributed systems. Ambry, on the other hand, is designed for large-scale blob storage and retrieval, offering better scalability for large datasets. While etcd provides strong consistency through the Raft algorithm, Ambry is optimized for high-throughput blob operations. The choice between the two depends on the specific use case and requirements of the system being developed.

tikv

15,948

Distributed transactional key-value database, originally created to complement TiDB

Pros of TiKV

Designed as a distributed key-value store, offering better scalability for large-scale deployments
Supports ACID transactions across multiple keys, providing stronger consistency guarantees
Implements the Raft consensus algorithm, enhancing fault tolerance and data reliability

Cons of TiKV

Higher complexity due to its distributed nature, potentially requiring more resources to manage
Steeper learning curve for developers not familiar with distributed systems
May have higher latency for simple read/write operations compared to Ambry's simpler architecture

Code Comparison

TiKV (Rust):

let client = TiKVClient::new(vec!["127.0.0.1:2379"]).await?;
let txn = client.begin().await?;
txn.put("key".to_owned(), "value".to_owned()).await?;
txn.commit().await?;

Ambry (Java):

Router router = new RouterFactory(clusterMap, responseHandler).getRouter();
BlobId blobId = new BlobId(version, BlobId.BlobIdType.NATIVE, dataCenterId, accountId, containerId, partitionId, isEncrypted, BlobId.BlobDataType.DATACHUNK);
router.putBlob(blobId, blobProperties, userMetadata, channel, callback);

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Ambry

Ambry is a distributed object store that supports storage of trillions of small immutable objects (50K -100K) as well as billions of large objects. It was specifically designed to store and serve media objects in web companies. However, it can be used as a general purpose storage system to store DB backups, search indexes or business reports. The system has the following characterisitics:

Highly available and horizontally scalable
Low latency and high throughput
Optimized for both small and large objects
Cost effective
Easy to use

Requires JDK version 1.11 - 1.14.

Documentation

Detailed documentation is available at https://github.com/linkedin/ambry/wiki

Research

Paper introducing Ambry at SIGMOD 2016 -> http://dprg.cs.uiuc.edu/data/files/2016/ambry.pdf

Reach out to us at ambrydev@googlegroups.com if you would like us to list a paper that is based off of research on Ambry.

Getting Started

Step 1: Download the code, build it and prepare for deployment.

To get the latest code and build it, do

$ git clone https://github.com/linkedin/ambry.git 
$ cd ambry
$ ./gradlew allJar
$ cd target
$ mkdir logs

Ambry uses files that provide information about the cluster to route requests from the frontend to servers and for replication between servers. We will use a simple clustermap that contains a single server with one partition. The partition will use /tmp as the mount point.

Step 2: Deploy a server.

$ nohup java -Dlog4j2.configurationFile=file:../config/log4j2.xml -jar ambry.jar --serverPropsFilePath ../config/server.properties --hardwareLayoutFilePath ../config/HardwareLayout.json --partitionLayoutFilePath ../config/PartitionLayout.json > logs/server.log &

Through this command, we configure the log4j properties, provide the server with configuration options and cluster definitions and redirect output to a log. Note down the process ID returned (serverProcessID) because it will be needed for shutdown.
The log will be available at logs/server.log. Alternately, you can change the log4j properties to write the log messages to a file instead of standard output.

Step 3: Deploy a frontend.

$ nohup java -Dlog4j2.configurationFile=file:../config/log4j2.xml -cp "*" com.github.ambry.frontend.AmbryFrontendMain --serverPropsFilePath ../config/frontend.properties --hardwareLayoutFilePath ../config/HardwareLayout.json --partitionLayoutFilePath ../config/PartitionLayout.json > logs/frontend.log &

Note down the process ID returned (frontendProcessID) because it will be needed for shutdown. Make sure that the frontend is ready to receive requests.

$ curl http://localhost:1174/healthCheck
GOOD

The log will be available at logs/frontend.log. Alternately, you can change the log4j properties to write the log messages to a file instead of standard output.

Step 4: Interact with Ambry !

We are now ready to store and retrieve data from Ambry. Let us start by storing a simple image. For demonstration purposes, we will use an image demo.gif that has been copied into the target folder.

POST

$ curl -i -H "x-ambry-service-id:CUrlUpload"  -H "x-ambry-owner-id:`whoami`" -H "x-ambry-content-type:image/gif" -H "x-ambry-um-description:Demonstration Image" http://localhost:1174/ --data-binary @demo.gif
HTTP/1.1 201 Created
Location: AmbryID
Content-Length: 0

The CUrl command creates a POST request that contains the binary data in demo.gif. Along with the file data, we provide headers that act as blob properties. These include the size of the blob, the service ID, the owner ID and the content type.
In addition to these properties, Ambry also has a provision for arbitrary user defined metadata. We provide x-ambry-um-description as user metadata. Ambry does not interpret this data and it is purely for user annotation. The Location header in the response is the blob ID of the blob we just uploaded.

GET - Blob Info

Now that we stored a blob, let us verify some properties of the blob we uploaded.

$ curl -i http://localhost:1174/AmbryID/BlobInfo
HTTP/1.1 200 OK
x-ambry-blob-size: {Blob size}
x-ambry-service-id: CUrlUpload
x-ambry-creation-time: {Creation time}
x-ambry-private: false
x-ambry-content-type: image/gif
x-ambry-owner-id: {username}
x-ambry-um-desc: Demonstration Image
Content-Length: 0

GET - Blob

Now that we have verified that Ambry returns properties correctly, let us obtain the actual blob.

$ curl http://localhost:1174/AmbryID > demo-downloaded.gif
$ diff demo.gif demo-downloaded.gif 
$

This confirms that the data that was sent in the POST request matches what we received in the GET. If you would like to see the image, simply point your browser to http://localhost:1174/AmbryID and you should see the image that was uploaded !

DELETE

Ambry is an immutable store and blobs cannot be updated but they can be deleted in order to make them irretrievable. Let us go ahead and delete the blob we just created.

$ curl -i -X DELETE http://localhost:1174/AmbryID
HTTP/1.1 202 Accepted
Content-Length: 0

You will no longer be able to retrieve the blob properties or data.

$ curl -i http://localhost:1174/AmbryID/BlobInfo
HTTP/1.1 410 Gone
Content-Type: text/plain; charset=UTF-8
Content-Length: 17
Connection: close

Failure: 410 Gone

Step 5: Stop the frontend and server.

$ kill -15 frontendProcessID
$ kill -15 serverProcessID

You can confirm that the services have been shut down by looking at the logs.

Additional information:

In addition to the simple APIs demonstrated above, Ambry provides support for GET of only user metadata and HEAD. In addition to the POST of binary data that was demonstrated, Ambry also supports POST of multipart/form-data via CUrl or web forms. Other features of interest include:

Time To Live (TTL): During POST, a TTL in seconds can be provided through the addition of a header named x-ambry-ttl. This means that Ambry will stop serving the blob after the TTL has expired. On GET, expired blobs behave the same way as deleted blobs.
Private: During POST, providing a header named x-ambry-private with the value true will mark the blob as private. API behavior can be configured based on whether a blob is public or private.

Testing

WARNING: Tests currently can take upwards of 40 minutes to run

Ambry requires azurite and MySQL for testing. To install on MacOS:

azurite:

$ npm install -g azurite
$ azurite

mysql:

$ brew install mysql
$ brew services start mysql
$ mysql -uroot
mysql> CREATE USER 'travis'@'localhost';
mysql> GRANT ALL PRIVILEGES ON *.* to 'travis'@'localhost';
mysql> FLUSH PRIVILEGES;
mysql> CREATE DATABASE AmbryRepairRequests;
mysql> USE AmbryRepairRequests; SOURCE ./ambry-mysql/src/main/resources/AmbryRepairRequests.ddl;

Then run ./gradlew build to build and run all unit tests.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot