Convert Figma logo to code with AI

linkedin logoambry

Distributed object store

1,740
275
1,740
139

Top Related Projects

28,601

Mirror of Apache Kafka

14,185

Apache Pulsar - distributed pub-sub messaging system

14,703

Apache Hadoop

5,214

Apache HBase

47,616

Distributed reliable key-value store for the most critical data of a distributed system

14,991

Distributed transactional key-value database, originally created to complement TiDB

Quick Overview

Ambry is a distributed object store developed by LinkedIn for storing and serving large binary objects (blobs). It is designed to handle high-throughput, low-latency read and write operations, making it suitable for various use cases such as storing and serving media files, documents, and other large data objects.

Pros

  • Highly scalable and distributed architecture
  • Low-latency read and write operations
  • Support for multi-datacenter replication
  • Flexible storage options (in-memory, on-disk, or tiered)

Cons

  • Complex setup and configuration process
  • Limited documentation for advanced use cases
  • Steep learning curve for newcomers to distributed systems
  • Primarily designed for large-scale deployments, may be overkill for smaller projects

Code Examples

  1. Creating a BlobId:
BlobId blobId = new BlobId(version, BlobIdType.NATIVE, datacenterId, accountId, containerId, partitionId, isEncrypted, blobDataType);

This code creates a unique identifier for a blob in Ambry.

  1. Uploading a blob:
BlobProperties blobProperties = new BlobProperties(blobSize, serviceId);
ByteBufferInputStream stream = new ByteBufferInputStream(ByteBuffer.wrap(data));
router.putBlob(blobId, blobProperties, stream, null, callback);

This example demonstrates how to upload a blob to Ambry using the Router interface.

  1. Retrieving a blob:
router.getBlob(blobId, new GetBlobOptions(), new Callback<GetBlobResult>() {
    @Override
    public void onCompletion(GetBlobResult result, Exception exception) {
        if (exception == null) {
            InputStream blobContent = result.getBlobDataChannel();
            // Process the blob content
        } else {
            // Handle the exception
        }
    }
});

This code shows how to retrieve a blob from Ambry using its BlobId.

Getting Started

To get started with Ambry, follow these steps:

  1. Clone the repository:

    git clone https://github.com/linkedin/ambry.git
    
  2. Build the project:

    cd ambry
    ./gradlew build
    
  3. Run a single node Ambry server:

    ./gradlew startServer
    
  4. Use the Ambry client to interact with the server:

    Properties props = new Properties();
    props.setProperty("host.name", "localhost");
    props.setProperty("port", "6667");
    NetworkClient networkClient = new NetworkClientFactory(props).getNetworkClient();
    Router router = new RouterFactory(props, networkClient).getRouter();
    // Use the router to perform operations on Ambry
    

For more detailed instructions and advanced configurations, refer to the project's documentation on GitHub.

Competitor Comparisons

28,601

Mirror of Apache Kafka

Pros of Kafka

  • Highly scalable and distributed streaming platform with high throughput
  • Robust ecosystem with wide industry adoption and extensive tooling
  • Supports real-time data processing and complex event streaming use cases

Cons of Kafka

  • Higher complexity and steeper learning curve for setup and management
  • Can be resource-intensive, especially for smaller deployments
  • Less optimized for large object storage compared to Ambry

Code Comparison

Kafka producer example:

Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer<>(props);

Ambry client example:

RouterConfig routerConfig = new RouterConfig(verifiableProperties);
ClusterMap clusterMap = new HelixClusterManager(clusterMapConfig, helixManagerFactory);
Router router = new NonBlockingRouterFactory(routerConfig, clusterMap).getRouter();

Both Kafka and Ambry are distributed storage systems, but they serve different purposes. Kafka excels in real-time data streaming and processing, while Ambry is designed for large object storage and retrieval. Kafka offers more flexibility for complex event streaming scenarios, whereas Ambry provides better optimization for storing and accessing large files or blobs.

14,185

Apache Pulsar - distributed pub-sub messaging system

Pros of Pulsar

  • Supports both streaming and queuing messaging models
  • Built-in multi-tenancy and geo-replication features
  • Scalable architecture with separate storage and serving layers

Cons of Pulsar

  • More complex setup and configuration compared to Ambry
  • Higher resource requirements for deployment
  • Steeper learning curve for developers new to the system

Code Comparison

Ambry (Java):

BlobId blobId = new BlobId(version, BlobIdType.NATIVE, dataCenterId, accountId, containerId, partitionId, isEncrypted, BlobIdSourceType.NATIVE);
BlobProperties blobProperties = new BlobProperties(blobSize, serviceId, accountId, containerId, false);
router.putBlob(blobId, blobProperties, userMetadata, channel, new PutBlobOptionsBuilder().build(), callback);

Pulsar (Java):

PulsarClient client = PulsarClient.builder().serviceUrl("pulsar://localhost:6650").build();
Producer<byte[]> producer = client.newProducer().topic("my-topic").create();
producer.send("Hello, Pulsar!".getBytes());

Both Ambry and Pulsar are distributed storage systems, but they serve different purposes. Ambry is primarily designed for blob storage, while Pulsar focuses on messaging and event streaming. Pulsar offers more features for real-time data processing and supports multiple messaging patterns, making it suitable for a wider range of use cases. However, this comes at the cost of increased complexity and resource requirements compared to Ambry's more focused approach to blob storage.

14,703

Apache Hadoop

Pros of Hadoop

  • Widely adopted and supported by a large community
  • Offers a comprehensive ecosystem for big data processing
  • Provides robust scalability for handling massive datasets

Cons of Hadoop

  • More complex setup and configuration compared to Ambry
  • Higher resource requirements for running a full Hadoop cluster
  • Steeper learning curve for new users

Code Comparison

Hadoop (HDFS write operation):

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path file = new Path("hdfs://localhost:9000/user/hadoop/file.txt");
FSDataOutputStream out = fs.create(file);
out.writeUTF("Hello, Hadoop!");
out.close();

Ambry (BlobStore write operation):

BlobId blobId = new BlobId(partitionId, "file.txt");
ByteBuffer buffer = ByteBuffer.wrap("Hello, Ambry!".getBytes());
BlobProperties properties = new BlobProperties(buffer.capacity(), "application/text");
store.put(blobId, buffer, properties);

Both Hadoop and Ambry provide distributed storage solutions, but Hadoop offers a more comprehensive ecosystem for big data processing, while Ambry focuses on providing a simpler, more lightweight blob storage system. Hadoop's HDFS is designed for handling large files and batch processing, whereas Ambry is optimized for smaller objects and faster access times.

5,214

Apache HBase

Pros of HBase

  • Mature and widely adopted project with a large community and extensive documentation
  • Supports real-time read/write access to large datasets
  • Offers strong consistency and automatic sharding

Cons of HBase

  • Higher complexity and steeper learning curve
  • Requires more resources and maintenance overhead
  • Less efficient for storing and retrieving small objects

Code Comparison

HBase example (Java):

Put put = new Put(Bytes.toBytes("row1"));
put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("qual1"), Bytes.toBytes("value1"));
table.put(put);

Ambry example (Java):

BlobProperties blobProperties = new BlobProperties(blobSize, serviceId);
Router router = new RouterFactory(verifiableProperties, clusterMap).getRouter();
router.putBlob(blobId, blobProperties, userMetadata, blobContent, callback);

Key Differences

  • Ambry is designed specifically for storing and serving large binary objects (blobs), while HBase is a more general-purpose columnar database
  • HBase provides stronger consistency guarantees, while Ambry focuses on high throughput and low latency for blob storage
  • Ambry offers simpler deployment and management for blob storage use cases, whereas HBase provides more flexibility for complex data models

Both projects have their strengths, with HBase being more suitable for general-purpose distributed database needs and Ambry excelling in blob storage scenarios.

47,616

Distributed reliable key-value store for the most critical data of a distributed system

Pros of etcd

  • Designed for distributed systems and highly available key-value storage
  • Strong consistency and reliability through the Raft consensus algorithm
  • Well-suited for storing configuration data and service discovery

Cons of etcd

  • Limited scalability for large datasets compared to Ambry
  • Not optimized for storing and retrieving large binary objects
  • Higher complexity in setup and maintenance for simple use cases

Code Comparison

etcd (Go):

cli, _ := clientv3.New(clientv3.Config{Endpoints: []string{"localhost:2379"}})
defer cli.Close()
ctx, cancel := context.WithTimeout(context.Background(), time.Second)
_, err := cli.Put(ctx, "key", "value")
cancel()

Ambry (Java):

RouterConfig routerConfig = new RouterConfig(verifiableProperties);
NonBlockingRouter router = new NonBlockingRouter(routerConfig, clusterMap, notificationSystem);
BlobProperties blobProperties = new BlobProperties(blobSize, serviceId);
router.putBlob(blobId, blobProperties, userMetadata, channel, null);

Summary

etcd is a distributed key-value store focused on consistency and reliability, making it ideal for configuration management and service discovery in distributed systems. Ambry, on the other hand, is designed for large-scale blob storage and retrieval, offering better scalability for large datasets. While etcd provides strong consistency through the Raft algorithm, Ambry is optimized for high-throughput blob operations. The choice between the two depends on the specific use case and requirements of the system being developed.

14,991

Distributed transactional key-value database, originally created to complement TiDB

Pros of TiKV

  • Designed as a distributed key-value store, offering better scalability for large-scale deployments
  • Supports ACID transactions across multiple keys, providing stronger consistency guarantees
  • Implements the Raft consensus algorithm, enhancing fault tolerance and data reliability

Cons of TiKV

  • Higher complexity due to its distributed nature, potentially requiring more resources to manage
  • Steeper learning curve for developers not familiar with distributed systems
  • May have higher latency for simple read/write operations compared to Ambry's simpler architecture

Code Comparison

TiKV (Rust):

let client = TiKVClient::new(vec!["127.0.0.1:2379"]).await?;
let txn = client.begin().await?;
txn.put("key".to_owned(), "value".to_owned()).await?;
txn.commit().await?;

Ambry (Java):

Router router = new RouterFactory(clusterMap, responseHandler).getRouter();
BlobId blobId = new BlobId(version, BlobId.BlobIdType.NATIVE, dataCenterId, accountId, containerId, partitionId, isEncrypted, BlobId.BlobDataType.DATACHUNK);
router.putBlob(blobId, blobProperties, userMetadata, channel, callback);

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Ambry

Github Actions CI codecov.io license

Ambry is a distributed object store that supports storage of trillions of small immutable objects (50K -100K) as well as billions of large objects. It was specifically designed to store and serve media objects in web companies. However, it can be used as a general purpose storage system to store DB backups, search indexes or business reports. The system has the following characterisitics:

  1. Highly available and horizontally scalable
  2. Low latency and high throughput
  3. Optimized for both small and large objects
  4. Cost effective
  5. Easy to use

Requires at least JDK 1.8.

Documentation

Detailed documentation is available at https://github.com/linkedin/ambry/wiki

Research

Paper introducing Ambry at SIGMOD 2016 -> http://dprg.cs.uiuc.edu/data/files/2016/ambry.pdf

Reach out to us at ambrydev@googlegroups.com if you would like us to list a paper that is based off of research on Ambry.

Getting Started

Step 1: Download the code, build it and prepare for deployment.

To get the latest code and build it, do

$ git clone https://github.com/linkedin/ambry.git 
$ cd ambry
$ ./gradlew allJar
$ cd target
$ mkdir logs

Ambry uses files that provide information about the cluster to route requests from the frontend to servers and for replication between servers. We will use a simple clustermap that contains a single server with one partition. The partition will use /tmp as the mount point.

Step 2: Deploy a server.
$ nohup java -Dlog4j2.configurationFile=file:../config/log4j2.xml -jar ambry.jar --serverPropsFilePath ../config/server.properties --hardwareLayoutFilePath ../config/HardwareLayout.json --partitionLayoutFilePath ../config/PartitionLayout.json > logs/server.log &

Through this command, we configure the log4j properties, provide the server with configuration options and cluster definitions and redirect output to a log. Note down the process ID returned (serverProcessID) because it will be needed for shutdown.
The log will be available at logs/server.log. Alternately, you can change the log4j properties to write the log messages to a file instead of standard output.

Step 3: Deploy a frontend.
$ nohup java -Dlog4j2.configurationFile=file:../config/log4j2.xml -cp "*" com.github.ambry.frontend.AmbryFrontendMain --serverPropsFilePath ../config/frontend.properties --hardwareLayoutFilePath ../config/HardwareLayout.json --partitionLayoutFilePath ../config/PartitionLayout.json > logs/frontend.log &

Note down the process ID returned (frontendProcessID) because it will be needed for shutdown. Make sure that the frontend is ready to receive requests.

$ curl http://localhost:1174/healthCheck
GOOD

The log will be available at logs/frontend.log. Alternately, you can change the log4j properties to write the log messages to a file instead of standard output.

Step 4: Interact with Ambry !

We are now ready to store and retrieve data from Ambry. Let us start by storing a simple image. For demonstration purposes, we will use an image demo.gif that has been copied into the target folder.

POST
$ curl -i -H "x-ambry-service-id:CUrlUpload"  -H "x-ambry-owner-id:`whoami`" -H "x-ambry-content-type:image/gif" -H "x-ambry-um-description:Demonstration Image" http://localhost:1174/ --data-binary @demo.gif
HTTP/1.1 201 Created
Location: AmbryID
Content-Length: 0

The CUrl command creates a POST request that contains the binary data in demo.gif. Along with the file data, we provide headers that act as blob properties. These include the size of the blob, the service ID, the owner ID and the content type.
In addition to these properties, Ambry also has a provision for arbitrary user defined metadata. We provide x-ambry-um-description as user metadata. Ambry does not interpret this data and it is purely for user annotation. The Location header in the response is the blob ID of the blob we just uploaded.

GET - Blob Info

Now that we stored a blob, let us verify some properties of the blob we uploaded.

$ curl -i http://localhost:1174/AmbryID/BlobInfo
HTTP/1.1 200 OK
x-ambry-blob-size: {Blob size}
x-ambry-service-id: CUrlUpload
x-ambry-creation-time: {Creation time}
x-ambry-private: false
x-ambry-content-type: image/gif
x-ambry-owner-id: {username}
x-ambry-um-desc: Demonstration Image
Content-Length: 0
GET - Blob

Now that we have verified that Ambry returns properties correctly, let us obtain the actual blob.

$ curl http://localhost:1174/AmbryID > demo-downloaded.gif
$ diff demo.gif demo-downloaded.gif 
$

This confirms that the data that was sent in the POST request matches what we received in the GET. If you would like to see the image, simply point your browser to http://localhost:1174/AmbryID and you should see the image that was uploaded !

DELETE

Ambry is an immutable store and blobs cannot be updated but they can be deleted in order to make them irretrievable. Let us go ahead and delete the blob we just created.

$ curl -i -X DELETE http://localhost:1174/AmbryID
HTTP/1.1 202 Accepted
Content-Length: 0

You will no longer be able to retrieve the blob properties or data.

$ curl -i http://localhost:1174/AmbryID/BlobInfo
HTTP/1.1 410 Gone
Content-Type: text/plain; charset=UTF-8
Content-Length: 17
Connection: close

Failure: 410 Gone
Step 5: Stop the frontend and server.
$ kill -15 frontendProcessID
$ kill -15 serverProcessID

You can confirm that the services have been shut down by looking at the logs.

Additional information:

In addition to the simple APIs demonstrated above, Ambry provides support for GET of only user metadata and HEAD. In addition to the POST of binary data that was demonstrated, Ambry also supports POST of multipart/form-data via CUrl or web forms. Other features of interest include:

  • Time To Live (TTL): During POST, a TTL in seconds can be provided through the addition of a header named x-ambry-ttl. This means that Ambry will stop serving the blob after the TTL has expired. On GET, expired blobs behave the same way as deleted blobs.
  • Private: During POST, providing a header named x-ambry-private with the value true will mark the blob as private. API behavior can be configured based on whether a blob is public or private.