Top Related Projects
Mirror of Apache Kafka
Apache Pulsar - distributed pub-sub messaging system
Apache Hadoop
Apache HBase
Distributed reliable key-value store for the most critical data of a distributed system
Distributed transactional key-value database, originally created to complement TiDB
Quick Overview
Ambry is a distributed object store developed by LinkedIn for storing and serving large binary objects (blobs). It is designed to handle high-throughput, low-latency read and write operations, making it suitable for various use cases such as storing and serving media files, documents, and other large data objects.
Pros
- Highly scalable and distributed architecture
- Low-latency read and write operations
- Support for multi-datacenter replication
- Flexible storage options (in-memory, on-disk, or tiered)
Cons
- Complex setup and configuration process
- Limited documentation for advanced use cases
- Steep learning curve for newcomers to distributed systems
- Primarily designed for large-scale deployments, may be overkill for smaller projects
Code Examples
- Creating a BlobId:
BlobId blobId = new BlobId(version, BlobIdType.NATIVE, datacenterId, accountId, containerId, partitionId, isEncrypted, blobDataType);
This code creates a unique identifier for a blob in Ambry.
- Uploading a blob:
BlobProperties blobProperties = new BlobProperties(blobSize, serviceId);
ByteBufferInputStream stream = new ByteBufferInputStream(ByteBuffer.wrap(data));
router.putBlob(blobId, blobProperties, stream, null, callback);
This example demonstrates how to upload a blob to Ambry using the Router interface.
- Retrieving a blob:
router.getBlob(blobId, new GetBlobOptions(), new Callback<GetBlobResult>() {
@Override
public void onCompletion(GetBlobResult result, Exception exception) {
if (exception == null) {
InputStream blobContent = result.getBlobDataChannel();
// Process the blob content
} else {
// Handle the exception
}
}
});
This code shows how to retrieve a blob from Ambry using its BlobId.
Getting Started
To get started with Ambry, follow these steps:
-
Clone the repository:
git clone https://github.com/linkedin/ambry.git
-
Build the project:
cd ambry ./gradlew build
-
Run a single node Ambry server:
./gradlew startServer
-
Use the Ambry client to interact with the server:
Properties props = new Properties(); props.setProperty("host.name", "localhost"); props.setProperty("port", "6667"); NetworkClient networkClient = new NetworkClientFactory(props).getNetworkClient(); Router router = new RouterFactory(props, networkClient).getRouter(); // Use the router to perform operations on Ambry
For more detailed instructions and advanced configurations, refer to the project's documentation on GitHub.
Competitor Comparisons
Mirror of Apache Kafka
Pros of Kafka
- Highly scalable and distributed streaming platform with high throughput
- Robust ecosystem with wide industry adoption and extensive tooling
- Supports real-time data processing and complex event streaming use cases
Cons of Kafka
- Higher complexity and steeper learning curve for setup and management
- Can be resource-intensive, especially for smaller deployments
- Less optimized for large object storage compared to Ambry
Code Comparison
Kafka producer example:
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer<>(props);
Ambry client example:
RouterConfig routerConfig = new RouterConfig(verifiableProperties);
ClusterMap clusterMap = new HelixClusterManager(clusterMapConfig, helixManagerFactory);
Router router = new NonBlockingRouterFactory(routerConfig, clusterMap).getRouter();
Both Kafka and Ambry are distributed storage systems, but they serve different purposes. Kafka excels in real-time data streaming and processing, while Ambry is designed for large object storage and retrieval. Kafka offers more flexibility for complex event streaming scenarios, whereas Ambry provides better optimization for storing and accessing large files or blobs.
Apache Pulsar - distributed pub-sub messaging system
Pros of Pulsar
- Supports both streaming and queuing messaging models
- Built-in multi-tenancy and geo-replication features
- Scalable architecture with separate storage and serving layers
Cons of Pulsar
- More complex setup and configuration compared to Ambry
- Higher resource requirements for deployment
- Steeper learning curve for developers new to the system
Code Comparison
Ambry (Java):
BlobId blobId = new BlobId(version, BlobIdType.NATIVE, dataCenterId, accountId, containerId, partitionId, isEncrypted, BlobIdSourceType.NATIVE);
BlobProperties blobProperties = new BlobProperties(blobSize, serviceId, accountId, containerId, false);
router.putBlob(blobId, blobProperties, userMetadata, channel, new PutBlobOptionsBuilder().build(), callback);
Pulsar (Java):
PulsarClient client = PulsarClient.builder().serviceUrl("pulsar://localhost:6650").build();
Producer<byte[]> producer = client.newProducer().topic("my-topic").create();
producer.send("Hello, Pulsar!".getBytes());
Both Ambry and Pulsar are distributed storage systems, but they serve different purposes. Ambry is primarily designed for blob storage, while Pulsar focuses on messaging and event streaming. Pulsar offers more features for real-time data processing and supports multiple messaging patterns, making it suitable for a wider range of use cases. However, this comes at the cost of increased complexity and resource requirements compared to Ambry's more focused approach to blob storage.
Apache Hadoop
Pros of Hadoop
- Widely adopted and supported by a large community
- Offers a comprehensive ecosystem for big data processing
- Provides robust scalability for handling massive datasets
Cons of Hadoop
- More complex setup and configuration compared to Ambry
- Higher resource requirements for running a full Hadoop cluster
- Steeper learning curve for new users
Code Comparison
Hadoop (HDFS write operation):
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path file = new Path("hdfs://localhost:9000/user/hadoop/file.txt");
FSDataOutputStream out = fs.create(file);
out.writeUTF("Hello, Hadoop!");
out.close();
Ambry (BlobStore write operation):
BlobId blobId = new BlobId(partitionId, "file.txt");
ByteBuffer buffer = ByteBuffer.wrap("Hello, Ambry!".getBytes());
BlobProperties properties = new BlobProperties(buffer.capacity(), "application/text");
store.put(blobId, buffer, properties);
Both Hadoop and Ambry provide distributed storage solutions, but Hadoop offers a more comprehensive ecosystem for big data processing, while Ambry focuses on providing a simpler, more lightweight blob storage system. Hadoop's HDFS is designed for handling large files and batch processing, whereas Ambry is optimized for smaller objects and faster access times.
Apache HBase
Pros of HBase
- Mature and widely adopted project with a large community and extensive documentation
- Supports real-time read/write access to large datasets
- Offers strong consistency and automatic sharding
Cons of HBase
- Higher complexity and steeper learning curve
- Requires more resources and maintenance overhead
- Less efficient for storing and retrieving small objects
Code Comparison
HBase example (Java):
Put put = new Put(Bytes.toBytes("row1"));
put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("qual1"), Bytes.toBytes("value1"));
table.put(put);
Ambry example (Java):
BlobProperties blobProperties = new BlobProperties(blobSize, serviceId);
Router router = new RouterFactory(verifiableProperties, clusterMap).getRouter();
router.putBlob(blobId, blobProperties, userMetadata, blobContent, callback);
Key Differences
- Ambry is designed specifically for storing and serving large binary objects (blobs), while HBase is a more general-purpose columnar database
- HBase provides stronger consistency guarantees, while Ambry focuses on high throughput and low latency for blob storage
- Ambry offers simpler deployment and management for blob storage use cases, whereas HBase provides more flexibility for complex data models
Both projects have their strengths, with HBase being more suitable for general-purpose distributed database needs and Ambry excelling in blob storage scenarios.
Distributed reliable key-value store for the most critical data of a distributed system
Pros of etcd
- Designed for distributed systems and highly available key-value storage
- Strong consistency and reliability through the Raft consensus algorithm
- Well-suited for storing configuration data and service discovery
Cons of etcd
- Limited scalability for large datasets compared to Ambry
- Not optimized for storing and retrieving large binary objects
- Higher complexity in setup and maintenance for simple use cases
Code Comparison
etcd (Go):
cli, _ := clientv3.New(clientv3.Config{Endpoints: []string{"localhost:2379"}})
defer cli.Close()
ctx, cancel := context.WithTimeout(context.Background(), time.Second)
_, err := cli.Put(ctx, "key", "value")
cancel()
Ambry (Java):
RouterConfig routerConfig = new RouterConfig(verifiableProperties);
NonBlockingRouter router = new NonBlockingRouter(routerConfig, clusterMap, notificationSystem);
BlobProperties blobProperties = new BlobProperties(blobSize, serviceId);
router.putBlob(blobId, blobProperties, userMetadata, channel, null);
Summary
etcd is a distributed key-value store focused on consistency and reliability, making it ideal for configuration management and service discovery in distributed systems. Ambry, on the other hand, is designed for large-scale blob storage and retrieval, offering better scalability for large datasets. While etcd provides strong consistency through the Raft algorithm, Ambry is optimized for high-throughput blob operations. The choice between the two depends on the specific use case and requirements of the system being developed.
Distributed transactional key-value database, originally created to complement TiDB
Pros of TiKV
- Designed as a distributed key-value store, offering better scalability for large-scale deployments
- Supports ACID transactions across multiple keys, providing stronger consistency guarantees
- Implements the Raft consensus algorithm, enhancing fault tolerance and data reliability
Cons of TiKV
- Higher complexity due to its distributed nature, potentially requiring more resources to manage
- Steeper learning curve for developers not familiar with distributed systems
- May have higher latency for simple read/write operations compared to Ambry's simpler architecture
Code Comparison
TiKV (Rust):
let client = TiKVClient::new(vec!["127.0.0.1:2379"]).await?;
let txn = client.begin().await?;
txn.put("key".to_owned(), "value".to_owned()).await?;
txn.commit().await?;
Ambry (Java):
Router router = new RouterFactory(clusterMap, responseHandler).getRouter();
BlobId blobId = new BlobId(version, BlobId.BlobIdType.NATIVE, dataCenterId, accountId, containerId, partitionId, isEncrypted, BlobId.BlobDataType.DATACHUNK);
router.putBlob(blobId, blobProperties, userMetadata, channel, callback);
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Ambry
Ambry is a distributed object store that supports storage of trillions of small immutable objects (50K -100K) as well as billions of large objects. It was specifically designed to store and serve media objects in web companies. However, it can be used as a general purpose storage system to store DB backups, search indexes or business reports. The system has the following characterisitics:
- Highly available and horizontally scalable
- Low latency and high throughput
- Optimized for both small and large objects
- Cost effective
- Easy to use
Requires at least JDK 1.8.
Documentation
Detailed documentation is available at https://github.com/linkedin/ambry/wiki
Research
Paper introducing Ambry at SIGMOD 2016 -> http://dprg.cs.uiuc.edu/data/files/2016/ambry.pdf
Reach out to us at ambrydev@googlegroups.com if you would like us to list a paper that is based off of research on Ambry.
Getting Started
Step 1: Download the code, build it and prepare for deployment.
To get the latest code and build it, do
$ git clone https://github.com/linkedin/ambry.git
$ cd ambry
$ ./gradlew allJar
$ cd target
$ mkdir logs
Ambry uses files that provide information about the cluster to route requests from the frontend to servers and for replication between servers. We will use a simple clustermap that contains a single server with one partition. The partition will use /tmp
as the mount point.
Step 2: Deploy a server.
$ nohup java -Dlog4j2.configurationFile=file:../config/log4j2.xml -jar ambry.jar --serverPropsFilePath ../config/server.properties --hardwareLayoutFilePath ../config/HardwareLayout.json --partitionLayoutFilePath ../config/PartitionLayout.json > logs/server.log &
Through this command, we configure the log4j properties, provide the server with configuration options and cluster definitions and redirect output to a log. Note down the process ID returned (serverProcessID
) because it will be needed for shutdown.
The log will be available at logs/server.log
. Alternately, you can change the log4j properties to write the log messages to a file instead of standard output.
Step 3: Deploy a frontend.
$ nohup java -Dlog4j2.configurationFile=file:../config/log4j2.xml -cp "*" com.github.ambry.frontend.AmbryFrontendMain --serverPropsFilePath ../config/frontend.properties --hardwareLayoutFilePath ../config/HardwareLayout.json --partitionLayoutFilePath ../config/PartitionLayout.json > logs/frontend.log &
Note down the process ID returned (frontendProcessID
) because it will be needed for shutdown. Make sure that the frontend is ready to receive requests.
$ curl http://localhost:1174/healthCheck
GOOD
The log will be available at logs/frontend.log
. Alternately, you can change the log4j properties to write the log messages to a file instead of standard output.
Step 4: Interact with Ambry !
We are now ready to store and retrieve data from Ambry. Let us start by storing a simple image. For demonstration purposes, we will use an image demo.gif
that has been copied into the target
folder.
POST
$ curl -i -H "x-ambry-service-id:CUrlUpload" -H "x-ambry-owner-id:`whoami`" -H "x-ambry-content-type:image/gif" -H "x-ambry-um-description:Demonstration Image" http://localhost:1174/ --data-binary @demo.gif
HTTP/1.1 201 Created
Location: AmbryID
Content-Length: 0
The CUrl command creates a POST
request that contains the binary data in demo.gif. Along with the file data, we provide headers that act as blob properties. These include the size of the blob, the service ID, the owner ID and the content type.
In addition to these properties, Ambry also has a provision for arbitrary user defined metadata. We provide x-ambry-um-description
as user metadata. Ambry does not interpret this data and it is purely for user annotation.
The Location
header in the response is the blob ID of the blob we just uploaded.
GET - Blob Info
Now that we stored a blob, let us verify some properties of the blob we uploaded.
$ curl -i http://localhost:1174/AmbryID/BlobInfo
HTTP/1.1 200 OK
x-ambry-blob-size: {Blob size}
x-ambry-service-id: CUrlUpload
x-ambry-creation-time: {Creation time}
x-ambry-private: false
x-ambry-content-type: image/gif
x-ambry-owner-id: {username}
x-ambry-um-desc: Demonstration Image
Content-Length: 0
GET - Blob
Now that we have verified that Ambry returns properties correctly, let us obtain the actual blob.
$ curl http://localhost:1174/AmbryID > demo-downloaded.gif
$ diff demo.gif demo-downloaded.gif
$
This confirms that the data that was sent in the POST
request matches what we received in the GET
. If you would like to see the image, simply point your browser to http://localhost:1174/AmbryID
and you should see the image that was uploaded !
DELETE
Ambry is an immutable store and blobs cannot be updated but they can be deleted in order to make them irretrievable. Let us go ahead and delete the blob we just created.
$ curl -i -X DELETE http://localhost:1174/AmbryID
HTTP/1.1 202 Accepted
Content-Length: 0
You will no longer be able to retrieve the blob properties or data.
$ curl -i http://localhost:1174/AmbryID/BlobInfo
HTTP/1.1 410 Gone
Content-Type: text/plain; charset=UTF-8
Content-Length: 17
Connection: close
Failure: 410 Gone
Step 5: Stop the frontend and server.
$ kill -15 frontendProcessID
$ kill -15 serverProcessID
You can confirm that the services have been shut down by looking at the logs.
Additional information:
In addition to the simple APIs demonstrated above, Ambry provides support for GET
of only user metadata and HEAD
. In addition to the POST
of binary data that was demonstrated, Ambry also supports POST
of multipart/form-data
via CUrl or web forms.
Other features of interest include:
- Time To Live (TTL): During
POST
, a TTL in seconds can be provided through the addition of a header namedx-ambry-ttl
. This means that Ambry will stop serving the blob after the TTL has expired. OnGET
, expired blobs behave the same way as deleted blobs. - Private: During
POST
, providing a header namedx-ambry-private
with the valuetrue
will mark the blob as private. API behavior can be configured based on whether a blob is public or private.
Top Related Projects
Mirror of Apache Kafka
Apache Pulsar - distributed pub-sub messaging system
Apache Hadoop
Apache HBase
Distributed reliable key-value store for the most critical data of a distributed system
Distributed transactional key-value database, originally created to complement TiDB
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot