Top Related Projects
GPU Accelerated JavaScript
ArrayFire: a general purpose GPU library.
NumPy aware dynamic Python compiler using LLVM
An Open Source Machine Learning Framework for Everyone
Tensors and Dynamic neural networks in Python with strong GPU acceleration
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
Quick Overview
Emu is a high-performance, GPU-accelerated computing framework for Rust. It allows developers to write Rust code that can seamlessly run on both CPUs and GPUs, providing a unified programming model for heterogeneous computing.
Pros
- Seamless integration of GPU computing in Rust without separate CUDA or OpenCL code
- High-level abstractions for parallel computing, making GPU programming more accessible
- Automatic memory management and data transfer between CPU and GPU
- Supports both NVIDIA and AMD GPUs through various backends
Cons
- Still in early development, may have stability issues or incomplete features
- Limited documentation and examples compared to more established GPU computing frameworks
- Potential performance overhead compared to hand-optimized CUDA or OpenCL code
- Requires a compatible GPU and drivers for full functionality
Code Examples
- Basic vector addition:
use emu_core::prelude::*;
fn main() {
let mut a = vec![1.0; 1000000];
let mut b = vec![2.0; 1000000];
let mut c = vec![0.0; 1000000];
emu_core::exec(|(a, b, c): (&[f32], &[f32], &mut [f32])| {
for i in 0..a.len() {
c[i] = a[i] + b[i];
}
}).run((&a, &b, &mut c));
}
- Matrix multiplication:
use emu_core::prelude::*;
fn main() {
let a = vec![1.0; 1024 * 1024];
let b = vec![2.0; 1024 * 1024];
let mut c = vec![0.0; 1024 * 1024];
emu_core::exec(|(a, b, c): (&[f32], &[f32], &mut [f32])| {
for i in 0..1024 {
for j in 0..1024 {
let mut sum = 0.0;
for k in 0..1024 {
sum += a[i * 1024 + k] * b[k * 1024 + j];
}
c[i * 1024 + j] = sum;
}
}
}).run((&a, &b, &mut c));
}
- Custom kernel with shared memory:
use emu_core::prelude::*;
fn main() {
let mut data = vec![1.0; 1024];
emu_core::exec(|data: &mut [f32]| {
let tid = emu_core::thread_idx().x;
let shared = emu_core::shared::<[f32; 256]>();
shared[tid as usize] = data[tid as usize];
emu_core::sync_threads();
if tid < 256 {
data[tid as usize] = shared[tid as usize] * 2.0;
}
}).config(256, 1, 256).run(&mut data);
}
Getting Started
-
Add Emu to your
Cargo.toml
:[dependencies] emu_core = "0.1"
-
Import Emu in your Rust code:
use emu_core::prelude::*;
-
Write your GPU-accelerated code using Emu's
exec
function and run it:emu_core::exec(|data: &mut [f32]| { // Your GPU code here }).run(&mut data);
-
Build and run your project with
cargo run --release
Competitor Comparisons
GPU Accelerated JavaScript
Pros of gpu.js
- More mature and widely adopted project with a larger community
- Supports a broader range of GPU operations and functions
- Better documentation and examples available
Cons of gpu.js
- Steeper learning curve for beginners
- More complex setup process for certain use cases
- Limited support for some advanced GPU features
Code Comparison
emu:
#[gpu_use]
fn add_vectors(a: &[f32], b: &[f32]) -> Vec<f32> {
a.iter().zip(b.iter()).map(|(x, y)| x + y).collect()
}
gpu.js:
const gpu = new GPU();
const addVectors = gpu.createKernel(function(a, b) {
return a[this.thread.x] + b[this.thread.x];
}).setOutput([1024]);
Key Differences
- emu uses Rust with GPU annotations, while gpu.js uses JavaScript
- emu's syntax is closer to standard Rust, making it easier for Rust developers
- gpu.js provides more fine-grained control over GPU operations
- emu has a simpler API, potentially making it more accessible for newcomers to GPU programming
Use Cases
- emu: Better suited for Rust projects and developers familiar with Rust
- gpu.js: Ideal for web-based applications and JavaScript developers
Performance
- Both libraries offer significant performance improvements over CPU-based computations
- gpu.js may have an edge in certain scenarios due to its maturity and optimization
ArrayFire: a general purpose GPU library.
Pros of ArrayFire
- More mature and widely-used library with extensive documentation
- Supports multiple backends (CPU, CUDA, OpenCL) for cross-platform acceleration
- Offers a broader range of optimized functions for scientific computing and signal processing
Cons of ArrayFire
- Larger codebase and more complex API, potentially steeper learning curve
- Requires external dependencies and setup for GPU acceleration
- May have higher overhead for simple operations compared to Emu's lightweight approach
Code Comparison
Emu:
let x = emu::array([1, 2, 3, 4]);
let y = emu::array([5, 6, 7, 8]);
let z = x + y;
ArrayFire:
af::array x = af::array(4, {1, 2, 3, 4});
af::array y = af::array(4, {5, 6, 7, 8});
af::array z = x + y;
Summary
ArrayFire is a more comprehensive and mature library for GPU-accelerated computing, offering cross-platform support and a wide range of optimized functions. Emu, on the other hand, provides a simpler, Rust-native approach to GPU acceleration with a focus on ease of use. While ArrayFire may be better suited for complex scientific computing tasks, Emu could be preferable for Rust developers looking for a lightweight GPU acceleration solution.
NumPy aware dynamic Python compiler using LLVM
Pros of Numba
- More mature and widely adopted project with extensive documentation
- Supports a broader range of Python and NumPy features
- Offers both automatic and manual optimization options
Cons of Numba
- Steeper learning curve for advanced usage
- Limited support for object-oriented programming constructs
- May require more manual intervention for optimal performance
Code Comparison
Emu:
@emu.jit
def add_vectors(a, b):
return a + b
result = add_vectors(np.array([1, 2, 3]), np.array([4, 5, 6]))
Numba:
from numba import jit
import numpy as np
@jit(nopython=True)
def add_vectors(a, b):
return a + b
result = add_vectors(np.array([1, 2, 3]), np.array([4, 5, 6]))
Both Emu and Numba aim to accelerate Python code, particularly for numerical computations. Emu focuses on simplicity and ease of use, while Numba offers more advanced features and optimization options. Emu's syntax is generally more straightforward, but Numba provides greater flexibility and performance potential for complex use cases. The code comparison shows that both libraries use similar decorator-based approaches for JIT compilation, with Numba requiring an explicit import and offering additional optimization flags.
An Open Source Machine Learning Framework for Everyone
Pros of TensorFlow
- Extensive ecosystem with robust tools and libraries
- Highly optimized for large-scale machine learning and deep learning
- Strong community support and extensive documentation
Cons of TensorFlow
- Steeper learning curve for beginners
- Can be overkill for smaller projects or simpler machine learning tasks
- Slower development cycle compared to more lightweight alternatives
Code Comparison
TensorFlow:
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
Emu:
use emu_core::*;
let model = sequential![
dense(64).with_activation(Activation::ReLU),
dense(10).with_activation(Activation::Softmax)
];
Summary
TensorFlow is a powerful, industry-standard framework for machine learning and deep learning, offering a comprehensive ecosystem and excellent performance for large-scale projects. However, it can be complex for beginners and may be excessive for simpler tasks.
Emu, on the other hand, is a newer, lightweight GPU-accelerated ML framework written in Rust. It aims to provide a simpler, more intuitive API for machine learning tasks, making it potentially more accessible for beginners or smaller projects. However, it may lack the extensive features and community support of TensorFlow.
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Pros of PyTorch
- Mature, widely-used framework with extensive community support and documentation
- Comprehensive set of tools for deep learning, including neural network layers, optimizers, and data loading utilities
- Seamless integration with CUDA for GPU acceleration
Cons of PyTorch
- Steeper learning curve for beginners compared to Emu's simplified API
- Larger codebase and more dependencies, potentially leading to longer compilation times
- Less focus on GPU memory optimization compared to Emu's automatic memory management
Code Comparison
PyTorch example:
import torch
x = torch.tensor([1, 2, 3])
y = torch.tensor([4, 5, 6])
z = torch.matmul(x, y)
Emu example:
use emu_core::*;
let x = tensor([1, 2, 3]);
let y = tensor([4, 5, 6]);
let z = x.matmul(y);
Summary
PyTorch offers a robust, feature-rich environment for deep learning with extensive community support. Emu provides a simpler, more memory-efficient alternative with a focus on GPU optimization. While PyTorch is more widely adopted and offers a broader range of tools, Emu may be more appealing for projects prioritizing GPU memory management and ease of use.
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
Pros of ONNX Runtime
- Broader industry support and adoption
- More comprehensive feature set for production deployments
- Extensive documentation and community resources
Cons of ONNX Runtime
- Steeper learning curve for beginners
- Larger codebase and more complex architecture
- May be overkill for simpler projects or prototypes
Code Comparison
ONNX Runtime example:
import onnxruntime as ort
session = ort.InferenceSession("model.onnx")
input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: input_data})
Emu example:
use emu_core::*;
let mut session = Session::new();
session.compile(include_str!("model.emu"));
let output = session.run(vec![("input", input_data)]);
Summary
ONNX Runtime is a more mature and feature-rich solution for deploying machine learning models, particularly in production environments. It offers broader compatibility and extensive resources but may be more complex for beginners.
Emu, on the other hand, provides a simpler and more lightweight approach, making it potentially more suitable for smaller projects or those new to machine learning deployment. However, it may lack some of the advanced features and widespread adoption of ONNX Runtime.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
The old version of Emu (which used macros) is here.
Overview
Emu is a GPGPU library for Rust with a focus on portability, modularity, and performance.
It's a CUDA-esque compute-specific abstraction over WebGPU providing specific functionality to make WebGPU feel more like CUDA. Here's a quick run-down of highlight features...
-
Emu can run anywhere - Emu uses WebGPU to support DirectX, Metal, Vulkan (and also OpenGL and browser eventually) as compile targets. This allows Emu to run on pretty much any user interface including desktop, mobile, and browser. By moving heavy computations to the user's device, you can reduce system latency and improve privacy.
-
Emu makes compute easier - Emu makes WebGPU feel like CUDA. It does this by providing...
DeviceBox<T>
as a wrapper for data that lives on the GPU (thereby ensuring type-safe data movement)DevicePool
as a no-config auto-managed pool of devices (similar to CUDA)trait Cache
- a no-setup-required LRU cache of JITed compute kernels.
-
Emu is transparent - Emu is a fully transparent abstraction. This means, at any point, you can decide to remove the abstraction and work directly with WebGPU constructs with zero overhead. For example, if you want to mix Emu with WebGPU-based graphics, you can do that with zero overhead. You can also swap out the JIT compiler artifact cache with your own cache, manage the device pool if you wish, and define your own compile-to-SPIR-V compiler that interops with Emu.
-
Emu is asynchronous - Emu is fully asynchronous. Most API calls will be non-blocking and can be synchronized by calls to
DeviceBox::get
when data is read back from device.
An example
Here's a quick example of Emu. You can find more in emu_core/examples
and most recent documentation here.
First, we just import a bunch of stuff
use emu_glsl::*;
use emu_core::prelude::*;
use zerocopy::*;
We can define types of structures so that they can be safely serialized and deserialized to/from the GPU.
#[repr(C)]
#[derive(AsBytes, FromBytes, Copy, Clone, Default, Debug)]
struct Rectangle {
x: u32,
y: u32,
w: i32,
h: i32,
}
For this example, we make this entire function async but in reality you will only want small blocks of code to be async (like a bunch of asynchronous memory transfers and computation) and these blocks will be sent off to an executor to execute. You definitely don't want to do something like this where you are blocking (by doing an entire compilation step) in your async code.
fn main() -> Result<(), Box<dyn std::error::Error>> {
futures::executor::block_on(assert_device_pool_initialized());
// first, we move a bunch of rectangles to the GPU
let mut x: DeviceBox<[Rectangle]> = vec![Default::default(); 128].as_device_boxed()?;
// then we compile some GLSL code using the GlslCompile compiler and
// the GlobalCache for caching compiler artifacts
let c = compile::<String, GlslCompile, _, GlobalCache>(
GlslBuilder::new()
.set_entry_point_name("main")
.add_param_mut()
.set_code_with_glsl(
r#"
#version 450
layout(local_size_x = 1) in; // our thread block size is 1, that is we only have 1 thread per block
struct Rectangle {
uint x;
uint y;
int w;
int h;
};
// make sure to use only a single set and keep all your n parameters in n storage buffers in bindings 0 to n-1
// you shouldn't use push constants or anything OTHER than storage buffers for passing stuff into the kernel
// just use buffers with one buffer per binding
layout(set = 0, binding = 0) buffer Rectangles {
Rectangle[] rectangles;
}; // this is used as both input and output for convenience
Rectangle flip(Rectangle r) {
r.x = r.x + r.w;
r.y = r.y + r.h;
r.w *= -1;
r.h *= -1;
return r;
}
// there should be only one entry point and it should be named "main"
// ultimately, Emu has to kind of restrict how you use GLSL because it is compute focused
void main() {
uint index = gl_GlobalInvocationID.x; // this gives us the index in the x dimension of the thread space
rectangles[index] = flip(rectangles[index]);
}
"#,
)
)?.finish()?;
// we spawn 128 threads (really 128 thread blocks)
unsafe {
spawn(128).launch(call!(c, &mut x));
}
// this is the Future we need to block on to get stuff to happen
// everything else is non-blocking in the API (except stuff like compilation)
println!("{:?}", futures::executor::block_on(x.get())?);
Ok(())
}
And last but certainly not least, we use an executor to execute.
fn main() {
futures::executor::block_on(do_some_stuff()).expect("failed to do stuff on GPU");
}
Built with Emu
Emu is relatively new but has already been used for GPU acceleration in a variety of projects.
- Used in toil for GPU-accelerated linear algebra
- Used in ipl3hasher for hash collision finding
- Used in bigbang for simulating gravitational acceleration (used older version of Emu)
Getting started
The latest stable version is on Crates.io. To start using Emu, simply add the following line to your Cargo.toml
.
[dependencies]
emu_core = "0.1.1"
To understand how to start using Emu, check out the docs. If you have any questions, please ask in the Discord.
Contributing
Feedback, discussion, PRs would all very much be appreciated. Some relatively high-priority, non-API-breaking things that have yet to be implemented are the following in rough order of priority.
- Enusre that WebGPU polling is done correctly in `DeviceBox::get
- Add support for WGLSL as input, use Naga for shader compilation
- Add WASM support in
Cargo.toml
- Add benchmarks`
- Reuse staging buffers between different
DeviceBox
es - Maybe use uniforms for
DeviceBox<T>
whenT
is small (maybe)
If you are interested in any of these or anything else, please don't hesitate to open an issue on GitHub or discuss more on Discord.
Top Related Projects
GPU Accelerated JavaScript
ArrayFire: a general purpose GPU library.
NumPy aware dynamic Python compiler using LLVM
An Open Source Machine Learning Framework for Everyone
Tensors and Dynamic neural networks in Python with strong GPU acceleration
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot