emu

The write-once-run-anywhere GPGPU library for Rust

1,610

View on GitHub

Top Related Projects

arrayfire

4,726

ArrayFire: a general purpose GPU library.

numba

10,488

NumPy aware dynamic Python compiler using LLVM

tensorflow

190,523

An Open Source Machine Learning Framework for Everyone

pytorch

91,080

Tensors and Dynamic neural networks in Python with strong GPU acceleration

onnxruntime

17,390

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

Quick Overview

Emu is a high-performance, GPU-accelerated computing framework for Rust. It allows developers to write Rust code that can seamlessly run on both CPUs and GPUs, providing a unified programming model for heterogeneous computing.

Pros

Seamless integration of GPU computing in Rust without separate CUDA or OpenCL code
High-level abstractions for parallel computing, making GPU programming more accessible
Automatic memory management and data transfer between CPU and GPU
Supports both NVIDIA and AMD GPUs through various backends

Cons

Still in early development, may have stability issues or incomplete features
Limited documentation and examples compared to more established GPU computing frameworks
Potential performance overhead compared to hand-optimized CUDA or OpenCL code
Requires a compatible GPU and drivers for full functionality

Code Examples

Basic vector addition:

use emu_core::prelude::*;

fn main() {
    let mut a = vec![1.0; 1000000];
    let mut b = vec![2.0; 1000000];
    let mut c = vec![0.0; 1000000];

    emu_core::exec(|(a, b, c): (&[f32], &[f32], &mut [f32])| {
        for i in 0..a.len() {
            c[i] = a[i] + b[i];
        }
    }).run((&a, &b, &mut c));
}

Matrix multiplication:

use emu_core::prelude::*;

fn main() {
    let a = vec![1.0; 1024 * 1024];
    let b = vec![2.0; 1024 * 1024];
    let mut c = vec![0.0; 1024 * 1024];

    emu_core::exec(|(a, b, c): (&[f32], &[f32], &mut [f32])| {
        for i in 0..1024 {
            for j in 0..1024 {
                let mut sum = 0.0;
                for k in 0..1024 {
                    sum += a[i * 1024 + k] * b[k * 1024 + j];
                }
                c[i * 1024 + j] = sum;
            }
        }
    }).run((&a, &b, &mut c));
}

Custom kernel with shared memory:

use emu_core::prelude::*;

fn main() {
    let mut data = vec![1.0; 1024];

    emu_core::exec(|data: &mut [f32]| {
        let tid = emu_core::thread_idx().x;
        let shared = emu_core::shared::<[f32; 256]>();

        shared[tid as usize] = data[tid as usize];
        emu_core::sync_threads();

        if tid < 256 {
            data[tid as usize] = shared[tid as usize] * 2.0;
        }
    }).config(256, 1, 256).run(&mut data);
}

Getting Started

Add Emu to your Cargo.toml:
```
[dependencies]
emu_core = "0.1"
```
Import Emu in your Rust code:
```
use emu_core::prelude::*;
```

Write your GPU-accelerated code using Emu's exec function and run it:

emu_core::exec(|data: &mut [f32]| {
    // Your GPU code here
}).run(&mut data);

Build and run your project with cargo run --release

Competitor Comparisons

gpu.js

15,280

GPU Accelerated JavaScript

Pros of gpu.js

More mature and widely adopted project with a larger community
Supports a broader range of GPU operations and functions
Better documentation and examples available

Cons of gpu.js

Steeper learning curve for beginners
More complex setup process for certain use cases
Limited support for some advanced GPU features

Code Comparison

emu:

#[gpu_use]
fn add_vectors(a: &[f32], b: &[f32]) -> Vec<f32> {
    a.iter().zip(b.iter()).map(|(x, y)| x + y).collect()
}

gpu.js:

const gpu = new GPU();
const addVectors = gpu.createKernel(function(a, b) {
  return a[this.thread.x] + b[this.thread.x];
}).setOutput([1024]);

Key Differences

emu uses Rust with GPU annotations, while gpu.js uses JavaScript
emu's syntax is closer to standard Rust, making it easier for Rust developers
gpu.js provides more fine-grained control over GPU operations
emu has a simpler API, potentially making it more accessible for newcomers to GPU programming

Use Cases

emu: Better suited for Rust projects and developers familiar with Rust
gpu.js: Ideal for web-based applications and JavaScript developers

Performance

Both libraries offer significant performance improvements over CPU-based computations
gpu.js may have an edge in certain scenarios due to its maturity and optimization

arrayfire

4,726

ArrayFire: a general purpose GPU library.

Pros of ArrayFire

More mature and widely-used library with extensive documentation
Supports multiple backends (CPU, CUDA, OpenCL) for cross-platform acceleration
Offers a broader range of optimized functions for scientific computing and signal processing

Cons of ArrayFire

Larger codebase and more complex API, potentially steeper learning curve
Requires external dependencies and setup for GPU acceleration
May have higher overhead for simple operations compared to Emu's lightweight approach

Code Comparison

Emu:

let x = emu::array([1, 2, 3, 4]);
let y = emu::array([5, 6, 7, 8]);
let z = x + y;

ArrayFire:

af::array x = af::array(4, {1, 2, 3, 4});
af::array y = af::array(4, {5, 6, 7, 8});
af::array z = x + y;

Summary

ArrayFire is a more comprehensive and mature library for GPU-accelerated computing, offering cross-platform support and a wide range of optimized functions. Emu, on the other hand, provides a simpler, Rust-native approach to GPU acceleration with a focus on ease of use. While ArrayFire may be better suited for complex scientific computing tasks, Emu could be preferable for Rust developers looking for a lightweight GPU acceleration solution.

numba

10,488

NumPy aware dynamic Python compiler using LLVM

Pros of Numba

More mature and widely adopted project with extensive documentation
Supports a broader range of Python and NumPy features
Offers both automatic and manual optimization options

Cons of Numba

Steeper learning curve for advanced usage
Limited support for object-oriented programming constructs
May require more manual intervention for optimal performance

Code Comparison

Emu:

@emu.jit
def add_vectors(a, b):
    return a + b

result = add_vectors(np.array([1, 2, 3]), np.array([4, 5, 6]))

Numba:

from numba import jit
import numpy as np

@jit(nopython=True)
def add_vectors(a, b):
    return a + b

result = add_vectors(np.array([1, 2, 3]), np.array([4, 5, 6]))

Both Emu and Numba aim to accelerate Python code, particularly for numerical computations. Emu focuses on simplicity and ease of use, while Numba offers more advanced features and optimization options. Emu's syntax is generally more straightforward, but Numba provides greater flexibility and performance potential for complex use cases. The code comparison shows that both libraries use similar decorator-based approaches for JIT compilation, with Numba requiring an explicit import and offering additional optimization flags.

tensorflow

190,523

An Open Source Machine Learning Framework for Everyone

Pros of TensorFlow

Extensive ecosystem with robust tools and libraries
Highly optimized for large-scale machine learning and deep learning
Strong community support and extensive documentation

Cons of TensorFlow

Steeper learning curve for beginners
Can be overkill for smaller projects or simpler machine learning tasks
Slower development cycle compared to more lightweight alternatives

Code Comparison

TensorFlow:

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])

Emu:

use emu_core::*;

let model = sequential![
    dense(64).with_activation(Activation::ReLU),
    dense(10).with_activation(Activation::Softmax)
];

Summary

TensorFlow is a powerful, industry-standard framework for machine learning and deep learning, offering a comprehensive ecosystem and excellent performance for large-scale projects. However, it can be complex for beginners and may be excessive for simpler tasks.

Emu, on the other hand, is a newer, lightweight GPU-accelerated ML framework written in Rust. It aims to provide a simpler, more intuitive API for machine learning tasks, making it potentially more accessible for beginners or smaller projects. However, it may lack the extensive features and community support of TensorFlow.

pytorch

91,080

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Pros of PyTorch

Mature, widely-used framework with extensive community support and documentation
Comprehensive set of tools for deep learning, including neural network layers, optimizers, and data loading utilities
Seamless integration with CUDA for GPU acceleration

Cons of PyTorch

Steeper learning curve for beginners compared to Emu's simplified API
Larger codebase and more dependencies, potentially leading to longer compilation times
Less focus on GPU memory optimization compared to Emu's automatic memory management

Code Comparison

PyTorch example:

import torch

x = torch.tensor([1, 2, 3])
y = torch.tensor([4, 5, 6])
z = torch.matmul(x, y)

Emu example:

use emu_core::*;

let x = tensor([1, 2, 3]);
let y = tensor([4, 5, 6]);
let z = x.matmul(y);

Summary

PyTorch offers a robust, feature-rich environment for deep learning with extensive community support. Emu provides a simpler, more memory-efficient alternative with a focus on GPU optimization. While PyTorch is more widely adopted and offers a broader range of tools, Emu may be more appealing for projects prioritizing GPU memory management and ease of use.

onnxruntime

17,390

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

Pros of ONNX Runtime

Broader industry support and adoption
More comprehensive feature set for production deployments
Extensive documentation and community resources

Cons of ONNX Runtime

Steeper learning curve for beginners
Larger codebase and more complex architecture
May be overkill for simpler projects or prototypes

Code Comparison

ONNX Runtime example:

import onnxruntime as ort

session = ort.InferenceSession("model.onnx")
input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: input_data})

Emu example:

use emu_core::*;

let mut session = Session::new();
session.compile(include_str!("model.emu"));
let output = session.run(vec![("input", input_data)]);

Summary

ONNX Runtime is a more mature and feature-rich solution for deploying machine learning models, particularly in production environments. It offers broader compatibility and extensive resources but may be more complex for beginners.

Emu, on the other hand, provides a simpler and more lightweight approach, making it potentially more suitable for smaller projects or those new to machine learning deployment. However, it may lack some of the advanced features and widespread adoption of ONNX Runtime.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

The old version of Emu (which used macros) is here.

Overview

Emu is a GPGPU library for Rust with a focus on portability, modularity, and performance.

It's a CUDA-esque compute-specific abstraction over WebGPU providing specific functionality to make WebGPU feel more like CUDA. Here's a quick run-down of highlight features...

Emu can run anywhere - Emu uses WebGPU to support DirectX, Metal, Vulkan (and also OpenGL and browser eventually) as compile targets. This allows Emu to run on pretty much any user interface including desktop, mobile, and browser. By moving heavy computations to the user's device, you can reduce system latency and improve privacy.
Emu makes compute easier - Emu makes WebGPU feel like CUDA. It does this by providing...
- DeviceBox<T> as a wrapper for data that lives on the GPU (thereby ensuring type-safe data movement)
- DevicePool as a no-config auto-managed pool of devices (similar to CUDA)
- trait Cache - a no-setup-required LRU cache of JITed compute kernels.
Emu is transparent - Emu is a fully transparent abstraction. This means, at any point, you can decide to remove the abstraction and work directly with WebGPU constructs with zero overhead. For example, if you want to mix Emu with WebGPU-based graphics, you can do that with zero overhead. You can also swap out the JIT compiler artifact cache with your own cache, manage the device pool if you wish, and define your own compile-to-SPIR-V compiler that interops with Emu.
Emu is asynchronous - Emu is fully asynchronous. Most API calls will be non-blocking and can be synchronized by calls to DeviceBox::get when data is read back from device.

An example

Here's a quick example of Emu. You can find more in emu_core/examples and most recent documentation here.

First, we just import a bunch of stuff

use emu_glsl::*;
use emu_core::prelude::*;
use zerocopy::*;

We can define types of structures so that they can be safely serialized and deserialized to/from the GPU.

#[repr(C)]
#[derive(AsBytes, FromBytes, Copy, Clone, Default, Debug)]
struct Rectangle {
    x: u32,
    y: u32,
    w: i32,
    h: i32,
}

For this example, we make this entire function async but in reality you will only want small blocks of code to be async (like a bunch of asynchronous memory transfers and computation) and these blocks will be sent off to an executor to execute. You definitely don't want to do something like this where you are blocking (by doing an entire compilation step) in your async code.

fn main() -> Result<(), Box<dyn std::error::Error>> {
    futures::executor::block_on(assert_device_pool_initialized());

    // first, we move a bunch of rectangles to the GPU
    let mut x: DeviceBox<[Rectangle]> = vec![Default::default(); 128].as_device_boxed()?;
    
    // then we compile some GLSL code using the GlslCompile compiler and
    // the GlobalCache for caching compiler artifacts
    let c = compile::<String, GlslCompile, _, GlobalCache>(
        GlslBuilder::new()
            .set_entry_point_name("main")
            .add_param_mut()
            .set_code_with_glsl(
            r#"
#version 450
layout(local_size_x = 1) in; // our thread block size is 1, that is we only have 1 thread per block

struct Rectangle {
    uint x;
    uint y;
    int w;
    int h;
};

// make sure to use only a single set and keep all your n parameters in n storage buffers in bindings 0 to n-1
// you shouldn't use push constants or anything OTHER than storage buffers for passing stuff into the kernel
// just use buffers with one buffer per binding
layout(set = 0, binding = 0) buffer Rectangles {
    Rectangle[] rectangles;
}; // this is used as both input and output for convenience

Rectangle flip(Rectangle r) {
    r.x = r.x + r.w;
    r.y = r.y + r.h;
    r.w *= -1;
    r.h *= -1;
    return r;
}

// there should be only one entry point and it should be named "main"
// ultimately, Emu has to kind of restrict how you use GLSL because it is compute focused
void main() {
    uint index = gl_GlobalInvocationID.x; // this gives us the index in the x dimension of the thread space
    rectangles[index] = flip(rectangles[index]);
}
            "#,
        )
    )?.finish()?;
    
    // we spawn 128 threads (really 128 thread blocks)
    unsafe {
        spawn(128).launch(call!(c, &mut x));
    }

    // this is the Future we need to block on to get stuff to happen
    // everything else is non-blocking in the API (except stuff like compilation)
    println!("{:?}", futures::executor::block_on(x.get())?);

    Ok(())
}

And last but certainly not least, we use an executor to execute.

fn main() {
    futures::executor::block_on(do_some_stuff()).expect("failed to do stuff on GPU");
}

Built with Emu

Emu is relatively new but has already been used for GPU acceleration in a variety of projects.

Used in toil for GPU-accelerated linear algebra
Used in ipl3hasher for hash collision finding
Used in bigbang for simulating gravitational acceleration (used older version of Emu)

Getting started

The latest stable version is on Crates.io. To start using Emu, simply add the following line to your Cargo.toml.

[dependencies]
emu_core = "0.1.1"

To understand how to start using Emu, check out the docs. If you have any questions, please ask in the Discord.

Contributing

Feedback, discussion, PRs would all very much be appreciated. Some relatively high-priority, non-API-breaking things that have yet to be implemented are the following in rough order of priority.

Enusre that WebGPU polling is done correctly in `DeviceBox::get
Add support for WGLSL as input, use Naga for shader compilation
Add WASM support in Cargo.toml
Add benchmarks`
Reuse staging buffers between different DeviceBoxes
Maybe use uniforms for DeviceBox<T> when T is small (maybe)

If you are interested in any of these or anything else, please don't hesitate to open an issue on GitHub or discuss more on Discord.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot