Convert Figma logo to code with AI

calebwin logoemu

The write-once-run-anywhere GPGPU library for Rust

1,597
52
1,597
29

Top Related Projects

15,165

GPU Accelerated JavaScript

ArrayFire: a general purpose GPU library.

10,051

NumPy aware dynamic Python compiler using LLVM

186,879

An Open Source Machine Learning Framework for Everyone

85,015

Tensors and Dynamic neural networks in Python with strong GPU acceleration

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

Quick Overview

Emu is a high-performance, GPU-accelerated computing framework for Rust. It allows developers to write Rust code that can seamlessly run on both CPUs and GPUs, providing a unified programming model for heterogeneous computing.

Pros

  • Seamless integration of GPU computing in Rust without separate CUDA or OpenCL code
  • High-level abstractions for parallel computing, making GPU programming more accessible
  • Automatic memory management and data transfer between CPU and GPU
  • Supports both NVIDIA and AMD GPUs through various backends

Cons

  • Still in early development, may have stability issues or incomplete features
  • Limited documentation and examples compared to more established GPU computing frameworks
  • Potential performance overhead compared to hand-optimized CUDA or OpenCL code
  • Requires a compatible GPU and drivers for full functionality

Code Examples

  1. Basic vector addition:
use emu_core::prelude::*;

fn main() {
    let mut a = vec![1.0; 1000000];
    let mut b = vec![2.0; 1000000];
    let mut c = vec![0.0; 1000000];

    emu_core::exec(|(a, b, c): (&[f32], &[f32], &mut [f32])| {
        for i in 0..a.len() {
            c[i] = a[i] + b[i];
        }
    }).run((&a, &b, &mut c));
}
  1. Matrix multiplication:
use emu_core::prelude::*;

fn main() {
    let a = vec![1.0; 1024 * 1024];
    let b = vec![2.0; 1024 * 1024];
    let mut c = vec![0.0; 1024 * 1024];

    emu_core::exec(|(a, b, c): (&[f32], &[f32], &mut [f32])| {
        for i in 0..1024 {
            for j in 0..1024 {
                let mut sum = 0.0;
                for k in 0..1024 {
                    sum += a[i * 1024 + k] * b[k * 1024 + j];
                }
                c[i * 1024 + j] = sum;
            }
        }
    }).run((&a, &b, &mut c));
}
  1. Custom kernel with shared memory:
use emu_core::prelude::*;

fn main() {
    let mut data = vec![1.0; 1024];

    emu_core::exec(|data: &mut [f32]| {
        let tid = emu_core::thread_idx().x;
        let shared = emu_core::shared::<[f32; 256]>();

        shared[tid as usize] = data[tid as usize];
        emu_core::sync_threads();

        if tid < 256 {
            data[tid as usize] = shared[tid as usize] * 2.0;
        }
    }).config(256, 1, 256).run(&mut data);
}

Getting Started

  1. Add Emu to your Cargo.toml:

    [dependencies]
    emu_core = "0.1"
    
  2. Import Emu in your Rust code:

    use emu_core::prelude::*;
    
  3. Write your GPU-accelerated code using Emu's exec function and run it:

    emu_core::exec(|data: &mut [f32]| {
        // Your GPU code here
    }).run(&mut data);
    
  4. Build and run your project with cargo run --release

Competitor Comparisons

15,165

GPU Accelerated JavaScript

Pros of gpu.js

  • More mature and widely adopted project with a larger community
  • Supports a broader range of GPU operations and functions
  • Better documentation and examples available

Cons of gpu.js

  • Steeper learning curve for beginners
  • More complex setup process for certain use cases
  • Limited support for some advanced GPU features

Code Comparison

emu:

#[gpu_use]
fn add_vectors(a: &[f32], b: &[f32]) -> Vec<f32> {
    a.iter().zip(b.iter()).map(|(x, y)| x + y).collect()
}

gpu.js:

const gpu = new GPU();
const addVectors = gpu.createKernel(function(a, b) {
  return a[this.thread.x] + b[this.thread.x];
}).setOutput([1024]);

Key Differences

  • emu uses Rust with GPU annotations, while gpu.js uses JavaScript
  • emu's syntax is closer to standard Rust, making it easier for Rust developers
  • gpu.js provides more fine-grained control over GPU operations
  • emu has a simpler API, potentially making it more accessible for newcomers to GPU programming

Use Cases

  • emu: Better suited for Rust projects and developers familiar with Rust
  • gpu.js: Ideal for web-based applications and JavaScript developers

Performance

  • Both libraries offer significant performance improvements over CPU-based computations
  • gpu.js may have an edge in certain scenarios due to its maturity and optimization

ArrayFire: a general purpose GPU library.

Pros of ArrayFire

  • More mature and widely-used library with extensive documentation
  • Supports multiple backends (CPU, CUDA, OpenCL) for cross-platform acceleration
  • Offers a broader range of optimized functions for scientific computing and signal processing

Cons of ArrayFire

  • Larger codebase and more complex API, potentially steeper learning curve
  • Requires external dependencies and setup for GPU acceleration
  • May have higher overhead for simple operations compared to Emu's lightweight approach

Code Comparison

Emu:

let x = emu::array([1, 2, 3, 4]);
let y = emu::array([5, 6, 7, 8]);
let z = x + y;

ArrayFire:

af::array x = af::array(4, {1, 2, 3, 4});
af::array y = af::array(4, {5, 6, 7, 8});
af::array z = x + y;

Summary

ArrayFire is a more comprehensive and mature library for GPU-accelerated computing, offering cross-platform support and a wide range of optimized functions. Emu, on the other hand, provides a simpler, Rust-native approach to GPU acceleration with a focus on ease of use. While ArrayFire may be better suited for complex scientific computing tasks, Emu could be preferable for Rust developers looking for a lightweight GPU acceleration solution.

10,051

NumPy aware dynamic Python compiler using LLVM

Pros of Numba

  • More mature and widely adopted project with extensive documentation
  • Supports a broader range of Python and NumPy features
  • Offers both automatic and manual optimization options

Cons of Numba

  • Steeper learning curve for advanced usage
  • Limited support for object-oriented programming constructs
  • May require more manual intervention for optimal performance

Code Comparison

Emu:

@emu.jit
def add_vectors(a, b):
    return a + b

result = add_vectors(np.array([1, 2, 3]), np.array([4, 5, 6]))

Numba:

from numba import jit
import numpy as np

@jit(nopython=True)
def add_vectors(a, b):
    return a + b

result = add_vectors(np.array([1, 2, 3]), np.array([4, 5, 6]))

Both Emu and Numba aim to accelerate Python code, particularly for numerical computations. Emu focuses on simplicity and ease of use, while Numba offers more advanced features and optimization options. Emu's syntax is generally more straightforward, but Numba provides greater flexibility and performance potential for complex use cases. The code comparison shows that both libraries use similar decorator-based approaches for JIT compilation, with Numba requiring an explicit import and offering additional optimization flags.

186,879

An Open Source Machine Learning Framework for Everyone

Pros of TensorFlow

  • Extensive ecosystem with robust tools and libraries
  • Highly optimized for large-scale machine learning and deep learning
  • Strong community support and extensive documentation

Cons of TensorFlow

  • Steeper learning curve for beginners
  • Can be overkill for smaller projects or simpler machine learning tasks
  • Slower development cycle compared to more lightweight alternatives

Code Comparison

TensorFlow:

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])

Emu:

use emu_core::*;

let model = sequential![
    dense(64).with_activation(Activation::ReLU),
    dense(10).with_activation(Activation::Softmax)
];

Summary

TensorFlow is a powerful, industry-standard framework for machine learning and deep learning, offering a comprehensive ecosystem and excellent performance for large-scale projects. However, it can be complex for beginners and may be excessive for simpler tasks.

Emu, on the other hand, is a newer, lightweight GPU-accelerated ML framework written in Rust. It aims to provide a simpler, more intuitive API for machine learning tasks, making it potentially more accessible for beginners or smaller projects. However, it may lack the extensive features and community support of TensorFlow.

85,015

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Pros of PyTorch

  • Mature, widely-used framework with extensive community support and documentation
  • Comprehensive set of tools for deep learning, including neural network layers, optimizers, and data loading utilities
  • Seamless integration with CUDA for GPU acceleration

Cons of PyTorch

  • Steeper learning curve for beginners compared to Emu's simplified API
  • Larger codebase and more dependencies, potentially leading to longer compilation times
  • Less focus on GPU memory optimization compared to Emu's automatic memory management

Code Comparison

PyTorch example:

import torch

x = torch.tensor([1, 2, 3])
y = torch.tensor([4, 5, 6])
z = torch.matmul(x, y)

Emu example:

use emu_core::*;

let x = tensor([1, 2, 3]);
let y = tensor([4, 5, 6]);
let z = x.matmul(y);

Summary

PyTorch offers a robust, feature-rich environment for deep learning with extensive community support. Emu provides a simpler, more memory-efficient alternative with a focus on GPU optimization. While PyTorch is more widely adopted and offers a broader range of tools, Emu may be more appealing for projects prioritizing GPU memory management and ease of use.

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

Pros of ONNX Runtime

  • Broader industry support and adoption
  • More comprehensive feature set for production deployments
  • Extensive documentation and community resources

Cons of ONNX Runtime

  • Steeper learning curve for beginners
  • Larger codebase and more complex architecture
  • May be overkill for simpler projects or prototypes

Code Comparison

ONNX Runtime example:

import onnxruntime as ort

session = ort.InferenceSession("model.onnx")
input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: input_data})

Emu example:

use emu_core::*;

let mut session = Session::new();
session.compile(include_str!("model.emu"));
let output = session.run(vec![("input", input_data)]);

Summary

ONNX Runtime is a more mature and feature-rich solution for deploying machine learning models, particularly in production environments. It offers broader compatibility and extensive resources but may be more complex for beginners.

Emu, on the other hand, provides a simpler and more lightweight approach, making it potentially more suitable for smaller projects or those new to machine learning deployment. However, it may lack some of the advanced features and widespread adoption of ONNX Runtime.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

The old version of Emu (which used macros) is here.

Discord Chat crates.io docs.rs

Overview

Emu is a GPGPU library for Rust with a focus on portability, modularity, and performance.

It's a CUDA-esque compute-specific abstraction over WebGPU providing specific functionality to make WebGPU feel more like CUDA. Here's a quick run-down of highlight features...

  • Emu can run anywhere - Emu uses WebGPU to support DirectX, Metal, Vulkan (and also OpenGL and browser eventually) as compile targets. This allows Emu to run on pretty much any user interface including desktop, mobile, and browser. By moving heavy computations to the user's device, you can reduce system latency and improve privacy.

  • Emu makes compute easier - Emu makes WebGPU feel like CUDA. It does this by providing...

    • DeviceBox<T> as a wrapper for data that lives on the GPU (thereby ensuring type-safe data movement)
    • DevicePool as a no-config auto-managed pool of devices (similar to CUDA)
    • trait Cache - a no-setup-required LRU cache of JITed compute kernels.
  • Emu is transparent - Emu is a fully transparent abstraction. This means, at any point, you can decide to remove the abstraction and work directly with WebGPU constructs with zero overhead. For example, if you want to mix Emu with WebGPU-based graphics, you can do that with zero overhead. You can also swap out the JIT compiler artifact cache with your own cache, manage the device pool if you wish, and define your own compile-to-SPIR-V compiler that interops with Emu.

  • Emu is asynchronous - Emu is fully asynchronous. Most API calls will be non-blocking and can be synchronized by calls to DeviceBox::get when data is read back from device.

An example

Here's a quick example of Emu. You can find more in emu_core/examples and most recent documentation here.

First, we just import a bunch of stuff

use emu_glsl::*;
use emu_core::prelude::*;
use zerocopy::*;

We can define types of structures so that they can be safely serialized and deserialized to/from the GPU.

#[repr(C)]
#[derive(AsBytes, FromBytes, Copy, Clone, Default, Debug)]
struct Rectangle {
    x: u32,
    y: u32,
    w: i32,
    h: i32,
}

For this example, we make this entire function async but in reality you will only want small blocks of code to be async (like a bunch of asynchronous memory transfers and computation) and these blocks will be sent off to an executor to execute. You definitely don't want to do something like this where you are blocking (by doing an entire compilation step) in your async code.

fn main() -> Result<(), Box<dyn std::error::Error>> {
    futures::executor::block_on(assert_device_pool_initialized());

    // first, we move a bunch of rectangles to the GPU
    let mut x: DeviceBox<[Rectangle]> = vec![Default::default(); 128].as_device_boxed()?;
    
    // then we compile some GLSL code using the GlslCompile compiler and
    // the GlobalCache for caching compiler artifacts
    let c = compile::<String, GlslCompile, _, GlobalCache>(
        GlslBuilder::new()
            .set_entry_point_name("main")
            .add_param_mut()
            .set_code_with_glsl(
            r#"
#version 450
layout(local_size_x = 1) in; // our thread block size is 1, that is we only have 1 thread per block

struct Rectangle {
    uint x;
    uint y;
    int w;
    int h;
};

// make sure to use only a single set and keep all your n parameters in n storage buffers in bindings 0 to n-1
// you shouldn't use push constants or anything OTHER than storage buffers for passing stuff into the kernel
// just use buffers with one buffer per binding
layout(set = 0, binding = 0) buffer Rectangles {
    Rectangle[] rectangles;
}; // this is used as both input and output for convenience

Rectangle flip(Rectangle r) {
    r.x = r.x + r.w;
    r.y = r.y + r.h;
    r.w *= -1;
    r.h *= -1;
    return r;
}

// there should be only one entry point and it should be named "main"
// ultimately, Emu has to kind of restrict how you use GLSL because it is compute focused
void main() {
    uint index = gl_GlobalInvocationID.x; // this gives us the index in the x dimension of the thread space
    rectangles[index] = flip(rectangles[index]);
}
            "#,
        )
    )?.finish()?;
    
    // we spawn 128 threads (really 128 thread blocks)
    unsafe {
        spawn(128).launch(call!(c, &mut x));
    }

    // this is the Future we need to block on to get stuff to happen
    // everything else is non-blocking in the API (except stuff like compilation)
    println!("{:?}", futures::executor::block_on(x.get())?);

    Ok(())
}

And last but certainly not least, we use an executor to execute.

fn main() {
    futures::executor::block_on(do_some_stuff()).expect("failed to do stuff on GPU");
}

Built with Emu

Emu is relatively new but has already been used for GPU acceleration in a variety of projects.

  • Used in toil for GPU-accelerated linear algebra
  • Used in ipl3hasher for hash collision finding
  • Used in bigbang for simulating gravitational acceleration (used older version of Emu)

Getting started

The latest stable version is on Crates.io. To start using Emu, simply add the following line to your Cargo.toml.

[dependencies]
emu_core = "0.1.1"

To understand how to start using Emu, check out the docs. If you have any questions, please ask in the Discord.

Contributing

Feedback, discussion, PRs would all very much be appreciated. Some relatively high-priority, non-API-breaking things that have yet to be implemented are the following in rough order of priority.

  • Enusre that WebGPU polling is done correctly in `DeviceBox::get
  • Add support for WGLSL as input, use Naga for shader compilation
  • Add WASM support in Cargo.toml
  • Add benchmarks`
  • Reuse staging buffers between different DeviceBoxes
  • Maybe use uniforms for DeviceBox<T> when T is small (maybe)

If you are interested in any of these or anything else, please don't hesitate to open an issue on GitHub or discuss more on Discord.