ComputeLibrary

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.

2,993

796

2,993

View on GitHub

Top Related Projects

tensorflow

190,523

An Open Source Machine Learning Framework for Everyone

pytorch

91,080

Tensors and Dynamic neural networks in Python with strong GPU acceleration

onnxruntime

17,390

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

tvm

12,482

Open deep learning compiler stack for cpu, gpu and specialized accelerators

OpenBLAS

6,839

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.

oneDNN

3,820

oneAPI Deep Neural Network Library (oneDNN)

Quick Overview

The ARM-software/ComputeLibrary is an open-source collection of low-level machine learning functions optimized for ARM Cortex-A CPUs and ARM Mali GPUs. It provides a comprehensive set of software components to accelerate computer vision and machine learning workloads on ARM-based platforms, including mobile devices, embedded systems, and IoT devices.

Pros

Highly optimized for ARM architectures, providing excellent performance for machine learning tasks
Supports a wide range of ARM CPUs and Mali GPUs, ensuring compatibility across various devices
Includes implementations for popular neural network operations, computer vision algorithms, and linear algebra functions
Regularly updated with new features and optimizations

Cons

Steep learning curve for developers not familiar with low-level programming or ARM architectures
Limited documentation and examples compared to some higher-level machine learning frameworks
Primarily focused on ARM platforms, which may limit portability to other architectures
Requires careful memory management and optimization for best performance

Code Examples

Initializing and running a simple convolution operation:

#include <arm_compute/runtime/NEON/NEFunctions.h>

using namespace arm_compute;

// Create tensors
Tensor src, weights, dst;

// Configure tensor shapes and data types
src.allocator()->init(TensorInfo(TensorShape(128, 128, 3), 1, DataType::F32));
weights.allocator()->init(TensorInfo(TensorShape(3, 3, 3), 1, DataType::F32));
dst.allocator()->init(TensorInfo(TensorShape(126, 126, 1), 1, DataType::F32));

// Create and configure convolution layer
NEConvolutionLayer conv;
conv.configure(&src, &weights, nullptr, &dst, PadStrideInfo(1, 1, 0, 0));

// Allocate tensors
src.allocator()->allocate();
weights.allocator()->allocate();
dst.allocator()->allocate();

// Fill tensors with data (omitted for brevity)

// Run convolution
conv.run();

Performing matrix multiplication:

#include <arm_compute/runtime/NEON/NEFunctions.h>

using namespace arm_compute;

// Create tensors
Tensor a, b, c;

// Configure tensor shapes and data types
a.allocator()->init(TensorInfo(TensorShape(128, 64), 1, DataType::F32));
b.allocator()->init(TensorInfo(TensorShape(64, 32), 1, DataType::F32));
c.allocator()->init(TensorInfo(TensorShape(128, 32), 1, DataType::F32));

// Create and configure matrix multiplication
NEGEMMLayer gemm;
gemm.configure(&a, &b, nullptr, &c);

// Allocate tensors
a.allocator()->allocate();
b.allocator()->allocate();
c.allocator()->allocate();

// Fill tensors with data (omitted for brevity)

// Run matrix multiplication
gemm.run();

Applying a softmax function:

#include <arm_compute/runtime/NEON/NEFunctions.h>

using namespace arm_compute;

// Create tensors
Tensor input, output;

// Configure tensor shapes and data types
input.allocator()->init(TensorInfo(TensorShape(10), 1, DataType::F32));
output.allocator()->init(TensorInfo(TensorShape(10), 1, DataType::F32));

// Create and configure softmax function
NESoftmaxLayer softmax;
softmax.configure(&input, &output);

// Allocate tensors
input.allocator()->allocate();
output.allocator()->allocate();

// Fill input tensor with data (omitted for brevity)

// Run softmax
softmax.run();

Getting Started

Clone the repository:

git clone https://github.com/ARM-software/ComputeLibrary.git

Build the library:
```
cd ComputeLibrary
scons Wer
```

Competitor Comparisons

tensorflow

190,523

An Open Source Machine Learning Framework for Everyone

Pros of TensorFlow

Broader ecosystem and community support
More extensive documentation and learning resources
Supports a wider range of platforms and devices

Cons of TensorFlow

Larger footprint and potentially slower performance on ARM devices
Steeper learning curve for beginners
Less optimized for specific ARM architectures

Code Comparison

TensorFlow:

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])

ComputeLibrary:

#include <arm_compute/runtime/NEON/NEFunctions.h>

arm_compute::NEFullyConnectedLayer fc1;
arm_compute::NEActivationLayer    act1;
fc1.configure(&input, &weights, &biases, &output);
act1.configure(&output, nullptr, arm_compute::ActivationLayerInfo(arm_compute::ActivationLayerInfo::ActivationFunction::RELU));

The TensorFlow code is more concise and abstracts away low-level details, while ComputeLibrary offers more fine-grained control over ARM-specific optimizations. TensorFlow provides a higher-level API, making it easier to build and train models, whereas ComputeLibrary requires more detailed configuration but allows for better performance tuning on ARM devices.

pytorch

91,080

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Pros of PyTorch

Wider community support and more extensive ecosystem
Higher-level API, easier for rapid prototyping and research
Supports dynamic computational graphs, allowing for more flexible model architectures

Cons of PyTorch

Generally slower performance on ARM-based devices
Larger memory footprint, which can be a concern on resource-constrained systems
Less optimized for specific ARM hardware compared to ComputeLibrary

Code Comparison

PyTorch example:

import torch

x = torch.tensor([1, 2, 3])
y = torch.tensor([4, 5, 6])
z = torch.add(x, y)

ComputeLibrary example:

#include <arm_compute/runtime/NEON/NEFunctions.h>

NEArithmeticAddition add;
Tensor x, y, z;
add.configure(&x, &y, &z, ConvertPolicy::SATURATE);

The PyTorch example demonstrates its simplicity and ease of use, while the ComputeLibrary example shows its lower-level approach and direct hardware optimization capabilities.

onnxruntime

17,390

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

Pros of ONNX Runtime

Broader platform support, including Windows, Linux, macOS, and mobile devices
Extensive ecosystem integration with popular ML frameworks like PyTorch and TensorFlow
Advanced optimizations for various hardware accelerators (CPU, GPU, NPU)

Cons of ONNX Runtime

Larger codebase and potentially higher resource requirements
May have a steeper learning curve for developers new to ONNX ecosystem

Code Comparison

ComputeLibrary:

NEConvolutionLayer conv;
conv.configure(&input, &weights, &biases, &output, conv_info);
conv.run();

ONNX Runtime:

session = onnxruntime.InferenceSession("model.onnx")
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name
result = session.run([output_name], {input_name: input_data})

Summary

ComputeLibrary is specifically designed for ARM architectures, offering optimized performance on ARM-based devices. ONNX Runtime provides a more versatile solution with broader platform support and integration capabilities. While ComputeLibrary may offer better performance on ARM devices, ONNX Runtime's flexibility and extensive ecosystem make it suitable for a wider range of applications and deployment scenarios.

tvm

12,482

Open deep learning compiler stack for cpu, gpu and specialized accelerators

Pros of TVM

Broader hardware support, including CPUs, GPUs, and various AI accelerators
End-to-end compiler stack for deep learning, offering more flexibility
Active open-source community with frequent updates and contributions

Cons of TVM

Steeper learning curve due to its more comprehensive nature
May have higher overhead for simple tasks on ARM devices
Less specialized optimization for ARM architectures compared to ComputeLibrary

Code Comparison

TVM example (tensor addition):

import tvm
from tvm import te

n = te.var("n")
A = te.placeholder((n,), name="A")
B = te.placeholder((n,), name="B")
C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")

ComputeLibrary example (tensor addition):

#include <arm_compute/runtime/NEON/NEFunctions.h>

using namespace arm_compute;

NEArithmeticAddition addition;
addition.configure(&input1, &input2, &output, ConvertPolicy::SATURATE);
addition.run();

OpenBLAS

6,839

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.

Pros of OpenBLAS

Wider platform support, including x86, ARM, POWER, and more
Extensive optimizations for various CPU architectures
Well-established and widely used in scientific computing

Cons of OpenBLAS

Primarily focused on linear algebra operations
Less emphasis on modern machine learning primitives
May require more manual optimization for specific use cases

Code Comparison

OpenBLAS:

#include <cblas.h>

double A[6] = {1.0, 2.0, 1.0, -3.0, 4.0, -1.0};
double B[6] = {1.0, 2.0, 1.0, -3.0, 4.0, -1.0};
cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, 2, 2, 2, 1.0, A, 2, B, 2, 0.0, C, 2);

ComputeLibrary:

#include <arm_compute/runtime/NEON/NEFunctions.h>

NEGEMMLayer gemm;
gemm.configure(&src0, &src1, nullptr, &dst, 1.0f, 0.0f);
gemm.run();

The ComputeLibrary focuses on ARM-specific optimizations and provides higher-level abstractions for machine learning operations, while OpenBLAS offers a more traditional BLAS interface with broader platform support.

oneDNN

3,820

oneAPI Deep Neural Network Library (oneDNN)

Pros of oneDNN

Broader hardware support, including x86, ARM, and GPUs
More extensive optimizations for various architectures
Larger community and wider adoption in industry

Cons of oneDNN

Steeper learning curve due to more complex API
Potentially slower performance on ARM-specific hardware
Larger codebase, which may lead to longer compilation times

Code Comparison

oneDNN:

dnnl::engine eng(dnnl::engine::kind::cpu, 0);
dnnl::memory::dims dims = {2, 2, 3, 3};
auto src_md = dnnl::memory::desc(dims, dnnl::memory::data_type::f32, dnnl::memory::format_tag::nchw);
auto src_mem = dnnl::memory(src_md, eng);

ComputeLibrary:

arm_compute::CLTensor tensor;
arm_compute::TensorInfo info(TensorShape(3, 3, 2, 2), 1, DataType::F32);
tensor.allocator()->init(info);
tensor.allocator()->allocate();

Both libraries provide efficient tensor operations and optimizations for neural network computations. ComputeLibrary focuses specifically on ARM architectures, while oneDNN offers broader hardware support. The choice between them depends on the target hardware and specific project requirements.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

[!IMPORTANT] Versioning Change: CalVer â SemVer

We have updated our versioning scheme for the Arm Compute Library.

v25.04 is the last release using Calendar Versioning (CalVer).

Starting from the next release, we are switching to **Semantic Versioning (SemVer).

The next version will be: v52.0.0

This change provides clearer version semantics and better compatibility tracking for users and integrators.

Compute Library

The Compute Library is a collection of low-level machine learning functions optimized for ArmÂ® CortexÂ®-A, ArmÂ® Neoverseâ¢ and ArmÂ® Maliâ¢ GPUs architectures.

The library provides superior performance to other open source alternatives and immediate support for new ArmÂ® technologies e.g. SVE2.

Key Features:

Open source software available under a permissive MIT license
Over 100 machine learning functions for CPU and GPU
Multiple convolution algorithms (GeMM, Winograd, FFT, Direct and indirect-GeMM)
Support for multiple data types: FP32, FP16, INT8, UINT8, BFLOAT16
Micro-architecture optimization for key ML primitives
Highly configurable build options enabling lightweight binaries
Advanced optimization techniques such as kernel fusion, Fast math enablement and texture utilization
Device and workload specific tuning using OpenCL tuner and GeMM optimized heuristics

Repository	Link
Release	https://github.com/arm-software/ComputeLibrary
Development	https://review.mlplatform.org/#/admin/projects/ml/ComputeLibrary

Documentation

Note: The documentation includes the reference API, changelogs, build guide, contribution guide, errata, etc.

Pre-built binaries

All the binaries can be downloaded from here or from the tables below.

Platform	Operating System	Release archive (Download)
Raspberry Pi 4	LinuxÂ® 32bit
Raspberry Pi 4	LinuxÂ® 64bit
Odroid N2	LinuxÂ® 64bit
HiKey960	LinuxÂ® 64bit

Architecture	Operating System	Release archive (Download)
armv7	LinuxÂ®
arm64-v8a	Androidâ¢
arm64-v8a	LinuxÂ®

Please refer to the following link for more pre-built binaries:

Pre-build binaries are generated with the following security / good coding practices related flags:

-Wall, -Wextra, -Wformat=2, -Winit-self, -Wstrict-overflow=2, -Wswitch-default, -Woverloaded-virtual, -Wformat-security, -Wctor-dtor-privacy, -Wsign-promo, -Weffc++, -pedantic, -fstack-protector-strong

Supported Architectures/Technologies

ArmÂ® CPUs:
- ArmÂ® CortexÂ®-A processor family using ArmÂ® Neonâ¢ technology
- ArmÂ® Neoverseâ¢ processor family
- ArmÂ® CortexÂ®-R processor family with Armv8-R AArch64 architecture using ArmÂ® Neonâ¢ technology
- ArmÂ® CortexÂ®-X1 processor using ArmÂ® Neonâ¢ technology
ArmÂ® Maliâ¢ GPUs:
- ArmÂ® Maliâ¢-G processor family
- ArmÂ® Maliâ¢-T processor family
x86

Supported Systems

Androidâ¢
Bare Metal
LinuxÂ®
OpenBSDÂ®
macOSÂ®
Tizenâ¢
QNXÂ® (Experimental)

Resources

Experimental builds

â Important Bazel and CMake builds are experimental CPU only builds, please see the documentation for more details.

How to contribute

Contributions to the Compute Library are more than welcome. If you are interested on contributing, please have a look at our how to contribute guidelines.

Developer Certificate of Origin (DCO)

Before the Compute Library accepts your contribution, you need to certify its origin and give us your permission. To manage this process we use the Developer Certificate of Origin (DCO) V1.1 (https://developercertificate.org/)

To indicate that you agree to the the terms of the DCO, you "sign off" your contribution by adding a line with your name and e-mail address to every git commit message:

Signed-off-by: John Doe <john.doe@example.org>

You must use your real name, no pseudonyms or anonymous contributions are accepted.

Public mailing list

For technical discussion, the ComputeLibrary project has a public mailing list: acl-dev@lists.linaro.org The list is open to anyone inside or outside of Arm to self subscribe. In order to subscribe, please visit the following website: https://lists.linaro.org/mailman3/lists/acl-dev.lists.linaro.org/

License and Contributions

The software is provided under MIT license. Contributions to this project are accepted under the same license.

Other Projects

This project contains code from other projects as listed below. The original license text is included in those source files.

The OpenCL header library is licensed under Apache License, Version 2.0, which is a permissive license compatible with MIT license.
The half library is licensed under MIT license.
The libnpy library is licensed under MIT license.
The stb image library is either licensed under MIT license or is in Public Domain. It is used by this project under the terms of MIT license.
The KleidiAI library is licensed under Apache License, Version 2.0.
The GoogleTest library is used by KleidiAI and is licensed under BSD-3-Clause license.
The Benchmark library is used by KleidiAI and is licensed under Apache License, Version 2.0.

Trademarks and Copyrights

Android is a trademark of Google LLC.

Arm, Cortex, Mali and Neon are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere.

Bazel is a trademark of Google LLC., registered in the U.S. and other countries.

CMake is a trademark of Kitware, Inc., registered in the U.S. and other countries.

LinuxÂ® is the registered trademark of Linus Torvalds in the U.S. and other countries.

Mac and macOS are trademarks of Apple Inc., registered in the U.S. and other countries.

Tizen is a registered trademark of The Linux Foundation.

WindowsÂ® is a trademark of the Microsoft group of companies.

QNXÂ® is a trademark of QNX, a division of BlackBerry.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot