XNNPACK

High-efficiency floating-point neural network inference operators for mobile, server, and Web

2,079

436

2,079

236

View on GitHub

Top Related Projects

tensorflow

190,523

An Open Source Machine Learning Framework for Everyone

pytorch

91,080

Tensors and Dynamic neural networks in Python with strong GPU acceleration

coremltools

4,869

Core ML tools contain supporting tools for Core ML model conversion, editing, and validation.

onnxruntime

17,390

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

MNN is a blazing fast, lightweight deep learning framework, battle-tested by business-critical use cases in Alibaba. Full multimodal LLM Android App:[MNN-LLM-Android](./apps/Android/MnnLlmChat/README.md). MNN TaoAvatar Android - Local 3D Avatar Intelligence: apps/Android/Mnn3dAvatar/README.md

ncnn

21,857

ncnn is a high-performance neural network inference framework optimized for the mobile platform

Quick Overview

XNNPACK is a highly optimized library for neural network inference acceleration on a variety of hardware platforms, including CPUs, GPUs, and specialized AI accelerators. It provides a set of low-level, hardware-agnostic kernels for common neural network operations, which can be used to build high-performance inference pipelines.

Pros

High Performance: XNNPACK is designed to deliver state-of-the-art performance on a wide range of hardware, leveraging platform-specific optimizations and low-level hardware features.
Hardware Agnostic: The library provides a hardware-agnostic API, allowing developers to easily integrate it into their projects without worrying about the underlying hardware.
Extensive Kernel Support: XNNPACK supports a comprehensive set of neural network operations, including convolution, pooling, activation functions, and more, making it a versatile choice for various neural network architectures.
Open-Source: XNNPACK is an open-source project, allowing developers to contribute, customize, and extend the library to fit their specific needs.

Cons

Complexity: The library's extensive feature set and hardware-specific optimizations can make it challenging for newcomers to understand and integrate into their projects.
Dependency Management: XNNPACK relies on several external dependencies, which can add complexity to the build and deployment process, especially in constrained environments.
Limited Documentation: While the project has good technical documentation, the overall user experience and getting started guides could be improved to make it more accessible to a wider audience.
Licensing: XNNPACK is licensed under the Apache License 2.0, which may not be compatible with all project requirements.

Code Examples

Here are a few examples of how to use XNNPACK in your projects:

Performing Convolution:

#include <xnnpack.h>

void convolution_example() {
    const size_t input_channels = 3;
    const size_t output_channels = 16;
    const size_t kernel_size = 3;
    const size_t batch_size = 1;
    const size_t input_height = 224;
    const size_t input_width = 224;

    // Allocate memory for input, weights, and output
    float input[batch_size * input_channels * input_height * input_width];
    float weights[output_channels * input_channels * kernel_size * kernel_size];
    float output[batch_size * output_channels * (input_height - kernel_size + 1) * (input_width - kernel_size + 1)];

    // Set up XNNPACK convolution parameters
    xnn_status status = xnn_initialize(NULL);
    xnn_operator_t convolution_op;
    xnn_status create_status = xnn_create_convolution2d_nhwc_f32(
        0, output_channels, kernel_size, kernel_size,
        1, 1, 0, 0, 0, 0,
        input_channels, output_channels, kernel_size * kernel_size,
        xnn_init_f32_conv_minmax_params(0.0f, 6.0f),
        &convolution_op);

    // Execute the convolution
    xnn_status run_status = xnn_run_operator(convolution_op, input, weights, output, NULL);

    // Clean up
    xnn_delete_operator(convolution_op);
    xnn_deinitialize();
}

Performing Pooling:

#include <xnnpack.h>

void pooling_example() {
    const size_t batch_size = 1;
    const size_t channels = 16;
    const size_t input_height = 224;
    const size_t input_width = 224;

    // Allocate memory for input and output
    float input[batch_size * channels * input_height * input_width];
    float output[batch_size * channels * (input_height / 2) * (input_width / 2)];

    // Set up XNNPACK pooling parameters
    xnn_status status = xnn_initialize(NULL);

Competitor Comparisons

tensorflow

190,523

An Open Source Machine Learning Framework for Everyone

Pros of TensorFlow

Comprehensive ecosystem with high-level APIs, tools, and extensive documentation
Supports a wide range of platforms and devices
Large community and extensive third-party library support

Cons of TensorFlow

Steeper learning curve for beginners
Can be slower for certain operations compared to XNNPACK's optimized kernels
Larger footprint and resource requirements

Code Comparison

XNNPACK (low-level operator implementation):

void xnn_f32_vmax_ukernel__avx_x8(
    size_t n,
    const float* a,
    const float* b,
    float* y,
    const union xnn_f32_output_params params[restrict static 1])
{
  // Implementation details...
}

TensorFlow (high-level API usage):

import tensorflow as tf

a = tf.constant([1.0, 2.0, 3.0])
b = tf.constant([4.0, 5.0, 6.0])
result = tf.maximum(a, b)

XNNPACK focuses on low-level, optimized kernels for neural network operators, while TensorFlow provides a high-level API for building and training machine learning models. XNNPACK is more suitable for performance-critical, embedded applications, whereas TensorFlow offers a more comprehensive solution for general machine learning tasks.

pytorch

91,080

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Pros of PyTorch

Comprehensive deep learning framework with high-level APIs
Large ecosystem and community support
Flexible and dynamic computational graph

Cons of PyTorch

Larger footprint and resource requirements
Steeper learning curve for beginners
Less optimized for mobile and edge devices

Code Comparison

XNNPACK (low-level operator implementation):

xnn_status xnn_create_convolution2d_nhwc_f32(
    uint32_t input_padding_top,
    uint32_t input_padding_right,
    uint32_t input_padding_bottom,
    uint32_t input_padding_left,
    uint32_t kernel_height,
    uint32_t kernel_width,
    // ... (additional parameters)
)

PyTorch (high-level API):

import torch.nn as nn

conv = nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding)
output = conv(input)

XNNPACK focuses on low-level, highly optimized implementations of neural network operators, particularly for mobile and edge devices. It provides fine-grained control over performance-critical parameters.

PyTorch offers a more user-friendly, high-level API for building and training neural networks. It abstracts away many low-level details, making it easier to prototype and experiment with different architectures.

coremltools

4,869

Core ML tools contain supporting tools for Core ML model conversion, editing, and validation.

Pros of coremltools

Specifically designed for Apple platforms, offering seamless integration with iOS, macOS, and other Apple devices
Provides tools for converting models from various frameworks (TensorFlow, PyTorch, etc.) to Core ML format
Includes features for model optimization and quantization tailored for Apple hardware

Cons of coremltools

Limited to Apple ecosystem, lacking cross-platform support
May have a steeper learning curve for developers not familiar with Apple's ML ecosystem
Less flexible for low-level optimizations compared to XNNPACK

Code Comparison

XNNPACK (C++):

xnn_status xnn_initialize(const xnn_init_flags flags) {
  if (flags & ~(XNN_INIT_FLAG_XNNPACK | XNN_INIT_FLAG_SPARSE)) {
    xnn_log_error("Invalid initialization flags: %#x", flags);
    return xnn_status_invalid_parameter;
  }
  // ... (implementation continues)
}

coremltools (Python):

import coremltools as ct

model = ct.convert(keras_model, source='keras')
model.save('my_model.mlmodel')

The code snippets highlight the different approaches: XNNPACK focuses on low-level initialization and optimization, while coremltools emphasizes high-level model conversion and deployment for Apple platforms.

onnxruntime

17,390

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

Pros of ONNX Runtime

Broader ecosystem support with ONNX format compatibility
More comprehensive, supporting a wider range of ML models and operations
Cross-platform support for various hardware accelerators (CPU, GPU, etc.)

Cons of ONNX Runtime

Larger footprint and potentially higher resource usage
May have more overhead for simpler models or specific use cases

Code Comparison

XNNPACK (C):

xnn_status xnn_initialize(const struct xnn_allocator* allocator);
xnn_status xnn_create_convolution2d_nhwc_f32(...);
xnn_status xnn_setup_convolution2d_nhwc_f32(...);

ONNX Runtime (C++):

Ort::Env env;
Ort::Session session(env, model_path, session_options);
auto output_tensors = session.Run(run_options, input_names, input_tensors, output_names);

XNNPACK focuses on low-level, high-performance neural network operators, while ONNX Runtime provides a higher-level interface for running entire ML models. XNNPACK is more suitable for fine-grained control and optimization of specific operations, whereas ONNX Runtime offers a more comprehensive solution for deploying and running various ML models across different platforms.

MNN

11,962

Pros of MNN

Supports a wider range of platforms, including mobile, embedded, and IoT devices
Offers a comprehensive set of tools for model conversion, visualization, and benchmarking
Provides higher-level APIs for easier integration and usage

Cons of MNN

Less specialized for low-level optimizations compared to XNNPACK
May have a steeper learning curve due to its broader feature set
Potentially larger binary size due to more comprehensive functionality

Code Comparison

XNNPACK (C++):

xnn_status xnn_initialize(const xnn_init_flags flags);
xnn_status xnn_create_convolution2d_nhwc_f32(...);
xnn_status xnn_setup_convolution2d_nhwc_f32(...);
xnn_status xnn_run_operator(xnn_operator_t op, pthreadpool_t threadpool);

MNN (C++):

auto net = std::shared_ptr<MNN::Interpreter>(MNN::Interpreter::createFromFile(modelFile));
auto session = net->createSession(config);
net->runSession(session);
auto tensor = net->getSessionOutput(session, "output");

Both libraries offer efficient neural network inference, but XNNPACK focuses on low-level optimizations for specific operations, while MNN provides a more comprehensive solution with higher-level abstractions. XNNPACK may be preferred for fine-grained control and performance optimization, whereas MNN offers a more user-friendly approach with broader platform support and tools for end-to-end deployment.

ncnn

21,857

ncnn is a high-performance neural network inference framework optimized for the mobile platform

Pros of ncnn

Broader platform support, including mobile and embedded devices
More comprehensive model conversion tools
Larger community and ecosystem, with more pre-trained models available

Cons of ncnn

Generally slower performance compared to XNNPACK
Less focus on low-precision inference optimizations
More complex API and setup process

Code Comparison

XNNPACK example (C++):

xnn_initialize(nullptr);
xnn_operator_t conv_op = nullptr;
xnn_status status = xnn_create_convolution2d_nhwc_f32(
    /* ... parameters ... */
    &conv_op);

ncnn example (C++):

ncnn::Net net;
net.load_param("model.param");
net.load_model("model.bin");
ncnn::Mat in(224, 224, 3);
ncnn::Mat out;
net.extract("output", out);

Both libraries provide efficient neural network inference, but XNNPACK focuses on low-level optimizations for specific hardware, while ncnn offers a higher-level API with broader device support. XNNPACK generally provides better performance, especially for low-precision operations, while ncnn offers more flexibility and easier integration for a wider range of applications and platforms.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

XNNPACK

XNNPACK is a highly optimized solution for neural network inference on ARM, x86, WebAssembly, and RISC-V platforms. XNNPACK is not intended for direct use by deep learning practitioners and researchers; instead it provides low-level performance primitives for accelerating high-level machine learning frameworks, such as TensorFlow Lite, TensorFlow.js, PyTorch, ONNX Runtime, ExecuTorch, and MediaPipe.

Supported Architectures

ARM64 on Android, iOS, macOS, Linux, and Windows
ARMv7 (with NEON) on Android
ARMv6 (with VFPv2) on Linux
x86 and x86-64 (up to AVX512) on Windows, Linux, macOS, Android, and iOS simulator
WebAssembly MVP
WebAssembly SIMD
WebAssembly Relaxed SIMD (experimental)
RISC-V (RV32GC and RV64GC)

Operator Coverage

XNNPACK implements the following neural network operators:

2D Convolution (including grouped and depthwise)
2D Deconvolution (AKA Transposed Convolution)
2D Average Pooling
2D Max Pooling
2D ArgMax Pooling (Max Pooling + indices)
2D Unpooling
2D Bilinear Resize
2D Depth-to-Space (AKA Pixel Shuffle)
Add (including broadcasting, two inputs only)
Subtract (including broadcasting)
Divide (including broadcasting)
Maximum (including broadcasting)
Minimum (including broadcasting)
Multiply (including broadcasting)
Squared Difference (including broadcasting)
Global Average Pooling
Channel Shuffle
Fully Connected
Abs (absolute value)
Bankers' Rounding (rounding to nearest, ties to even)
Ceiling (rounding to integer above)
Clamp (includes ReLU and ReLU6)
Convert (includes fixed-point and half-precision quantization and dequantization)
Copy
ELU
Floor (rounding to integer below)
HardSwish
Leaky ReLU
Negate
Sigmoid
Softmax
Square
Tanh
Transpose
Truncation (rounding to integer towards zero)
PReLU

All operators in XNNPACK support NHWC layout, but additionally allow custom stride along the Channel dimension. Thus, operators can consume a subset of channels in the input tensor, and produce a subset of channels in the output tensor, providing a zero-cost Channel Split and Channel Concatenation operations.

Performance

Mobile phones

The table below presents single-threaded performance of XNNPACK library on three generations of MobileNet models and three generations of Pixel phones.

Model	Pixel, ms	Pixel 2, ms	Pixel 3a, ms
FP32 MobileNet v1 1.0X	82	86	88
FP32 MobileNet v2 1.0X	49	53	55
FP32 MobileNet v3 Large	39	42	44
FP32 MobileNet v3 Small	12	14	14

The following table presents multi-threaded (using as many threads as there are big cores) performance of XNNPACK library on three generations of MobileNet models and three generations of Pixel phones.

Model	Pixel, ms	Pixel 2, ms	Pixel 3a, ms
FP32 MobileNet v1 1.0X	43	27	46
FP32 MobileNet v2 1.0X	26	18	28
FP32 MobileNet v3 Large	22	16	24
FP32 MobileNet v3 Small	7	6	8

Benchmarked on March 27, 2020 with end2end_bench --benchmark_min_time=5 on an Android/ARM64 build with Android NDK r21 (bazel build -c opt --config android_arm64 :end2end_bench) and neural network models with randomized weights and inputs.

Raspberry Pi

The table below presents multi-threaded performance of XNNPACK library on three generations of MobileNet models and three generations of Raspberry Pi boards.

Model	RPi Zero W (BCM2835), ms	RPi 2 (BCM2836), ms	RPi 3+ (BCM2837B0), ms	RPi 4 (BCM2711), ms	RPi 4 (BCM2711, ARM64), ms
FP32 MobileNet v1 1.0X	3919	302	114	72	77
FP32 MobileNet v2 1.0X	1987	191	79	41	46
FP32 MobileNet v3 Large	1658	161	67	38	40
FP32 MobileNet v3 Small	474	50	22	13	15
INT8 MobileNet v1 1.0X	2589	128	46	29	24
INT8 MobileNet v2 1.0X	1495	82	30	20	17

Benchmarked on Feb 8, 2022 with end2end-bench --benchmark_min_time=5 on a Raspbian Buster build with CMake (./scripts/build-local.sh) and neural network models with randomized weights and inputs. INT8 inference was evaluated on per-channel quantization schema.

Minimum build requirements

C11
C++14
Python 3

Publications

Marat Dukhan "The Indirect Convolution Algorithm". Presented on Efficient Deep Learning for Compute Vision (ECV) 2019 workshop (slides, paper on ArXiv).
Erich Elsen, Marat Dukhan, Trevor Gale, Karen Simonyan "Fast Sparse ConvNets". Paper on ArXiv, pre-trained sparse models.
Marat Dukhan, Artsiom Ablavatski "The Two-Pass Softmax Algorithm". Paper on ArXiv.
Yury Pisarchyk, Juhyun Lee "Efficient Memory Management for Deep Neural Net Inference". Paper on ArXiv.

Ecosystem

Machine Learning Frameworks

Acknowledgements

XNNPACK is based on QNNPACK library. Over time its codebase diverged a lot, and XNNPACK API is no longer compatible with QNNPACK.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot