gemmlowp

Low-precision matrix multiplication

1,795

455

1,795

View on GitHub

Top Related Projects

tensorflow

188,828

An Open Source Machine Learning Framework for Everyone

turicreate

11,197

Turi Create simplifies the development of custom machine learning models.

onnxruntime

15,345

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

pytorch

88,135

Tensors and Dynamic neural networks in Python with strong GPU acceleration

OpenBLAS

6,648

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.

oneDNN

3,750

oneAPI Deep Neural Network Library (oneDNN)

Quick Overview

Gemmlowp is a small, portable, low-precision GEMM (General Matrix Multiplication) library developed by Google. It focuses on quantized computation, particularly useful for machine learning applications on mobile and embedded devices. The library is designed to be efficient, flexible, and easy to integrate into existing projects.

Pros

Optimized for low-precision arithmetic, suitable for mobile and embedded devices
Supports various quantization schemes, allowing for flexible implementation
Portable across different platforms and architectures
Designed with a focus on performance and efficiency

Cons

Limited to low-precision computations, not suitable for high-precision requirements
May require additional effort to integrate with existing high-level machine learning frameworks
Documentation could be more comprehensive for easier adoption
Primarily focused on GEMM operations, limiting its use for other types of computations

Code Examples

Basic matrix multiplication:

#include "gemmlowp/public/gemmlowp.h"

void MatrixMultiply(const uint8_t* lhs, const uint8_t* rhs, int32_t* result,
                    int m, int n, int k) {
  gemmlowp::GemmContext context;
  gemmlowp::MatrixMap<const uint8_t, gemmlowp::MapOrder::RowMajor> lhs_matrix(lhs, m, k);
  gemmlowp::MatrixMap<const uint8_t, gemmlowp::MapOrder::ColMajor> rhs_matrix(rhs, k, n);
  gemmlowp::MatrixMap<int32_t, gemmlowp::MapOrder::RowMajor> result_matrix(result, m, n);
  
  gemmlowp::GemmWithOutputPipeline<uint8_t, int32_t, gemmlowp::DefaultL8R8BitDepthParams>(
      &context, lhs_matrix, rhs_matrix, &result_matrix, -128, -128, 
      gemmlowp::MakeStandardOutputPipeline(0, 0));
}

Using custom output pipeline:

#include "gemmlowp/public/gemmlowp.h"

void CustomOutputPipeline(const uint8_t* lhs, const uint8_t* rhs, float* result,
                          int m, int n, int k) {
  gemmlowp::GemmContext context;
  gemmlowp::MatrixMap<const uint8_t, gemmlowp::MapOrder::RowMajor> lhs_matrix(lhs, m, k);
  gemmlowp::MatrixMap<const uint8_t, gemmlowp::MapOrder::ColMajor> rhs_matrix(rhs, k, n);
  gemmlowp::MatrixMap<float, gemmlowp::MapOrder::RowMajor> result_matrix(result, m, n);
  
  using OutputStageType = gemmlowp::OutputStageScaleFloat<gemmlowp::DefaultL8R8BitDepthParams>;
  OutputStageType output_stage;
  output_stage.multiplier = 1.0f / 256.0f;
  
  gemmlowp::GemmWithOutputPipeline<uint8_t, float, gemmlowp::DefaultL8R8BitDepthParams>(
      &context, lhs_matrix, rhs_matrix, &result_matrix, -128, -128, 
      gemmlowp::MakeOutputPipeline(output_stage));
}

Getting Started

To use gemmlowp in your project:

Clone the repository:

git clone https://github.com/google/gemmlowp.git

Include the necessary headers in your C++ file:
```
#include "gemmlowp/public/gemmlowp.h"
```

Compile your project with the gemmlowp source files:

g++ -I/path/to/gemmlowp your_file.cpp -o your_program

Use the gem

Competitor Comparisons

tensorflow

188,828

An Open Source Machine Learning Framework for Everyone

Pros of TensorFlow

Comprehensive machine learning framework with a wide range of tools and libraries
Large and active community, extensive documentation, and tutorials
Supports multiple programming languages and platforms

Cons of TensorFlow

Steeper learning curve due to its complexity and extensive features
Larger codebase and resource requirements

Code Comparison

gemmlowp:

typedef gemmlowp::EightBitIntGemm<std::uint8_t, std::int32_t, RowMajorWithSum> GemmImpl;
GemmImpl gemm_impl;
gemm_impl.Gemm(M, N, K, lhs_data, N, rhs_data, N, result_data, N);

TensorFlow:

import tensorflow as tf

x = tf.constant([[1, 2], [3, 4]])
y = tf.constant([[5, 6], [7, 8]])
result = tf.matmul(x, y)

Summary

gemmlowp is a specialized library for low-precision matrix multiplication, focusing on efficiency and performance for embedded systems. TensorFlow is a comprehensive machine learning framework with a broader scope and more features. While gemmlowp offers a lightweight solution for specific use cases, TensorFlow provides a complete ecosystem for various machine learning tasks at the cost of increased complexity and resource requirements.

turicreate

11,197

Turi Create simplifies the development of custom machine learning models.

Pros of Turicreate

Offers a high-level API for machine learning tasks, making it more user-friendly
Provides a wide range of ML algorithms and tools for various applications
Includes visualization capabilities for data exploration and model evaluation

Cons of Turicreate

Larger and more complex codebase, potentially harder to integrate into specific projects
May have higher computational requirements due to its comprehensive feature set
Less focused on low-level optimizations compared to gemmlowp

Code Comparison

Turicreate (Python):

import turicreate as tc
data = tc.SFrame('dataset.csv')
model = tc.image_classifier.create(data, target='label', model='resnet-50')
predictions = model.predict(new_data)

gemmlowp (C++):

#include "gemmlowp/public/gemmlowp.h"
typedef gemmlowp::VectorMap<const std::uint8_t, gemmlowp::VectorShape::Col> InputVector;
typedef gemmlowp::VectorMap<std::int32_t, gemmlowp::VectorShape::Col> OutputVector;
gemmlowp::GemmContext context;
gemmlowp::GemmWithOutputPipeline<std::uint8_t, std::int32_t, gemmlowp::DefaultL8R8BitDepthParams>(
    &context, lhs, rhs, &result, lhs_offset, rhs_offset, &output_pipeline);

Turicreate focuses on providing a high-level ML framework with various algorithms and tools, while gemmlowp specializes in low-level matrix multiplication optimizations for machine learning applications. Turicreate offers more accessibility for general ML tasks, whereas gemmlowp provides fine-grained control for performance-critical operations.

onnxruntime

15,345

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

Pros of ONNX Runtime

Broader scope: Supports multiple ML frameworks and hardware accelerators
Active development: Frequent updates and large community support
Cross-platform: Works on various operating systems and devices

Cons of ONNX Runtime

Larger footprint: More complex and resource-intensive than gemmlowp
Steeper learning curve: Requires more setup and configuration

Code Comparison

ONNX Runtime (C++):

Ort::Env env;
Ort::Session session(env, model_path, Ort::SessionOptions{});
auto output_tensors = session.Run(Ort::RunOptions{nullptr}, input_names, &input_tensor, 1, output_names, 1);

gemmlowp (C++):

gemmlowp::GemmContext gemm_context;
gemmlowp::OutputStageQuantizeDownInt32ToUint8ScaleByFixedPoint output_pipeline;
gemmlowp::GemmWithOutputPipeline<uint8_t, uint8_t, gemmlowp::L8R8WithLhsNonzeroBitDepthParams>(
    &gemm_context, lhs, rhs, &result, lhs_offset, rhs_offset, &output_pipeline);

ONNX Runtime offers a higher-level API for running ML models, while gemmlowp provides low-level matrix multiplication optimizations. ONNX Runtime is more versatile but complex, whereas gemmlowp is focused on efficient GEMM operations for low-precision arithmetic.

pytorch

88,135

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Pros of PyTorch

Broader scope and functionality, supporting a wide range of deep learning tasks
Large and active community, with extensive documentation and resources
Flexible and intuitive API, making it easier for researchers and developers to experiment

Cons of PyTorch

Larger codebase and more complex architecture, potentially harder to contribute to
Higher resource requirements for installation and usage
May have slower performance for specific low-level operations compared to gemmlowp

Code Comparison

PyTorch example (tensor creation and basic operation):

import torch

x = torch.tensor([1, 2, 3])
y = torch.tensor([4, 5, 6])
z = x + y
print(z)

gemmlowp example (matrix multiplication):

#include "gemmlowp/public/gemmlowp.h"

gemmlowp::GemmContext context;
gemmlowp::MatrixMap<const std::uint8_t, gemmlowp::MapOrder::RowMajor> lhs(lhs_data, m, k, k);
gemmlowp::MatrixMap<const std::uint8_t, gemmlowp::MapOrder::ColMajor> rhs(rhs_data, k, n, n);
gemmlowp::MatrixMap<std::int32_t, gemmlowp::MapOrder::RowMajor> result(result_data, m, n, n);
gemmlowp::GemmWithOutputPipeline<std::uint8_t, std::int32_t, gemmlowp::DefaultL8R8BitDepthParams>(
    &context, lhs, rhs, &result, -lhs_offset, -rhs_offset, pipeline);

OpenBLAS

6,648

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.

Pros of OpenBLAS

Broader scope: Covers a wide range of linear algebra operations beyond matrix multiplication
Highly optimized for various CPU architectures, including x86, ARM, and POWER
Extensive community support and regular updates

Cons of OpenBLAS

Larger library size and potentially higher memory footprint
May have more complexity for simple matrix multiplication tasks
Not specifically optimized for low-precision or quantized operations

Code Comparison

OpenBLAS (C):

#include <cblas.h>

cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
            M, N, K, alpha, A, K, B, N, beta, C, N);

gemmlowp (C++):

#include "gemmlowp/public/gemmlowp.h"

gemmlowp::GemmContext context;
gemmlowp::GemmWithOutputPipeline<uint8_t, uint8_t, gemmlowp::L8R8WithLhsNonzeroBitDepthParams>(
    &context, lhs, rhs, dst, M, N, K, lhs_offset, rhs_offset, dst_offset);

Note: gemmlowp focuses on low-precision matrix multiplication, while OpenBLAS provides a more general-purpose BLAS implementation. The code examples show basic matrix multiplication calls, highlighting the different APIs and use cases for each library.

oneDNN

3,750

oneAPI Deep Neural Network Library (oneDNN)

Pros of oneDNN

Broader scope: Supports a wide range of deep learning operations, not just matrix multiplication
Cross-platform: Works on CPUs, GPUs, and FPGAs, offering more flexibility
Active development: Regularly updated with new features and optimizations

Cons of oneDNN

Larger codebase: More complex and potentially harder to integrate for simple projects
Higher resource requirements: May consume more memory and processing power due to its comprehensive nature

Code Comparison

gemmlowp:

typedef gemmlowp::EightBitIntGemm<std::uint8_t, std::int32_t, std::int32_t> EightBitIntGemm;
EightBitIntGemm::Params params;
EightBitIntGemm::Run(lhs, rhs, dst, m, n, k, params);

oneDNN:

dnnl::memory::desc a_md({M, K}, dnnl::memory::data_type::u8, dnnl::memory::format_tag::ab);
dnnl::memory::desc b_md({K, N}, dnnl::memory::data_type::s8, dnnl::memory::format_tag::ab);
dnnl::matmul::desc matmul_d(a_md, b_md, c_md);
dnnl::matmul::primitive_desc matmul_pd(matmul_d, engine);

Both libraries provide optimized matrix multiplication, but oneDNN offers a more comprehensive set of operations for deep learning. gemmlowp focuses specifically on low-precision GEMM operations, making it potentially more suitable for embedded or mobile applications with limited resources. oneDNN's broader scope and cross-platform support make it a better choice for larger-scale deep learning projects, especially those requiring diverse hardware support.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

gemmlowp: a small self-contained low-precision GEMM library

This is not a full linear algebra library, only a GEMM library: it only does general matrix multiplication ("GEMM").

The meaning of "low precision" is detailed in this document: doc/low-precision.md

Some of the general design is explained in doc/design.md.

Warning: This library goes very slow if compiled incorrectly; see below.

Disclaimer

This is not an official Google product (experimental or otherwise), it is just code that happens to be owned by Google.

Mailing list

gemmlowp-related discussion, about either development or usage, is welcome on this Google Group (mailing list / forum):

https://groups.google.com/forum/#!forum/gemmlowp

Portability, target platforms/architectures

Should be portable to any platform with some C++11 and POSIX support, while we have optional optimized code paths for specific architectures.

Required:

C++11 (a small conservative subset of it)

Required for some features:

Some POSIX interfaces:
- pthreads (for multi-threaded operation and for profiling).
- sysconf (for multi-threaded operation to detect number of cores; may be bypassed).

Optional:

Architecture-specific code paths use intrinsics or inline assembly. See "Architecture-specific optimized code paths" below.

Architecture-specific optimized code paths

We have some optimized code paths for specific instruction sets. Some are written in inline assembly, some are written in C++ using intrinsics. Both GCC and Clang are supported.

Current optimized code paths:

ARM with NEON (both 32bit and 64bit).
Intel x86 with SSE 4.1 (both 32bit and 64bit).

When building for x86, it's very important to pass -msse4.1 to the compiler, otherwise gemmlowp will use slow reference code. Bazel users can compile by running bazel build --copt=-msse4.1 //gemmlowp:all. The compiled binary should work on all Intel CPUs since 2008 (including low power microarchitectures) as well as AMD CPUs since 2011.

Please note when compiling binaries that don't need to be distributed, it's generally a better idea to pass -march=native to the compiler. That flag implies -msse4.1 flag, along with others that might be helpful. This of course assumes the host machine supports those instructions. Bazel users should prefer to run bazel build --config=opt //gemmlowp:all instead.

Details of what it takes to make an efficient port of gemmlowp, namely writing a suitable GEMM kernel and accompanying packing code, are explained in this file: doc/kernel.md.

Public interfaces

The gemmlowp public interface

gemmlowp's main public interface is in the public/ subdirectory.

This is a headers-only library, so there is nothing to link to.

Usage documentation, and comments on the deprecation status of each public entry point, may be found in doc/public.md .

A full, self-contained usage example, showing how to quantize float matrices and perform a quantized matrix multiplication approximating a float matrix multiplication, is given in doc/quantization_example.cc.

Old EightBitIntGemm legacy deprecated interface

The eight_bit_int_gemm/ subdirectory contains an alternate interface that should be considered purely legacy, deprecated, and going to be removed at some point in the future.

Building

Building by manually invoking your compiler

Because gemmlowp is so simple, working with it involves only single-command-line compiler invocations. Therefore we expect that most people working with gemmlowp will either manually invoke their compiler, or write their own rules for their own preferred build system.

Keep in mind (previous section) that gemmlowp itself is a pure-headers-only library so there is nothing to build.

For a Android gemmlowp development workflow, the scripts/ directory contains a script to build and run a program on an Android device:

scripts/test-android.sh

Building using Bazel

That being said, we also maintain a Bazel BUILD system as part of gemmlowp. Its usage is not mandatory at all and is only one possible way that gemmlowp libraries and tests may be built. If you are interested, Bazel's home page is http://bazel.build/ And you can get started with using Bazel to build gemmlowp targets by first creating an empty WORKSPACE file in a parent directory, for instance:

$ cd gemmlowp/..  # change to parent directory containing gemmlowp/
$ touch WORKSPACE # declare that to be our workspace root
$ bazel build gemmlowp:all

Building gemmlowp - Using vcpkg

You can download and install gemmlowp using the vcpkg dependency manager:

git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh
./vcpkg integrate install
./vcpkg install gemmlowp

The gemmlowp port in vcpkg is kept up to date by Microsoft team members and community contributors. If the version is out of date, please create an issue or pull request on the vcpkg repository.

Testing

Testing by manually building and running tests

The test/ directory contains unit tests. The primary unit test is

test/test.cc

Since it covers also the EightBitIntGemm interface, it needs to be linked against

eight_bit_int_gemm/eight_bit_int_gemm.cc

It also uses realistic data captured from a neural network run in

test/test_data.cc

Thus you'll want to pass the following list of source files to your compiler/linker:

test/test.cc
eight_bit_int_gemm/eight_bit_int_gemm.cc
test/test_data.cc

The scripts/ directory contains a script to build and run a program on an Android device:

scripts/test-android.sh

It expects the CXX environment variable to point to an Android toolchain's C++ compiler, and expects source files (and optionally, cflags) as command-line parameters. To build and run the above-mentioned main unit test, first set CXX e.g.:

$ export CXX=/some/toolchains/arm-linux-androideabi-4.8/bin/arm-linux-androideabi-g++

Then run:

$ ./scripts/test-android.sh \
test/test.cc \
eight_bit_int_gemm/eight_bit_int_gemm.cc \
test/test_data.cc

Testing using Bazel

Alternatively, you can use Bazel to build and run tests. See the Bazel instruction in the above section on building. Once your Bazel workspace is set up, you can for instance do:

$ bazel test gemmlowp:all

Troubleshooting Compilation

If you're having trouble finding the compiler, follow these instructions to build a standalone toolchain: https://developer.android.com/ndk/guides/standalone_toolchain.html

Here's an example of setting up Clang 3.5:

$ export INSTALL_DIR=~/toolchains/clang-21-stl-gnu
$ $NDK/build/tools/make-standalone-toolchain.sh \
--toolchain=arm-linux-androideabi-clang3.5 --platform=android-21 \
--install-dir=$INSTALL_DIR
$ export CXX="$INSTALL_DIR/bin/arm-linux-androideabi-g++ \
--sysroot=$INSTALL_DIR/sysroot"

Some compilers (e.g. the default clang++ in the same bin directory) don't support NEON assembly. The benchmark build process will issue a warning if support isn't detected, and you should make sure you're using a compiler like arm-linux-androideabi-g++ that does include NEON.

Benchmarking

The main benchmark is

test/benchmark.cc

It doesn't need to be linked to any other source file. We recommend building with assertions disabled (-DNDEBUG).

For example, the benchmark can be built and run on an Android device by doing:

$ ./scripts/test-android.sh test/benchmark.cc -DNDEBUG

If GEMMLOWP_TEST_PROFILE is defined then the benchmark will be built with profiling instrumentation (which makes it slower) and will dump profiles. See next section on profiling.

Profiling

The profiling/ subdirectory offers a very simple, naive, inaccurate, non-interrupting sampling profiler that only requires pthreads (no signals).

It relies on source code being instrumented with pseudo-stack labels. See profiling/instrumentation.h. A full example of using this profiler is given in the top comment of profiling/profiler.h.

Contributing

Contribution-related discussion is always welcome on the gemmlowp mailing list (see above).

We try to keep a current list of TODO items in the todo/ directory. Prospective contributors are welcome to pick one to work on, and communicate about it on the gemmlowp mailing list.

Details of the contributing process, including legalese, are in CONTRIBUTING.

Performance goals

Our performance goals differ from typical GEMM performance goals in the following ways:

We care not only about speed, but also about minimizing power usage. We specifically care about charge usage in mobile/embedded devices. This implies that we care doubly about minimizing memory bandwidth usage: we care about it, like any GEMM, because of the impact on speed, and we also care about it because it is a key factor of power usage.
Most GEMMs are optimized primarily for large dense matrix sizes (>= 1000). We do care about large sizes, but we also care specifically about the typically smaller matrix sizes encountered in various mobile applications. This means that we have to optimize for all sizes, not just for large enough sizes.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot