ggml

Tensor library for machine learning

12,812

1,282

12,812

306

View on GitHub

Top Related Projects

triton

16,367

Development repository for the Triton language and compiler

DeepSpeed

39,112

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

pytorch

91,080

Tensors and Dynamic neural networks in Python with strong GPU acceleration

jax

32,985

Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more

tensorflow

190,523

An Open Source Machine Learning Framework for Everyone

faiss

36,332

A library for efficient similarity search and clustering of dense vectors.

Quick Overview

GGML (Georgi Gerganov Machine Learning) is a tensor library for machine learning, focusing on efficient CPU and GPU inference. It's designed to be lightweight, portable, and easy to integrate into existing projects, with a particular emphasis on running large language models on consumer hardware.

Pros

Highly optimized for CPU and GPU performance
Supports quantization for reduced memory usage and faster inference
Easy to integrate into existing C/C++ projects
Actively maintained and rapidly evolving

Cons

Limited documentation and examples compared to more established libraries
Primarily focused on inference, not training
Steeper learning curve for those not familiar with low-level C programming
May require manual optimization for specific hardware configurations

Code Examples

Creating and manipulating tensors:

#include "ggml.h"

int main() {
    struct ggml_context * ctx = ggml_init(ggml_mem_size(1024*1024*1024));
    struct ggml_tensor * a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 2, 3);
    struct ggml_tensor * b = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 3, 2);
    struct ggml_tensor * c = ggml_mul_mat(ctx, a, b);
    ggml_free(ctx);
    return 0;
}

Performing inference with a pre-trained model:

#include "ggml.h"

int main() {
    struct ggml_context * ctx = ggml_init(ggml_mem_size(1024*1024*1024));
    struct ggml_tensor * input = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 768);
    struct ggml_tensor * weights = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 768, 512);
    struct ggml_tensor * output = ggml_mul_mat(ctx, weights, input);
    ggml_compute_forward(output);
    ggml_free(ctx);
    return 0;
}

Using quantization for reduced memory usage:

#include "ggml.h"

int main() {
    struct ggml_context * ctx = ggml_init(ggml_mem_size(1024*1024*1024));
    struct ggml_tensor * a = ggml_new_tensor_1d(ctx, GGML_TYPE_Q4_0, 1024);
    struct ggml_tensor * b = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1024);
    struct ggml_tensor * c = ggml_add(ctx, ggml_dequantize(ctx, a), b);
    ggml_free(ctx);
    return 0;
}

Getting Started

To use GGML in your project:

Clone the repository:

git clone https://github.com/ggerganov/ggml.git

Include GGML in your project:
```
#include "ggml.h"
```

Compile your project with GGML:

gcc -I./ggml/include -c your_file.c
gcc -o your_program your_file.o ./ggml/build/libggml.a -lm

Run your program:
```
./your_program
```

Competitor Comparisons

triton

16,367

Development repository for the Triton language and compiler

Pros of Triton

Designed for GPU programming with a focus on high-performance computing
Offers a Python-based domain-specific language for easier GPU kernel development
Provides automatic optimization and code generation for different GPU architectures

Cons of Triton

Steeper learning curve due to its specialized nature and GPU-specific concepts
More limited in scope, primarily focused on GPU acceleration rather than general-purpose machine learning

Code Comparison

GGML example (matrix multiplication):

void ggml_mul_mat(const struct ggml_tensor * src0, const struct ggml_tensor * src1, struct ggml_tensor * dst) {
    // Implementation details
}

Triton example (matrix multiplication):

@triton.jit
def matmul_kernel(a_ptr, b_ptr, c_ptr, M, N, K, BLOCK_SIZE: tl.constexpr):
    # Kernel implementation details

Summary

GGML is a general-purpose machine learning library with a focus on efficiency and portability, while Triton is specifically designed for GPU programming and optimization. GGML offers broader applicability across different hardware, whereas Triton excels in GPU-specific tasks and provides a more accessible way to write high-performance GPU kernels.

DeepSpeed

39,112

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Pros of DeepSpeed

Optimized for distributed training across multiple GPUs and nodes
Supports a wide range of AI models and frameworks (PyTorch, TensorFlow, etc.)
Offers advanced features like ZeRO optimizer and pipeline parallelism

Cons of DeepSpeed

More complex setup and configuration compared to GGML
Primarily focused on large-scale training, may be overkill for smaller projects
Steeper learning curve for beginners

Code Comparison

GGML (simple model loading):

struct ggml_context * ctx = ggml_init(params);
struct ggml_tensor * input = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 28, 28);
struct ggml_tensor * output = model_eval(ctx, input);

DeepSpeed (model initialization with ZeRO):

model_engine, optimizer, _, _ = deepspeed.initialize(
    args=args,
    model=model,
    model_parameters=model.parameters(),
    config=ds_config
)

Both libraries aim to optimize AI model performance, but DeepSpeed focuses on distributed training for large-scale models, while GGML emphasizes efficiency for smaller devices and inference. DeepSpeed offers more advanced features but requires more setup, whereas GGML provides a simpler interface for quick integration in C/C++ projects.

pytorch

91,080

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Pros of PyTorch

Extensive ecosystem with a wide range of pre-built models and tools
Strong community support and frequent updates
Seamless integration with CUDA for GPU acceleration

Cons of PyTorch

Larger memory footprint and slower inference times
Steeper learning curve for beginners
More complex setup and deployment process

Code Comparison

PyTorch example:

import torch

x = torch.tensor([1, 2, 3])
y = torch.tensor([4, 5, 6])
z = torch.matmul(x, y)

GGML example:

#include "ggml.h"

struct ggml_tensor * x = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 3);
struct ggml_tensor * y = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 3);
struct ggml_tensor * z = ggml_mul_mat(ctx, x, y);

Summary

PyTorch offers a comprehensive deep learning framework with extensive features and community support, while GGML focuses on efficient inference for smaller models. PyTorch excels in research and large-scale projects, whereas GGML is better suited for lightweight applications and embedded systems. The choice between the two depends on the specific requirements of your project, such as model size, deployment environment, and performance constraints.

jax

32,985

Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more

Pros of JAX

Supports automatic differentiation and GPU/TPU acceleration
Offers a more comprehensive machine learning ecosystem
Provides better integration with popular ML frameworks like TensorFlow and PyTorch

Cons of JAX

Higher learning curve and complexity compared to GGML
Larger codebase and potentially slower compilation times
May be overkill for simpler machine learning tasks

Code Comparison

GGML (simple matrix multiplication):

struct ggml_tensor * result = ggml_mul_mat(ctx, A, B);

JAX (simple matrix multiplication):

import jax.numpy as jnp

result = jnp.dot(A, B)

GGML focuses on simplicity and efficiency for specific tasks, while JAX offers a more comprehensive set of tools for machine learning and scientific computing. GGML is written in C and designed for performance, especially in resource-constrained environments. JAX, on the other hand, is Python-based and provides a wider range of features, including automatic differentiation and hardware acceleration.

GGML is better suited for projects requiring lightweight, efficient implementations of specific ML algorithms, particularly in embedded systems or applications with limited resources. JAX is more appropriate for complex machine learning projects, research, and applications that benefit from its extensive ecosystem and integration with other ML frameworks.

tensorflow

190,523

An Open Source Machine Learning Framework for Everyone

Pros of TensorFlow

Extensive ecosystem with robust tools, libraries, and community support
Highly scalable for large-scale machine learning and deep learning projects
Supports multiple programming languages and platforms

Cons of TensorFlow

Steeper learning curve, especially for beginners
Can be resource-intensive and slower for smaller projects
More complex setup and configuration process

Code Comparison

GGML (simple matrix multiplication):

struct ggml_tensor * result = ggml_mul_mat(ctx, A, B);

TensorFlow (equivalent operation):

import tensorflow as tf

result = tf.matmul(A, B)

GGML focuses on simplicity and efficiency for smaller-scale projects, particularly in C/C++ environments. It's designed for lightweight implementations and easy integration into existing codebases.

TensorFlow, on the other hand, offers a comprehensive framework for building and deploying machine learning models at scale. It provides a wide range of tools and libraries for various ML tasks, making it suitable for complex, large-scale projects across different domains.

While GGML excels in simplicity and performance for specific use cases, TensorFlow's versatility and extensive ecosystem make it a popular choice for diverse machine learning applications.

faiss

36,332

A library for efficient similarity search and clustering of dense vectors.

Pros of Faiss

Highly optimized for similarity search and clustering of dense vectors
Supports GPU acceleration for faster processing
Extensive documentation and well-established community support

Cons of Faiss

Primarily focused on vector similarity search, less versatile for general machine learning tasks
Steeper learning curve due to its specialized nature
Larger codebase and dependencies compared to GGML

Code Comparison

GGML (matrix multiplication):

void ggml_mul_mat(
    const struct ggml_tensor * src0,
    const struct ggml_tensor * src1,
          struct ggml_tensor * dst) {
    // Implementation details
}

Faiss (vector indexing):

faiss::IndexFlatL2 index(d);
index.add(n, xb);
index.search(nq, xq, k, distances, labels);

GGML focuses on efficient tensor operations for machine learning, while Faiss specializes in vector similarity search and clustering. GGML's codebase is more compact and versatile, suitable for various ML tasks. Faiss excels in high-performance vector operations, particularly for large-scale similarity search applications. The code snippets highlight their different focuses: GGML on matrix operations and Faiss on vector indexing and search.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

ggml

Roadmap / Manifesto

Tensor library for machine learning

Note that this project is under active development.
Some of the development is currently happening in the llama.cpp and whisper.cpp repos

Features

Low-level cross-platform implementation
Integer quantization support
Broad hardware support
Automatic differentiation
ADAM and L-BFGS optimizers
No third-party dependencies
Zero memory allocations during runtime

Build

git clone https://github.com/ggml-org/ggml
cd ggml

# install python dependencies in a virtual environment
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# build the examples
mkdir build && cd build
cmake ..
cmake --build . --config Release -j 8

GPT inference (example)

# run the GPT-2 small 117M model
../examples/gpt-2/download-ggml-model.sh 117M
./bin/gpt-2-backend -m models/gpt-2-117M/ggml-model.bin -p "This is an example"

For more information, checkout the corresponding programs in the examples folder.

Using CUDA

# fix the path to point to your CUDA compiler
cmake -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.1/bin/nvcc ..

Using hipBLAS

cmake -DCMAKE_C_COMPILER="$(hipconfig -l)/clang" -DCMAKE_CXX_COMPILER="$(hipconfig -l)/clang++" -DGGML_HIP=ON

Using SYCL

# linux
source /opt/intel/oneapi/setvars.sh
cmake -G "Ninja" -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_SYCL=ON ..

# windows
"C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
cmake -G "Ninja" -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx -DGGML_SYCL=ON ..

Compiling for Android

Download and unzip the NDK from this download page. Set the NDK_ROOT_PATH environment variable or provide the absolute path to the CMAKE_ANDROID_NDK in the command below.

cmake .. \
   -DCMAKE_SYSTEM_NAME=Android \
   -DCMAKE_SYSTEM_VERSION=33 \
   -DCMAKE_ANDROID_ARCH_ABI=arm64-v8a \
   -DCMAKE_ANDROID_NDK=$NDK_ROOT_PATH
   -DCMAKE_ANDROID_STL_TYPE=c++_shared

# create directories
adb shell 'mkdir /data/local/tmp/bin'
adb shell 'mkdir /data/local/tmp/models'

# push the compiled binaries to the folder
adb push bin/* /data/local/tmp/bin/

# push the ggml library
adb push src/libggml.so /data/local/tmp/

# push model files
adb push models/gpt-2-117M/ggml-model.bin /data/local/tmp/models/

adb shell
cd /data/local/tmp
export LD_LIBRARY_PATH=/data/local/tmp
./bin/gpt-2-backend -m models/ggml-model.bin -p "this is an example"

Resources

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot