TensorComprehensions

A domain specific language to express machine learning workloads.

1,759

212

1,759

View on GitHub

Top Related Projects

pytorch

91,080

Tensors and Dynamic neural networks in Python with strong GPU acceleration

tensorflow

190,523

An Open Source Machine Learning Framework for Everyone

jax

32,985

Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more

triton

16,367

Development repository for the Triton language and compiler

onnxruntime

17,390

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

Quick Overview

TensorComprehensions is a deep learning library developed by Facebook Research that provides a domain-specific language (DSL) for expressing tensor computations. It aims to automatically optimize the performance of these computations by generating efficient low-level code.

Pros

Expressive DSL: The TensorComprehensions DSL allows for concise and intuitive expression of complex tensor operations, making it easier to develop and maintain deep learning models.
Automatic Optimization: The library automatically generates optimized low-level code (e.g., CUDA kernels) for the specified tensor computations, improving performance without requiring manual optimization.
Portability: TensorComprehensions supports multiple hardware backends, including CPUs and GPUs, allowing for portable and efficient deep learning deployments.
Integration with Deep Learning Frameworks: The library can be integrated with popular deep learning frameworks like PyTorch and TensorFlow, enabling seamless adoption in existing projects.

Cons

Limited Adoption: TensorComprehensions is a relatively new library, and its adoption within the deep learning community may be slower compared to more established frameworks.
Steep Learning Curve: The TensorComprehensions DSL and its optimization process can have a steep learning curve, especially for developers unfamiliar with domain-specific languages and low-level code generation.
Dependency on Compiler Infrastructure: The performance of TensorComprehensions is heavily dependent on the underlying compiler infrastructure, which may introduce additional complexity and potential issues.
Ongoing Development: As an active research project, TensorComprehensions may have a higher rate of changes and updates, which could impact the stability and long-term support of the library.

Code Examples

Example 1: Matrix Multiplication

import tc
import torch

A = torch.randn(128, 256)
B = torch.randn(256, 512)

def matmul(A, B):
    return tc.define("""
        def matmul(float(M, K) A, float(K, N) B) -> (float(M, N)) {
            output(i, j) +=! A(i, k) * B(k, j)
        }
    """, name="matmul")(A, B)

result = matmul(A, B)

This example demonstrates how to use the TensorComprehensions DSL to define and execute a matrix multiplication operation.

Example 2: Convolution

import tc
import torch

input = torch.randn(1, 3, 224, 224)
weight = torch.randn(64, 3, 3, 3)

def conv2d(input, weight):
    return tc.define("""
        def conv2d(float(N, C, H, W) input, float(O, C, KH, KW) weight)
        -> (float(N, O, H, W)) {
            output(n, o, h, w) +=! input(n, c, h + kh, w + kw) * weight(o, c, kh, kw)
        }
    """, name="conv2d")(input, weight)

result = conv2d(input, weight)

This example shows how to use TensorComprehensions to define and execute a 2D convolution operation.

Example 3: Reduction

import tc
import torch

input = torch.randn(128, 256)

def sum_reduction(input):
    return tc.define("""
        def sum_reduction(float(M, N) input) -> (float(N)) {
            output(j) +=! input(i, j)
        }
    """, name="sum_reduction")(input)

result = sum_reduction(input)

This example demonstrates how to use the TensorComprehensions DSL to define and execute a sum reduction operation on a 2D tensor.

Getting Started

To get started with TensorComprehensions, follow these steps:

Install the required dependencies, including PyTorch or TensorFlow, and the TensorComprehensions library:

pip install tensorcomprehensions

Import the necessary modules and define your tensor operations using the TensorComprehensions DSL:

Competitor Comparisons

pytorch

91,080

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Pros of PyTorch

Larger and more active community, with more resources and support available
Extensive documentation and tutorials, making it easier to get started
Seamless integration with other popular libraries and tools in the machine learning ecosystem

Cons of PyTorch

Slower performance compared to TensorFlow for certain workloads
Slightly more complex to set up and configure for large-scale deployments
Limited support for eager execution in earlier versions

Code Comparison

PyTorch:

import torch
import torch.nn as nn

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.fc = nn.Linear(10, 5)

    def forward(self, x):
        return self.fc(x)

TensorComprehensions:

import tc
import torch

def my_model(x):
    return tc.define(
        """
        def fun(float(N, 10) x) -> (y) {
            y(n, p) +=! x(n, r) * w(r, p)
        }
        """,
        name="my_model",
        inputs=[x],
        outputs=[y],
        options=tc.Options(tile=[1, 5], unroll=4)
    )

tensorflow

190,523

An Open Source Machine Learning Framework for Everyone

Pros of TensorFlow

Extensive community support and documentation
Wide range of pre-built models and tools
Supports a variety of hardware accelerators (CPUs, GPUs, TPUs)

Cons of TensorFlow

Steeper learning curve compared to TensorComprehensions
Larger codebase and dependencies
Can be more complex to configure and deploy

Code Comparison

TensorFlow:

import tensorflow as tf

x = tf.constant([[1.0, 2.0], [3.0, 4.0]])
y = tf.constant([[5.0, 6.0], [7.0, 8.0]])
z = tf.matmul(x, y)

print(z)

TensorComprehensions:

import tc

x = tc.Tensor([1.0, 2.0, 3.0, 4.0]).reshape([2, 2])
y = tc.Tensor([5.0, 6.0, 7.0, 8.0]).reshape([2, 2])
z = tc.matmul(x, y)

print(z)

jax

32,985

Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more

Pros of JAX

More general-purpose and flexible for various machine learning tasks
Better integration with Python ecosystem and NumPy-like syntax
Active development and broader community support

Cons of JAX

Steeper learning curve for users new to functional programming concepts
Less focus on automatic code generation for specific hardware

Code Comparison

TensorComprehensions:

def mm(float(M,K) A, float(K,N) B) -> (C) {
    C(m,n) +=! A(m,k) * B(k,n)
}

JAX:

def mm(A, B):
    return jnp.dot(A, B)

Key Differences

TensorComprehensions focuses on generating optimized code for specific hardware
JAX provides a more general-purpose framework for machine learning and scientific computing
TensorComprehensions uses a domain-specific language, while JAX extends Python with functional programming concepts

Use Cases

TensorComprehensions: Optimizing specific tensor operations for hardware acceleration
JAX: Developing and researching machine learning models, especially those requiring automatic differentiation and GPU acceleration

triton

16,367

Development repository for the Triton language and compiler

Pros of Triton

Optimized for GPU Performance: Triton is designed to leverage the power of GPUs, providing efficient code generation and execution for GPU-accelerated workloads.
Flexible Language: Triton offers a flexible and expressive language that allows developers to write complex tensor operations with ease.
Extensibility: Triton's modular design makes it easy to extend and integrate with other libraries and frameworks.

Cons of Triton

Steeper Learning Curve: Triton's specialized language and tooling may require more time and effort to learn compared to more widely-used frameworks like TensorFlow or PyTorch.
Limited Ecosystem: Triton has a smaller community and ecosystem compared to more established deep learning frameworks, which may limit the availability of pre-built models and integrations.

Code Comparison

Tensor Comprehensions (TC):

def matmul(A, B):
    C = tc.define("C(i, j) +=! A(i, k) * B(k, j)")
    return C(A, B)

Triton:

import triton

@triton.jit
def matmul(A, B, C):
    i, j = triton.program_id(0), triton.program_id(1)
    C[i, j] = 0
    for k in range(A.shape[1]):
        C[i, j] += A[i, k] * B[k, j]
    return C

Both examples implement a simple matrix multiplication operation, but the Triton version provides a more low-level and explicit approach, while the Tensor Comprehensions version uses a more concise and declarative syntax.

onnxruntime

17,390

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

Pros of ONNX Runtime

Cross-platform Compatibility: ONNX Runtime supports a wide range of platforms, including Windows, Linux, and macOS, making it a versatile choice for deployment across different environments.
Performance Optimization: ONNX Runtime includes various optimization techniques, such as graph optimization and hardware acceleration, which can improve the inference performance of your models.
Extensive Model Support: ONNX Runtime supports a large number of pre-trained models, including popular models like BERT, ResNet, and YOLOv5, making it easier to integrate these models into your applications.

Cons of ONNX Runtime

Limited Customization: ONNX Runtime is primarily focused on inference, and it may not provide the same level of flexibility and customization options as Tensor Comprehensions, which is designed for both training and inference.
Dependency on ONNX Format: ONNX Runtime requires your models to be in the ONNX format, which may require additional conversion steps if your models are in a different format, such as TensorFlow or PyTorch.

Code Comparison

ONNX Runtime:

import onnxruntime as ort

# Load the ONNX model
session = ort.InferenceSession("model.onnx")

# Define the input data
input_data = np.random.rand(1, 3, 224, 224).astype(np.float32)

# Run the inference
output = session.run(None, {"input": input_data})

Tensor Comprehensions:

import tensor_comprehensions as tc

# Define the TC program
lang = """
def matmul(float(M,N) A, float(N,P) B) -> (float(M,P)) {
    c(i,j) +=! a(i,k) * b(k,j)
}
"""

# Compile the TC program
matmul = tc.define(lang)

# Run the TC program
A = torch.randn(128, 256)
B = torch.randn(256, 512)
C = matmul(A, B)

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Tensor Comprehensions (TC) is a fully-functional C++ library to automatically synthesize high-performance machine learning kernels using Halide, ISL and NVRTC or LLVM. TC additionally provides basic integration with Caffe2 and PyTorch. We provide more details in our paper on arXiv.

This library is designed to be highly portable, machine-learning-framework agnostic and only requires a simple tensor library with memory allocation, offloading and synchronization capabilities.

For now, we have integrated TC with Caffe2 and PyTorch.

A simple example

The following illustrates a short but powerful feature of the library: the capacity to JIT-compile high-performance machine learning kernels on demand, for specific sizes.

import tensor_comprehensions as tc
import torch
lang = """
def tensordot(float(N, C1, C2, H, W) I0, float(N, C2, C3, H, W) I1) -> (O) {
    O(n, c1, c3, h, w) +=! I0(n, c1, c2, h, w) * I1(n, c2, c3, h, w)
}
"""
N, C1, C2, C3, H, W = 32, 512, 8, 2, 28, 28
tensordot = tc.define(lang, name="tensordot")
I0, I1 = torch.randn(N, C1, C2, H, W).cuda(), torch.randn(N, C2, C3, H, W).cuda()
best_options = tensordot.autotune(I0, I1, cache=True)
out = tensordot(I0, I1, options=best_options)

After a few generations of autotuning on a 2-GPU P100 system, we see results resembling:

Autotuning Sample

In C++ a minimal autotuning example resembles the following:

TEST(TensorDot, SimpleAutotune) {
  // 1. Define and setup the TC compilation unit with CUDA memory
  // management backed by ATen tensors.
  std::string tc = R"TC(
def tensordot(float(N, C1, C2, H, W) I0,
              float(N, C2, C3, H, W) I1)  -> (O)
{
    O(n, c1, c3, h, w) +=! I0(n, c1, r_c2, h, w) * I1(n, r_c2, c3, h, w)
}
  )TC";

  // 2. Allocate tensors with random data.
  at::Tensor I0 = at::CUDA(at::kFloat).rand({32,  8, 16, 17, 25});
  at::Tensor I1 = at::CUDA(at::kFloat).rand({32, 16, 2, 17, 25});

  // 3. Run autotuning with evolutionary search starting from a naive option.
  auto naiveOptions = Backend::MappingOptionsType::makeNaiveMappingOptions();
  tc::aten::ATenAutotuner<tc::CudaBackend, tc::autotune::GeneticSearch>
      geneticAutotuneATen(tc);
  auto bestOption =
      geneticAutotuneATen.tune("tensordot", {I0, I1}, {naiveOptions});

  // 4. Compile and run the TC with the best option after allocating output
  //    tensors.
  auto pExecutor =
      tc::aten::compile<Backend>(tc, "tensordot", {I0, I1}, bestOption[0]);
  auto outputs = tc::aten::prepareOutputs(tc, "tensordot", {I0, I1});
  auto timings = tc::aten::profile(*pExecutor, {I0, I1}, outputs);
  std::cout << "tensordot size I0: " << I0.sizes() << ", "
            << "size I1: " << I1.sizes()
            << " ran in: " << timings.kernelRuntime.toMicroSeconds() << "us\n";
}

Note that we only need to autotune a TC once to obtain reasonable mapping options that can translate to other problem sizes for a given TC as the following snippet illustrates:

// 5. Reuse bestOptions from autotuning on another kernel
for (auto sizes : std::vector<std::pair<at::IntList, at::IntList>>{
         {{4, 9, 7, 16, 14}, {4, 7, 3, 16, 14}},
         {{8, 5, 11, 10, 10}, {8, 11, 16, 10, 10}},
     }) {
  at::Tensor I0 = makeATenTensor<Backend>(sizes.first);
  at::Tensor I1 = makeATenTensor<Backend>(sizes.second);
  auto pExecutor =
      tc::aten::compile<Backend>(tc, "tensordot", {I0, I1}, bestOption[0]);
  auto outputs = tc::aten::prepareOutputs(tc, "tensordot", {I0, I1});
  auto timings = tc::aten::profile(*pExecutor, {I0, I1}, outputs);
  std::cout << "tensordot size I0: " << I0.sizes() << ", "
            << "size I1: " << I1.sizes()
            << " ran in: " << timings.kernelRuntime.toMicroSeconds()
            << "us\n";
}

Putting it all together, one may see:

> build$ ./examples/example_simple
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from TensorDot
[ RUN      ] TensorDot.SimpleAutotune
Generation 0    Jobs(Compiled, GPU)/total  (10, 10)/10   (best/median/worst)us: 226/4238/7345
Generation 1    Jobs(Compiled, GPU)/total  (10, 10)/10   (best/median/worst)us: 220/221/233
Generation 2    Jobs(Compiled, GPU)/total  (10, 10)/10   (best/median/worst)us: 220/221/234
tensordot size I0: [16, 8, 16, 17, 25], size I1: [16, 16, 2, 17, 25] ran in: 239us
tensordot size I0: [4, 9, 7, 16, 14], size I1: [4, 7, 3, 16, 14] ran in: 56us
tensordot size I0: [8, 5, 11, 10, 10], size I1: [8, 11, 16, 10, 10] ran in: 210us
[       OK ] TensorDot.SimpleAutotune (27812 ms)
[----------] 1 test from TensorDot (27812 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (27812 ms total)
[  PASSED  ] 1 test.

We have not yet characterized the precise fraction of peak performance we obtain but it is not uncommon to obtain 80%+ of peak shared memory bandwidth after autotuning. Solid register-level optimizations are still in the work but TC in its current form already addresses the productivity gap between the needs of research and the needs of production. Which is why we are excited to share it with the entire community and bring this collaborative effort in the open.

Documentation

General: You can find detailed information about Tensor Comprehensions here.

C++ API: We also provide documentation for our C++ API which can can be found here

Installation

Binaries

We provide conda package for making it easy to install and use TC binary. Please refer to our documentation here for instructions.

From Source

You can find documentation here which contains instructions for building TC via docker, conda packages or in non-conda environment.

Communication

Email: tensorcomp@fb.com
GitHub issues: bug reports, feature requests, install issues, RFCs, thoughts, etc.

Code of Conduct

See the CODE_OF_CONDUCT.md file for more details.

License

Tensor Comprehensions is distributed under a permissive Apache v2.0 license, see the LICENSE file for more details.

Contributing

See the CONTRIBUTING.md file for more details.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot