Convert Figma logo to code with AI

salesforce logoCodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation

2,867
434
2,867
81

Top Related Projects

CodeBERT

CodeXGLUE

Google Research

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Quick Overview

CodeT5 is an open-source project by Salesforce that introduces a family of encoder-decoder models pre-trained on both natural language and programming language data. It is designed to support various code intelligence tasks, such as code generation, translation, and summarization, across multiple programming languages.

Pros

  • Supports multiple programming languages and code-related tasks
  • Pre-trained on a large corpus of code and natural language data
  • Achieves state-of-the-art performance on various code intelligence benchmarks
  • Easily fine-tunable for specific tasks or domains

Cons

  • Requires significant computational resources for training and fine-tuning
  • May struggle with very complex or domain-specific code tasks
  • Limited to the programming languages it was trained on
  • Potential biases in generated code based on training data

Code Examples

  1. Code Generation:
from transformers import AutoTokenizer, T5ForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-base")
model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-base")

input_text = "Generate a Python function to calculate the factorial of a number"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

generated_ids = model.generate(input_ids, max_length=128)
generated_code = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(generated_code)
  1. Code Summarization:
input_text = "def factorial(n):\n    if n == 0:\n        return 1\n    else:\n        return n * factorial(n-1)"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

generated_ids = model.generate(input_ids, max_length=50)
summary = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(summary)
  1. Code Translation (Java to Python):
input_text = "public static int factorial(int n) {\n    if (n == 0) return 1;\n    return n * factorial(n - 1);\n}"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

generated_ids = model.generate(input_ids, max_length=128)
translated_code = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(translated_code)

Getting Started

To get started with CodeT5, follow these steps:

  1. Install the required libraries:
pip install transformers torch
  1. Load the pre-trained model and tokenizer:
from transformers import AutoTokenizer, T5ForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-base")
model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-base")
  1. Use the model for your desired code intelligence task (e.g., code generation, summarization, or translation) as shown in the code examples above.

Competitor Comparisons

CodeBERT

Pros of CodeBERT

  • Larger pre-training dataset, including both natural language and programming language data
  • Supports a wider range of programming languages (6 languages)
  • Better performance on code search and code documentation generation tasks

Cons of CodeBERT

  • Limited to encoder-only architecture, which may restrict its capabilities in certain tasks
  • Less flexible for fine-tuning on downstream tasks compared to CodeT5's encoder-decoder architecture
  • Smaller model size, potentially limiting its ability to capture complex code patterns

Code Comparison

CodeBERT example:

from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")

CodeT5 example:

from transformers import T5ForConditionalGeneration, RobertaTokenizer
model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-base")
tokenizer = RobertaTokenizer.from_pretrained("Salesforce/codet5-base")

Both repositories provide pre-trained models for code understanding and generation tasks. CodeBERT offers a more extensive pre-training dataset and supports more programming languages, while CodeT5 provides a more flexible encoder-decoder architecture that can be adapted to a wider range of downstream tasks. The choice between the two depends on the specific requirements of the project and the target programming languages.

CodeXGLUE

Pros of CodeXGLUE

  • Comprehensive benchmark suite for code intelligence tasks
  • Includes a wide range of programming languages and tasks
  • Well-documented datasets and evaluation metrics

Cons of CodeXGLUE

  • Primarily focused on benchmarking, not providing pre-trained models
  • May require more setup and preprocessing for specific tasks
  • Less emphasis on fine-tuning and transfer learning

Code Comparison

CodeXGLUE example (code summarization task):

def calculate_average(numbers):
    total = sum(numbers)
    count = len(numbers)
    return total / count if count > 0 else 0

CodeT5 example (code generation task):

# Generate a function to calculate the average of a list of numbers
def calculate_average(numbers):
    if not numbers:
        return 0
    return sum(numbers) / len(numbers)

While CodeXGLUE focuses on providing datasets and evaluation metrics for various code-related tasks, CodeT5 offers pre-trained models and fine-tuning capabilities for code understanding and generation. CodeXGLUE is better suited for benchmarking and comparing different models, while CodeT5 provides a more ready-to-use solution for code-related tasks.

Google Research

Pros of google-research

  • Broader scope, covering various AI and ML research areas
  • More frequent updates and contributions from a larger team
  • Extensive documentation and research papers accompanying projects

Cons of google-research

  • Less focused on a specific domain, potentially overwhelming for users
  • May require more effort to navigate and find relevant projects
  • Some projects might be experimental or not production-ready

Code comparison

CodeT5:

from transformers import T5ForConditionalGeneration, RobertaTokenizer

model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-base")
tokenizer = RobertaTokenizer.from_pretrained("Salesforce/codet5-base")

google-research:

import tensorflow as tf
from tensorflow.keras import layers

model = tf.keras.Sequential([
  layers.Dense(64, activation='relu'),
  layers.Dense(10, activation='softmax')
])

The code snippets demonstrate the different focus areas of the repositories. CodeT5 is specifically designed for code-related tasks using pre-trained models, while google-research provides a more general-purpose machine learning framework. CodeT5 uses the Transformers library, whereas google-research often utilizes TensorFlow for implementing various research projects.

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Pros of transformers

  • Broader scope: Supports a wide range of NLP tasks and models
  • Larger community: More contributors and extensive documentation
  • Frequent updates: Regular releases with new features and improvements

Cons of transformers

  • Steeper learning curve: More complex due to its extensive functionality
  • Potentially slower: May have higher overhead for simple tasks
  • Less specialized: Not specifically optimized for code-related tasks

Code comparison

CodeT5:

from transformers import RobertaTokenizer, T5ForConditionalGeneration

tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base')

transformers:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("t5-base")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

The code snippets show that both repositories use similar patterns for loading models and tokenizers. However, CodeT5 is specifically designed for code-related tasks, while transformers provides a more general-purpose approach with its Auto classes.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

CodeT5 and CodeT5+

Official research release for CodeT5 and CodeT5+ models for Code Understanding and Generation from Salesforce Research, which are introduced by the following papers:

Title: CodeT5+: Open Code Large Language Models for Code Understanding and Generation

Authors: Yue Wang*, Hung Le*, Akhilesh Deepak Gotmare, Nghi D.Q. Bui, Junnan Li, Steven C.H. Hoi (* indicates equal contribution)

Title: CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Authors: Yue Wang, Weishi Wang , Shafiq Joty, Steven C.H. Hoi

In practice, CodeT5 and CodeT5+ models can be deployed as an AI-powered coding assistant to boost the productivity of software developers. At Salesforce, we build an AI coding assistant demo using CodeT5 as a VS Code plugin to provide three capabilities:

  • Text-to-code generation: generate code based on the natural language description.
  • Code autocompletion: complete the whole function of code given the target function name.
  • Code summarization: generate the summary of a function in natural language description.

CodeT5 demo

What's New: 🎉

May 2023

CodeT5+ paper and models are released!🔥
paper | code | model | blog

Sep 2022

Our CodeRL paper has been accepted to NeurIPS 2022!
paper | code | blog

July 2022

We release two large-sized CodeT5 checkpoints at HuggingFace: Salesforce/codet5-large and Salesforce/codet5-large-ntp-py, which are introduced by the CodeRL paper.

Oct 2021

We release fine-tuned checkpoints for all the downstream tasks covered in the paper. Besides, we release a CodeT5-base fine-tuned checkpoint (Salesforce/codet5-base-multi-sum) for multilingual code summarization.

Sep, 2021

CodeT5 paper accepted to EMNLP 2021 and models are released!
paper | code | model | model card | blog

Citation

If you find this code to be useful for your research, please consider citing:

@inproceedings{
    wang2021codet5,
    title={CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation}, 
    author={Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi},
    booktitle={EMNLP},
    year={2021},
}

@inproceedings{
    le2022coderl,
    title={CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},
    author={Le, Hung and Wang, Yue and Gotmare, Akhilesh Deepak and Savarese, Silvio and Hoi, Steven C. H.},
    booktitle={NeurIPS},
    year={2022}
}

@article{
    wang2023codet5plus,
    title={CodeT5+: Open Code Large Language Models for Code Understanding and Generation},
    author={Wang, Yue and Le, Hung and Gotmare, Akhilesh Deepak and Bui, Nghi D.Q. and Li, Junnan and Hoi, Steven C. H.},
    journal={arXiv preprint},
    year={2023}
}

License

The code is released under the BSD-3 License (see LICENSE.txt for details), but we also ask that users respect the following:

This software should not be used to promote or profit from:

violence, hate, and division,

environmental destruction,

abuse of human rights, or

the destruction of people's physical and mental health.

We encourage users of this software to tell us about the applications in which they are putting it to use by emailing codeT5@salesforce.com, and to use appropriate documentation when developing high-stakes applications of this model.

Get Involved

Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!