CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation

3,038

476

3,038

View on GitHub

Top Related Projects

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Quick Overview

CodeT5 is an open-source project by Salesforce that introduces a family of encoder-decoder models pre-trained on both natural language and programming language data. It is designed to support various code intelligence tasks, such as code generation, translation, and summarization, across multiple programming languages.

Pros

Supports multiple programming languages and code-related tasks
Pre-trained on a large corpus of code and natural language data
Achieves state-of-the-art performance on various code intelligence benchmarks
Easily fine-tunable for specific tasks or domains

Cons

Requires significant computational resources for training and fine-tuning
May struggle with very complex or domain-specific code tasks
Limited to the programming languages it was trained on
Potential biases in generated code based on training data

Code Examples

Code Generation:

from transformers import AutoTokenizer, T5ForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-base")
model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-base")

input_text = "Generate a Python function to calculate the factorial of a number"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

generated_ids = model.generate(input_ids, max_length=128)
generated_code = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(generated_code)

Code Summarization:

input_text = "def factorial(n):\n    if n == 0:\n        return 1\n    else:\n        return n * factorial(n-1)"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

generated_ids = model.generate(input_ids, max_length=50)
summary = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(summary)

Code Translation (Java to Python):

input_text = "public static int factorial(int n) {\n    if (n == 0) return 1;\n    return n * factorial(n - 1);\n}"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

generated_ids = model.generate(input_ids, max_length=128)
translated_code = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(translated_code)

Getting Started

To get started with CodeT5, follow these steps:

Install the required libraries:

pip install transformers torch

Load the pre-trained model and tokenizer:

from transformers import AutoTokenizer, T5ForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-base")
model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-base")

Use the model for your desired code intelligence task (e.g., code generation, summarization, or translation) as shown in the code examples above.

Competitor Comparisons

CodeBERT

2,596

CodeBERT

Pros of CodeBERT

Larger pre-training dataset, including both natural language and programming language data
Supports a wider range of programming languages (6 languages)
Better performance on code search and code documentation generation tasks

Cons of CodeBERT

Limited to encoder-only architecture, which may restrict its capabilities in certain tasks
Less flexible for fine-tuning on downstream tasks compared to CodeT5's encoder-decoder architecture
Smaller model size, potentially limiting its ability to capture complex code patterns

Code Comparison

CodeBERT example:

from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")

CodeT5 example:

from transformers import T5ForConditionalGeneration, RobertaTokenizer
model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-base")
tokenizer = RobertaTokenizer.from_pretrained("Salesforce/codet5-base")

Both repositories provide pre-trained models for code understanding and generation tasks. CodeBERT offers a more extensive pre-training dataset and supports more programming languages, while CodeT5 provides a more flexible encoder-decoder architecture that can be adapted to a wider range of downstream tasks. The choice between the two depends on the specific requirements of the project and the target programming languages.

CodeXGLUE

1,720

CodeXGLUE

Pros of CodeXGLUE

Comprehensive benchmark suite for code intelligence tasks
Includes a wide range of programming languages and tasks
Well-documented datasets and evaluation metrics

Cons of CodeXGLUE

Primarily focused on benchmarking, not providing pre-trained models
May require more setup and preprocessing for specific tasks
Less emphasis on fine-tuning and transfer learning

Code Comparison

CodeXGLUE example (code summarization task):

def calculate_average(numbers):
    total = sum(numbers)
    count = len(numbers)
    return total / count if count > 0 else 0

CodeT5 example (code generation task):

# Generate a function to calculate the average of a list of numbers
def calculate_average(numbers):
    if not numbers:
        return 0
    return sum(numbers) / len(numbers)

While CodeXGLUE focuses on providing datasets and evaluation metrics for various code-related tasks, CodeT5 offers pre-trained models and fine-tuning capabilities for code understanding and generation. CodeXGLUE is better suited for benchmarking and comparing different models, while CodeT5 provides a more ready-to-use solution for code-related tasks.

google-research

36,128

Google Research

Pros of google-research

Broader scope, covering various AI and ML research areas
More frequent updates and contributions from a larger team
Extensive documentation and research papers accompanying projects

Cons of google-research

Less focused on a specific domain, potentially overwhelming for users
May require more effort to navigate and find relevant projects
Some projects might be experimental or not production-ready

Code comparison

CodeT5:

from transformers import T5ForConditionalGeneration, RobertaTokenizer

model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-base")
tokenizer = RobertaTokenizer.from_pretrained("Salesforce/codet5-base")

google-research:

import tensorflow as tf
from tensorflow.keras import layers

model = tf.keras.Sequential([
  layers.Dense(64, activation='relu'),
  layers.Dense(10, activation='softmax')
])

The code snippets demonstrate the different focus areas of the repositories. CodeT5 is specifically designed for code-related tasks using pre-trained models, while google-research provides a more general-purpose machine learning framework. CodeT5 uses the Transformers library, whereas google-research often utilizes TensorFlow for implementing various research projects.

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Pros of transformers

Broader scope: Supports a wide range of NLP tasks and models
Larger community: More contributors and extensive documentation
Frequent updates: Regular releases with new features and improvements

Cons of transformers

Steeper learning curve: More complex due to its extensive functionality
Potentially slower: May have higher overhead for simple tasks
Less specialized: Not specifically optimized for code-related tasks

Code comparison

CodeT5:

from transformers import RobertaTokenizer, T5ForConditionalGeneration

tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base')

transformers:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("t5-base")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

The code snippets show that both repositories use similar patterns for loading models and tokenizers. However, CodeT5 is specifically designed for code-related tasks, while transformers provides a more general-purpose approach with its Auto classes.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

CodeT5 and CodeT5+

Official research release for CodeT5 and CodeT5+ models for Code Understanding and Generation from Salesforce Research, which are introduced by the following papers:

Title: CodeT5+: Open Code Large Language Models for Code Understanding and Generation

Authors: Yue Wang*, Hung Le*, Akhilesh Deepak Gotmare, Nghi D.Q. Bui, Junnan Li, Steven C.H. Hoi (* indicates equal contribution)

Title: CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Authors: Yue Wang, Weishi Wang , Shafiq Joty, Steven C.H. Hoi

In practice, CodeT5 and CodeT5+ models can be deployed as an AI-powered coding assistant to boost the productivity of software developers. At Salesforce, we build an AI coding assistant demo using CodeT5 as a VS Code plugin to provide three capabilities:

Text-to-code generation: generate code based on the natural language description.
Code autocompletion: complete the whole function of code given the target function name.
Code summarization: generate the summary of a function in natural language description.

CodeT5 demo

What's New: ð

May 2023

CodeT5+ paper and models are releasedï¼ð¥
paper | code | model | blog

Sep 2022

Our CodeRL paper has been accepted to NeurIPS 2022!
paper | code | blog

July 2022

We release two large-sized CodeT5 checkpoints at HuggingFace: Salesforce/codet5-large and Salesforce/codet5-large-ntp-py, which are introduced by the CodeRL paper.

Oct 2021

We release fine-tuned checkpoints for all the downstream tasks covered in the paper. Besides, we release a CodeT5-base fine-tuned checkpoint (Salesforce/codet5-base-multi-sum) for multilingual code summarization.

Sep, 2021

CodeT5 paper accepted to EMNLP 2021 and models are released!
paper | code | model | model card | blog

Citation

If you find this code to be useful for your research, please consider citing:

@inproceedings{
    wang2021codet5,
    title={CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation}, 
    author={Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi},
    booktitle={EMNLP},
    year={2021},
}

@inproceedings{
    le2022coderl,
    title={CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},
    author={Le, Hung and Wang, Yue and Gotmare, Akhilesh Deepak and Savarese, Silvio and Hoi, Steven C. H.},
    booktitle={NeurIPS},
    year={2022}
}

@article{
    wang2023codet5plus,
    title={CodeT5+: Open Code Large Language Models for Code Understanding and Generation},
    author={Wang, Yue and Le, Hung and Gotmare, Akhilesh Deepak and Bui, Nghi D.Q. and Li, Junnan and Hoi, Steven C. H.},
    journal={arXiv preprint},
    year={2023}
}

License

The code is released under the BSD-3 License (see LICENSE.txt for details), but we also ask that users respect the following:

This software should not be used to promote or profit from:

violence, hate, and division,

environmental destruction,

abuse of human rights, or

the destruction of people's physical and mental health.

We encourage users of this software to tell us about the applications in which they are putting it to use by emailing codeT5@salesforce.com, and to use appropriate documentation when developing high-stakes applications of this model.

Get Involved

Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of CodeBERT

Cons of CodeBERT

Code Comparison

Pros of CodeXGLUE

Cons of CodeXGLUE

Code Comparison

Pros of google-research

Cons of google-research

Code comparison

Pros of transformers

Cons of transformers

Code comparison

Convert designs to code with AI

README

CodeT5 and CodeT5+

What's New: ð

Citation

License

Get Involved

Top Related Projects

Convert designs to code with AI

What's New: ð