CodeBERT

2,492

488

2,492

Top Related Projects

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

CodeT5

2,974

Home of CodeT5: Open Code LLMs for Code Understanding and Generation

Quick Overview

CodeBERT is a pre-trained language model for programming languages developed by Microsoft. It is designed to understand and generate code across multiple programming languages, making it useful for various code-related tasks such as code search, code completion, and code translation.

Pros

Supports multiple programming languages, including Python, Java, JavaScript, PHP, Ruby, and Go
Pre-trained on a large corpus of code, allowing for transfer learning to various downstream tasks
Achieves state-of-the-art performance on many code-related benchmarks
Can be fine-tuned for specific tasks or domains

Cons

Requires significant computational resources for training and fine-tuning
May struggle with understanding complex code structures or domain-specific implementations
Limited to the programming languages it was trained on
Potential privacy concerns when processing proprietary or sensitive code

Code Examples

Loading CodeBERT for code classification:

from transformers import RobertaTokenizer, RobertaForSequenceClassification
import torch

tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaForSequenceClassification.from_pretrained("microsoft/codebert-base")

code = "def hello_world():\n    print('Hello, World!')"
inputs = tokenizer(code, return_tensors="pt")
outputs = model(**inputs)

Using CodeBERT for code search:

from transformers import RobertaTokenizer, RobertaModel
import torch

tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")

query = "How to read a file in Python"
code_snippet = "with open('file.txt', 'r') as f:\n    content = f.read()"

inputs = tokenizer(query, code_snippet, return_tensors="pt", padding=True)
outputs = model(**inputs)
similarity = torch.nn.functional.cosine_similarity(outputs.last_hidden_state[0], outputs.last_hidden_state[1])

Fine-tuning CodeBERT for code completion:

from transformers import RobertaTokenizer, RobertaForCausalLM
import torch

tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaForCausalLM.from_pretrained("microsoft/codebert-base")

code_prefix = "def factorial(n):\n    if n == 0:\n        return 1\n    else:\n        return"
inputs = tokenizer(code_prefix, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100, num_return_sequences=1, temperature=0.7)
completed_code = tokenizer.decode(outputs[0], skip_special_tokens=True)

Getting Started

To get started with CodeBERT, follow these steps:

Install the required libraries:

pip install transformers torch

Import the necessary modules and load the pre-trained model:

from transformers import RobertaTokenizer, RobertaModel
import torch

tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")

Prepare your input and run it through the model:

code = "print('Hello, World!')"
inputs = tokenizer(code, return_tensors="pt")
outputs = model(**inputs)

Process the outputs based on your specific task (e.g., classification, code search, or generation).

Competitor Comparisons

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Pros of Transformers

Broader scope: Supports a wide range of NLP tasks and models beyond code-related tasks
Larger community: More contributors and frequent updates
Extensive documentation and examples for various use cases

Cons of Transformers

Less specialized for code-related tasks
Potentially more complex to use for specific code understanding tasks
Larger library size, which may impact performance in some scenarios

Code Comparison

CodeBERT:

from codebert import CodeBERTModel, RobertaTokenizer

model = CodeBERTModel.from_pretrained("microsoft/codebert-base")
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")

Transformers:

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("microsoft/codebert-base")
tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")

Both repositories provide pre-trained models for code understanding tasks. CodeBERT focuses specifically on code-related tasks, while Transformers offers a more general-purpose library for various NLP tasks. CodeBERT may be more suitable for specialized code understanding projects, while Transformers provides greater flexibility and a wider range of models for diverse NLP applications.

CodeT5

2,974

Home of CodeT5: Open Code LLMs for Code Understanding and Generation

Pros of CodeT5

Supports both understanding and generation tasks, while CodeBERT focuses primarily on understanding
Utilizes a unified encoder-decoder architecture, potentially offering more flexibility in task adaptation
Incorporates task-specific prefixes, allowing for better multi-task learning and transfer

Cons of CodeT5

May require more computational resources due to its larger model size and encoder-decoder architecture
Could have a steeper learning curve for implementation compared to CodeBERT's simpler structure
Potentially slower inference time for certain tasks due to the decoder component

Code Comparison

CodeT5 example:

from transformers import RobertaTokenizer, T5ForConditionalGeneration

tokenizer = RobertaTokenizer.from_pretrained("Salesforce/codet5-base")
model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-base")

CodeBERT example:

from transformers import RobertaTokenizer, RobertaModel

tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")

The main difference in usage is that CodeT5 uses a T5-based model, while CodeBERT uses a RoBERTa-based model, reflecting their architectural differences.

CodeXGLUE

1,660

CodeXGLUE

Pros of CodeXGLUE

Offers a comprehensive benchmark suite for code intelligence tasks
Provides a wider range of datasets and tasks compared to CodeBERT
Includes evaluation metrics and leaderboards for easier comparison of models

Cons of CodeXGLUE

Primarily focused on benchmarking rather than providing a pre-trained model
May require more setup and configuration for specific tasks
Less suitable for direct use in code-related applications

Code Comparison

CodeBERT:

from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")

CodeXGLUE:

from datasets import load_dataset
dataset = load_dataset("code_x_glue_ct_code_to_text", "python")

CodeBERT is a pre-trained model for programming language understanding, while CodeXGLUE is a benchmark collection for code intelligence. CodeBERT provides a ready-to-use model for various code-related tasks, making it easier to integrate into projects. CodeXGLUE, on the other hand, offers a standardized way to evaluate and compare different models across multiple code intelligence tasks, which is valuable for researchers and developers working on improving code understanding and generation models.

google-research

35,428

Google Research

Pros of google-research

Broader scope, covering various AI/ML research areas beyond just code-related tasks
More frequent updates and contributions from a larger research community
Includes implementations of cutting-edge algorithms and techniques across multiple domains

Cons of google-research

Less focused on code-specific tasks and models compared to CodeBERT
May be more challenging to navigate and find specific code-related resources
Potentially steeper learning curve due to the diverse range of topics covered

Code Comparison

CodeBERT:

from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")

google-research:

import tensorflow as tf
from bert import modeling
bert_config = modeling.BertConfig.from_json_file("bert_config.json")
model = modeling.BertModel(config=bert_config, is_training=False)

While CodeBERT provides a more streamlined approach for code-related tasks, google-research offers a wider range of models and implementations across various AI/ML domains. CodeBERT's focus on code understanding makes it more suitable for specific programming-related applications, whereas google-research caters to a broader audience of researchers and practitioners in the field of artificial intelligence and machine learning.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Code Pretraining Models

This repo contains code pretraining models in the CodeBERT series from Microsoft, including six models as of June 2023.

CodeBERT (EMNLP 2020)
GraphCodeBERT (ICLR 2021)
UniXcoder (ACL 2022)
CodeReviewer (ESEC/FSE 2022)
CodeExecutor (ACL 2023)
LongCoder (ICML 2023)

CodeBERT

This repo provides the code for reproducing the experiments in CodeBERT: A Pre-Trained Model for Programming and Natural Languages. CodeBERT is a pre-trained model for programming language, which is a multi-programming-lingual model pre-trained on NL-PL pairs in 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go).

Dependency

pip install torch
pip install transformers

Quick Tour

We use huggingface/transformers framework to train the model. You can use our model like the pre-trained Roberta base. Now, We give an example on how to load the model.

import torch
from transformers import RobertaTokenizer, RobertaConfig, RobertaModel

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")
model.to(device)

NL-PL Embeddings

Here, we give an example to obtain embedding from CodeBERT.

>>> from transformers import AutoTokenizer, AutoModel
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
>>> model = AutoModel.from_pretrained("microsoft/codebert-base")
>>> nl_tokens=tokenizer.tokenize("return maximum value")
['return', 'Ä maximum', 'Ä value']
>>> code_tokens=tokenizer.tokenize("def max(a,b): if a>b: return a else return b")
['def', 'Ä max', '(', 'a', ',', 'b', '):', 'Ä if', 'Ä a', '>', 'b', ':', 'Ä return', 'Ä a', 'Ä else', 'Ä return', 'Ä b']
>>> tokens=[tokenizer.cls_token]+nl_tokens+[tokenizer.sep_token]+code_tokens+[tokenizer.eos_token]
['<s>', 'return', 'Ä maximum', 'Ä value', '</s>', 'def', 'Ä max', '(', 'a', ',', 'b', '):', 'Ä if', 'Ä a', '>', 'b', ':', 'Ä return', 'Ä a', 'Ä else', 'Ä return', 'Ä b', '</s>']
>>> tokens_ids=tokenizer.convert_tokens_to_ids(tokens)
[0, 30921, 4532, 923, 2, 9232, 19220, 1640, 102, 6, 428, 3256, 114, 10, 15698, 428, 35, 671, 10, 1493, 671, 741, 2]
>>> context_embeddings=model(torch.tensor(tokens_ids)[None,:])[0]
torch.Size([1, 23, 768])
tensor([[-0.1423,  0.3766,  0.0443,  ..., -0.2513, -0.3099,  0.3183],
        [-0.5739,  0.1333,  0.2314,  ..., -0.1240, -0.1219,  0.2033],
        [-0.1579,  0.1335,  0.0291,  ...,  0.2340, -0.8801,  0.6216],
        ...,
        [-0.4042,  0.2284,  0.5241,  ..., -0.2046, -0.2419,  0.7031],
        [-0.3894,  0.4603,  0.4797,  ..., -0.3335, -0.6049,  0.4730],
        [-0.1433,  0.3785,  0.0450,  ..., -0.2527, -0.3121,  0.3207]],
       grad_fn=<SelectBackward>)

Probing

As stated in the paper, CodeBERT is not suitable for mask prediction task, while CodeBERT (MLM) is suitable for mask prediction task.

We give an example on how to use CodeBERT(MLM) for mask prediction task.

from transformers import RobertaConfig, RobertaTokenizer, RobertaForMaskedLM, pipeline

model = RobertaForMaskedLM.from_pretrained("microsoft/codebert-base-mlm")
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base-mlm")

CODE = "if (x is not None) <mask> (x>1)"
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

outputs = fill_mask(CODE)
print(outputs)

Results

'and', 'or', 'if', 'then', 'AND'

The detailed outputs are as follows:

{'sequence': '<s> if (x is not None) and (x>1)</s>', 'score': 0.6049249172210693, 'token': 8}
{'sequence': '<s> if (x is not None) or (x>1)</s>', 'score': 0.30680200457572937, 'token': 50}
{'sequence': '<s> if (x is not None) if (x>1)</s>', 'score': 0.02133703976869583, 'token': 114}
{'sequence': '<s> if (x is not None) then (x>1)</s>', 'score': 0.018607674166560173, 'token': 172}
{'sequence': '<s> if (x is not None) AND (x>1)</s>', 'score': 0.007619690150022507, 'token': 4248}

Downstream Tasks

For Code Search and Code Documentation Generation tasks, please refer to the CodeBERT folder.

GraphCodeBERT

This repo also provides the code for reproducing the experiments in GraphCodeBERT: Pre-training Code Representations with Data Flow. GraphCodeBERT is a pre-trained model for programming language that considers the inherent structure of code i.e. data flow, which is a multi-programming-lingual model pre-trained on NL-PL pairs in 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go).

For downstream tasks like code search, clone detection, code refinement and code translation, please refer to the GraphCodeBERT folder.

UniXcoder

This repo will provide the code for reproducing the experiments in UniXcoder: Unified Cross-Modal Pre-training for Code Representation. UniXcoder is a unified cross-modal pre-trained model for programming languages to support both code-related understanding and generation tasks.

Please refer to the UniXcoder folder for tutorials and downstream tasks.

CodeReviewer

This repo also provides the code for reproducing the experiments in CodeReviewer: Pre-Training for Automating Code Review Activities. CodeReviewer is a model pre-trained with code change and code review data to support code review tasks.

Please refer to the CodeReviewer folder for tutorials and downstream tasks.

CodeExecutor

This repo provides the code for reproducing the experiments in Code Execution with Pre-trained Language Models. CodeExecutor is a pre-trained model that learns to predict the execution traces using a code execution pre-training task and curriculum learning.

Please refer to the CodeExecutor folder for details.

LongCoder

This repo will provide the code for reproducing the experiments on LCC datasets in LongCoder: A Long-Range Pre-trained Language Model for Code Completion. LongCoder is a sparse and efficient pre-trained Transformer model for long code modeling.

Please refer to the LongCoder folder for details.

Contact

Feel free to contact Daya Guo (guody5@mail2.sysu.edu.cn), Shuai Lu (shuailu@microsoft.com) and Nan Duan (nanduan@microsoft.com) if you have any further questions.

Contributing

We appreciate all contributions and thank all the contributors!

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot