Convert Figma logo to code with AI

microsoft logoCodeBERT

CodeBERT

2,333
468
2,333
82

Top Related Projects

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

2,867

Home of CodeT5: Open Code LLMs for Code Understanding and Generation

CodeXGLUE

Google Research

Quick Overview

CodeBERT is a pre-trained language model for programming languages developed by Microsoft. It is designed to understand and generate code across multiple programming languages, making it useful for various code-related tasks such as code search, code completion, and code translation.

Pros

  • Supports multiple programming languages, including Python, Java, JavaScript, PHP, Ruby, and Go
  • Pre-trained on a large corpus of code, allowing for transfer learning to various downstream tasks
  • Achieves state-of-the-art performance on many code-related benchmarks
  • Can be fine-tuned for specific tasks or domains

Cons

  • Requires significant computational resources for training and fine-tuning
  • May struggle with understanding complex code structures or domain-specific implementations
  • Limited to the programming languages it was trained on
  • Potential privacy concerns when processing proprietary or sensitive code

Code Examples

  1. Loading CodeBERT for code classification:
from transformers import RobertaTokenizer, RobertaForSequenceClassification
import torch

tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaForSequenceClassification.from_pretrained("microsoft/codebert-base")

code = "def hello_world():\n    print('Hello, World!')"
inputs = tokenizer(code, return_tensors="pt")
outputs = model(**inputs)
  1. Using CodeBERT for code search:
from transformers import RobertaTokenizer, RobertaModel
import torch

tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")

query = "How to read a file in Python"
code_snippet = "with open('file.txt', 'r') as f:\n    content = f.read()"

inputs = tokenizer(query, code_snippet, return_tensors="pt", padding=True)
outputs = model(**inputs)
similarity = torch.nn.functional.cosine_similarity(outputs.last_hidden_state[0], outputs.last_hidden_state[1])
  1. Fine-tuning CodeBERT for code completion:
from transformers import RobertaTokenizer, RobertaForCausalLM
import torch

tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaForCausalLM.from_pretrained("microsoft/codebert-base")

code_prefix = "def factorial(n):\n    if n == 0:\n        return 1\n    else:\n        return"
inputs = tokenizer(code_prefix, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100, num_return_sequences=1, temperature=0.7)
completed_code = tokenizer.decode(outputs[0], skip_special_tokens=True)

Getting Started

To get started with CodeBERT, follow these steps:

  1. Install the required libraries:
pip install transformers torch
  1. Import the necessary modules and load the pre-trained model:
from transformers import RobertaTokenizer, RobertaModel
import torch

tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")
  1. Prepare your input and run it through the model:
code = "print('Hello, World!')"
inputs = tokenizer(code, return_tensors="pt")
outputs = model(**inputs)
  1. Process the outputs based on your specific task (e.g., classification, code search, or generation).

Competitor Comparisons

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Pros of Transformers

  • Broader scope: Supports a wide range of NLP tasks and models beyond code-related tasks
  • Larger community: More contributors and frequent updates
  • Extensive documentation and examples for various use cases

Cons of Transformers

  • Less specialized for code-related tasks
  • Potentially more complex to use for specific code understanding tasks
  • Larger library size, which may impact performance in some scenarios

Code Comparison

CodeBERT:

from codebert import CodeBERTModel, RobertaTokenizer

model = CodeBERTModel.from_pretrained("microsoft/codebert-base")
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")

Transformers:

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("microsoft/codebert-base")
tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")

Both repositories provide pre-trained models for code understanding tasks. CodeBERT focuses specifically on code-related tasks, while Transformers offers a more general-purpose library for various NLP tasks. CodeBERT may be more suitable for specialized code understanding projects, while Transformers provides greater flexibility and a wider range of models for diverse NLP applications.

2,867

Home of CodeT5: Open Code LLMs for Code Understanding and Generation

Pros of CodeT5

  • Supports both understanding and generation tasks, while CodeBERT focuses primarily on understanding
  • Utilizes a unified encoder-decoder architecture, potentially offering more flexibility in task adaptation
  • Incorporates task-specific prefixes, allowing for better multi-task learning and transfer

Cons of CodeT5

  • May require more computational resources due to its larger model size and encoder-decoder architecture
  • Could have a steeper learning curve for implementation compared to CodeBERT's simpler structure
  • Potentially slower inference time for certain tasks due to the decoder component

Code Comparison

CodeT5 example:

from transformers import RobertaTokenizer, T5ForConditionalGeneration

tokenizer = RobertaTokenizer.from_pretrained("Salesforce/codet5-base")
model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-base")

CodeBERT example:

from transformers import RobertaTokenizer, RobertaModel

tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")

The main difference in usage is that CodeT5 uses a T5-based model, while CodeBERT uses a RoBERTa-based model, reflecting their architectural differences.

CodeXGLUE

Pros of CodeXGLUE

  • Offers a comprehensive benchmark suite for code intelligence tasks
  • Provides a wider range of datasets and tasks compared to CodeBERT
  • Includes evaluation metrics and leaderboards for easier comparison of models

Cons of CodeXGLUE

  • Primarily focused on benchmarking rather than providing a pre-trained model
  • May require more setup and configuration for specific tasks
  • Less suitable for direct use in code-related applications

Code Comparison

CodeBERT:

from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")

CodeXGLUE:

from datasets import load_dataset
dataset = load_dataset("code_x_glue_ct_code_to_text", "python")

CodeBERT is a pre-trained model for programming language understanding, while CodeXGLUE is a benchmark collection for code intelligence. CodeBERT provides a ready-to-use model for various code-related tasks, making it easier to integrate into projects. CodeXGLUE, on the other hand, offers a standardized way to evaluate and compare different models across multiple code intelligence tasks, which is valuable for researchers and developers working on improving code understanding and generation models.

Google Research

Pros of google-research

  • Broader scope, covering various AI/ML research areas beyond just code-related tasks
  • More frequent updates and contributions from a larger research community
  • Includes implementations of cutting-edge algorithms and techniques across multiple domains

Cons of google-research

  • Less focused on code-specific tasks and models compared to CodeBERT
  • May be more challenging to navigate and find specific code-related resources
  • Potentially steeper learning curve due to the diverse range of topics covered

Code Comparison

CodeBERT:

from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")

google-research:

import tensorflow as tf
from bert import modeling
bert_config = modeling.BertConfig.from_json_file("bert_config.json")
model = modeling.BertModel(config=bert_config, is_training=False)

While CodeBERT provides a more streamlined approach for code-related tasks, google-research offers a wider range of models and implementations across various AI/ML domains. CodeBERT's focus on code understanding makes it more suitable for specific programming-related applications, whereas google-research caters to a broader audience of researchers and practitioners in the field of artificial intelligence and machine learning.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Code Pretraining Models

This repo contains code pretraining models in the CodeBERT series from Microsoft, including six models as of June 2023.

  • CodeBERT (EMNLP 2020)
  • GraphCodeBERT (ICLR 2021)
  • UniXcoder (ACL 2022)
  • CodeReviewer (ESEC/FSE 2022)
  • CodeExecutor (ACL 2023)
  • LongCoder (ICML 2023)

CodeBERT

This repo provides the code for reproducing the experiments in CodeBERT: A Pre-Trained Model for Programming and Natural Languages. CodeBERT is a pre-trained model for programming language, which is a multi-programming-lingual model pre-trained on NL-PL pairs in 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go).

Dependency

  • pip install torch
  • pip install transformers

Quick Tour

We use huggingface/transformers framework to train the model. You can use our model like the pre-trained Roberta base. Now, We give an example on how to load the model.

import torch
from transformers import RobertaTokenizer, RobertaConfig, RobertaModel

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")
model.to(device)

NL-PL Embeddings

Here, we give an example to obtain embedding from CodeBERT.

>>> from transformers import AutoTokenizer, AutoModel
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
>>> model = AutoModel.from_pretrained("microsoft/codebert-base")
>>> nl_tokens=tokenizer.tokenize("return maximum value")
['return', 'Ä maximum', 'Ä value']
>>> code_tokens=tokenizer.tokenize("def max(a,b): if a>b: return a else return b")
['def', 'Ä max', '(', 'a', ',', 'b', '):', 'Ä if', 'Ä a', '>', 'b', ':', 'Ä return', 'Ä a', 'Ä else', 'Ä return', 'Ä b']
>>> tokens=[tokenizer.cls_token]+nl_tokens+[tokenizer.sep_token]+code_tokens+[tokenizer.eos_token]
['<s>', 'return', 'Ä maximum', 'Ä value', '</s>', 'def', 'Ä max', '(', 'a', ',', 'b', '):', 'Ä if', 'Ä a', '>', 'b', ':', 'Ä return', 'Ä a', 'Ä else', 'Ä return', 'Ä b', '</s>']
>>> tokens_ids=tokenizer.convert_tokens_to_ids(tokens)
[0, 30921, 4532, 923, 2, 9232, 19220, 1640, 102, 6, 428, 3256, 114, 10, 15698, 428, 35, 671, 10, 1493, 671, 741, 2]
>>> context_embeddings=model(torch.tensor(tokens_ids)[None,:])[0]
torch.Size([1, 23, 768])
tensor([[-0.1423,  0.3766,  0.0443,  ..., -0.2513, -0.3099,  0.3183],
        [-0.5739,  0.1333,  0.2314,  ..., -0.1240, -0.1219,  0.2033],
        [-0.1579,  0.1335,  0.0291,  ...,  0.2340, -0.8801,  0.6216],
        ...,
        [-0.4042,  0.2284,  0.5241,  ..., -0.2046, -0.2419,  0.7031],
        [-0.3894,  0.4603,  0.4797,  ..., -0.3335, -0.6049,  0.4730],
        [-0.1433,  0.3785,  0.0450,  ..., -0.2527, -0.3121,  0.3207]],
       grad_fn=<SelectBackward>)

Probing

As stated in the paper, CodeBERT is not suitable for mask prediction task, while CodeBERT (MLM) is suitable for mask prediction task.

We give an example on how to use CodeBERT(MLM) for mask prediction task.

from transformers import RobertaConfig, RobertaTokenizer, RobertaForMaskedLM, pipeline

model = RobertaForMaskedLM.from_pretrained("microsoft/codebert-base-mlm")
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base-mlm")

CODE = "if (x is not None) <mask> (x>1)"
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

outputs = fill_mask(CODE)
print(outputs)

Results

'and', 'or', 'if', 'then', 'AND'

The detailed outputs are as follows:

{'sequence': '<s> if (x is not None) and (x>1)</s>', 'score': 0.6049249172210693, 'token': 8}
{'sequence': '<s> if (x is not None) or (x>1)</s>', 'score': 0.30680200457572937, 'token': 50}
{'sequence': '<s> if (x is not None) if (x>1)</s>', 'score': 0.02133703976869583, 'token': 114}
{'sequence': '<s> if (x is not None) then (x>1)</s>', 'score': 0.018607674166560173, 'token': 172}
{'sequence': '<s> if (x is not None) AND (x>1)</s>', 'score': 0.007619690150022507, 'token': 4248}

Downstream Tasks

For Code Search and Code Documentation Generation tasks, please refer to the CodeBERT folder.

GraphCodeBERT

This repo also provides the code for reproducing the experiments in GraphCodeBERT: Pre-training Code Representations with Data Flow. GraphCodeBERT is a pre-trained model for programming language that considers the inherent structure of code i.e. data flow, which is a multi-programming-lingual model pre-trained on NL-PL pairs in 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go).

For downstream tasks like code search, clone detection, code refinement and code translation, please refer to the GraphCodeBERT folder.

UniXcoder

This repo will provide the code for reproducing the experiments in UniXcoder: Unified Cross-Modal Pre-training for Code Representation. UniXcoder is a unified cross-modal pre-trained model for programming languages to support both code-related understanding and generation tasks.

Please refer to the UniXcoder folder for tutorials and downstream tasks.

CodeReviewer

This repo also provides the code for reproducing the experiments in CodeReviewer: Pre-Training for Automating Code Review Activities. CodeReviewer is a model pre-trained with code change and code review data to support code review tasks.

Please refer to the CodeReviewer folder for tutorials and downstream tasks.

CodeExecutor

This repo provides the code for reproducing the experiments in Code Execution with Pre-trained Language Models. CodeExecutor is a pre-trained model that learns to predict the execution traces using a code execution pre-training task and curriculum learning.

Please refer to the CodeExecutor folder for details.

LongCoder

This repo will provide the code for reproducing the experiments on LCC datasets in LongCoder: A Long-Range Pre-trained Language Model for Code Completion. LongCoder is a sparse and efficient pre-trained Transformer model for long code modeling.

Please refer to the LongCoder folder for details.

Contact

Feel free to contact Daya Guo (guody5@mail2.sysu.edu.cn), Shuai Lu (shuailu@microsoft.com) and Nan Duan (nanduan@microsoft.com) if you have any further questions.

Contributing

We appreciate all contributions and thank all the contributors!