Top Related Projects
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Home of CodeT5: Open Code LLMs for Code Understanding and Generation
CodeXGLUE
Google Research
Quick Overview
CodeBERT is a pre-trained language model for programming languages developed by Microsoft. It is designed to understand and generate code across multiple programming languages, making it useful for various code-related tasks such as code search, code completion, and code translation.
Pros
- Supports multiple programming languages, including Python, Java, JavaScript, PHP, Ruby, and Go
- Pre-trained on a large corpus of code, allowing for transfer learning to various downstream tasks
- Achieves state-of-the-art performance on many code-related benchmarks
- Can be fine-tuned for specific tasks or domains
Cons
- Requires significant computational resources for training and fine-tuning
- May struggle with understanding complex code structures or domain-specific implementations
- Limited to the programming languages it was trained on
- Potential privacy concerns when processing proprietary or sensitive code
Code Examples
- Loading CodeBERT for code classification:
from transformers import RobertaTokenizer, RobertaForSequenceClassification
import torch
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaForSequenceClassification.from_pretrained("microsoft/codebert-base")
code = "def hello_world():\n print('Hello, World!')"
inputs = tokenizer(code, return_tensors="pt")
outputs = model(**inputs)
- Using CodeBERT for code search:
from transformers import RobertaTokenizer, RobertaModel
import torch
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")
query = "How to read a file in Python"
code_snippet = "with open('file.txt', 'r') as f:\n content = f.read()"
inputs = tokenizer(query, code_snippet, return_tensors="pt", padding=True)
outputs = model(**inputs)
similarity = torch.nn.functional.cosine_similarity(outputs.last_hidden_state[0], outputs.last_hidden_state[1])
- Fine-tuning CodeBERT for code completion:
from transformers import RobertaTokenizer, RobertaForCausalLM
import torch
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaForCausalLM.from_pretrained("microsoft/codebert-base")
code_prefix = "def factorial(n):\n if n == 0:\n return 1\n else:\n return"
inputs = tokenizer(code_prefix, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100, num_return_sequences=1, temperature=0.7)
completed_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
Getting Started
To get started with CodeBERT, follow these steps:
- Install the required libraries:
pip install transformers torch
- Import the necessary modules and load the pre-trained model:
from transformers import RobertaTokenizer, RobertaModel
import torch
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")
- Prepare your input and run it through the model:
code = "print('Hello, World!')"
inputs = tokenizer(code, return_tensors="pt")
outputs = model(**inputs)
- Process the outputs based on your specific task (e.g., classification, code search, or generation).
Competitor Comparisons
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Pros of Transformers
- Broader scope: Supports a wide range of NLP tasks and models beyond code-related tasks
- Larger community: More contributors and frequent updates
- Extensive documentation and examples for various use cases
Cons of Transformers
- Less specialized for code-related tasks
- Potentially more complex to use for specific code understanding tasks
- Larger library size, which may impact performance in some scenarios
Code Comparison
CodeBERT:
from codebert import CodeBERTModel, RobertaTokenizer
model = CodeBERTModel.from_pretrained("microsoft/codebert-base")
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
Transformers:
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("microsoft/codebert-base")
tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
Both repositories provide pre-trained models for code understanding tasks. CodeBERT focuses specifically on code-related tasks, while Transformers offers a more general-purpose library for various NLP tasks. CodeBERT may be more suitable for specialized code understanding projects, while Transformers provides greater flexibility and a wider range of models for diverse NLP applications.
Home of CodeT5: Open Code LLMs for Code Understanding and Generation
Pros of CodeT5
- Supports both understanding and generation tasks, while CodeBERT focuses primarily on understanding
- Utilizes a unified encoder-decoder architecture, potentially offering more flexibility in task adaptation
- Incorporates task-specific prefixes, allowing for better multi-task learning and transfer
Cons of CodeT5
- May require more computational resources due to its larger model size and encoder-decoder architecture
- Could have a steeper learning curve for implementation compared to CodeBERT's simpler structure
- Potentially slower inference time for certain tasks due to the decoder component
Code Comparison
CodeT5 example:
from transformers import RobertaTokenizer, T5ForConditionalGeneration
tokenizer = RobertaTokenizer.from_pretrained("Salesforce/codet5-base")
model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-base")
CodeBERT example:
from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")
The main difference in usage is that CodeT5 uses a T5-based model, while CodeBERT uses a RoBERTa-based model, reflecting their architectural differences.
CodeXGLUE
Pros of CodeXGLUE
- Offers a comprehensive benchmark suite for code intelligence tasks
- Provides a wider range of datasets and tasks compared to CodeBERT
- Includes evaluation metrics and leaderboards for easier comparison of models
Cons of CodeXGLUE
- Primarily focused on benchmarking rather than providing a pre-trained model
- May require more setup and configuration for specific tasks
- Less suitable for direct use in code-related applications
Code Comparison
CodeBERT:
from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")
CodeXGLUE:
from datasets import load_dataset
dataset = load_dataset("code_x_glue_ct_code_to_text", "python")
CodeBERT is a pre-trained model for programming language understanding, while CodeXGLUE is a benchmark collection for code intelligence. CodeBERT provides a ready-to-use model for various code-related tasks, making it easier to integrate into projects. CodeXGLUE, on the other hand, offers a standardized way to evaluate and compare different models across multiple code intelligence tasks, which is valuable for researchers and developers working on improving code understanding and generation models.
Google Research
Pros of google-research
- Broader scope, covering various AI/ML research areas beyond just code-related tasks
- More frequent updates and contributions from a larger research community
- Includes implementations of cutting-edge algorithms and techniques across multiple domains
Cons of google-research
- Less focused on code-specific tasks and models compared to CodeBERT
- May be more challenging to navigate and find specific code-related resources
- Potentially steeper learning curve due to the diverse range of topics covered
Code Comparison
CodeBERT:
from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")
google-research:
import tensorflow as tf
from bert import modeling
bert_config = modeling.BertConfig.from_json_file("bert_config.json")
model = modeling.BertModel(config=bert_config, is_training=False)
While CodeBERT provides a more streamlined approach for code-related tasks, google-research offers a wider range of models and implementations across various AI/ML domains. CodeBERT's focus on code understanding makes it more suitable for specific programming-related applications, whereas google-research caters to a broader audience of researchers and practitioners in the field of artificial intelligence and machine learning.
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Code Pretraining Models
This repo contains code pretraining models in the CodeBERT series from Microsoft, including six models as of June 2023.
- CodeBERT (EMNLP 2020)
- GraphCodeBERT (ICLR 2021)
- UniXcoder (ACL 2022)
- CodeReviewer (ESEC/FSE 2022)
- CodeExecutor (ACL 2023)
- LongCoder (ICML 2023)
CodeBERT
This repo provides the code for reproducing the experiments in CodeBERT: A Pre-Trained Model for Programming and Natural Languages. CodeBERT is a pre-trained model for programming language, which is a multi-programming-lingual model pre-trained on NL-PL pairs in 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go).
Dependency
- pip install torch
- pip install transformers
Quick Tour
We use huggingface/transformers framework to train the model. You can use our model like the pre-trained Roberta base. Now, We give an example on how to load the model.
import torch
from transformers import RobertaTokenizer, RobertaConfig, RobertaModel
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")
model.to(device)
NL-PL Embeddings
Here, we give an example to obtain embedding from CodeBERT.
>>> from transformers import AutoTokenizer, AutoModel
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
>>> model = AutoModel.from_pretrained("microsoft/codebert-base")
>>> nl_tokens=tokenizer.tokenize("return maximum value")
['return', 'Ä maximum', 'Ä value']
>>> code_tokens=tokenizer.tokenize("def max(a,b): if a>b: return a else return b")
['def', 'Ä max', '(', 'a', ',', 'b', '):', 'Ä if', 'Ä a', '>', 'b', ':', 'Ä return', 'Ä a', 'Ä else', 'Ä return', 'Ä b']
>>> tokens=[tokenizer.cls_token]+nl_tokens+[tokenizer.sep_token]+code_tokens+[tokenizer.eos_token]
['<s>', 'return', 'Ä maximum', 'Ä value', '</s>', 'def', 'Ä max', '(', 'a', ',', 'b', '):', 'Ä if', 'Ä a', '>', 'b', ':', 'Ä return', 'Ä a', 'Ä else', 'Ä return', 'Ä b', '</s>']
>>> tokens_ids=tokenizer.convert_tokens_to_ids(tokens)
[0, 30921, 4532, 923, 2, 9232, 19220, 1640, 102, 6, 428, 3256, 114, 10, 15698, 428, 35, 671, 10, 1493, 671, 741, 2]
>>> context_embeddings=model(torch.tensor(tokens_ids)[None,:])[0]
torch.Size([1, 23, 768])
tensor([[-0.1423, 0.3766, 0.0443, ..., -0.2513, -0.3099, 0.3183],
[-0.5739, 0.1333, 0.2314, ..., -0.1240, -0.1219, 0.2033],
[-0.1579, 0.1335, 0.0291, ..., 0.2340, -0.8801, 0.6216],
...,
[-0.4042, 0.2284, 0.5241, ..., -0.2046, -0.2419, 0.7031],
[-0.3894, 0.4603, 0.4797, ..., -0.3335, -0.6049, 0.4730],
[-0.1433, 0.3785, 0.0450, ..., -0.2527, -0.3121, 0.3207]],
grad_fn=<SelectBackward>)
Probing
As stated in the paper, CodeBERT is not suitable for mask prediction task, while CodeBERT (MLM) is suitable for mask prediction task.
We give an example on how to use CodeBERT(MLM) for mask prediction task.
from transformers import RobertaConfig, RobertaTokenizer, RobertaForMaskedLM, pipeline
model = RobertaForMaskedLM.from_pretrained("microsoft/codebert-base-mlm")
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base-mlm")
CODE = "if (x is not None) <mask> (x>1)"
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
outputs = fill_mask(CODE)
print(outputs)
Results
'and', 'or', 'if', 'then', 'AND'
The detailed outputs are as follows:
{'sequence': '<s> if (x is not None) and (x>1)</s>', 'score': 0.6049249172210693, 'token': 8}
{'sequence': '<s> if (x is not None) or (x>1)</s>', 'score': 0.30680200457572937, 'token': 50}
{'sequence': '<s> if (x is not None) if (x>1)</s>', 'score': 0.02133703976869583, 'token': 114}
{'sequence': '<s> if (x is not None) then (x>1)</s>', 'score': 0.018607674166560173, 'token': 172}
{'sequence': '<s> if (x is not None) AND (x>1)</s>', 'score': 0.007619690150022507, 'token': 4248}
Downstream Tasks
For Code Search and Code Documentation Generation tasks, please refer to the CodeBERT folder.
GraphCodeBERT
This repo also provides the code for reproducing the experiments in GraphCodeBERT: Pre-training Code Representations with Data Flow. GraphCodeBERT is a pre-trained model for programming language that considers the inherent structure of code i.e. data flow, which is a multi-programming-lingual model pre-trained on NL-PL pairs in 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go).
For downstream tasks like code search, clone detection, code refinement and code translation, please refer to the GraphCodeBERT folder.
UniXcoder
This repo will provide the code for reproducing the experiments in UniXcoder: Unified Cross-Modal Pre-training for Code Representation. UniXcoder is a unified cross-modal pre-trained model for programming languages to support both code-related understanding and generation tasks.
Please refer to the UniXcoder folder for tutorials and downstream tasks.
CodeReviewer
This repo also provides the code for reproducing the experiments in CodeReviewer: Pre-Training for Automating Code Review Activities. CodeReviewer is a model pre-trained with code change and code review data to support code review tasks.
Please refer to the CodeReviewer folder for tutorials and downstream tasks.
CodeExecutor
This repo provides the code for reproducing the experiments in Code Execution with Pre-trained Language Models. CodeExecutor is a pre-trained model that learns to predict the execution traces using a code execution pre-training task and curriculum learning.
Please refer to the CodeExecutor folder for details.
LongCoder
This repo will provide the code for reproducing the experiments on LCC datasets in LongCoder: A Long-Range Pre-trained Language Model for Code Completion. LongCoder is a sparse and efficient pre-trained Transformer model for long code modeling.
Please refer to the LongCoder folder for details.
Contact
Feel free to contact Daya Guo (guody5@mail2.sysu.edu.cn), Shuai Lu (shuailu@microsoft.com) and Nan Duan (nanduan@microsoft.com) if you have any further questions.
Contributing
We appreciate all contributions and thank all the contributors!
Top Related Projects
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Home of CodeT5: Open Code LLMs for Code Understanding and Generation
CodeXGLUE
Google Research
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot