Top Related Projects
Evolutionary Scale Modeling (esm): Pretrained language models for proteins
TensorFlow code and pre-trained models for BERT
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
A BERT model for scientific text.
LAVIS - A One-stop Library for Language-Vision Intelligence
Quick Overview
BioGPT is an open-source large language model developed by Microsoft, specifically trained on biomedical literature. It aims to assist researchers and professionals in the biomedical field by providing domain-specific knowledge and capabilities for various biomedical natural language processing tasks.
Pros
- Specialized in biomedical domain, offering more accurate and relevant responses for biomedical queries
- Pre-trained on a vast corpus of biomedical literature, providing up-to-date knowledge
- Supports various downstream tasks such as question answering, text generation, and named entity recognition
- Open-source, allowing for community contributions and customization
Cons
- Requires significant computational resources for fine-tuning and deployment
- May have limitations in understanding or generating content outside the biomedical domain
- Potential for biases present in the training data to be reflected in the model's outputs
- Requires careful prompt engineering and fine-tuning for optimal performance in specific use cases
Code Examples
# Load the BioGPT model and tokenizer
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("microsoft/biogpt")
model = AutoModelForCausalLM.from_pretrained("microsoft/biogpt")
# Generate text based on a prompt
prompt = "The BRCA1 gene is associated with"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(input_ids, max_length=100, num_return_sequences=1)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
# Perform question answering
question = "What is the function of the p53 protein?"
context = "The p53 protein is a tumor suppressor that plays a crucial role in preventing cancer development."
inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)
answer_start = torch.argmax(outputs.start_logits)
answer_end = torch.argmax(outputs.end_logits) + 1
answer = tokenizer.decode(inputs.input_ids[0][answer_start:answer_end])
print(answer)
Getting Started
To get started with BioGPT, follow these steps:
-
Install the required dependencies:
pip install transformers torch
-
Load the model and tokenizer:
from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("microsoft/biogpt") model = AutoModelForCausalLM.from_pretrained("microsoft/biogpt")
-
Use the model for your desired task, such as text generation or question answering, as shown in the code examples above.
Competitor Comparisons
Evolutionary Scale Modeling (esm): Pretrained language models for proteins
Pros of ESM
- Focuses on protein language models, offering pre-trained models for various protein-related tasks
- Provides extensive documentation and examples for using the models in different scenarios
- Actively maintained with frequent updates and contributions from the research community
Cons of ESM
- Limited to protein-specific tasks, unlike BioGPT's broader biomedical text generation capabilities
- May require more domain expertise to utilize effectively compared to BioGPT's more general-purpose approach
- Larger model sizes can be computationally demanding for some users
Code Comparison
ESM:
import torch
from esm import pretrained
model, alphabet = pretrained.load_model_and_alphabet("esm2_t33_650M_UR50D")
batch_converter = alphabet.get_batch_converter()
BioGPT:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("microsoft/biogpt")
model = AutoModelForCausalLM.from_pretrained("microsoft/biogpt")
TensorFlow code and pre-trained models for BERT
Pros of BERT
- Widely adopted and extensively studied in the NLP community
- Pre-trained on a large corpus of general text, making it versatile for various NLP tasks
- Supports multiple languages and has numerous fine-tuned variants available
Cons of BERT
- Not specifically designed for biomedical text, potentially limiting its performance in specialized domains
- Requires fine-tuning for specific tasks, which can be computationally expensive
- May struggle with long-range dependencies due to its fixed input length
Code Comparison
BERT:
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
BioGPT:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("microsoft/biogpt")
model = AutoModelForCausalLM.from_pretrained("microsoft/biogpt")
Both repositories use the Hugging Face Transformers library, but BioGPT is specifically designed for biomedical text generation tasks, while BERT is a more general-purpose language model. BioGPT may offer better performance in biomedical domains, but BERT's versatility and extensive ecosystem make it a strong choice for many NLP applications.
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Pros of transformers
- Broader scope, covering a wide range of NLP tasks and models
- Larger community and more frequent updates
- Extensive documentation and examples
Cons of transformers
- Can be overwhelming for beginners due to its vast scope
- May require more setup and configuration for specific tasks
Code comparison
BioGPT:
from biogpt import BioGPT
model = BioGPT.from_pretrained("microsoft/biogpt")
output = model.generate("Describe the function of insulin:")
transformers:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("microsoft/biogpt")
tokenizer = AutoTokenizer.from_pretrained("microsoft/biogpt")
inputs = tokenizer("Describe the function of insulin:", return_tensors="pt")
outputs = model.generate(**inputs)
Summary
transformers offers a more comprehensive toolkit for various NLP tasks, while BioGPT focuses specifically on biomedical text generation. transformers benefits from a larger community and more frequent updates, but may be more complex for beginners. BioGPT provides a simpler API for its specific use case, making it easier to use for biomedical text generation tasks.
A BERT model for scientific text.
Pros of SciBERT
- Pre-trained on a large corpus of scientific text, making it well-suited for scientific and biomedical NLP tasks
- Offers both cased and uncased models, providing flexibility for different use cases
- Widely adopted in the scientific community with extensive documentation and examples
Cons of SciBERT
- Limited to BERT architecture, which may not be as advanced as more recent language models
- Focused primarily on scientific text, potentially less versatile for general-purpose tasks
- Requires fine-tuning for specific downstream tasks, which can be computationally expensive
Code Comparison
SciBERT:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')
BioGPT:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("microsoft/biogpt")
model = AutoModelForCausalLM.from_pretrained("microsoft/biogpt")
The main differences in the code snippets are:
- SciBERT uses
AutoModel
, while BioGPT usesAutoModelForCausalLM
- The model names and paths differ based on their respective repositories
- BioGPT is designed for causal language modeling, while SciBERT is a bidirectional encoder
LAVIS - A One-stop Library for Language-Vision Intelligence
Pros of LAVIS
- Broader scope: Supports multiple vision-language tasks beyond just text generation
- More extensive documentation and examples for various use cases
- Active development with frequent updates and community contributions
Cons of LAVIS
- Larger codebase and potentially more complex setup
- May require more computational resources due to its multi-modal nature
- Less specialized for biomedical domain-specific tasks
Code Comparison
LAVIS example:
from lavis.models import load_model_and_preprocess
model, vis_processors, txt_processors = load_model_and_preprocess("blip_caption", "large")
raw_image = Image.open("path/to/image.jpg").convert("RGB")
image = vis_processors["eval"](raw_image).unsqueeze(0)
caption = model.generate({"image": image})
BioGPT example:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("microsoft/biogpt")
model = AutoModelForCausalLM.from_pretrained("microsoft/biogpt")
inputs = tokenizer("Generate a summary of the following medical text:", return_tensors="pt")
outputs = model.generate(**inputs)
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
BioGPT
This repository contains the implementation of BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining, by Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon and Tie-Yan Liu.
Requirements and Installation
- PyTorch version == 1.12.0
- Python version == 3.10
- fairseq version == 0.12.0:
git clone https://github.com/pytorch/fairseq
cd fairseq
git checkout v0.12.0
pip install .
python setup.py build_ext --inplace
cd ..
- Moses
git clone https://github.com/moses-smt/mosesdecoder.git
export MOSES=${PWD}/mosesdecoder
- fastBPE
git clone https://github.com/glample/fastBPE.git
export FASTBPE=${PWD}/fastBPE
cd fastBPE
g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast
- sacremoses
pip install sacremoses
- sklearn
pip install scikit-learn
Remember to set the environment variables MOSES
and FASTBPE
to the path of Moses and fastBPE respetively, as they will be required later.
Getting Started
Pre-trained models
We provide our pre-trained BioGPT model checkpoints along with fine-tuned checkpoints for downstream tasks, available both through URL download as well as through the Hugging Face ð¤ Hub.
Model | Description | URL | ð¤ Hub |
---|---|---|---|
BioGPT | Pre-trained BioGPT model checkpoint | link | link |
BioGPT-Large | Pre-trained BioGPT-Large model checkpoint | link | link |
BioGPT-QA-PubMedQA-BioGPT | Fine-tuned BioGPT for question answering task on PubMedQA | link | |
BioGPT-QA-PubMedQA-BioGPT-Large | Fine-tuned BioGPT-Large for question answering task on PubMedQA | link | |
BioGPT-RE-BC5CDR | Fine-tuned BioGPT for relation extraction task on BC5CDR | link | |
BioGPT-RE-DDI | Fine-tuned BioGPT for relation extraction task on DDI | link | |
BioGPT-RE-DTI | Fine-tuned BioGPT for relation extraction task on KD-DTI | link | |
BioGPT-DC-HoC | Fine-tuned BioGPT for document classification task on HoC | link |
Download them and extract them to the checkpoints
folder of this project.
For example:
mkdir checkpoints
cd checkpoints
wget https://msralaphilly2.blob.core.windows.net/release/BioGPT/checkpoints/Pre-trained-BioGPT.tgz?sp=r&st=2023-11-13T15:37:35Z&se=2099-12-30T23:37:35Z&spr=https&sv=2022-11-02&sr=b&sig=3CcG1TOhqJPBhkVutvVn3PtUq0vPyLBgwggUfojypfY%3D
tar -zxvf Pre-trained-BioGPT.tgz
Example Usage
Use pre-trained BioGPT model in your code:
import torch
from fairseq.models.transformer_lm import TransformerLanguageModel
m = TransformerLanguageModel.from_pretrained(
"checkpoints/Pre-trained-BioGPT",
"checkpoint.pt",
"data",
tokenizer='moses',
bpe='fastbpe',
bpe_codes="data/bpecodes",
min_len=100,
max_len_b=1024)
m.cuda()
src_tokens = m.encode("COVID-19 is")
generate = m.generate([src_tokens], beam=5)[0]
output = m.decode(generate[0]["tokens"])
print(output)
Use fine-tuned BioGPT model on KD-DTI for drug-target-interaction in your code:
import torch
from src.transformer_lm_prompt import TransformerLanguageModelPrompt
m = TransformerLanguageModelPrompt.from_pretrained(
"checkpoints/RE-DTI-BioGPT",
"checkpoint_avg.pt",
"data/KD-DTI/relis-bin",
tokenizer='moses',
bpe='fastbpe',
bpe_codes="data/bpecodes",
max_len_b=1024,
beam=1)
m.cuda()
src_text="" # input text, e.g., a PubMed abstract
src_tokens = m.encode(src_text)
generate = m.generate([src_tokens], beam=args.beam)[0]
output = m.decode(generate[0]["tokens"])
print(output)
For more downstream tasks, please see below.
Downstream tasks
See corresponding folder in examples:
Relation Extraction on BC5CDR
Relation Extraction on KD-DTI
Relation Extraction on DDI
Document Classification on HoC
Question Answering on PubMedQA
Text Generation
Hugging Face ð¤ Usage
BioGPT has also been integrated into the Hugging Face transformers
library, and model checkpoints are available on the Hugging Face Hub.
You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:
from transformers import pipeline, set_seed
from transformers import BioGptTokenizer, BioGptForCausalLM
model = BioGptForCausalLM.from_pretrained("microsoft/biogpt")
tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt")
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
set_seed(42)
generator("COVID-19 is", max_length=20, num_return_sequences=5, do_sample=True)
Here is how to use this model to get the features of a given text in PyTorch:
from transformers import BioGptTokenizer, BioGptForCausalLM
tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt")
model = BioGptForCausalLM.from_pretrained("microsoft/biogpt")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
Beam-search decoding:
import torch
from transformers import BioGptTokenizer, BioGptForCausalLM, set_seed
tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt")
model = BioGptForCausalLM.from_pretrained("microsoft/biogpt")
sentence = "COVID-19 is"
inputs = tokenizer(sentence, return_tensors="pt")
set_seed(42)
with torch.no_grad():
beam_output = model.generate(**inputs,
min_length=100,
max_length=1024,
num_beams=5,
early_stopping=True
)
tokenizer.decode(beam_output[0], skip_special_tokens=True)
For more information, please see the documentation on the Hugging Face website.
Demos
Check out these demos on Hugging Face Spaces:
License
BioGPT is MIT-licensed. The license applies to the pre-trained models as well.
Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
Top Related Projects
Evolutionary Scale Modeling (esm): Pretrained language models for proteins
TensorFlow code and pre-trained models for BERT
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
A BERT model for scientific text.
LAVIS - A One-stop Library for Language-Vision Intelligence
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot