BioGPT

No description available

4,405

460

4,405

View on GitHub

Top Related Projects

esm

3,583

Evolutionary Scale Modeling (esm): Pretrained language models for proteins

bert

39,267

TensorFlow code and pre-trained models for BERT

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

LAVIS

10,689

LAVIS - A One-stop Library for Language-Vision Intelligence

Quick Overview

BioGPT is an open-source large language model developed by Microsoft, specifically trained on biomedical literature. It aims to assist researchers and professionals in the biomedical field by providing domain-specific knowledge and capabilities for various biomedical natural language processing tasks.

Pros

Specialized in biomedical domain, offering more accurate and relevant responses for biomedical queries
Pre-trained on a vast corpus of biomedical literature, providing up-to-date knowledge
Supports various downstream tasks such as question answering, text generation, and named entity recognition
Open-source, allowing for community contributions and customization

Cons

Requires significant computational resources for fine-tuning and deployment
May have limitations in understanding or generating content outside the biomedical domain
Potential for biases present in the training data to be reflected in the model's outputs
Requires careful prompt engineering and fine-tuning for optimal performance in specific use cases

Code Examples

# Load the BioGPT model and tokenizer
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/biogpt")
model = AutoModelForCausalLM.from_pretrained("microsoft/biogpt")

# Generate text based on a prompt
prompt = "The BRCA1 gene is associated with"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(input_ids, max_length=100, num_return_sequences=1)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

# Perform question answering
question = "What is the function of the p53 protein?"
context = "The p53 protein is a tumor suppressor that plays a crucial role in preventing cancer development."
inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)
answer_start = torch.argmax(outputs.start_logits)
answer_end = torch.argmax(outputs.end_logits) + 1
answer = tokenizer.decode(inputs.input_ids[0][answer_start:answer_end])
print(answer)

Getting Started

To get started with BioGPT, follow these steps:

Install the required dependencies:
```
pip install transformers torch
```

Load the model and tokenizer:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/biogpt")
model = AutoModelForCausalLM.from_pretrained("microsoft/biogpt")

Use the model for your desired task, such as text generation or question answering, as shown in the code examples above.

Competitor Comparisons

esm

3,583

Evolutionary Scale Modeling (esm): Pretrained language models for proteins

Pros of ESM

Focuses on protein language models, offering pre-trained models for various protein-related tasks
Provides extensive documentation and examples for using the models in different scenarios
Actively maintained with frequent updates and contributions from the research community

Cons of ESM

Limited to protein-specific tasks, unlike BioGPT's broader biomedical text generation capabilities
May require more domain expertise to utilize effectively compared to BioGPT's more general-purpose approach
Larger model sizes can be computationally demanding for some users

Code Comparison

ESM:

import torch
from esm import pretrained

model, alphabet = pretrained.load_model_and_alphabet("esm2_t33_650M_UR50D")
batch_converter = alphabet.get_batch_converter()

BioGPT:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/biogpt")
model = AutoModelForCausalLM.from_pretrained("microsoft/biogpt")

bert

39,267

TensorFlow code and pre-trained models for BERT

Pros of BERT

Widely adopted and extensively studied in the NLP community
Pre-trained on a large corpus of general text, making it versatile for various NLP tasks
Supports multiple languages and has numerous fine-tuned variants available

Cons of BERT

Not specifically designed for biomedical text, potentially limiting its performance in specialized domains
Requires fine-tuning for specific tasks, which can be computationally expensive
May struggle with long-range dependencies due to its fixed input length

Code Comparison

BERT:

from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

BioGPT:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/biogpt")
model = AutoModelForCausalLM.from_pretrained("microsoft/biogpt")

Both repositories use the Hugging Face Transformers library, but BioGPT is specifically designed for biomedical text generation tasks, while BERT is a more general-purpose language model. BioGPT may offer better performance in biomedical domains, but BERT's versatility and extensive ecosystem make it a strong choice for many NLP applications.

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Pros of transformers

Broader scope, covering a wide range of NLP tasks and models
Larger community and more frequent updates
Extensive documentation and examples

Cons of transformers

Can be overwhelming for beginners due to its vast scope
May require more setup and configuration for specific tasks

Code comparison

BioGPT:

from biogpt import BioGPT

model = BioGPT.from_pretrained("microsoft/biogpt")
output = model.generate("Describe the function of insulin:")

transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("microsoft/biogpt")
tokenizer = AutoTokenizer.from_pretrained("microsoft/biogpt")
inputs = tokenizer("Describe the function of insulin:", return_tensors="pt")
outputs = model.generate(**inputs)

Summary

transformers offers a more comprehensive toolkit for various NLP tasks, while BioGPT focuses specifically on biomedical text generation. transformers benefits from a larger community and more frequent updates, but may be more complex for beginners. BioGPT provides a simpler API for its specific use case, making it easier to use for biomedical text generation tasks.

scibert

1,588

A BERT model for scientific text.

Pros of SciBERT

Pre-trained on a large corpus of scientific text, making it well-suited for scientific and biomedical NLP tasks
Offers both cased and uncased models, providing flexibility for different use cases
Widely adopted in the scientific community with extensive documentation and examples

Cons of SciBERT

Limited to BERT architecture, which may not be as advanced as more recent language models
Focused primarily on scientific text, potentially less versatile for general-purpose tasks
Requires fine-tuning for specific downstream tasks, which can be computationally expensive

Code Comparison

SciBERT:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')

BioGPT:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/biogpt")
model = AutoModelForCausalLM.from_pretrained("microsoft/biogpt")

The main differences in the code snippets are:

SciBERT uses AutoModel, while BioGPT uses AutoModelForCausalLM
The model names and paths differ based on their respective repositories
BioGPT is designed for causal language modeling, while SciBERT is a bidirectional encoder

LAVIS

10,689

LAVIS - A One-stop Library for Language-Vision Intelligence

Pros of LAVIS

Broader scope: Supports multiple vision-language tasks beyond just text generation
More extensive documentation and examples for various use cases
Active development with frequent updates and community contributions

Cons of LAVIS

Larger codebase and potentially more complex setup
May require more computational resources due to its multi-modal nature
Less specialized for biomedical domain-specific tasks

Code Comparison

LAVIS example:

from lavis.models import load_model_and_preprocess

model, vis_processors, txt_processors = load_model_and_preprocess("blip_caption", "large")
raw_image = Image.open("path/to/image.jpg").convert("RGB")
image = vis_processors["eval"](raw_image).unsqueeze(0)
caption = model.generate({"image": image})

BioGPT example:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/biogpt")
model = AutoModelForCausalLM.from_pretrained("microsoft/biogpt")
inputs = tokenizer("Generate a summary of the following medical text:", return_tensors="pt")
outputs = model.generate(**inputs)

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

BioGPT

This repository contains the implementation of BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining, by Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon and Tie-Yan Liu.

Requirements and Installation

PyTorch version == 1.12.0
Python version == 3.10
fairseq version == 0.12.0:

git clone https://github.com/pytorch/fairseq
cd fairseq
git checkout v0.12.0
pip install .
python setup.py build_ext --inplace
cd ..

Moses

git clone https://github.com/moses-smt/mosesdecoder.git
export MOSES=${PWD}/mosesdecoder

fastBPE

git clone https://github.com/glample/fastBPE.git
export FASTBPE=${PWD}/fastBPE
cd fastBPE
g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast

sacremoses

pip install sacremoses

sklearn

pip install scikit-learn

Remember to set the environment variables MOSES and FASTBPE to the path of Moses and fastBPE respetively, as they will be required later.

Getting Started

Pre-trained models

We provide our pre-trained BioGPT model checkpoints along with fine-tuned checkpoints for downstream tasks, available both through URL download as well as through the Hugging Face ð¤ Hub.

Model	Description	URL	ð¤ Hub
BioGPT	Pre-trained BioGPT model checkpoint	link	link
BioGPT-Large	Pre-trained BioGPT-Large model checkpoint	link	link
BioGPT-QA-PubMedQA-BioGPT	Fine-tuned BioGPT for question answering task on PubMedQA	link
BioGPT-QA-PubMedQA-BioGPT-Large	Fine-tuned BioGPT-Large for question answering task on PubMedQA	link
BioGPT-RE-BC5CDR	Fine-tuned BioGPT for relation extraction task on BC5CDR	link
BioGPT-RE-DDI	Fine-tuned BioGPT for relation extraction task on DDI	link
BioGPT-RE-DTI	Fine-tuned BioGPT for relation extraction task on KD-DTI	link
BioGPT-DC-HoC	Fine-tuned BioGPT for document classification task on HoC	link

Download them and extract them to the checkpoints folder of this project.

For example:

mkdir checkpoints
cd checkpoints
wget https://msralaphilly2.blob.core.windows.net/release/BioGPT/checkpoints/Pre-trained-BioGPT.tgz?sp=r&st=2023-11-13T15:37:35Z&se=2099-12-30T23:37:35Z&spr=https&sv=2022-11-02&sr=b&sig=3CcG1TOhqJPBhkVutvVn3PtUq0vPyLBgwggUfojypfY%3D
tar -zxvf Pre-trained-BioGPT.tgz

Example Usage

Use pre-trained BioGPT model in your code:

import torch
from fairseq.models.transformer_lm import TransformerLanguageModel
m = TransformerLanguageModel.from_pretrained(
        "checkpoints/Pre-trained-BioGPT", 
        "checkpoint.pt", 
        "data",
        tokenizer='moses', 
        bpe='fastbpe', 
        bpe_codes="data/bpecodes",
        min_len=100,
        max_len_b=1024)
m.cuda()
src_tokens = m.encode("COVID-19 is")
generate = m.generate([src_tokens], beam=5)[0]
output = m.decode(generate[0]["tokens"])
print(output)

Use fine-tuned BioGPT model on KD-DTI for drug-target-interaction in your code:

import torch
from src.transformer_lm_prompt import TransformerLanguageModelPrompt
m = TransformerLanguageModelPrompt.from_pretrained(
        "checkpoints/RE-DTI-BioGPT", 
        "checkpoint_avg.pt", 
        "data/KD-DTI/relis-bin",
        tokenizer='moses', 
        bpe='fastbpe', 
        bpe_codes="data/bpecodes",
        max_len_b=1024,
        beam=1)
m.cuda()
src_text="" # input text, e.g., a PubMed abstract
src_tokens = m.encode(src_text)
generate = m.generate([src_tokens], beam=args.beam)[0]
output = m.decode(generate[0]["tokens"])
print(output)

For more downstream tasks, please see below.

Downstream tasks

See corresponding folder in examples:

Relation Extraction on BC5CDR

Relation Extraction on KD-DTI

Relation Extraction on DDI

Document Classification on HoC

Question Answering on PubMedQA

Text Generation

Hugging Face ð¤ Usage

BioGPT has also been integrated into the Hugging Face transformers library, and model checkpoints are available on the Hugging Face Hub.

You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

from transformers import pipeline, set_seed
from transformers import BioGptTokenizer, BioGptForCausalLM
model = BioGptForCausalLM.from_pretrained("microsoft/biogpt")
tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt")
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
set_seed(42)
generator("COVID-19 is", max_length=20, num_return_sequences=5, do_sample=True)

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import BioGptTokenizer, BioGptForCausalLM
tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt")
model = BioGptForCausalLM.from_pretrained("microsoft/biogpt")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

Beam-search decoding:

import torch
from transformers import BioGptTokenizer, BioGptForCausalLM, set_seed

tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt")
model = BioGptForCausalLM.from_pretrained("microsoft/biogpt")

sentence = "COVID-19 is"
inputs = tokenizer(sentence, return_tensors="pt")

set_seed(42)

with torch.no_grad():
    beam_output = model.generate(**inputs,
                                 min_length=100,
                                 max_length=1024,
                                 num_beams=5,
                                 early_stopping=True
                                )
tokenizer.decode(beam_output[0], skip_special_tokens=True)

For more information, please see the documentation on the Hugging Face website.

Demos

Check out these demos on Hugging Face Spaces:

License

BioGPT is MIT-licensed. The license applies to the pre-trained models as well.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of ESM

Cons of ESM

Code Comparison

Pros of BERT

Cons of BERT

Code Comparison

Pros of transformers

Cons of transformers

Code comparison

Summary

Pros of SciBERT

Cons of SciBERT

Code Comparison

Pros of LAVIS

Cons of LAVIS

Code Comparison

Convert designs to code with AI

README

BioGPT

Requirements and Installation

Getting Started

Pre-trained models

Example Usage

Downstream tasks

Hugging Face ð¤ Usage

Demos

License

Contributing

Trademarks

Top Related Projects

Convert designs to code with AI

Hugging Face ð¤ Usage