Convert Figma logo to code with AI

microsoft logoBioGPT

No description available

4,279
449
4,279
73

Top Related Projects

3,107

Evolutionary Scale Modeling (esm): Pretrained language models for proteins

37,810

TensorFlow code and pre-trained models for BERT

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

1,481

A BERT model for scientific text.

9,615

LAVIS - A One-stop Library for Language-Vision Intelligence

Quick Overview

BioGPT is an open-source large language model developed by Microsoft, specifically trained on biomedical literature. It aims to assist researchers and professionals in the biomedical field by providing domain-specific knowledge and capabilities for various biomedical natural language processing tasks.

Pros

  • Specialized in biomedical domain, offering more accurate and relevant responses for biomedical queries
  • Pre-trained on a vast corpus of biomedical literature, providing up-to-date knowledge
  • Supports various downstream tasks such as question answering, text generation, and named entity recognition
  • Open-source, allowing for community contributions and customization

Cons

  • Requires significant computational resources for fine-tuning and deployment
  • May have limitations in understanding or generating content outside the biomedical domain
  • Potential for biases present in the training data to be reflected in the model's outputs
  • Requires careful prompt engineering and fine-tuning for optimal performance in specific use cases

Code Examples

# Load the BioGPT model and tokenizer
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/biogpt")
model = AutoModelForCausalLM.from_pretrained("microsoft/biogpt")
# Generate text based on a prompt
prompt = "The BRCA1 gene is associated with"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(input_ids, max_length=100, num_return_sequences=1)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
# Perform question answering
question = "What is the function of the p53 protein?"
context = "The p53 protein is a tumor suppressor that plays a crucial role in preventing cancer development."
inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)
answer_start = torch.argmax(outputs.start_logits)
answer_end = torch.argmax(outputs.end_logits) + 1
answer = tokenizer.decode(inputs.input_ids[0][answer_start:answer_end])
print(answer)

Getting Started

To get started with BioGPT, follow these steps:

  1. Install the required dependencies:

    pip install transformers torch
    
  2. Load the model and tokenizer:

    from transformers import AutoTokenizer, AutoModelForCausalLM
    
    tokenizer = AutoTokenizer.from_pretrained("microsoft/biogpt")
    model = AutoModelForCausalLM.from_pretrained("microsoft/biogpt")
    
  3. Use the model for your desired task, such as text generation or question answering, as shown in the code examples above.

Competitor Comparisons

3,107

Evolutionary Scale Modeling (esm): Pretrained language models for proteins

Pros of ESM

  • Focuses on protein language models, offering pre-trained models for various protein-related tasks
  • Provides extensive documentation and examples for using the models in different scenarios
  • Actively maintained with frequent updates and contributions from the research community

Cons of ESM

  • Limited to protein-specific tasks, unlike BioGPT's broader biomedical text generation capabilities
  • May require more domain expertise to utilize effectively compared to BioGPT's more general-purpose approach
  • Larger model sizes can be computationally demanding for some users

Code Comparison

ESM:

import torch
from esm import pretrained

model, alphabet = pretrained.load_model_and_alphabet("esm2_t33_650M_UR50D")
batch_converter = alphabet.get_batch_converter()

BioGPT:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/biogpt")
model = AutoModelForCausalLM.from_pretrained("microsoft/biogpt")
37,810

TensorFlow code and pre-trained models for BERT

Pros of BERT

  • Widely adopted and extensively studied in the NLP community
  • Pre-trained on a large corpus of general text, making it versatile for various NLP tasks
  • Supports multiple languages and has numerous fine-tuned variants available

Cons of BERT

  • Not specifically designed for biomedical text, potentially limiting its performance in specialized domains
  • Requires fine-tuning for specific tasks, which can be computationally expensive
  • May struggle with long-range dependencies due to its fixed input length

Code Comparison

BERT:

from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

BioGPT:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/biogpt")
model = AutoModelForCausalLM.from_pretrained("microsoft/biogpt")

Both repositories use the Hugging Face Transformers library, but BioGPT is specifically designed for biomedical text generation tasks, while BERT is a more general-purpose language model. BioGPT may offer better performance in biomedical domains, but BERT's versatility and extensive ecosystem make it a strong choice for many NLP applications.

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Pros of transformers

  • Broader scope, covering a wide range of NLP tasks and models
  • Larger community and more frequent updates
  • Extensive documentation and examples

Cons of transformers

  • Can be overwhelming for beginners due to its vast scope
  • May require more setup and configuration for specific tasks

Code comparison

BioGPT:

from biogpt import BioGPT

model = BioGPT.from_pretrained("microsoft/biogpt")
output = model.generate("Describe the function of insulin:")

transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("microsoft/biogpt")
tokenizer = AutoTokenizer.from_pretrained("microsoft/biogpt")
inputs = tokenizer("Describe the function of insulin:", return_tensors="pt")
outputs = model.generate(**inputs)

Summary

transformers offers a more comprehensive toolkit for various NLP tasks, while BioGPT focuses specifically on biomedical text generation. transformers benefits from a larger community and more frequent updates, but may be more complex for beginners. BioGPT provides a simpler API for its specific use case, making it easier to use for biomedical text generation tasks.

1,481

A BERT model for scientific text.

Pros of SciBERT

  • Pre-trained on a large corpus of scientific text, making it well-suited for scientific and biomedical NLP tasks
  • Offers both cased and uncased models, providing flexibility for different use cases
  • Widely adopted in the scientific community with extensive documentation and examples

Cons of SciBERT

  • Limited to BERT architecture, which may not be as advanced as more recent language models
  • Focused primarily on scientific text, potentially less versatile for general-purpose tasks
  • Requires fine-tuning for specific downstream tasks, which can be computationally expensive

Code Comparison

SciBERT:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')

BioGPT:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/biogpt")
model = AutoModelForCausalLM.from_pretrained("microsoft/biogpt")

The main differences in the code snippets are:

  1. SciBERT uses AutoModel, while BioGPT uses AutoModelForCausalLM
  2. The model names and paths differ based on their respective repositories
  3. BioGPT is designed for causal language modeling, while SciBERT is a bidirectional encoder
9,615

LAVIS - A One-stop Library for Language-Vision Intelligence

Pros of LAVIS

  • Broader scope: Supports multiple vision-language tasks beyond just text generation
  • More extensive documentation and examples for various use cases
  • Active development with frequent updates and community contributions

Cons of LAVIS

  • Larger codebase and potentially more complex setup
  • May require more computational resources due to its multi-modal nature
  • Less specialized for biomedical domain-specific tasks

Code Comparison

LAVIS example:

from lavis.models import load_model_and_preprocess

model, vis_processors, txt_processors = load_model_and_preprocess("blip_caption", "large")
raw_image = Image.open("path/to/image.jpg").convert("RGB")
image = vis_processors["eval"](raw_image).unsqueeze(0)
caption = model.generate({"image": image})

BioGPT example:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/biogpt")
model = AutoModelForCausalLM.from_pretrained("microsoft/biogpt")
inputs = tokenizer("Generate a summary of the following medical text:", return_tensors="pt")
outputs = model.generate(**inputs)

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

BioGPT

This repository contains the implementation of BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining, by Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon and Tie-Yan Liu.

Requirements and Installation

  • PyTorch version == 1.12.0
  • Python version == 3.10
  • fairseq version == 0.12.0:
git clone https://github.com/pytorch/fairseq
cd fairseq
git checkout v0.12.0
pip install .
python setup.py build_ext --inplace
cd ..
  • Moses
git clone https://github.com/moses-smt/mosesdecoder.git
export MOSES=${PWD}/mosesdecoder
  • fastBPE
git clone https://github.com/glample/fastBPE.git
export FASTBPE=${PWD}/fastBPE
cd fastBPE
g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast
  • sacremoses
pip install sacremoses
  • sklearn
pip install scikit-learn

Remember to set the environment variables MOSES and FASTBPE to the path of Moses and fastBPE respetively, as they will be required later.

Getting Started

Pre-trained models

We provide our pre-trained BioGPT model checkpoints along with fine-tuned checkpoints for downstream tasks, available both through URL download as well as through the Hugging Face 🤗 Hub.

ModelDescriptionURL🤗 Hub
BioGPTPre-trained BioGPT model checkpointlinklink
BioGPT-LargePre-trained BioGPT-Large model checkpointlinklink
BioGPT-QA-PubMedQA-BioGPTFine-tuned BioGPT for question answering task on PubMedQAlink
BioGPT-QA-PubMedQA-BioGPT-LargeFine-tuned BioGPT-Large for question answering task on PubMedQAlink
BioGPT-RE-BC5CDRFine-tuned BioGPT for relation extraction task on BC5CDRlink
BioGPT-RE-DDIFine-tuned BioGPT for relation extraction task on DDIlink
BioGPT-RE-DTIFine-tuned BioGPT for relation extraction task on KD-DTIlink
BioGPT-DC-HoCFine-tuned BioGPT for document classification task on HoClink

Download them and extract them to the checkpoints folder of this project.

For example:

mkdir checkpoints
cd checkpoints
wget https://msralaphilly2.blob.core.windows.net/release/BioGPT/checkpoints/Pre-trained-BioGPT.tgz?sp=r&st=2023-11-13T15:37:35Z&se=2099-12-30T23:37:35Z&spr=https&sv=2022-11-02&sr=b&sig=3CcG1TOhqJPBhkVutvVn3PtUq0vPyLBgwggUfojypfY%3D
tar -zxvf Pre-trained-BioGPT.tgz

Example Usage

Use pre-trained BioGPT model in your code:

import torch
from fairseq.models.transformer_lm import TransformerLanguageModel
m = TransformerLanguageModel.from_pretrained(
        "checkpoints/Pre-trained-BioGPT", 
        "checkpoint.pt", 
        "data",
        tokenizer='moses', 
        bpe='fastbpe', 
        bpe_codes="data/bpecodes",
        min_len=100,
        max_len_b=1024)
m.cuda()
src_tokens = m.encode("COVID-19 is")
generate = m.generate([src_tokens], beam=5)[0]
output = m.decode(generate[0]["tokens"])
print(output)

Use fine-tuned BioGPT model on KD-DTI for drug-target-interaction in your code:

import torch
from src.transformer_lm_prompt import TransformerLanguageModelPrompt
m = TransformerLanguageModelPrompt.from_pretrained(
        "checkpoints/RE-DTI-BioGPT", 
        "checkpoint_avg.pt", 
        "data/KD-DTI/relis-bin",
        tokenizer='moses', 
        bpe='fastbpe', 
        bpe_codes="data/bpecodes",
        max_len_b=1024,
        beam=1)
m.cuda()
src_text="" # input text, e.g., a PubMed abstract
src_tokens = m.encode(src_text)
generate = m.generate([src_tokens], beam=args.beam)[0]
output = m.decode(generate[0]["tokens"])
print(output)

For more downstream tasks, please see below.

Downstream tasks

See corresponding folder in examples:

Relation Extraction on BC5CDR

Relation Extraction on KD-DTI

Relation Extraction on DDI

Document Classification on HoC

Question Answering on PubMedQA

Text Generation

Hugging Face 🤗 Usage

BioGPT has also been integrated into the Hugging Face transformers library, and model checkpoints are available on the Hugging Face Hub.

You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

from transformers import pipeline, set_seed
from transformers import BioGptTokenizer, BioGptForCausalLM
model = BioGptForCausalLM.from_pretrained("microsoft/biogpt")
tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt")
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
set_seed(42)
generator("COVID-19 is", max_length=20, num_return_sequences=5, do_sample=True)

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import BioGptTokenizer, BioGptForCausalLM
tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt")
model = BioGptForCausalLM.from_pretrained("microsoft/biogpt")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

Beam-search decoding:

import torch
from transformers import BioGptTokenizer, BioGptForCausalLM, set_seed

tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt")
model = BioGptForCausalLM.from_pretrained("microsoft/biogpt")

sentence = "COVID-19 is"
inputs = tokenizer(sentence, return_tensors="pt")

set_seed(42)

with torch.no_grad():
    beam_output = model.generate(**inputs,
                                 min_length=100,
                                 max_length=1024,
                                 num_beams=5,
                                 early_stopping=True
                                )
tokenizer.decode(beam_output[0], skip_special_tokens=True)

For more information, please see the documentation on the Hugging Face website.

Demos

Check out these demos on Hugging Face Spaces:

License

BioGPT is MIT-licensed. The license applies to the pre-trained models as well.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.