scibert

A BERT model for scientific text.

1,627

235

1,627

View on GitHub

Top Related Projects

bert

39,267

TensorFlow code and pre-trained models for BERT

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

fairseq

31,682

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

unilm

21,586

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

models

77,618

Models and examples built with TensorFlow

Quick Overview

SciBERT is a pretrained language model based on BERT, specifically designed for scientific text. It's trained on a large corpus of scientific publications to better understand and process scientific language. SciBERT aims to improve performance on scientific and biomedical NLP tasks.

Pros

Specialized for scientific and biomedical text, potentially outperforming general-purpose models on domain-specific tasks
Pretrained on a large corpus of scientific publications, capturing domain-specific vocabulary and language patterns
Compatible with existing BERT-based architectures and tools, allowing easy integration into existing workflows
Offers both cased and uncased versions, as well as versions with custom and original BERT vocabularies

Cons

May not perform as well on general-domain text compared to models trained on more diverse corpora
Requires significant computational resources for fine-tuning and inference, similar to other BERT-based models
Limited to the scientific knowledge available up to its training cutoff date
Potential bias towards certain scientific fields or publication types represented in the training data

Code Examples

Loading SciBERT model and tokenizer:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')

Encoding text with SciBERT:

text = "The mitochondria is the powerhouse of the cell."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

Fine-tuning SciBERT for text classification:

from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

model = AutoModelForSequenceClassification.from_pretrained('allenai/scibert_scivocab_uncased', num_labels=2)
trainer = Trainer(
    model=model,
    args=TrainingArguments(output_dir="./results", num_train_epochs=3, per_device_train_batch_size=16),
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)
trainer.train()

Getting Started

To use SciBERT in your project:

Install the required libraries:

pip install transformers torch

Import and load the model:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')

Use the model for your specific task (e.g., text classification, named entity recognition, etc.) by fine-tuning or using it as a feature extractor.

Competitor Comparisons

bert

39,267

TensorFlow code and pre-trained models for BERT

Pros of BERT

Broader application across various domains
Larger community support and more extensive documentation
More pre-trained models available for different languages and tasks

Cons of BERT

Less specialized for scientific and biomedical text
May require more fine-tuning for domain-specific tasks
Potentially lower performance on scientific literature compared to SciBERT

Code Comparison

SciBERT:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')

BERT:

from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

Both repositories use similar code structures for loading models and tokenizers. The main difference lies in the specific model being loaded. SciBERT uses a specialized scientific vocabulary, while BERT uses a general-purpose vocabulary. SciBERT is optimized for scientific text, potentially offering better performance in that domain, while BERT provides a more versatile foundation for various NLP tasks across different fields.

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Pros of transformers

Broader scope: Supports a wide range of NLP tasks and models beyond just scientific text
Active development: Frequently updated with new models and features
Extensive documentation and community support

Cons of transformers

Larger codebase: May be more complex to navigate for specific use cases
Potentially higher resource requirements due to its comprehensive nature

Code comparison

SciBERT:

from scibert import SciBertModel
model = SciBertModel.from_pretrained('scibert_scivocab_uncased')

transformers:

from transformers import AutoModel
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')

Summary

While SciBERT focuses specifically on scientific text processing, transformers offers a more versatile toolkit for various NLP tasks. SciBERT may be more straightforward for users working exclusively with scientific literature, but transformers provides greater flexibility and ongoing development. The code usage is similar, with transformers offering a more unified API across different models. Choose based on your specific needs: SciBERT for targeted scientific text processing, or transformers for a comprehensive NLP solution.

fairseq

31,682

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Pros of fairseq

Broader scope: Supports a wide range of sequence-to-sequence tasks beyond just scientific text
More active development: Regularly updated with new features and models
Extensive documentation and examples for various use cases

Cons of fairseq

Steeper learning curve due to its broader scope and more complex architecture
May be overkill for projects focused solely on scientific text processing
Requires more computational resources for training and inference

Code Comparison

SciBERT:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')

fairseq:

from fairseq.models.roberta import RobertaModel

roberta = RobertaModel.from_pretrained('path/to/roberta/model', checkpoint_file='model.pt')
roberta.eval()  # disable dropout

Both repositories provide pre-trained models for natural language processing tasks. SciBERT focuses specifically on scientific text, while fairseq offers a more versatile toolkit for various sequence-to-sequence tasks. SciBERT is easier to use for scientific text processing, while fairseq provides more flexibility and options for advanced users working on diverse NLP projects.

unilm

21,586

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Pros of UniLM

Supports a wider range of NLP tasks, including text generation and understanding
Utilizes a unified pre-training approach for multiple downstream tasks
Offers better performance on certain benchmarks like SQuAD and GLUE

Cons of UniLM

May require more computational resources due to its larger model size
Less specialized for scientific and biomedical text compared to SciBERT
Potentially more complex to fine-tune for specific domain tasks

Code Comparison

SciBERT:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("allenai/scibert_scivocab_uncased")
model = AutoModel.from_pretrained("allenai/scibert_scivocab_uncased")

UniLM:

from transformers import UniLMTokenizer, UniLMForConditionalGeneration

tokenizer = UniLMTokenizer.from_pretrained("microsoft/unilm-base-cased")
model = UniLMForConditionalGeneration.from_pretrained("microsoft/unilm-base-cased")

Both repositories provide pre-trained models for natural language processing tasks. SciBERT focuses on scientific text, while UniLM offers a more versatile approach for various NLP applications. The code examples demonstrate how to load the respective models using the Transformers library, highlighting the different model architectures and their intended use cases.

models

77,618

Models and examples built with TensorFlow

Pros of models

Broader scope, covering various machine learning tasks and architectures
Extensive documentation and tutorials for different models
Active community with frequent updates and contributions

Cons of models

Larger repository size, potentially overwhelming for newcomers
Focused primarily on TensorFlow, limiting flexibility for other frameworks
May require more setup and configuration for specific tasks

Code Comparison

models:

import tensorflow as tf
from official.nlp import bert

model = bert.BertModel(config=bert_config)
outputs = model(input_ids, attention_mask=input_mask)

SciBERT:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')

Summary

While models offers a comprehensive collection of machine learning models and examples, SciBERT focuses specifically on scientific text processing. models provides a wider range of applications but may be more complex to navigate, whereas SciBERT offers a streamlined solution for scientific NLP tasks with easier integration through the Transformers library.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

`SciBERT`

SciBERT is a BERT model trained on scientific text.

SciBERT is trained on papers from the corpus of semanticscholar.org. Corpus size is 1.14M papers, 3.1B tokens. We use the full text of the papers in training, not just abstracts.
SciBERT has its own vocabulary (scivocab) that's built to best match the training corpus. We trained cased and uncased versions. We also include models trained on the original BERT vocabulary (basevocab) for comparison.
It results in state-of-the-art performance on a wide range of scientific domain nlp tasks. The details of the evaluation are in the paper. Evaluation code and data are included in this repo.

Downloading Trained Models

Update! SciBERT models now installable directly within Huggingface's framework under the allenai org:

from transformers import *

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_cased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_cased')

We release the tensorflow and the pytorch version of the trained models. The tensorflow version is compatible with code that works with the model from Google Research. The pytorch version is created using the Hugging Face library, and this repo shows how to use it in AllenNLP. All combinations of scivocab and basevocab, cased and uncased models are available below. Our evaluation shows that scivocab-uncased usually gives the best results.

Using SciBERT in your own model

SciBERT models include all necessary files to be plugged in your own model and are in same format as BERT. If you are using Tensorflow, refer to Google's BERT repo and if you use PyTorch, refer to Hugging Face's repo where detailed instructions on using BERT models are provided.

Training new models using AllenNLP

To run experiments on different tasks and reproduce our results in the paper, you need to first setup the Python 3.6 environment:

pip install -r requirements.txt

which will install dependencies like AllenNLP.

Use the scibert/scripts/train_allennlp_local.sh script as an example of how to run an experiment (you'll need to modify paths and variable names like TASK and DATASET).

We include a broad set of scientific nlp datasets under the data/ directory across the following tasks. Each task has a sub-directory of available datasets.

âââ ner
âÂ Â  âââ JNLPBA
âÂ Â  âââ NCBI-disease
âÂ Â  âââ bc5cdr
âÂ Â  âââ sciie
âââ parsing
âÂ Â  âââ genia
âââ pico
âÂ Â  âââ ebmnlp
âââ text_classification
    âââ chemprot
    âââ citation_intent
    âââ mag
    âââ rct-20k
    âââ sci-cite
    âââ sciie-relation-extraction

For example to run the model on the Named Entity Recognition (NER) task and on the BC5CDR dataset (BioCreative V CDR), modify the scibert/train_allennlp_local.sh script according to:

DATASET='bc5cdr'
TASK='ner'
...

Decompress the PyTorch model that you downloaded using
tar -xvf scibert_scivocab_uncased.tar
The results will be in the scibert_scivocab_uncased directory containing two files: A vocabulary file (vocab.txt) and a weights file (weights.tar.gz). Copy the files to your desired location and then set correct paths for BERT_WEIGHTS and BERT_VOCAB in the script:

export BERT_VOCAB=path-to/scibert_scivocab_uncased.vocab
export BERT_WEIGHTS=path-to/scibert_scivocab_uncased.tar.gz

Finally run the script:

./scibert/scripts/train_allennlp_local.sh [serialization-directory]

Where [serialization-directory] is the path to an output directory where the model files will be stored.

Citing

If you use SciBERT in your research, please cite SciBERT: Pretrained Language Model for Scientific Text.

@inproceedings{Beltagy2019SciBERT,
  title={SciBERT: Pretrained Language Model for Scientific Text},
  author={Iz Beltagy and Kyle Lo and Arman Cohan},
  year={2019},
  booktitle={EMNLP},
  Eprint={arXiv:1903.10676}
}

SciBERT is an open-source project developed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of BERT

Cons of BERT

Code Comparison

Pros of transformers

Cons of transformers

Code comparison

Summary

Pros of fairseq

Cons of fairseq

Code Comparison

Pros of UniLM

Cons of UniLM

Code Comparison

Pros of models

Cons of models

Code Comparison

Summary

Convert designs to code with AI

README

SciBERT

Downloading Trained Models

Tensorflow Models

PyTorch AllenNLP Models

PyTorch HuggingFace Models

Using SciBERT in your own model

Training new models using AllenNLP

Citing

Top Related Projects

Convert designs to code with AI

`SciBERT`