RoseTTAFold

This package contains deep learning models and related scripts for RoseTTAFold

2,171

446

2,171

View on GitHub

Top Related Projects

ColabFold

2,355

Making Protein folding accessible to all!

esm

3,719

Evolutionary Scale Modeling (esm): Pretrained language models for proteins

Quick Overview

RoseTTAFold is an open-source protein structure prediction software developed by the Baker lab at the University of Washington. It uses a three-track neural network to simultaneously process multiple representations of protein sequences and integrates them to generate accurate 3D structure predictions. RoseTTAFold is designed to be faster and more memory-efficient than AlphaFold2 while maintaining comparable accuracy.

Pros

High accuracy in protein structure prediction, comparable to AlphaFold2
Faster runtime and lower memory requirements than AlphaFold2
Open-source and freely available for academic and non-commercial use
Supports both single-chain and complex structure prediction

Cons

Requires significant computational resources, including GPUs
Installation and setup can be complex for non-expert users
Limited documentation and user support compared to some commercial alternatives
May have lower accuracy for some challenging protein targets compared to the latest AlphaFold versions

Code Examples

# Example 1: Predicting a protein structure
from rosettafold import RoseTTAFold

model = RoseTTAFold()
sequence = "MVKVGVNGFGRIGRLVTRAAFNSGKVDIVAINDPFIDLNYMVYMFQYDSTHGKFHGTVKAENGKLVINGNPITIFQERDPSKIKWGDAGAEYVVESTGVFTTMEKAGAHLQGGAKRVIISAPSADAPMFVMGVNHEKYDNSLKIISNASCTTNCLAPLAKVIHDNFGIVEGLMTTVHAITATQKTVDGPSGKLWRDGRGALQNIIPASTGAAKAVGKVIPELDGKLTGMAFRVPTANVSVVDLTCRLEKPAKYDDIKKVVKQASEGPLKGILGYTEHQVVSSDFNSDTHSSTFDAGAGIALNDHFVKLISWYDNEFGYSNRVVDLMAHMASKE"
predicted_structure = model.predict(sequence)

# Example 2: Visualizing the predicted structure
import py3Dmol

view = py3Dmol.view()
view.addModel(predicted_structure.to_pdb(), "pdb")
view.setStyle({'cartoon': {'color': 'spectrum'}})
view.zoomTo()
view.show()

# Example 3: Evaluating prediction confidence
confidence_scores = predicted_structure.get_plddt()
import matplotlib.pyplot as plt

plt.plot(confidence_scores)
plt.xlabel("Residue")
plt.ylabel("pLDDT Score")
plt.title("Prediction Confidence")
plt.show()

Getting Started

Install RoseTTAFold and its dependencies:

git clone https://github.com/RosettaCommons/RoseTTAFold.git
cd RoseTTAFold
conda env create -f folding_linux.yml
conda activate folding

Download pre-trained weights:
```
./download_weights.sh
```

Run a prediction:

python run_e2e_ver.py input.fasta ./output_dir --weights_dir ./weights

For more detailed instructions and advanced usage, refer to the official RoseTTAFold documentation.

Competitor Comparisons

ColabFold

2,355

Making Protein folding accessible to all!

Pros of ColabFold

Easier to use with Google Colab integration, making it more accessible for users without local high-performance computing resources
Faster prediction times due to optimized implementation and use of MMseqs2 for sequence searches
More frequent updates and active community support

Cons of ColabFold

Less customizable than RoseTTAFold for advanced users who need fine-grained control over the prediction process
May have limitations on input size and runtime when using free Google Colab resources

Code Comparison

RoseTTAFold:

msa, deletion_matrix = parsers.parse_a3m(a3m_file)
N, L = msa.shape
if N > args.nseq:
    idx = np.argsort(np.sum(msa==20,axis=1))[:args.nseq]
    msa = msa[idx]
    deletion_matrix = deletion_matrix[idx]

ColabFold:

a3m_lines = input_file.split("\n")
msa, ins = parse_a3m(a3m_lines)
N, L = msa.shape
if N > args.max_msa_clusters:
    msa = msa[:args.max_msa_clusters]
    ins = ins[:args.max_msa_clusters]

Both repositories handle multiple sequence alignments (MSA) parsing and limiting the number of sequences, but ColabFold's implementation is more streamlined and optimized for use with AlphaFold2.

alphafold

13,724

Open source code for AlphaFold 2.

Pros of AlphaFold

Higher accuracy in protein structure prediction
More extensive training on protein databases
Better handling of multi-chain complexes

Cons of AlphaFold

Computationally intensive, requiring significant resources
Less flexibility in customizing the prediction process
Longer runtime for predictions compared to RoseTTAFold

Code Comparison

RoseTTAFold:

def predict_structure(sequence):
    msa = generate_msa(sequence)
    features = extract_features(msa)
    model = load_model("rosettafold_model.pkl")
    return model.predict(features)

AlphaFold:

def predict_structure(sequence):
    features = pipeline.process(protein=protein.from_sequence(sequence))
    model = alphafold.model.Model(config)
    prediction = model.predict(features)
    return prediction.unrelaxed_protein

Both repositories aim to predict protein structures, but AlphaFold generally achieves higher accuracy at the cost of computational resources. RoseTTAFold offers a faster, more lightweight alternative with slightly lower accuracy. The code snippets illustrate the different approaches: RoseTTAFold uses a more traditional pipeline with MSA generation and feature extraction, while AlphaFold employs a more integrated end-to-end approach.

esm

3,719

Evolutionary Scale Modeling (esm): Pretrained language models for proteins

Pros of ESM

Broader scope: ESM is a general-purpose protein language model, applicable to various protein-related tasks
Extensive pre-training: Trained on a vast amount of protein sequence data, enabling robust performance across different applications
Easier integration: Provides pre-trained models and tools for easy incorporation into various workflows

Cons of ESM

Less specialized: May not perform as well as RoseTTAFold for specific protein structure prediction tasks
Higher computational requirements: Large language models often require more computational resources to run effectively
Limited structural information: Primarily focuses on sequence information, potentially missing some structural nuances

Code Comparison

RoseTTAFold:

msa, xyz, mask = self.forward(msa, xyz, seq1hot, idx, t1d)
return msa, xyz, mask

ESM:

results = model(batch_tokens)
representations = results["representations"][33]

The code snippets show that RoseTTAFold focuses on processing multiple sequence alignments (MSA) and structural data, while ESM operates on tokenized sequences and produces representations. This reflects their different approaches to protein analysis.

google-research

36,128

Google Research

Pros of google-research

Broader scope, covering various AI and machine learning topics
More frequent updates and contributions from a larger team
Extensive documentation and examples for multiple projects

Cons of google-research

Less focused on a specific problem domain
May be more challenging to navigate due to its size and diversity
Potentially steeper learning curve for newcomers

Code Comparison

RoseTTAFold:

def predict_structure(sequence):
    model = load_model("rosettafold_model.pth")
    return model.predict(sequence)

google-research:

def run_bert(input_text):
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = BertModel.from_pretrained('bert-base-uncased')
    return model(tokenizer(input_text, return_tensors='pt'))

The code snippets illustrate the difference in focus: RoseTTAFold is specialized for protein structure prediction, while google-research covers a wider range of topics, such as natural language processing with BERT in this example.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

RoseTTAFold

This package contains deep learning models and related scripts to run RoseTTAFold.
This repository is the official implementation of RoseTTAFold: Accurate prediction of protein structures and interactions using a 3-track network.

Installation

Clone the package

git clone https://github.com/RosettaCommons/RoseTTAFold.git
cd RoseTTAFold

Create conda environment using RoseTTAFold-linux.yml file and folding-linux.yml file. The latter is required to run a pyrosetta version only (run_pyrosetta_ver.sh).

# create conda environment for RoseTTAFold
#   If your NVIDIA driver compatible with cuda11
conda env create -f RoseTTAFold-linux.yml
#   If not (but compatible with cuda10)
conda env create -f RoseTTAFold-linux-cu101.yml

# create conda environment for pyRosetta folding & running DeepAccNet
conda env create -f folding-linux.yml

Download network weights (under Rosetta-DL Software license -- please see below)
While the code is licensed under the MIT License, the trained weights and data for RoseTTAFold are made available for non-commercial use only under the terms of the Rosetta-DL Software license. You can find details at https://files.ipd.uw.edu/pub/RoseTTAFold/Rosetta-DL_LICENSE.txt

[Update Nov/02/2021] It's now including the weights (RF2t.pt) for RoseTTAFold-2track model used for yeast PPI screening. If you want to use it, please re-download weights. The original RoseTTAFold weights are not changed.

wget https://files.ipd.uw.edu/pub/RoseTTAFold/weights.tar.gz
tar xfz weights.tar.gz

Download and install third-party software.

./install_dependencies.sh

Download sequence and structure databases

# uniref30 [46G]
wget http://wwwuser.gwdg.de/~compbiol/uniclust/2020_06/UniRef30_2020_06_hhsuite.tar.gz
mkdir -p UniRef30_2020_06
tar xfz UniRef30_2020_06_hhsuite.tar.gz -C ./UniRef30_2020_06

# BFD [272G]
wget https://bfd.mmseqs.com/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz
mkdir -p bfd
tar xfz bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz -C ./bfd

# structure templates (including *_a3m.ffdata, *_a3m.ffindex) [over 100G]
wget https://files.ipd.uw.edu/pub/RoseTTAFold/pdb100_2021Mar03.tar.gz
tar xfz pdb100_2021Mar03.tar.gz
# for CASP14 benchmarks, we used this one: https://files.ipd.uw.edu/pub/RoseTTAFold/pdb100_2020Mar11.tar.gz

Obtain a PyRosetta licence and install the package in the newly created folding conda environment (link).

Usage

# For monomer structure prediction
cd example
../run_[pyrosetta, e2e]_ver.sh input.fa .

# For complex modeling
# please see README file under example/complex_modeling/README for details.
python network/predict_complex.py -i paired.a3m -o complex -Ls 218 310 

# For PPI screening using faster 2-track version (example input and output are at example/complex_2track)
python network_2track/predict_msa.py -msa [paired MSA file in a3m format] -npz [output npz file name] -L1 [Length of first chain]
e.g. python network_2track/predict_msa.py -msa input.a3m -npz complex.npz -L1 218

Expected outputs

For the pyrosetta version, user will get five final models having estimated CA rms error at the B-factor column (model/model_[1-5].crderr.pdb).
For the end-to-end version, there will be a single PDB output having estimated residue-wise CA-lddt at the B-factor column (t000_.e2e.pdb).

FAQ

Segmentation fault while running hhblits/hhsearch
For easy install, we used a statically compiled version of hhsuite (installed through conda). Currently, we're not sure what exactly causes segmentation fault error in some cases, but we found that it might be resolved if you compile hhsuite from source and use this compiled version instead of conda version. For installation of hhsuite, please see here.
Submitting jobs to computing nodes
The modeling pipeline provided here (run_pyrosetta_ver.sh/run_e2e_ver.sh) is a kind of guidelines to show how RoseTTAFold works. For more efficient use of computing resources, you might want to modify the provided bash script to submit separate jobs with proper dependencies for each of steps (more cpus/memory for hhblits/hhsearch, using gpus only for running the networks, etc).

Credit to performer-pytorch and SE(3)-Transformer codes

The code in the network/performer_pytorch.py is strongly based on this repo which is pytorch implementation of Performer architecture. The codes in network/equivariant_attention is from the original SE(3)-Transformer repo which accompanies the paper 'SE(3)-Transformers: 3D Roto-Translation Equivariant Attention Networks' by Fabian et al.

References

M. Baek, et al., Accurate prediction of protein structures and interactions using a three-track neural network, Science (2021). link

I.R. Humphreys, J. Pei, M. Baek, A. Krishnakumar, et al, Computed structures of core eukaryotic protein complexes, Science (2021). link

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of ColabFold

Cons of ColabFold

Code Comparison

Pros of AlphaFold

Cons of AlphaFold

Code Comparison

Pros of ESM

Cons of ESM

Code Comparison

Pros of google-research

Cons of google-research

Code Comparison

Convert designs to code with AI

README

RoseTTAFold

Installation

Usage

Expected outputs

FAQ

Links:

Credit to performer-pytorch and SE(3)-Transformer codes

References

Top Related Projects

Convert designs to code with AI