Convert Figma logo to code with AI

RosettaCommons logoRoseTTAFold

This package contains deep learning models and related scripts for RoseTTAFold

1,983
436
1,983
91

Top Related Projects

Making Protein folding accessible to all!

12,271

Open source code for AlphaFold.

3,107

Evolutionary Scale Modeling (esm): Pretrained language models for proteins

Google Research

Quick Overview

RoseTTAFold is an open-source protein structure prediction software developed by the Baker lab at the University of Washington. It uses a three-track neural network to simultaneously process multiple representations of protein sequences and integrates them to generate accurate 3D structure predictions. RoseTTAFold is designed to be faster and more memory-efficient than AlphaFold2 while maintaining comparable accuracy.

Pros

  • High accuracy in protein structure prediction, comparable to AlphaFold2
  • Faster runtime and lower memory requirements than AlphaFold2
  • Open-source and freely available for academic and non-commercial use
  • Supports both single-chain and complex structure prediction

Cons

  • Requires significant computational resources, including GPUs
  • Installation and setup can be complex for non-expert users
  • Limited documentation and user support compared to some commercial alternatives
  • May have lower accuracy for some challenging protein targets compared to the latest AlphaFold versions

Code Examples

# Example 1: Predicting a protein structure
from rosettafold import RoseTTAFold

model = RoseTTAFold()
sequence = "MVKVGVNGFGRIGRLVTRAAFNSGKVDIVAINDPFIDLNYMVYMFQYDSTHGKFHGTVKAENGKLVINGNPITIFQERDPSKIKWGDAGAEYVVESTGVFTTMEKAGAHLQGGAKRVIISAPSADAPMFVMGVNHEKYDNSLKIISNASCTTNCLAPLAKVIHDNFGIVEGLMTTVHAITATQKTVDGPSGKLWRDGRGALQNIIPASTGAAKAVGKVIPELDGKLTGMAFRVPTANVSVVDLTCRLEKPAKYDDIKKVVKQASEGPLKGILGYTEHQVVSSDFNSDTHSSTFDAGAGIALNDHFVKLISWYDNEFGYSNRVVDLMAHMASKE"
predicted_structure = model.predict(sequence)
# Example 2: Visualizing the predicted structure
import py3Dmol

view = py3Dmol.view()
view.addModel(predicted_structure.to_pdb(), "pdb")
view.setStyle({'cartoon': {'color': 'spectrum'}})
view.zoomTo()
view.show()
# Example 3: Evaluating prediction confidence
confidence_scores = predicted_structure.get_plddt()
import matplotlib.pyplot as plt

plt.plot(confidence_scores)
plt.xlabel("Residue")
plt.ylabel("pLDDT Score")
plt.title("Prediction Confidence")
plt.show()

Getting Started

  1. Install RoseTTAFold and its dependencies:

    git clone https://github.com/RosettaCommons/RoseTTAFold.git
    cd RoseTTAFold
    conda env create -f folding_linux.yml
    conda activate folding
    
  2. Download pre-trained weights:

    ./download_weights.sh
    
  3. Run a prediction:

    python run_e2e_ver.py input.fasta ./output_dir --weights_dir ./weights
    

For more detailed instructions and advanced usage, refer to the official RoseTTAFold documentation.

Competitor Comparisons

Making Protein folding accessible to all!

Pros of ColabFold

  • Easier to use with Google Colab integration, making it more accessible for users without local high-performance computing resources
  • Faster prediction times due to optimized implementation and use of MMseqs2 for sequence searches
  • More frequent updates and active community support

Cons of ColabFold

  • Less customizable than RoseTTAFold for advanced users who need fine-grained control over the prediction process
  • May have limitations on input size and runtime when using free Google Colab resources

Code Comparison

RoseTTAFold:

msa, deletion_matrix = parsers.parse_a3m(a3m_file)
N, L = msa.shape
if N > args.nseq:
    idx = np.argsort(np.sum(msa==20,axis=1))[:args.nseq]
    msa = msa[idx]
    deletion_matrix = deletion_matrix[idx]

ColabFold:

a3m_lines = input_file.split("\n")
msa, ins = parse_a3m(a3m_lines)
N, L = msa.shape
if N > args.max_msa_clusters:
    msa = msa[:args.max_msa_clusters]
    ins = ins[:args.max_msa_clusters]

Both repositories handle multiple sequence alignments (MSA) parsing and limiting the number of sequences, but ColabFold's implementation is more streamlined and optimized for use with AlphaFold2.

12,271

Open source code for AlphaFold.

Pros of AlphaFold

  • Higher accuracy in protein structure prediction
  • More extensive training on protein databases
  • Better handling of multi-chain complexes

Cons of AlphaFold

  • Computationally intensive, requiring significant resources
  • Less flexibility in customizing the prediction process
  • Longer runtime for predictions compared to RoseTTAFold

Code Comparison

RoseTTAFold:

def predict_structure(sequence):
    msa = generate_msa(sequence)
    features = extract_features(msa)
    model = load_model("rosettafold_model.pkl")
    return model.predict(features)

AlphaFold:

def predict_structure(sequence):
    features = pipeline.process(protein=protein.from_sequence(sequence))
    model = alphafold.model.Model(config)
    prediction = model.predict(features)
    return prediction.unrelaxed_protein

Both repositories aim to predict protein structures, but AlphaFold generally achieves higher accuracy at the cost of computational resources. RoseTTAFold offers a faster, more lightweight alternative with slightly lower accuracy. The code snippets illustrate the different approaches: RoseTTAFold uses a more traditional pipeline with MSA generation and feature extraction, while AlphaFold employs a more integrated end-to-end approach.

3,107

Evolutionary Scale Modeling (esm): Pretrained language models for proteins

Pros of ESM

  • Broader scope: ESM is a general-purpose protein language model, applicable to various protein-related tasks
  • Extensive pre-training: Trained on a vast amount of protein sequence data, enabling robust performance across different applications
  • Easier integration: Provides pre-trained models and tools for easy incorporation into various workflows

Cons of ESM

  • Less specialized: May not perform as well as RoseTTAFold for specific protein structure prediction tasks
  • Higher computational requirements: Large language models often require more computational resources to run effectively
  • Limited structural information: Primarily focuses on sequence information, potentially missing some structural nuances

Code Comparison

RoseTTAFold:

msa, xyz, mask = self.forward(msa, xyz, seq1hot, idx, t1d)
return msa, xyz, mask

ESM:

results = model(batch_tokens)
representations = results["representations"][33]

The code snippets show that RoseTTAFold focuses on processing multiple sequence alignments (MSA) and structural data, while ESM operates on tokenized sequences and produces representations. This reflects their different approaches to protein analysis.

Google Research

Pros of google-research

  • Broader scope, covering various AI and machine learning topics
  • More frequent updates and contributions from a larger team
  • Extensive documentation and examples for multiple projects

Cons of google-research

  • Less focused on a specific problem domain
  • May be more challenging to navigate due to its size and diversity
  • Potentially steeper learning curve for newcomers

Code Comparison

RoseTTAFold:

def predict_structure(sequence):
    model = load_model("rosettafold_model.pth")
    return model.predict(sequence)

google-research:

def run_bert(input_text):
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = BertModel.from_pretrained('bert-base-uncased')
    return model(tokenizer(input_text, return_tensors='pt'))

The code snippets illustrate the difference in focus: RoseTTAFold is specialized for protein structure prediction, while google-research covers a wider range of topics, such as natural language processing with BERT in this example.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

RoseTTAFold

This package contains deep learning models and related scripts to run RoseTTAFold.
This repository is the official implementation of RoseTTAFold: Accurate prediction of protein structures and interactions using a 3-track network.

Installation

  1. Clone the package
git clone https://github.com/RosettaCommons/RoseTTAFold.git
cd RoseTTAFold
  1. Create conda environment using RoseTTAFold-linux.yml file and folding-linux.yml file. The latter is required to run a pyrosetta version only (run_pyrosetta_ver.sh).
# create conda environment for RoseTTAFold
#   If your NVIDIA driver compatible with cuda11
conda env create -f RoseTTAFold-linux.yml
#   If not (but compatible with cuda10)
conda env create -f RoseTTAFold-linux-cu101.yml

# create conda environment for pyRosetta folding & running DeepAccNet
conda env create -f folding-linux.yml
  1. Download network weights (under Rosetta-DL Software license -- please see below)
    While the code is licensed under the MIT License, the trained weights and data for RoseTTAFold are made available for non-commercial use only under the terms of the Rosetta-DL Software license. You can find details at https://files.ipd.uw.edu/pub/RoseTTAFold/Rosetta-DL_LICENSE.txt

[Update Nov/02/2021] It's now including the weights (RF2t.pt) for RoseTTAFold-2track model used for yeast PPI screening. If you want to use it, please re-download weights. The original RoseTTAFold weights are not changed.

wget https://files.ipd.uw.edu/pub/RoseTTAFold/weights.tar.gz
tar xfz weights.tar.gz
  1. Download and install third-party software.
./install_dependencies.sh
  1. Download sequence and structure databases
# uniref30 [46G]
wget http://wwwuser.gwdg.de/~compbiol/uniclust/2020_06/UniRef30_2020_06_hhsuite.tar.gz
mkdir -p UniRef30_2020_06
tar xfz UniRef30_2020_06_hhsuite.tar.gz -C ./UniRef30_2020_06

# BFD [272G]
wget https://bfd.mmseqs.com/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz
mkdir -p bfd
tar xfz bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz -C ./bfd

# structure templates (including *_a3m.ffdata, *_a3m.ffindex) [over 100G]
wget https://files.ipd.uw.edu/pub/RoseTTAFold/pdb100_2021Mar03.tar.gz
tar xfz pdb100_2021Mar03.tar.gz
# for CASP14 benchmarks, we used this one: https://files.ipd.uw.edu/pub/RoseTTAFold/pdb100_2020Mar11.tar.gz
  1. Obtain a PyRosetta licence and install the package in the newly created folding conda environment (link).

Usage

# For monomer structure prediction
cd example
../run_[pyrosetta, e2e]_ver.sh input.fa .

# For complex modeling
# please see README file under example/complex_modeling/README for details.
python network/predict_complex.py -i paired.a3m -o complex -Ls 218 310 

# For PPI screening using faster 2-track version (example input and output are at example/complex_2track)
python network_2track/predict_msa.py -msa [paired MSA file in a3m format] -npz [output npz file name] -L1 [Length of first chain]
e.g. python network_2track/predict_msa.py -msa input.a3m -npz complex.npz -L1 218

Expected outputs

For the pyrosetta version, user will get five final models having estimated CA rms error at the B-factor column (model/model_[1-5].crderr.pdb).
For the end-to-end version, there will be a single PDB output having estimated residue-wise CA-lddt at the B-factor column (t000_.e2e.pdb).

FAQ

  1. Segmentation fault while running hhblits/hhsearch
    For easy install, we used a statically compiled version of hhsuite (installed through conda). Currently, we're not sure what exactly causes segmentation fault error in some cases, but we found that it might be resolved if you compile hhsuite from source and use this compiled version instead of conda version. For installation of hhsuite, please see here.

  2. Submitting jobs to computing nodes
    The modeling pipeline provided here (run_pyrosetta_ver.sh/run_e2e_ver.sh) is a kind of guidelines to show how RoseTTAFold works. For more efficient use of computing resources, you might want to modify the provided bash script to submit separate jobs with proper dependencies for each of steps (more cpus/memory for hhblits/hhsearch, using gpus only for running the networks, etc).

Links:

Credit to performer-pytorch and SE(3)-Transformer codes

The code in the network/performer_pytorch.py is strongly based on this repo which is pytorch implementation of Performer architecture. The codes in network/equivariant_attention is from the original SE(3)-Transformer repo which accompanies the paper 'SE(3)-Transformers: 3D Roto-Translation Equivariant Attention Networks' by Fabian et al.

References

M. Baek, et al., Accurate prediction of protein structures and interactions using a three-track neural network, Science (2021). link

I.R. Humphreys, J. Pei, M. Baek, A. Krishnakumar, et al, Computed structures of core eukaryotic protein complexes, Science (2021). link