RoseTTAFold
This package contains deep learning models and related scripts for RoseTTAFold
Top Related Projects
Making Protein folding accessible to all!
Open source code for AlphaFold.
Evolutionary Scale Modeling (esm): Pretrained language models for proteins
Google Research
Quick Overview
RoseTTAFold is an open-source protein structure prediction software developed by the Baker lab at the University of Washington. It uses a three-track neural network to simultaneously process multiple representations of protein sequences and integrates them to generate accurate 3D structure predictions. RoseTTAFold is designed to be faster and more memory-efficient than AlphaFold2 while maintaining comparable accuracy.
Pros
- High accuracy in protein structure prediction, comparable to AlphaFold2
- Faster runtime and lower memory requirements than AlphaFold2
- Open-source and freely available for academic and non-commercial use
- Supports both single-chain and complex structure prediction
Cons
- Requires significant computational resources, including GPUs
- Installation and setup can be complex for non-expert users
- Limited documentation and user support compared to some commercial alternatives
- May have lower accuracy for some challenging protein targets compared to the latest AlphaFold versions
Code Examples
# Example 1: Predicting a protein structure
from rosettafold import RoseTTAFold
model = RoseTTAFold()
sequence = "MVKVGVNGFGRIGRLVTRAAFNSGKVDIVAINDPFIDLNYMVYMFQYDSTHGKFHGTVKAENGKLVINGNPITIFQERDPSKIKWGDAGAEYVVESTGVFTTMEKAGAHLQGGAKRVIISAPSADAPMFVMGVNHEKYDNSLKIISNASCTTNCLAPLAKVIHDNFGIVEGLMTTVHAITATQKTVDGPSGKLWRDGRGALQNIIPASTGAAKAVGKVIPELDGKLTGMAFRVPTANVSVVDLTCRLEKPAKYDDIKKVVKQASEGPLKGILGYTEHQVVSSDFNSDTHSSTFDAGAGIALNDHFVKLISWYDNEFGYSNRVVDLMAHMASKE"
predicted_structure = model.predict(sequence)
# Example 2: Visualizing the predicted structure
import py3Dmol
view = py3Dmol.view()
view.addModel(predicted_structure.to_pdb(), "pdb")
view.setStyle({'cartoon': {'color': 'spectrum'}})
view.zoomTo()
view.show()
# Example 3: Evaluating prediction confidence
confidence_scores = predicted_structure.get_plddt()
import matplotlib.pyplot as plt
plt.plot(confidence_scores)
plt.xlabel("Residue")
plt.ylabel("pLDDT Score")
plt.title("Prediction Confidence")
plt.show()
Getting Started
-
Install RoseTTAFold and its dependencies:
git clone https://github.com/RosettaCommons/RoseTTAFold.git cd RoseTTAFold conda env create -f folding_linux.yml conda activate folding
-
Download pre-trained weights:
./download_weights.sh
-
Run a prediction:
python run_e2e_ver.py input.fasta ./output_dir --weights_dir ./weights
For more detailed instructions and advanced usage, refer to the official RoseTTAFold documentation.
Competitor Comparisons
Making Protein folding accessible to all!
Pros of ColabFold
- Easier to use with Google Colab integration, making it more accessible for users without local high-performance computing resources
- Faster prediction times due to optimized implementation and use of MMseqs2 for sequence searches
- More frequent updates and active community support
Cons of ColabFold
- Less customizable than RoseTTAFold for advanced users who need fine-grained control over the prediction process
- May have limitations on input size and runtime when using free Google Colab resources
Code Comparison
RoseTTAFold:
msa, deletion_matrix = parsers.parse_a3m(a3m_file)
N, L = msa.shape
if N > args.nseq:
idx = np.argsort(np.sum(msa==20,axis=1))[:args.nseq]
msa = msa[idx]
deletion_matrix = deletion_matrix[idx]
ColabFold:
a3m_lines = input_file.split("\n")
msa, ins = parse_a3m(a3m_lines)
N, L = msa.shape
if N > args.max_msa_clusters:
msa = msa[:args.max_msa_clusters]
ins = ins[:args.max_msa_clusters]
Both repositories handle multiple sequence alignments (MSA) parsing and limiting the number of sequences, but ColabFold's implementation is more streamlined and optimized for use with AlphaFold2.
Open source code for AlphaFold.
Pros of AlphaFold
- Higher accuracy in protein structure prediction
- More extensive training on protein databases
- Better handling of multi-chain complexes
Cons of AlphaFold
- Computationally intensive, requiring significant resources
- Less flexibility in customizing the prediction process
- Longer runtime for predictions compared to RoseTTAFold
Code Comparison
RoseTTAFold:
def predict_structure(sequence):
msa = generate_msa(sequence)
features = extract_features(msa)
model = load_model("rosettafold_model.pkl")
return model.predict(features)
AlphaFold:
def predict_structure(sequence):
features = pipeline.process(protein=protein.from_sequence(sequence))
model = alphafold.model.Model(config)
prediction = model.predict(features)
return prediction.unrelaxed_protein
Both repositories aim to predict protein structures, but AlphaFold generally achieves higher accuracy at the cost of computational resources. RoseTTAFold offers a faster, more lightweight alternative with slightly lower accuracy. The code snippets illustrate the different approaches: RoseTTAFold uses a more traditional pipeline with MSA generation and feature extraction, while AlphaFold employs a more integrated end-to-end approach.
Evolutionary Scale Modeling (esm): Pretrained language models for proteins
Pros of ESM
- Broader scope: ESM is a general-purpose protein language model, applicable to various protein-related tasks
- Extensive pre-training: Trained on a vast amount of protein sequence data, enabling robust performance across different applications
- Easier integration: Provides pre-trained models and tools for easy incorporation into various workflows
Cons of ESM
- Less specialized: May not perform as well as RoseTTAFold for specific protein structure prediction tasks
- Higher computational requirements: Large language models often require more computational resources to run effectively
- Limited structural information: Primarily focuses on sequence information, potentially missing some structural nuances
Code Comparison
RoseTTAFold:
msa, xyz, mask = self.forward(msa, xyz, seq1hot, idx, t1d)
return msa, xyz, mask
ESM:
results = model(batch_tokens)
representations = results["representations"][33]
The code snippets show that RoseTTAFold focuses on processing multiple sequence alignments (MSA) and structural data, while ESM operates on tokenized sequences and produces representations. This reflects their different approaches to protein analysis.
Google Research
Pros of google-research
- Broader scope, covering various AI and machine learning topics
- More frequent updates and contributions from a larger team
- Extensive documentation and examples for multiple projects
Cons of google-research
- Less focused on a specific problem domain
- May be more challenging to navigate due to its size and diversity
- Potentially steeper learning curve for newcomers
Code Comparison
RoseTTAFold:
def predict_structure(sequence):
model = load_model("rosettafold_model.pth")
return model.predict(sequence)
google-research:
def run_bert(input_text):
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
return model(tokenizer(input_text, return_tensors='pt'))
The code snippets illustrate the difference in focus: RoseTTAFold is specialized for protein structure prediction, while google-research covers a wider range of topics, such as natural language processing with BERT in this example.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
RoseTTAFold
This package contains deep learning models and related scripts to run RoseTTAFold.
This repository is the official implementation of RoseTTAFold: Accurate prediction of protein structures and interactions using a 3-track network.
Installation
- Clone the package
git clone https://github.com/RosettaCommons/RoseTTAFold.git
cd RoseTTAFold
- Create conda environment using
RoseTTAFold-linux.yml
file andfolding-linux.yml
file. The latter is required to run a pyrosetta version only (run_pyrosetta_ver.sh).
# create conda environment for RoseTTAFold
# If your NVIDIA driver compatible with cuda11
conda env create -f RoseTTAFold-linux.yml
# If not (but compatible with cuda10)
conda env create -f RoseTTAFold-linux-cu101.yml
# create conda environment for pyRosetta folding & running DeepAccNet
conda env create -f folding-linux.yml
- Download network weights (under Rosetta-DL Software license -- please see below)
While the code is licensed under the MIT License, the trained weights and data for RoseTTAFold are made available for non-commercial use only under the terms of the Rosetta-DL Software license. You can find details at https://files.ipd.uw.edu/pub/RoseTTAFold/Rosetta-DL_LICENSE.txt
[Update Nov/02/2021] It's now including the weights (RF2t.pt) for RoseTTAFold-2track model used for yeast PPI screening. If you want to use it, please re-download weights. The original RoseTTAFold weights are not changed.
wget https://files.ipd.uw.edu/pub/RoseTTAFold/weights.tar.gz
tar xfz weights.tar.gz
- Download and install third-party software.
./install_dependencies.sh
- Download sequence and structure databases
# uniref30 [46G]
wget http://wwwuser.gwdg.de/~compbiol/uniclust/2020_06/UniRef30_2020_06_hhsuite.tar.gz
mkdir -p UniRef30_2020_06
tar xfz UniRef30_2020_06_hhsuite.tar.gz -C ./UniRef30_2020_06
# BFD [272G]
wget https://bfd.mmseqs.com/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz
mkdir -p bfd
tar xfz bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz -C ./bfd
# structure templates (including *_a3m.ffdata, *_a3m.ffindex) [over 100G]
wget https://files.ipd.uw.edu/pub/RoseTTAFold/pdb100_2021Mar03.tar.gz
tar xfz pdb100_2021Mar03.tar.gz
# for CASP14 benchmarks, we used this one: https://files.ipd.uw.edu/pub/RoseTTAFold/pdb100_2020Mar11.tar.gz
- Obtain a PyRosetta licence and install the package in the newly created
folding
conda environment (link).
Usage
# For monomer structure prediction
cd example
../run_[pyrosetta, e2e]_ver.sh input.fa .
# For complex modeling
# please see README file under example/complex_modeling/README for details.
python network/predict_complex.py -i paired.a3m -o complex -Ls 218 310
# For PPI screening using faster 2-track version (example input and output are at example/complex_2track)
python network_2track/predict_msa.py -msa [paired MSA file in a3m format] -npz [output npz file name] -L1 [Length of first chain]
e.g. python network_2track/predict_msa.py -msa input.a3m -npz complex.npz -L1 218
Expected outputs
For the pyrosetta version, user will get five final models having estimated CA rms error at the B-factor column (model/model_[1-5].crderr.pdb).
For the end-to-end version, there will be a single PDB output having estimated residue-wise CA-lddt at the B-factor column (t000_.e2e.pdb).
FAQ
-
Segmentation fault while running hhblits/hhsearch
For easy install, we used a statically compiled version of hhsuite (installed through conda). Currently, we're not sure what exactly causes segmentation fault error in some cases, but we found that it might be resolved if you compile hhsuite from source and use this compiled version instead of conda version. For installation of hhsuite, please see here. -
Submitting jobs to computing nodes
The modeling pipeline provided here (run_pyrosetta_ver.sh/run_e2e_ver.sh) is a kind of guidelines to show how RoseTTAFold works. For more efficient use of computing resources, you might want to modify the provided bash script to submit separate jobs with proper dependencies for each of steps (more cpus/memory for hhblits/hhsearch, using gpus only for running the networks, etc).
Links:
- Robetta server (RoseTTAFold option)
- RoseTTAFold models for CASP14 targets [input MSA and hhsearch files are included]
Credit to performer-pytorch and SE(3)-Transformer codes
The code in the network/performer_pytorch.py is strongly based on this repo which is pytorch implementation of Performer architecture. The codes in network/equivariant_attention is from the original SE(3)-Transformer repo which accompanies the paper 'SE(3)-Transformers: 3D Roto-Translation Equivariant Attention Networks' by Fabian et al.
References
M. Baek, et al., Accurate prediction of protein structures and interactions using a three-track neural network, Science (2021). link
I.R. Humphreys, J. Pei, M. Baek, A. Krishnakumar, et al, Computed structures of core eukaryotic protein complexes, Science (2021). link
Top Related Projects
Making Protein folding accessible to all!
Open source code for AlphaFold.
Evolutionary Scale Modeling (esm): Pretrained language models for proteins
Google Research
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot