ColabFold

Making Protein folding accessible to all!

2,355

615

2,355

363

View on GitHub

Top Related Projects

RoseTTAFold

2,171

This package contains deep learning models and related scripts for RoseTTAFold

esm

3,719

Evolutionary Scale Modeling (esm): Pretrained language models for proteins

Quick Overview

ColabFold is a project that brings protein structure prediction to Google Colab, making it accessible to researchers without the need for powerful local hardware. It integrates AlphaFold2 and RoseTTAFold, allowing users to predict protein structures and complexes using a user-friendly interface in a cloud environment.

Pros

Democratizes access to state-of-the-art protein structure prediction tools
Runs on Google Colab, eliminating the need for local high-performance computing resources
Provides an easy-to-use interface for both AlphaFold2 and RoseTTAFold
Regularly updated to incorporate the latest improvements in protein structure prediction

Cons

Dependent on Google Colab's availability and resource limitations
May experience slower performance compared to local high-performance setups
Limited customization options compared to running the tools locally
Requires an internet connection and Google account to use

Code Examples

Running AlphaFold2 prediction:

from colabfold import run_alphafold2

sequence = "MVKVGVNGFGRIGRLVTRAAFNSGKVDIVAINDPFIDLNYMVYMFQYDSTHGKFHGTVKAENGKLVINGNPITIFQERDPSKIKWGDAGAEYVVESTGVFTTMEKAGAHLQGGAKRVIISAPSADAPMFVMGVNHEKYDNSLKIISNASCTTNCLAPLAKVIHDNFGIVEGLMTTVHAITATQKTVDGPSGKLWRDGRGALQNIIPASTGAAKAVGKVIPELDGKLTGMAFRVPTANVSVVDLTCRLEKPAKYDDIKKVVKQASEGPLKGILGYTEHQVVSSDFNSDTHSSTFDAGAGIALNDHFVKLISWYDNEFGYSNRVVDLMAHMASKE"
output = run_alphafold2(sequence)
print(output)

Predicting protein complex structure:

from colabfold import predict_complex

sequences = ["MVKVGVNGFGRIGRLVTRAAFNSGKVDIVAINDPFIDLNYMVYMFQYDSTHGKFHGTVKAENGKLVINGNPITIFQERDPSKIKWGDAGAEYVVESTGVFTTMEKAGAHLQGGAKRVIISAPSADAPMFVMGVNHEKYDNSLKIISNASCTTNCLAPLAKVIHDNFGIVEGLMTTVHAITATQKTVDGPSGKLWRDGRGALQNIIPASTGAAKAVGKVIPELDGKLTGMAFRVPTANVSVVDLTCRLEKPAKYDDIKKVVKQASEGPLKGILGYTEHQVVSSDFNSDTHSSTFDAGAGIALNDHFVKLISWYDNEFGYSNRVVDLMAHMASKE",
             "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK"]
output = predict_complex(sequences)
print(output)

Visualizing predicted structure:

from colabfold import visualize_structure

pdb_file = "predicted_structure.pdb"
visualize_structure(pdb_file)

Getting Started

To use ColabFold, follow these steps:

Open Google Colab (https://colab.research.google.com/)
Create a new notebook
Install ColabFold by running:
```
!pip install colabfold
```
Import the necessary modules:

Competitor Comparisons

alphafold

13,724

Open source code for AlphaFold 2.

Pros of AlphaFold

Original implementation by DeepMind, offering the complete, official codebase
Extensive documentation and detailed explanations of the algorithm
Highly optimized for performance on powerful hardware

Cons of AlphaFold

Requires significant computational resources and expertise to set up and run
Less user-friendly for researchers without extensive computational background
Limited flexibility for customization or integration with other tools

Code Comparison

AlphaFold:

def predict_structure(
    fasta_path: str,
    output_dir: str,
    data_pipeline: pipeline.DataPipeline,
    model_runners: Dict[str, model.RunModel],
    amber_relaxer: relax.AmberRelaxation,
    benchmark: bool,
    random_seed: int,
    models_to_relax: ModelsToRelax):
  """Predicts structure using AlphaFold for the given sequence."""
  # Implementation details...

ColabFold:

def predict_structure(sequence, jobname='test', num_recycle=3):
    """Predicts protein structure using ColabFold."""
    results = []
    for model_name in model_names:
        model = load_model(model_name)
        pred = model.predict(sequence, num_recycle=num_recycle)
        results.append(pred)
    return results

The code snippets illustrate the difference in complexity and abstraction level between the two implementations. AlphaFold's code is more detailed and parameterized, while ColabFold offers a simpler interface for quick predictions.

RoseTTAFold

2,171

This package contains deep learning models and related scripts for RoseTTAFold

Pros of RoseTTAFold

More comprehensive and flexible protein structure prediction pipeline
Integrates Rosetta energy functions for refinement and scoring
Supports additional features like complex modeling and design

Cons of RoseTTAFold

Steeper learning curve and more complex setup
Requires more computational resources
Less user-friendly for beginners or those without extensive bioinformatics experience

Code Comparison

RoseTTAFold:

# Example of running RoseTTAFold
from pyrosetta import *
from pyrosetta.rosetta.protocols.rosetta_scripts import XmlObjects
xml = XmlObjects.create_from_file("rosettafold.xml")
pose = pose_from_sequence("ACDEFGHIKLMNPQRSTVWY")
xml.get_mover("rosettafold").apply(pose)

ColabFold:

# Example of running ColabFold
from colabfold import batch
batch.predict("input.fasta", "output_dir", use_templates=True)

ColabFold offers a more streamlined and user-friendly approach, making it easier for researchers to quickly predict protein structures. RoseTTAFold, while more complex, provides greater flexibility and integration with the Rosetta suite of tools, allowing for more advanced modeling and design capabilities. The choice between the two depends on the user's specific needs, expertise, and available computational resources.

esm

3,719

Evolutionary Scale Modeling (esm): Pretrained language models for proteins

Pros of ESM

Broader scope: Focuses on protein language models and sequence-based predictions
More extensive documentation and examples for various use cases
Larger community and more frequent updates

Cons of ESM

Less specialized for protein structure prediction
Requires more setup and configuration for specific tasks
May be more complex for beginners to use effectively

Code Comparison

ESM:

import torch
from esm import pretrained

model, alphabet = pretrained.load_model_and_alphabet("esm2_t33_650M_UR50D")
batch_converter = alphabet.get_batch_converter()
model.eval()  # disables dropout for deterministic results

ColabFold:

from colabfold.batch import predict_structure_batch
from colabfold.download import default_data_dir
from colabfold.utils import setup_logging

predict_structure_batch(
    "sequence.fasta",
    "output_dir",
    data_dir=default_data_dir,
    num_recycle=3
)

The code snippets demonstrate the different focus areas of each project. ESM provides a more general-purpose protein language model, while ColabFold offers a streamlined interface for structure prediction.

google-research

36,128

Google Research

Pros of google-research

Broader scope, covering various research areas beyond protein folding
Larger community and more frequent updates
Official repository from Google, potentially more stable and well-maintained

Cons of google-research

Less focused on protein structure prediction specifically
May be more complex to navigate and use for specific tasks
Potentially steeper learning curve for newcomers to the field

Code comparison

ColabFold:

def run_mmseqs2(x, prefix, use_env=True, use_filter=True):
    return_value = os.system(f"mmseqs easy-search {x} {DB} {prefix} {TMP_DIR} \
                   --format-output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits,tcov,qcov \
                   -s 7.5 --alignment-mode 3 --slice-search")

google-research:

def run_alphafold(fasta_path, output_dir, max_template_date=None):
    model_runners = {}
    for model_name in config.MODEL_PRESETS['alphafold2_ptm']:
        model_config = config.model_config(model_name)
        model_runner = model.RunModel(model_config, model_params)
        model_runners[model_name] = model_runner

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

ColabFold - v1.5.5

For details of what was changed in v1.5, see change log!

[!NOTE] 04Aug2025: We changed the taxonomy/pairing files for the UniRef100 database. This might affect multimer predictions. Check the wiki entry for details.

Making Protein folding accessible to all via Google Colab!

Notebooks	monomers	complexes	mmseqs2	jackhmmer	templates
AlphaFold2_mmseqs2	Yes	Yes	Yes	No	Yes
AlphaFold2_batch	Yes	Yes	Yes	No	Yes
AlphaFold2 (from Deepmind)	Yes	Yes	No	Yes	No
relax_amber (relax input structure)
ESMFold	Yes	Maybe	No	No	No

BETA (in development) notebooks
RoseTTAFold2	Yes	Yes	Yes	No	WIP
Boltz	Yes	Yes	Yes	No	No
BioEmu	Yes	No	Yes	No	No
OmegaFold	Yes	Maybe	No	No	No
AlphaFold2_advanced_v2 (new experimental notebook)	Yes	Yes	Yes	No	Yes

Check the wiki page old retired notebooks for unsupported notebooks.

FAQ

Where can I chat with other ColabFold users?
- See our Discord channel!
Can I use the models for Molecular Replacement?
- Yes, but be CAREFUL, the bfactor column is populated with pLDDT confidence values (higher = better). Phenix.phaser expects a "real" bfactor, where (lower = better). See post from Claudia MillÃ¡n.
What is the maximum length?
- Limits depends on free GPU provided by Google-Colab fingers-crossed
- For GPU: Tesla T4 with ~16G the max length is ~2000
- To check what GPU you got, open a new code cell and type !nvidia-smi
Is it okay to use the MMseqs2 MSA server on a local computer?
- You can access the server from a local computer if you queries are serial from a single IP. Please do not use multiple computers to query the server.
Where can I download the databases used by ColabFold?
- The databases are available at colabfold.mmseqs.com
I want to render my own images of the predicted structures, how do I color by pLDDT?
- In pymol for AlphaFold structures: spectrum b, red_yellow_green_cyan_blue, minimum=50, maximum=90
- If you want to use AlphaFold Colours (credit: Konstantin Korotkov)
```
set_color n0, [0.051, 0.341, 0.827]
set_color n1, [0.416, 0.796, 0.945]
set_color n2, [0.996, 0.851, 0.212]
set_color n3, [0.992, 0.490, 0.302]
color n0, b < 100; color n1, b < 90
color n2, b < 70;  color n3, b < 50
```
- In pymol for RoseTTAFold structures: spectrum b, red_yellow_green_cyan_blue, minimum=0.5, maximum=0.9
What is the difference between the AlphaFold2_advanced and AlphaFold2_mmseqs2 (_batch) notebook for complex prediction?
- We currently have two different ways to predict protein complexes: (1) using the AlphaFold2 model with residue index jump and (2) using the AlphaFold2-multimer model. AlphaFold2_advanced supports (1) and AlphaFold2_mmseqs2 (_batch) (2).
What is the difference between localcolabfold and the pip installable colabfold_batch?
- LocalColabFold is an installer script designed to make ColabFold functionality available on local users' machines. It supports wide range of operating systems, such as Windows 10 or later (using Windows Subsystem for Linux 2), macOS, and Linux.
Is there a way to amber-relax structures without having to rerun alphafold/colabfold from scratch?
- Yes, see this notebook.
Where can I find the old notebooks that were previously developed and are now retired?
- You can find the list of retired notebooks in the old retired notebooks wiki page.
Where can I find the history of MSA Server Databases used in ColabFold?
- You can view the database version history on the MSA Server Database History wiki page.

Running locally

For instructions on how to install ColabFold locally refer to localcolabfold or see our wiki on how to run ColabFold within Docker.

Generating MSAs for small scale local structure/complex predictions using the MSA server

When you pass a FASTA or CSV file containing your sequences to colabfold_batch it will automatically query the public MSA server to generate MSAs. You might want to split this into two steps for better GPU resource utilization:

# Query the MSA server and predict the structure on local GPU in one go:
colabfold_batch input_sequences.fasta out_dir

# Split querying MSA server and GPU predictions into two steps
colabfold_batch input_sequences.fasta out_dir --msa-only
colabfold_batch input_sequences.fasta out_dir

Generating MSAs for large scale structure/complex predictions

First create a directory for the databases on a disk with sufficient storage (940GB (!)). Depending on where you are, this will take a couple of hours:

Note: MMseqs2 Release 18 is used to create the databases and perform sequece search in the ColabFold MSA server. Please use this version if you want to obtain the same MSAs as the server.

MMSEQS_NO_INDEX=1 ./setup_databases.sh /path/to/db_folder

If MMseqs2 is not installed in your PATH, add --mmseqs <path to mmseqs> to your mmseqs in colabfold_search:

# This needs a lot of CPU
colabfold_search --mmseqs /path/to/bin/mmseqs input_sequences.fasta /path/to/db_folder msas
# This needs a GPU
colabfold_batch msas predictions

This will create intermediate folder msas that contains all input multiple sequence alignments formated as a3m files and a predictions folder with all predicted pdb,json and png files.

The procedure above disables MMseqs2 preindexing of the various ColabFold databases by setting the MMSEQS_NO_INDEX=1 environment variable before calling the database setup script. For most use-cases of colabfold_search precomputing the index is not required and might hurt search speed. The precomputed index is necessary for fast response times of the ColabFold server, where the whole database is permamently kept in memory. In any case the batch searches will require a machine with about 128GB RAM or, if the databases are to be kept permamently in RAM, with over 1TB RAM.

In some cases using precomputed database can still be useful. For the following cases, call the setup_databases.sh script without the MMSEQS_NO_INDEX environment variable:

(0) As mentioned above, if you want to set-up a server.

(1) If the precomputed index is stored on a very fast storage system (e.g., NVMe-SSDs) it might be faster to read the index from disk than computing in on the fly. In this case, the search should be performed on the same machine that called setup_databases.sh since the precomputed index is created to fit within the given main memory size. Additionaly, pass the --db-load-mode 0 option to make sure the database is read once from the storage system before use.

(2) Fast single query searches require the full index (the .idx files) to be kept in memory. This can be done with e.g. by using vmtouch. Thus, this type of search requires a machine with at least 768GB to 1TB RAM for the ColabfoldDB. If the index is present in memory, use the --db-load-mode 2 parameter in colabfold_search to avoid index loading overhead.

If no index was created (MMSEQS_NO_INDEX=1 was set), then --db-load-mode does not do anything and can be ignored.

Saving MSAs in AlphaFold3-compatible JSON format

You can export MSAs into a json format compatible with AlphaFold3 input using the --af3-json option.

With colabfold_search:

If you are using the local database setup with colabfold_search, you can add the --af3-json option to save the MSAs as AlphaFold3 input json:

colabfold_search --mmseqs /path/to/bin/mmseqs input_sequences.fasta /path/to/db_folder msas --af3-json

This will create a json file in the msas folder, using the same name as the a3m file.

With colabfold_batch:

If you are using the MSA server via colabfold_batch, you can also use the --af3-json option. However, structure prediction will be skipped, and only the json file will be generated.

colabfold_batch input_sequences.fasta out_dir --af3-json

Including non-protein molecules in FASTA

AlphaFold3 supports non-protein components such as ligands and nucleic acids in input complexes. To include these in the generated json file, you can specify them directly in your FASTA input using the following format, molecule type|sequence|(copies). As molecue types, dna, rna, ccd, smiles are allowed.

:exclamation: Substitute aromatic bonds in SMILES If your SMILES string contains aromatic bonds (:), please replace them with semicolons (;) to avoid internal parsing issues.

Examples
- For DNA: dna|ATCG
- For RNA: rna|AUGC
- For ligands:
  - SMILES string: smiles|C1=NC(=C2C(=N1)N(C=N2)[C@H]3[C@@H]([C@@H]([C@H](O3)COP(=O)(O)OP(=O)(O)OP(=O)(O)O)O)O)N
  - CCD code: ccd|ATP
- To specify multiple copies of a molecule, you can add a number after the sequence, e.g. ccd|ATP|2 or dna|ATCG|2.

Here is an example of biological complex with 2 proteins and 2 ATP ligands:

>Complex1|Prot1:Prot2:Lig
FIRSTPROTEIN:SECONDPROTEIN:ccd|ATP|2
>Complex2|Prot1:Prot2:Lig
FIRSTPROTEIN:SECONDPROTEIN:ccd|ATP:ccd|ATP

As the copies is optional, the Complex1 and Complex2 will result in identical json input.

Note that MMseqs2-based MSAs are only generated for the protein sequences. RNA entries will not have unpaired MSAs in the json file. However, the field is marked as null so that AlphaFold3 can generate MSAs for them.

GPU-accelerated search with â `colabfold_search`â¯â

ColabFold supports GPU-accelerated MSA searches through MMseqs2-GPU.

GPU database setup

To setup the GPU databases, you will need to run the â setup_databases.shâ command with â GPU=1â as an environment variable:

GPU=1 ./setup_databases.sh /path/to/db_folder

This will download and setup the GPU databases in the specified folder. Note that here we do not pass â MMSEQS_NO_INDEX=1â as an argument since the indices are useful in the GPU search since we will keep them in the GPU memory.

GPU search

By default, running colabfold_search with the --gpu 1 option uses all available GPUs for its search.

colabfold_search /path/to/bin/mmseqs input_sequences.fasta /path/to/db_folder msas --gpu 1

To select specific GPUs, set the CUDA_VISIBLE_DEVICES environment variable:

CUDA_VISIBLE_DEVICES=0,1 colabfold_search --mmseqs /path/to/bin/mmseqs input_sequences.fasta /path/to/db_folder msas --gpu 1

Optional GPU server for enhanced performance:

For frequent searches or to achieve minimal latency, you can run a dedicated GPU server. This server holds databases permanently in GPU memory, largely eliminating search overhead:

Start the GPU server(s):

mmseqs gpuserver /path/to/db_folder/colabfold_envdb_202108_db --max-seqs 10000 --db-load-mode 0 --prefilter-mode 1 &
PID1=$!
mmseqs gpuserver /path/to/db_folder/uniref30_2302_db --max-seqs 10000 --db-load-mode 0 --prefilter-mode 1 &
PID2=$!

By default, the GPU server distributes the database evenly across all visible GPUs. You can limit GPU usage by setting the CUDA_VISIBLE_DEVICES environment variable (e.g., CUDA_VISIBLE_DEVICES=0,1). Important: Ensure that the CUDA_VISIBLE_DEVICES environment variable is set consistently for both gpuserver and colabfold_search, otherwise colabfold_search will try wait for the gpuserver to appear until a set timeout (by default 5 minutes). If your database exceeds GPU memory capacity, the GPU server efficiently streams data between host and GPU memory using asynchronous CUDA streams.

Run searches using the GPU server:

colabfold_search --mmseqs /path/to/bin/mmseqs input_sequences.fasta /path/to/db_folder msas --gpu 1 --gpu-server 1

To stop the server(s) when done:

kill $PID1
kill $PID2

For more details, see GPU-accelerated search.

Tutorials & Presentations

ColabFold Tutorial presented at the Boston Protein Design and Modeling Club. [video] [slides].

Projects based on ColabFold or helpers

Run ColabFold on your local computer by Yoshitaka Moriwaki
ColabFold/AlphaFold2 for protein structure predictions for Discoba species by Richard John Wheeler
Cloud-based molecular simulations for everyone by Pablo R. Arantes, Marcelo D. PolÃªto, Conrado Pedebos and Rodrigo Ligabue-Braun
getmoonbear is a webserver to predict protein structures by Stephanie Zhang and Neil Deshmukh
ColabFold/AlphaFold2 IDR complex prediction by Balint Meszaros
ColabFold/AlphaFold2 (Phenix version) for macromolecular structure determination by Tom Terwilliger
AlphaPickle: making AlphaFold2/ColabFold outputs interpretable by Matt Arnold

Acknowledgments

We would like to thank the RoseTTAFold and AlphaFold team for doing an excellent job open sourcing the software.
Also credit to David Koes for his awesome py3Dmol plugin, without whom these notebooks would be quite boring!
A colab by Sergey Ovchinnikov (@sokrypton), Milot Mirdita (@milot_mirdita) and Martin Steinegger (@thesteinegger).

How do I reference this work?

Mirdita M, SchÃ¼tze K, Moriwaki Y, Heo L, Ovchinnikov S and Steinegger M. ColabFold: Making protein folding accessible to all.
Nature Methods (2022) doi: 10.1038/s41592-022-01488-1
If youâre using AlphaFold, please also cite:
Jumper et al. "Highly accurate protein structure prediction with AlphaFold."
Nature (2021) doi: 10.1038/s41586-021-03819-2
If youâre using AlphaFold-multimer, please also cite:
Evans et al. "Protein complex prediction with AlphaFold-Multimer."
biorxiv (2021) doi: 10.1101/2021.10.04.463034v1
If you are using RoseTTAFold, please also cite:
Minkyung et al. "Accurate prediction of protein structures and interactions using a three-track neural network."
Science (2021) doi: 10.1126/science.abj8754

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of AlphaFold

Cons of AlphaFold

Code Comparison

Pros of RoseTTAFold

Cons of RoseTTAFold

Code Comparison

Pros of ESM

Cons of ESM

Code Comparison

Pros of google-research

Cons of google-research

Code comparison

Convert designs to code with AI

README

ColabFold - v1.5.5

Making Protein folding accessible to all via Google Colab!

FAQ

Running locally

Generating MSAs for small scale local structure/complex predictions using the MSA server

Generating MSAs for large scale structure/complex predictions

Saving MSAs in AlphaFold3-compatible JSON format

Including non-protein molecules in FASTA

GPU-accelerated search with â colabfold_searchâ¯â

GPU database setup

GPU search

Optional GPU server for enhanced performance:

Tutorials & Presentations

Projects based on ColabFold or helpers

Acknowledgments

How do I reference this work?

Top Related Projects

Convert designs to code with AI

GPU-accelerated search with â `colabfold_search`â¯â