seqtk

Toolkit for processing sequences in FASTA/Q formats

1,472

314

1,472

View on GitHub

Top Related Projects

samtools

1,757

Tools (written in C using htslib) for manipulating next-generation sequencing data

gatk

1,827

Official code repository for GATK versions 4 and up

seqkit

1,438

A cross-platform and ultrafast toolkit for FASTA/Q file manipulation

Quick Overview

Seqtk is a fast and lightweight tool for processing sequences in FASTA or FASTQ format. It provides various utilities for manipulating sequence data, including trimming, subsampling, and format conversion. Seqtk is designed to be efficient and easy to use, making it a valuable tool for bioinformatics workflows.

Pros

Fast and memory-efficient processing of large sequence files
Versatile functionality for common sequence manipulation tasks
Simple command-line interface for easy integration into pipelines
Written in C, allowing for high performance and portability

Cons

Limited to basic sequence manipulation tasks
Lacks advanced features found in more comprehensive bioinformatics suites
Minimal GUI, which may be challenging for users unfamiliar with command-line tools
Documentation could be more extensive for some advanced use cases

Code Examples

Subsampling sequences:

seqtk sample -s100 input.fastq 10000 > output.fastq

This command subsamples 10,000 reads from the input FASTQ file using a seed of 100.

Converting FASTQ to FASTA:

seqtk seq -a input.fastq > output.fasta

This command converts a FASTQ file to FASTA format.

Trimming sequences:

seqtk trimfq -b 5 -e 10 input.fastq > trimmed_output.fastq

This command trims 5 bases from the beginning and 10 bases from the end of each sequence in the input FASTQ file.

Getting Started

To get started with Seqtk, follow these steps:

Clone the repository:

git clone https://github.com/lh3/seqtk.git

Compile the source code:
```
cd seqtk
make
```
Add the Seqtk directory to your PATH or move the executable to a directory in your PATH:
```
export PATH=$PATH:/path/to/seqtk
```
Run Seqtk commands:
```
seqtk
```
This will display the available commands and options.

Competitor Comparisons

samtools

1,757

Tools (written in C using htslib) for manipulating next-generation sequencing data

Pros of samtools

More comprehensive suite of tools for manipulating SAM/BAM/CRAM files
Supports advanced operations like variant calling and pileup generation
Widely used in bioinformatics pipelines and well-documented

Cons of samtools

Larger and more complex codebase, potentially harder to contribute to
May have a steeper learning curve for beginners
Slower for simple operations compared to seqtk's lightweight approach

Code comparison

seqtk:

while ((l = kseq_read(seq)) >= 0) {
    if (seq->qual.l)
        printf("@%s\n%s\n+\n%s\n", seq->name.s, seq->seq.s, seq->qual.s);
    else
        printf(">%s\n%s\n", seq->name.s, seq->seq.s);
}

samtools:

while (sam_read1(in, header, b) >= 0) {
    if (bam_write1(out, b) < 0) {
        fprintf(stderr, "Error writing output.\n");
        return 1;
    }
}

Both tools use C for core functionality, but samtools has a more complex API for handling SAM/BAM files. seqtk focuses on simpler sequence manipulation tasks with a more straightforward approach. samtools offers more advanced features but requires more setup and understanding of file formats.

gatk

1,827

Official code repository for GATK versions 4 and up

Pros of GATK

Comprehensive suite of tools for variant discovery and genotyping
Actively maintained with regular updates and improvements
Extensive documentation and community support

Cons of GATK

More complex to use and requires more computational resources
Steeper learning curve for beginners
Slower processing speed for large datasets

Code Comparison

seqtk:

seqtk seq -a in.fq.gz > out.fa

GATK:

gatk HaplotypeCaller \
  -R reference.fasta \
  -I input.bam \
  -O output.vcf

Key Differences

seqtk is a lightweight, fast toolkit for processing sequences in FASTA/FASTQ format
GATK is a comprehensive suite for variant discovery and genotyping
seqtk focuses on basic sequence manipulation tasks, while GATK specializes in advanced genomic analysis
seqtk is simpler to use and faster for basic tasks, while GATK offers more advanced features but with increased complexity

Use Cases

seqtk: Quick sequence format conversions, subsampling, and basic manipulations
GATK: Advanced variant calling, genotyping, and complex genomic analyses in research or clinical settings

seqkit

1,438

A cross-platform and ultrafast toolkit for FASTA/Q file manipulation

Pros of seqkit

More comprehensive feature set, including advanced sequence manipulation and analysis tools
Better documentation and user guides
Actively maintained with frequent updates and bug fixes

Cons of seqkit

Slower performance for some basic operations compared to seqtk
Larger binary size and memory footprint

Code comparison

seqkit:

seqkit seq -n input.fa > output.txt
seqkit stats input.fa
seqkit grep -p "ATCG" input.fa > matched.fa

seqtk:

seqtk seq -n input.fa > output.txt
seqtk comp input.fa | awk '{print $1"\t"$2}'
seqtk subseq input.fa <(grep -n "ATCG" input.fa | cut -d: -f1) > matched.fa

Both tools provide similar basic functionality, but seqkit offers more built-in features and a more user-friendly interface. seqtk often requires combining with other command-line tools for complex operations, while seqkit has many of these features integrated. However, seqtk's simplicity and speed make it preferable for certain basic tasks and in resource-constrained environments.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Introduction

Seqtk is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip. To install seqtk,

git clone https://github.com/lh3/seqtk.git;
cd seqtk; make

The only library dependency is zlib.

Seqtk Examples

Convert FASTQ to FASTA:
```
  seqtk seq -a in.fq.gz > out.fa
```
Convert ILLUMINA 1.3+ FASTQ to FASTA and mask bases with quality lower than 20 to lowercases (the 1st command line) or to N (the 2nd):
```
  seqtk seq -aQ64 -q20 in.fq > out.fa
  seqtk seq -aQ64 -q20 -n N in.fq > out.fa
```
Fold long FASTA/Q lines and remove FASTA/Q comments:
```
  seqtk seq -Cl60 in.fa > out.fa
```
Convert multi-line FASTQ to 4-line FASTQ:
```
  seqtk seq -l0 in.fq > out.fq
```
Reverse complement FASTA/Q:
```
  seqtk seq -r in.fq > out.fq
```
Extract sequences with names in file name.lst, one sequence name per line:
```
  seqtk subseq in.fq name.lst > out.fq
```
Extract sequences in regions contained in file reg.bed:
```
  seqtk subseq in.fa reg.bed > out.fa
```
Mask regions in reg.bed to lowercases:
```
  seqtk seq -M reg.bed in.fa > out.fa
```
Subsample 10000 read pairs from two large paired FASTQ files (remember to use the same random seed to keep pairing):
```
  seqtk sample -s100 read1.fq 10000 > sub1.fq
  seqtk sample -s100 read2.fq 10000 > sub2.fq
```
Trim low-quality bases from both ends using the Phred algorithm:
```
  seqtk trimfq in.fq > out.fq
```
Trim 5bp from the left end of each read and 10bp from the right end:
```
  seqtk trimfq -b 5 -e 10 in.fa > out.fa
```

Find telomere (TTAGGG)n repeats:

  seqtk telo seq.fa > telo.bed 2> telo.count

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot