seqkit

A cross-platform and ultrafast toolkit for FASTA/Q file manipulation

1,274

157

1,274

View on GitHub

Top Related Projects

seqtk

1,369

Toolkit for processing sequences in FASTA/Q formats

samtools

1,606

Tools (written in C using htslib) for manipulating next-generation sequencing data

Quick Overview

SeqKit is a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. It provides a comprehensive set of tools for processing sequences in bioinformatics workflows, offering both command-line interface and Go programming language API.

Pros

Extremely fast performance, often outperforming similar tools
Comprehensive set of sequence manipulation functions
Cross-platform compatibility (Windows, macOS, Linux)
Well-documented with extensive user guide and examples

Cons

Requires command-line familiarity for non-programmers
Some advanced features may have a learning curve
Limited graphical user interface options

Code Examples

Reading and filtering sequences:

package main

import (
    "github.com/shenwei356/seqkit/seqio"
    "github.com/shenwei356/seqkit/seq"
)

func main() {
    reader, _ := seqio.NewAutoReader("input.fasta")
    writer := seqio.NewFastaWriter(os.Stdout)

    for {
        record, err := reader.Read()
        if err != nil {
            break
        }
        if len(record.Seq.Seq) > 1000 {
            writer.Write(record)
        }
    }
    writer.Flush()
}

Translating DNA to protein:

package main

import (
    "fmt"
    "github.com/shenwei356/seqkit/seq"
)

func main() {
    dna := seq.NewSeq(seq.DNA, []byte("ATGGCCATTA"))
    protein := dna.Translate(1, seq.TranslTableStandard)
    fmt.Println(string(protein.Seq))
}

Reverse complementing a sequence:

package main

import (
    "fmt"
    "github.com/shenwei356/seqkit/seq"
)

func main() {
    s := seq.NewSeq(seq.DNA, []byte("ATGCATGC"))
    rc := s.RevCom()
    fmt.Println(string(rc.Seq))
}

Getting Started

To use SeqKit as a command-line tool:

Download the pre-compiled binary from the GitHub releases page.
Add the binary to your system PATH.
Run SeqKit commands, e.g., seqkit seq -n input.fasta

To use SeqKit as a Go library:

Install Go on your system.
Run go get github.com/shenwei356/seqkit
Import SeqKit packages in your Go code:

import (
    "github.com/shenwei356/seqkit/seqio"
    "github.com/shenwei356/seqkit/seq"
)

Use SeqKit functions in your Go programs as shown in the code examples above.

Competitor Comparisons

seqtk

1,369

Toolkit for processing sequences in FASTA/Q formats

Pros of seqtk

Lightweight and fast, with minimal dependencies
Written in C, offering excellent performance for large datasets
Simple command-line interface for basic sequence manipulation tasks

Cons of seqtk

Limited functionality compared to SeqKit
Less active development and fewer updates
Lacks some advanced features like sequence pattern matching and statistics

Code comparison

SeqKit:

seqkit stats input.fastq
seqkit seq -n input.fasta
seqkit grep -p "ATCG" input.fasta

seqtk:

seqtk comp input.fastq
seqtk seq -n input.fasta
seqtk subseq input.fasta regions.bed

Both tools offer basic sequence manipulation, but SeqKit provides more advanced features and a wider range of subcommands. seqtk focuses on core functionalities with a simpler interface, while SeqKit offers more comprehensive sequence analysis capabilities.

SeqKit is written in Go, providing cross-platform compatibility and easier installation, whereas seqtk is written in C, potentially offering better performance for certain operations. SeqKit's development is more active, with frequent updates and new features, while seqtk has a more stable but less frequently updated codebase.

samtools

1,606

Tools (written in C using htslib) for manipulating next-generation sequencing data

Pros of samtools

Widely adopted industry standard for manipulating SAM/BAM files
Extensive functionality for variant calling and genomic analysis
Robust C implementation for high performance

Cons of samtools

Steeper learning curve for beginners
Limited support for FASTQ and other sequence formats
Requires separate installation of dependencies

Code comparison

samtools:

samtools view -bS input.sam > output.bam
samtools sort output.bam -o sorted.bam
samtools index sorted.bam

seqkit:

seqkit stat input.fastq
seqkit seq -n input.fasta
seqkit fq2fa input.fastq -o output.fasta

Key differences

seqkit focuses on FASTA/FASTQ manipulation, while samtools specializes in SAM/BAM processing
seqkit offers a more user-friendly interface for basic sequence operations
samtools provides more advanced genomic analysis tools
seqkit is written in Go, making it easier to install and use across platforms
samtools has a larger community and more extensive documentation

Both tools serve different purposes in bioinformatics workflows, with seqkit excelling at quick sequence manipulations and samtools being essential for alignment-based analyses.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

SeqKit - a cross-platform and ultrafast toolkit for FASTA/Q file manipulation

Documents: http://bioinf.shenwei.me/seqkit (Usage, FAQs, Tutorial, and Benchmark)
Source code: https://github.com/shenwei356/seqkit
Latest version:
Please cite: ,
Others:

Features

Easy to install (download)
- Providing statically linked executable binaries for multiple platforms (Linux/Windows/macOS, amd64/arm64)
- Light weight and out-of-the-box, no dependencies, no compilation, no configuration
- conda install -c bioconda seqkit
Easy to use
- Ultrafast (see technical-details and benchmark)
- Seamlessly parsing both FASTA and FASTQ formats
- Supporting (gzip/xz/zstd/bzip2 compressed) STDIN/STDOUT and input/output file, easily integrated in pipe
- Reproducible results (configurable rand seed in sample and shuffle)
- Supporting custom sequence ID via regular expression
- Supporting Bash/Zsh autocompletion
Versatile commands (usages and examples)
- Practical functions supported by 38 subcommands

Installation

Go to Download Page for more download options and changelogs, or install via conda:

conda install -c bioconda seqkit

Subcommands

Category	Command	Function	Input	Strand-sensitivity	Multi-threads
Basic operation	seq	Transform sequences: extract ID/seq, filter by length/quality, remove gapsâ¦	FASTA/Q
	stats	Simple statistics: #seqs, min/max_len, N50, Q20%, Q30%â¦	FASTA/Q		â
	subseq	Get subsequences by region/gtf/bed, including flanking sequences	FASTA/Q	+ or/and -
	sliding	Extract subsequences in sliding windows	FASTA/Q	+ only
	faidx	Create the FASTA index file and extract subsequences (with more features than samtools faidx)	FASTA	+ or/and -
	translate	translate DNA/RNA to protein sequence	FASTA/Q	+ or/and -
	watch	Monitoring and online histograms of sequence features	FASTA/Q
	scat	Real time concatenation and streaming of fastx files	FASTA/Q		â
Format conversion	fq2fa	Convert FASTQ to FASTA format	FASTQ
	fx2tab	Convert FASTA/Q to tabular format	FASTA/Q
	fa2fq	Retrieve corresponding FASTQ records by a FASTA file	FASTA/Q	+ only
	tab2fx	Convert tabular format to FASTA/Q format	TSV
	convert	Convert FASTQ quality encoding between Sanger, Solexa and Illumina	FASTA/Q
Searching	grep	Search sequences by ID/name/sequence/sequence motifs, mismatch allowed	FASTA/Q	+ and -	partly, -m
	locate	Locate subsequences/motifs, mismatch allowed	FASTA/Q	+ and -	partly, -m
	amplicon	Extract amplicon (or specific region around it), mismatch allowed	FASTA/Q	+ and -	partly, -m
	fish	Look for short sequences in larger sequences	FASTA/Q	+ and -
Set operation	sample	Sample sequences by number or proportion	FASTA/Q
	rmdup	Remove duplicated sequences by ID/name/sequence	FASTA/Q	+ and -
	common	Find common sequences of multiple files by id/name/sequence	FASTA/Q	+ and -
	duplicate	Duplicate sequences N times	FASTA/Q
	split	Split sequences into files by id/seq region/size/parts (mainly for FASTA)	FASTA preffered
	split2	Split sequences into files by size/parts (FASTA, PE/SE FASTQ)	FASTA/Q
	head	Print first N FASTA/Q records	FASTA/Q
	head-genome	Print sequences of the first genome with common prefixes in name	FASTA/Q
	range	Print FASTA/Q records in a range (start:end)	FASTA/Q
	pair	Patch up paired-end reads from two fastq files	FASTA/Q
Edit	replace	Replace name/sequence by regular expression	FASTA/Q	+ only
	rename	Rename duplicated IDs	FASTA/Q
	concat	Concatenate sequences with same ID from multiple files	FASTA/Q	+ only
	restart	Reset start position for circular genome	FASTA/Q	+ only
	mutate	Edit sequence (point mutation, insertion, deletion)	FASTA/Q	+ only
	sana	Sanitize broken single line FASTQ files	FASTQ
Ordering	sort	Sort sequences by id/name/sequence/length	FASTA preffered
	shuffle	Shuffle sequences	FASTA preffered
BAM processing	bam	Monitoring and online histograms of BAM record features	BAM
Miscellaneous	sum	Compute message digest for all sequences in FASTA/Q files	FASTA/Q		â
	merge-slides	Merge sliding windows generated from seqkit sliding	TSV

Notes:

Strand-sensitivity:
- + only: only processing on the positive/forward strand.
- + and -: searching on both strands.
- + or/and -: depends on users' flags/options/arguments.
Multiple-threads: Using the default 4 threads is fast enough for most commands, some commands can benefit from extra threads.

Citation

Wei Shen*, Botond Sipos, and Liuyang Zhao. 2024. SeqKit2: A Swiss Army Knife for Sequence and Alignment Processing. iMeta e191. doi:10.1002/imt2.191.
Wei Shen, Shuai Le, Yan Li*, and Fuquan Hu*. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLOS ONE. doi:10.1371/journal.pone.0163962.

Contributors

Wei Shen
Botond Sipos: bam, scat, fish, sana, watch.
others

Acknowledgements

We thank all users for their valuable feedback and suggestions. We thank all contributors for improving the code and documentation.

We appreciate Klaus Post for his fantastic packages ( compress and pgzip ) which accelerate gzip file reading and writing.

Contact

Create an issue to report bugs, propose new functions or ask for help.

License

MIT License

Starchart

Top Related Projects

seqtk

1,369

Toolkit for processing sequences in FASTA/Q formats

samtools

1,606

Tools (written in C using htslib) for manipulating next-generation sequencing data

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot