Top Related Projects
Quick Overview
SeqKit is a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. It provides a comprehensive set of tools for processing sequences in bioinformatics workflows, offering both command-line interface and Go programming language API.
Pros
- Extremely fast performance, often outperforming similar tools
- Comprehensive set of sequence manipulation functions
- Cross-platform compatibility (Windows, macOS, Linux)
- Well-documented with extensive user guide and examples
Cons
- Requires command-line familiarity for non-programmers
- Some advanced features may have a learning curve
- Limited graphical user interface options
Code Examples
- Reading and filtering sequences:
package main
import (
"github.com/shenwei356/seqkit/seqio"
"github.com/shenwei356/seqkit/seq"
)
func main() {
reader, _ := seqio.NewAutoReader("input.fasta")
writer := seqio.NewFastaWriter(os.Stdout)
for {
record, err := reader.Read()
if err != nil {
break
}
if len(record.Seq.Seq) > 1000 {
writer.Write(record)
}
}
writer.Flush()
}
- Translating DNA to protein:
package main
import (
"fmt"
"github.com/shenwei356/seqkit/seq"
)
func main() {
dna := seq.NewSeq(seq.DNA, []byte("ATGGCCATTA"))
protein := dna.Translate(1, seq.TranslTableStandard)
fmt.Println(string(protein.Seq))
}
- Reverse complementing a sequence:
package main
import (
"fmt"
"github.com/shenwei356/seqkit/seq"
)
func main() {
s := seq.NewSeq(seq.DNA, []byte("ATGCATGC"))
rc := s.RevCom()
fmt.Println(string(rc.Seq))
}
Getting Started
To use SeqKit as a command-line tool:
- Download the pre-compiled binary from the GitHub releases page.
- Add the binary to your system PATH.
- Run SeqKit commands, e.g.,
seqkit seq -n input.fasta
To use SeqKit as a Go library:
- Install Go on your system.
- Run
go get github.com/shenwei356/seqkit
- Import SeqKit packages in your Go code:
import (
"github.com/shenwei356/seqkit/seqio"
"github.com/shenwei356/seqkit/seq"
)
- Use SeqKit functions in your Go programs as shown in the code examples above.
Competitor Comparisons
Toolkit for processing sequences in FASTA/Q formats
Pros of seqtk
- Lightweight and fast, with minimal dependencies
- Written in C, offering excellent performance for large datasets
- Simple command-line interface for basic sequence manipulation tasks
Cons of seqtk
- Limited functionality compared to SeqKit
- Less active development and fewer updates
- Lacks some advanced features like sequence pattern matching and statistics
Code comparison
SeqKit:
seqkit stats input.fastq
seqkit seq -n input.fasta
seqkit grep -p "ATCG" input.fasta
seqtk:
seqtk comp input.fastq
seqtk seq -n input.fasta
seqtk subseq input.fasta regions.bed
Both tools offer basic sequence manipulation, but SeqKit provides more advanced features and a wider range of subcommands. seqtk focuses on core functionalities with a simpler interface, while SeqKit offers more comprehensive sequence analysis capabilities.
SeqKit is written in Go, providing cross-platform compatibility and easier installation, whereas seqtk is written in C, potentially offering better performance for certain operations. SeqKit's development is more active, with frequent updates and new features, while seqtk has a more stable but less frequently updated codebase.
Tools (written in C using htslib) for manipulating next-generation sequencing data
Pros of samtools
- Widely adopted industry standard for manipulating SAM/BAM files
- Extensive functionality for variant calling and genomic analysis
- Robust C implementation for high performance
Cons of samtools
- Steeper learning curve for beginners
- Limited support for FASTQ and other sequence formats
- Requires separate installation of dependencies
Code comparison
samtools:
samtools view -bS input.sam > output.bam
samtools sort output.bam -o sorted.bam
samtools index sorted.bam
seqkit:
seqkit stat input.fastq
seqkit seq -n input.fasta
seqkit fq2fa input.fastq -o output.fasta
Key differences
- seqkit focuses on FASTA/FASTQ manipulation, while samtools specializes in SAM/BAM processing
- seqkit offers a more user-friendly interface for basic sequence operations
- samtools provides more advanced genomic analysis tools
- seqkit is written in Go, making it easier to install and use across platforms
- samtools has a larger community and more extensive documentation
Both tools serve different purposes in bioinformatics workflows, with seqkit excelling at quick sequence manipulations and samtools being essential for alignment-based analyses.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
SeqKit - a cross-platform and ultrafast toolkit for FASTA/Q file manipulation
- Try SeqKit in your browser (Tutorials and Exercises provided by sandbox.bio)
- Documents: http://bioinf.shenwei.me/seqkit (Usage, FAQs, Tutorial, and Benchmark)
- Source code: https://github.com/shenwei356/seqkit
- Latest version:
- Please cite: ,
- Others:
Features
- Easy to install (download)
- Providing statically linked executable binaries for multiple platforms (Linux/Windows/macOS, amd64/arm64)
- Light weight and out-of-the-box, no dependencies, no compilation, no configuration
conda install -c bioconda seqkit
- Easy to use
- Ultrafast (see technical-details and benchmark)
- Seamlessly parsing both FASTA and FASTQ formats
- Supporting (
gzip
/xz
/zstd
/bzip2
compressed) STDIN/STDOUT and input/output file, easily integrated in pipe - Reproducible results (configurable rand seed in
sample
andshuffle
) - Supporting custom sequence ID via regular expression
- Supporting Bash/Zsh autocompletion
- Versatile commands (usages and examples)
- Practical functions supported by 38 subcommands
Installation
Go to Download Page for more download options and changelogs, or install via conda:
conda install -c bioconda seqkit
Subcommands
Category | Command | Function | Input | Strand-sensitivity | Multi-threads |
---|---|---|---|---|---|
Basic operation | seq | Transform sequences: extract ID/seq, filter by length/quality, remove gaps⦠| FASTA/Q | ||
stats | Simple statistics: #seqs, min/max_len, N50, Q20%, Q30%⦠| FASTA/Q | â | ||
subseq | Get subsequences by region/gtf/bed, including flanking sequences | FASTA/Q | + or/and - | ||
sliding | Extract subsequences in sliding windows | FASTA/Q | + only | ||
faidx | Create the FASTA index file and extract subsequences (with more features than samtools faidx) | FASTA | + or/and - | ||
translate | translate DNA/RNA to protein sequence | FASTA/Q | + or/and - | ||
watch | Monitoring and online histograms of sequence features | FASTA/Q | |||
scat | Real time concatenation and streaming of fastx files | FASTA/Q | â | ||
Format conversion | fq2fa | Convert FASTQ to FASTA format | FASTQ | ||
fx2tab | Convert FASTA/Q to tabular format | FASTA/Q | |||
fa2fq | Retrieve corresponding FASTQ records by a FASTA file | FASTA/Q | + only | ||
tab2fx | Convert tabular format to FASTA/Q format | TSV | |||
convert | Convert FASTQ quality encoding between Sanger, Solexa and Illumina | FASTA/Q | |||
Searching | grep | Search sequences by ID/name/sequence/sequence motifs, mismatch allowed | FASTA/Q | + and - | partly, -m |
locate | Locate subsequences/motifs, mismatch allowed | FASTA/Q | + and - | partly, -m | |
amplicon | Extract amplicon (or specific region around it), mismatch allowed | FASTA/Q | + and - | partly, -m | |
fish | Look for short sequences in larger sequences | FASTA/Q | + and - | ||
Set operation | sample | Sample sequences by number or proportion | FASTA/Q | ||
rmdup | Remove duplicated sequences by ID/name/sequence | FASTA/Q | + and - | ||
common | Find common sequences of multiple files by id/name/sequence | FASTA/Q | + and - | ||
duplicate | Duplicate sequences N times | FASTA/Q | |||
split | Split sequences into files by id/seq region/size/parts (mainly for FASTA) | FASTA preffered | |||
split2 | Split sequences into files by size/parts (FASTA, PE/SE FASTQ) | FASTA/Q | |||
head | Print first N FASTA/Q records | FASTA/Q | |||
head-genome | Print sequences of the first genome with common prefixes in name | FASTA/Q | |||
range | Print FASTA/Q records in a range (start:end) | FASTA/Q | |||
pair | Patch up paired-end reads from two fastq files | FASTA/Q | |||
Edit | replace | Replace name/sequence by regular expression | FASTA/Q | + only | |
rename | Rename duplicated IDs | FASTA/Q | |||
concat | Concatenate sequences with same ID from multiple files | FASTA/Q | + only | ||
restart | Reset start position for circular genome | FASTA/Q | + only | ||
mutate | Edit sequence (point mutation, insertion, deletion) | FASTA/Q | + only | ||
sana | Sanitize broken single line FASTQ files | FASTQ | |||
Ordering | sort | Sort sequences by id/name/sequence/length | FASTA preffered | ||
shuffle | Shuffle sequences | FASTA preffered | |||
BAM processing | bam | Monitoring and online histograms of BAM record features | BAM | ||
Miscellaneous | sum | Compute message digest for all sequences in FASTA/Q files | FASTA/Q | â | |
merge-slides | Merge sliding windows generated from seqkit sliding | TSV |
Notes:
- Strand-sensitivity:
+ only
: only processing on the positive/forward strand.+ and -
: searching on both strands.+ or/and -
: depends on users' flags/options/arguments.
- Multiple-threads: Using the default 4 threads is fast enough for most commands, some commands can benefit from extra threads.
Citation
- Wei Shen*, Botond Sipos, and Liuyang Zhao. 2024. SeqKit2: A Swiss Army Knife for Sequence and Alignment Processing. iMeta e191. doi:10.1002/imt2.191.
- Wei Shen, Shuai Le, Yan Li*, and Fuquan Hu*. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLOS ONE. doi:10.1371/journal.pone.0163962.
Contributors
- Wei Shen
- Botond Sipos:
bam
,scat
,fish
,sana
,watch
. - others
Acknowledgements
We thank all users for their valuable feedback and suggestions. We thank all contributors for improving the code and documentation.
We appreciate Klaus Post for his fantastic packages ( compress and pgzip ) which accelerate gzip file reading and writing.
Contact
Create an issue to report bugs, propose new functions or ask for help.
License
Starchart
Top Related Projects
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot