Convert Figma logo to code with AI

shenwei356 logoseqkit

A cross-platform and ultrafast toolkit for FASTA/Q file manipulation

1,339
160
1,339
22

Top Related Projects

1,407

Toolkit for processing sequences in FASTA/Q formats

Tools (written in C using htslib) for manipulating next-generation sequencing data

Quick Overview

SeqKit is a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. It provides a comprehensive set of tools for processing sequences in bioinformatics workflows, offering both command-line interface and Go programming language API.

Pros

  • Extremely fast performance, often outperforming similar tools
  • Comprehensive set of sequence manipulation functions
  • Cross-platform compatibility (Windows, macOS, Linux)
  • Well-documented with extensive user guide and examples

Cons

  • Requires command-line familiarity for non-programmers
  • Some advanced features may have a learning curve
  • Limited graphical user interface options

Code Examples

  1. Reading and filtering sequences:
package main

import (
    "github.com/shenwei356/seqkit/seqio"
    "github.com/shenwei356/seqkit/seq"
)

func main() {
    reader, _ := seqio.NewAutoReader("input.fasta")
    writer := seqio.NewFastaWriter(os.Stdout)

    for {
        record, err := reader.Read()
        if err != nil {
            break
        }
        if len(record.Seq.Seq) > 1000 {
            writer.Write(record)
        }
    }
    writer.Flush()
}
  1. Translating DNA to protein:
package main

import (
    "fmt"
    "github.com/shenwei356/seqkit/seq"
)

func main() {
    dna := seq.NewSeq(seq.DNA, []byte("ATGGCCATTA"))
    protein := dna.Translate(1, seq.TranslTableStandard)
    fmt.Println(string(protein.Seq))
}
  1. Reverse complementing a sequence:
package main

import (
    "fmt"
    "github.com/shenwei356/seqkit/seq"
)

func main() {
    s := seq.NewSeq(seq.DNA, []byte("ATGCATGC"))
    rc := s.RevCom()
    fmt.Println(string(rc.Seq))
}

Getting Started

To use SeqKit as a command-line tool:

  1. Download the pre-compiled binary from the GitHub releases page.
  2. Add the binary to your system PATH.
  3. Run SeqKit commands, e.g., seqkit seq -n input.fasta

To use SeqKit as a Go library:

  1. Install Go on your system.
  2. Run go get github.com/shenwei356/seqkit
  3. Import SeqKit packages in your Go code:
import (
    "github.com/shenwei356/seqkit/seqio"
    "github.com/shenwei356/seqkit/seq"
)
  1. Use SeqKit functions in your Go programs as shown in the code examples above.

Competitor Comparisons

1,407

Toolkit for processing sequences in FASTA/Q formats

Pros of seqtk

  • Lightweight and fast, with minimal dependencies
  • Written in C, offering excellent performance for large datasets
  • Simple command-line interface for basic sequence manipulation tasks

Cons of seqtk

  • Limited functionality compared to SeqKit
  • Less active development and fewer updates
  • Lacks some advanced features like sequence pattern matching and statistics

Code comparison

SeqKit:

seqkit stats input.fastq
seqkit seq -n input.fasta
seqkit grep -p "ATCG" input.fasta

seqtk:

seqtk comp input.fastq
seqtk seq -n input.fasta
seqtk subseq input.fasta regions.bed

Both tools offer basic sequence manipulation, but SeqKit provides more advanced features and a wider range of subcommands. seqtk focuses on core functionalities with a simpler interface, while SeqKit offers more comprehensive sequence analysis capabilities.

SeqKit is written in Go, providing cross-platform compatibility and easier installation, whereas seqtk is written in C, potentially offering better performance for certain operations. SeqKit's development is more active, with frequent updates and new features, while seqtk has a more stable but less frequently updated codebase.

Tools (written in C using htslib) for manipulating next-generation sequencing data

Pros of samtools

  • Widely adopted industry standard for manipulating SAM/BAM files
  • Extensive functionality for variant calling and genomic analysis
  • Robust C implementation for high performance

Cons of samtools

  • Steeper learning curve for beginners
  • Limited support for FASTQ and other sequence formats
  • Requires separate installation of dependencies

Code comparison

samtools:

samtools view -bS input.sam > output.bam
samtools sort output.bam -o sorted.bam
samtools index sorted.bam

seqkit:

seqkit stat input.fastq
seqkit seq -n input.fasta
seqkit fq2fa input.fastq -o output.fasta

Key differences

  • seqkit focuses on FASTA/FASTQ manipulation, while samtools specializes in SAM/BAM processing
  • seqkit offers a more user-friendly interface for basic sequence operations
  • samtools provides more advanced genomic analysis tools
  • seqkit is written in Go, making it easier to install and use across platforms
  • samtools has a larger community and more extensive documentation

Both tools serve different purposes in bioinformatics workflows, with seqkit excelling at quick sequence manipulations and samtools being essential for alignment-based analyses.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

SeqKit - a cross-platform and ultrafast toolkit for FASTA/Q file manipulation

Subcommands of SeqKit2

Features

  • Easy to install (download)
    • Providing statically linked executable binaries for multiple platforms (Linux/Windows/macOS, amd64/arm64)
    • Light weight and out-of-the-box, no dependencies, no compilation, no configuration
    • conda install -c bioconda seqkit
  • Easy to use
    • Ultrafast (see technical-details and benchmark)
    • Seamlessly parsing both FASTA and FASTQ formats
    • Supporting (gzip/xz/zstd/bzip2 compressed) STDIN/STDOUT and input/output file, easily integrated in pipe
    • Reproducible results (configurable rand seed in sample and shuffle)
    • Supporting custom sequence ID via regular expression
    • Supporting Bash/Zsh autocompletion
  • Versatile commands (usages and examples)

Installation

Go to Download Page for more download options and changelogs, or install via conda:

conda install -c bioconda seqkit

Subcommands

CategoryCommandFunctionInputStrand-sensitivityMulti-threads
Basic operationseqTransform sequences: extract ID/seq, filter by length/quality, remove gaps…FASTA/Q
statsSimple statistics: #seqs, min/max_len, N50, Q20%, Q30%…FASTA/Q✓
subseqGet subsequences by region/gtf/bed, including flanking sequencesFASTA/Q+ or/and -
slidingExtract subsequences in sliding windowsFASTA/Q+ only
faidxCreate the FASTA index file and extract subsequences (with more features than samtools faidx)FASTA+ or/and -
translatetranslate DNA/RNA to protein sequenceFASTA/Q+ or/and -
watch Monitoring and online histograms of sequence featuresFASTA/Q
scat Real time concatenation and streaming of fastx filesFASTA/Q✓
Format conversionfq2faConvert FASTQ to FASTA formatFASTQ
fx2tabConvert FASTA/Q to tabular formatFASTA/Q
fa2fqRetrieve corresponding FASTQ records by a FASTA fileFASTA/Q+ only
tab2fxConvert tabular format to FASTA/Q formatTSV
convertConvert FASTQ quality encoding between Sanger, Solexa and IlluminaFASTA/Q
SearchinggrepSearch sequences by ID/name/sequence/sequence motifs, mismatch allowedFASTA/Q+ and -partly, -m
locateLocate subsequences/motifs, mismatch allowedFASTA/Q+ and -partly, -m
ampliconExtract amplicon (or specific region around it), mismatch allowedFASTA/Q+ and -partly, -m
fishLook for short sequences in larger sequencesFASTA/Q+ and -
Set operationsampleSample sequences by number or proportionFASTA/Q
rmdupRemove duplicated sequences by ID/name/sequenceFASTA/Q+ and -
commonFind common sequences of multiple files by id/name/sequenceFASTA/Q+ and -
duplicateDuplicate sequences N timesFASTA/Q
splitSplit sequences into files by id/seq region/size/parts (mainly for FASTA)FASTA preffered
split2Split sequences into files by size/parts (FASTA, PE/SE FASTQ)FASTA/Q
headPrint first N FASTA/Q recordsFASTA/Q
head-genomePrint sequences of the first genome with common prefixes in nameFASTA/Q
rangePrint FASTA/Q records in a range (start:end)FASTA/Q
pairPatch up paired-end reads from two fastq filesFASTA/Q
EditreplaceReplace name/sequence by regular expressionFASTA/Q+ only
renameRename duplicated IDsFASTA/Q
concatConcatenate sequences with same ID from multiple filesFASTA/Q+ only
restartReset start position for circular genomeFASTA/Q+ only
mutateEdit sequence (point mutation, insertion, deletion)FASTA/Q+ only
sanaSanitize broken single line FASTQ filesFASTQ
OrderingsortSort sequences by id/name/sequence/lengthFASTA preffered
shuffleShuffle sequencesFASTA preffered
BAM processingbamMonitoring and online histograms of BAM record featuresBAM
MiscellaneoussumCompute message digest for all sequences in FASTA/Q filesFASTA/Q✓
merge-slidesMerge sliding windows generated from seqkit slidingTSV

Notes:

  • Strand-sensitivity:
    • + only: only processing on the positive/forward strand.
    • + and -: searching on both strands.
    • + or/and -: depends on users' flags/options/arguments.
  • Multiple-threads: Using the default 4 threads is fast enough for most commands, some commands can benefit from extra threads.

Citation

  1. Wei Shen*, Botond Sipos, and Liuyang Zhao. 2024. SeqKit2: A Swiss Army Knife for Sequence and Alignment Processing. iMeta e191. doi:10.1002/imt2.191.
  2. Wei Shen, Shuai Le, Yan Li*, and Fuquan Hu*. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLOS ONE. doi:10.1371/journal.pone.0163962.

Contributors

Acknowledgements

We thank all users for their valuable feedback and suggestions. We thank all contributors for improving the code and documentation.

We appreciate Klaus Post for his fantastic packages ( compress and pgzip ) which accelerate gzip file reading and writing.

Contact

Create an issue to report bugs, propose new functions or ask for help.

License

MIT License

Starchart

Stargazers over time