Convert Figma logo to code with AI

lh3 logoseqtk

Toolkit for processing sequences in FASTA/Q formats

1,369
308
1,369
65

Top Related Projects

Tools (written in C using htslib) for manipulating next-generation sequencing data

1,678

Official code repository for GATK versions 4 and up

1,274

A cross-platform and ultrafast toolkit for FASTA/Q file manipulation

Quick Overview

Seqtk is a fast and lightweight tool for processing sequences in FASTA or FASTQ format. It provides various utilities for manipulating sequence data, including trimming, subsampling, and format conversion. Seqtk is designed to be efficient and easy to use, making it a valuable tool for bioinformatics workflows.

Pros

  • Fast and memory-efficient processing of large sequence files
  • Versatile functionality for common sequence manipulation tasks
  • Simple command-line interface for easy integration into pipelines
  • Written in C, allowing for high performance and portability

Cons

  • Limited to basic sequence manipulation tasks
  • Lacks advanced features found in more comprehensive bioinformatics suites
  • Minimal GUI, which may be challenging for users unfamiliar with command-line tools
  • Documentation could be more extensive for some advanced use cases

Code Examples

  1. Subsampling sequences:
seqtk sample -s100 input.fastq 10000 > output.fastq

This command subsamples 10,000 reads from the input FASTQ file using a seed of 100.

  1. Converting FASTQ to FASTA:
seqtk seq -a input.fastq > output.fasta

This command converts a FASTQ file to FASTA format.

  1. Trimming sequences:
seqtk trimfq -b 5 -e 10 input.fastq > trimmed_output.fastq

This command trims 5 bases from the beginning and 10 bases from the end of each sequence in the input FASTQ file.

Getting Started

To get started with Seqtk, follow these steps:

  1. Clone the repository:

    git clone https://github.com/lh3/seqtk.git
    
  2. Compile the source code:

    cd seqtk
    make
    
  3. Add the Seqtk directory to your PATH or move the executable to a directory in your PATH:

    export PATH=$PATH:/path/to/seqtk
    
  4. Run Seqtk commands:

    seqtk
    

    This will display the available commands and options.

Competitor Comparisons

Tools (written in C using htslib) for manipulating next-generation sequencing data

Pros of samtools

  • More comprehensive suite of tools for manipulating SAM/BAM/CRAM files
  • Supports advanced operations like variant calling and pileup generation
  • Widely used in bioinformatics pipelines and well-documented

Cons of samtools

  • Larger and more complex codebase, potentially harder to contribute to
  • May have a steeper learning curve for beginners
  • Slower for simple operations compared to seqtk's lightweight approach

Code comparison

seqtk:

while ((l = kseq_read(seq)) >= 0) {
    if (seq->qual.l)
        printf("@%s\n%s\n+\n%s\n", seq->name.s, seq->seq.s, seq->qual.s);
    else
        printf(">%s\n%s\n", seq->name.s, seq->seq.s);
}

samtools:

while (sam_read1(in, header, b) >= 0) {
    if (bam_write1(out, b) < 0) {
        fprintf(stderr, "Error writing output.\n");
        return 1;
    }
}

Both tools use C for core functionality, but samtools has a more complex API for handling SAM/BAM files. seqtk focuses on simpler sequence manipulation tasks with a more straightforward approach. samtools offers more advanced features but requires more setup and understanding of file formats.

1,678

Official code repository for GATK versions 4 and up

Pros of GATK

  • Comprehensive suite of tools for variant discovery and genotyping
  • Actively maintained with regular updates and improvements
  • Extensive documentation and community support

Cons of GATK

  • More complex to use and requires more computational resources
  • Steeper learning curve for beginners
  • Slower processing speed for large datasets

Code Comparison

seqtk:

seqtk seq -a in.fq.gz > out.fa

GATK:

gatk HaplotypeCaller \
  -R reference.fasta \
  -I input.bam \
  -O output.vcf

Key Differences

  • seqtk is a lightweight, fast toolkit for processing sequences in FASTA/FASTQ format
  • GATK is a comprehensive suite for variant discovery and genotyping
  • seqtk focuses on basic sequence manipulation tasks, while GATK specializes in advanced genomic analysis
  • seqtk is simpler to use and faster for basic tasks, while GATK offers more advanced features but with increased complexity

Use Cases

  • seqtk: Quick sequence format conversions, subsampling, and basic manipulations
  • GATK: Advanced variant calling, genotyping, and complex genomic analyses in research or clinical settings
1,274

A cross-platform and ultrafast toolkit for FASTA/Q file manipulation

Pros of seqkit

  • More comprehensive feature set, including advanced sequence manipulation and analysis tools
  • Better documentation and user guides
  • Actively maintained with frequent updates and bug fixes

Cons of seqkit

  • Slower performance for some basic operations compared to seqtk
  • Larger binary size and memory footprint

Code comparison

seqkit:

seqkit seq -n input.fa > output.txt
seqkit stats input.fa
seqkit grep -p "ATCG" input.fa > matched.fa

seqtk:

seqtk seq -n input.fa > output.txt
seqtk comp input.fa | awk '{print $1"\t"$2}'
seqtk subseq input.fa <(grep -n "ATCG" input.fa | cut -d: -f1) > matched.fa

Both tools provide similar basic functionality, but seqkit offers more built-in features and a more user-friendly interface. seqtk often requires combining with other command-line tools for complex operations, while seqkit has many of these features integrated. However, seqtk's simplicity and speed make it preferable for certain basic tasks and in resource-constrained environments.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Introduction

Seqtk is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip. To install seqtk,

git clone https://github.com/lh3/seqtk.git;
cd seqtk; make

The only library dependency is zlib.

Seqtk Examples

  • Convert FASTQ to FASTA:

      seqtk seq -a in.fq.gz > out.fa
    
  • Convert ILLUMINA 1.3+ FASTQ to FASTA and mask bases with quality lower than 20 to lowercases (the 1st command line) or to N (the 2nd):

      seqtk seq -aQ64 -q20 in.fq > out.fa
      seqtk seq -aQ64 -q20 -n N in.fq > out.fa
    
  • Fold long FASTA/Q lines and remove FASTA/Q comments:

      seqtk seq -Cl60 in.fa > out.fa
    
  • Convert multi-line FASTQ to 4-line FASTQ:

      seqtk seq -l0 in.fq > out.fq
    
  • Reverse complement FASTA/Q:

      seqtk seq -r in.fq > out.fq
    
  • Extract sequences with names in file name.lst, one sequence name per line:

      seqtk subseq in.fq name.lst > out.fq
    
  • Extract sequences in regions contained in file reg.bed:

      seqtk subseq in.fa reg.bed > out.fa
    
  • Mask regions in reg.bed to lowercases:

      seqtk seq -M reg.bed in.fa > out.fa
    
  • Subsample 10000 read pairs from two large paired FASTQ files (remember to use the same random seed to keep pairing):

      seqtk sample -s100 read1.fq 10000 > sub1.fq
      seqtk sample -s100 read2.fq 10000 > sub2.fq
    
  • Trim low-quality bases from both ends using the Phred algorithm:

      seqtk trimfq in.fq > out.fq
    
  • Trim 5bp from the left end of each read and 10bp from the right end:

      seqtk trimfq -b 5 -e 10 in.fa > out.fa
    
  • Find telomere (TTAGGG)n repeats:

      seqtk telo seq.fa > telo.bed 2> telo.count