csvtk

A cross-platform, efficient and practical CSV/TSV toolkit in Golang

1,089

View on GitHub

Top Related Projects

xsv

10,670

A fast CSV command line toolkit written in Rust.

tsv-utils

1,447

eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.

csvkit

6,202

A suite of utilities for converting to and working with CSV, the king of tabular file formats.

miller

9,335

Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON

Quick Overview

csvtk is a cross-platform, efficient, and practical CSV/TSV toolkit written in Golang. It provides a set of command-line tools for handling CSV/TSV files, offering fast processing and various operations such as filtering, sorting, joining, and transforming data.

Pros

High performance due to Golang implementation and concurrent processing
Wide range of operations for CSV/TSV manipulation
Cross-platform compatibility (Windows, macOS, Linux)
Easy installation with pre-compiled binaries or via package managers

Cons

Command-line interface may have a learning curve for non-technical users
Limited GUI options for those who prefer graphical interfaces
Some advanced features may require familiarity with regular expressions
Documentation could be more comprehensive for complex use cases

Code Examples

Filtering rows based on a condition:

csvtk filter -f "age>30" input.csv > output.csv

Joining two CSV files:

csvtk join -f "id" file1.csv file2.csv > joined.csv

Converting CSV to markdown table:

csvtk csv2md input.csv > output.md

Sorting a CSV file by multiple columns:

csvtk sort -k "name,age" input.csv > sorted.csv

Getting Started

Install csvtk:
- Using homebrew (macOS/Linux):
```
brew install csvtk
```
- Download pre-compiled binary from GitHub releases

Basic usage:

csvtk -h  # Show help information
csvtk stat input.csv  # Get statistics of a CSV file
csvtk head -n 5 input.csv  # View first 5 rows of a CSV file

For more advanced usage, refer to the documentation on the GitHub repository.

Competitor Comparisons

xsv

10,670

A fast CSV command line toolkit written in Rust.

Pros of xsv

Written in Rust, offering excellent performance and memory efficiency
Provides a wide range of CSV manipulation commands, including advanced operations like joins and sampling
Supports parallel processing for faster execution on multi-core systems

Cons of xsv

Limited to CSV format only, while csvtk supports both CSV and TSV
Lacks some specialized bioinformatics-related functions found in csvtk
May have a steeper learning curve for users unfamiliar with command-line tools

Code Comparison

xsv:

xsv select name,age data.csv | xsv sort -R | xsv head -n 5

csvtk:

csvtk cut -f name,age data.csv | csvtk sort -r | csvtk head -n 5

Both tools offer similar functionality for basic CSV operations, but their syntax and command structure differ slightly. xsv uses subcommands for different operations, while csvtk uses separate commands with flags.

xsv excels in performance and advanced CSV manipulations, making it ideal for large datasets and complex operations. csvtk, on the other hand, offers broader file format support and specialized functions for bioinformatics, making it more versatile for certain use cases.

tsv-utils

1,447

eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.

Pros of tsv-utils

Written in D, offering high performance for large datasets
Focused specifically on TSV files, providing optimized tools for this format
Includes unique utilities like tsv-append and tsv-sample

Cons of tsv-utils

Limited to TSV format, less flexible than csvtk for other delimiters
Fewer data manipulation features compared to csvtk's extensive toolkit
Requires D compiler for installation, potentially less accessible

Code comparison

csvtk:

csvtk cut -f 1,3 input.csv > output.csv
csvtk filter -f "age>30" input.csv > filtered.csv

tsv-utils:

tsv-select -f 1,3 input.tsv > output.tsv
tsv-filter --gt 3:30 input.tsv > filtered.tsv

Summary

csvtk offers broader file format support and more extensive data manipulation features, while tsv-utils excels in performance for TSV-specific tasks. csvtk is more versatile but may be slower for large datasets, whereas tsv-utils is faster but limited to TSV format. The choice between them depends on specific project requirements, file formats, and performance needs.

csvkit

6,202

A suite of utilities for converting to and working with CSV, the king of tabular file formats.

Pros of csvkit

Written in Python, making it more accessible for users familiar with the language
Extensive documentation and tutorials available
Wider range of tools and functionalities for CSV manipulation

Cons of csvkit

Generally slower performance compared to csvtk
Requires Python installation and potential dependency management
Less suitable for large-scale data processing tasks

Code comparison

csvkit example:

csvcut -c 1,3 input.csv | csvgrep -c 1 -r "^A" | csvsort -c 3 > output.csv

csvtk example:

csvtk cut -f 1,3 input.csv | csvtk grep -f 1 -r "^A" | csvtk sort -k 3 > output.csv

Both tools offer similar command-line interfaces for CSV manipulation, but csvtk generally provides more concise commands and faster execution. csvkit offers a broader range of functionalities, while csvtk focuses on performance and efficiency for common CSV operations.

csvtk is written in Go, resulting in better performance for large datasets and easier distribution as a single binary. It's more suitable for users prioritizing speed and simplicity in CSV processing tasks.

csvkit, being Python-based, integrates well with other Python tools and libraries, making it a good choice for users already working in Python environments or requiring more complex data manipulation capabilities.

miller

9,335

Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON

Pros of Miller

Supports a wider range of data formats (CSV, TSV, JSON, JSONL)
More powerful query language with awk-like syntax
Better performance for large datasets

Cons of Miller

Steeper learning curve due to its more complex syntax
Less focused on CSV-specific operations
Fewer built-in functions for common CSV tasks

Code Comparison

Miller:

mlr --csv filter '$age > 30' \
    then sort -f last_name \
    then cut -f first_name,last_name,age \
    input.csv

csvtk:

csvtk filter -f "age>30" \
    | csvtk sort -k last_name \
    | csvtk cut -f first_name,last_name,age \
    input.csv

Summary

Miller and csvtk are both powerful command-line tools for processing structured data. Miller offers more versatility in terms of supported formats and query capabilities, making it suitable for complex data manipulation tasks across various file types. However, this comes at the cost of a steeper learning curve.

csvtk, on the other hand, is more focused on CSV-specific operations and provides a simpler syntax for common tasks. It may be more approachable for users primarily working with CSV files and requiring straightforward manipulations.

The choice between the two depends on the specific use case, the variety of file formats involved, and the complexity of the required data transformations.

csvq

1,577

SQL-like query language for csv

Pros of csvq

SQL-like query language for CSV manipulation
Supports complex operations like JOIN, GROUP BY, and subqueries
Ability to create and manipulate temporary tables

Cons of csvq

Steeper learning curve due to SQL-like syntax
Less suitable for quick, simple CSV operations
Slower performance for basic tasks compared to csvtk

Code comparison

csvq:

SELECT name, age FROM data.csv WHERE age > 30 ORDER BY age DESC

csvtk:

csvtk filter -f "age>30" data.csv | csvtk sort -k age:nr | csvtk cut -f name,age

Key differences

csvq offers a more powerful, database-like approach to CSV manipulation, making it ideal for complex operations and users familiar with SQL. It excels in scenarios requiring joins, aggregations, and advanced filtering.

csvtk provides a simpler, command-line oriented toolkit for CSV processing. It's more suitable for quick, straightforward tasks and offers better performance for basic operations. Its syntax is more accessible for users without SQL knowledge.

Choose csvq for complex data analysis and transformation tasks, especially when working with multiple CSV files. Opt for csvtk when you need fast, simple CSV manipulations or when integrating CSV processing into shell scripts.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

csvtk - a cross-platform, efficient and practical CSV/TSV toolkit

Documents: http://bioinf.shenwei.me/csvtk ( Usage, Tutorial and FAQs). ä¸æä»ç»
Source code: https://github.com/shenwei356/csvtk
Latest version:

Introduction

Similar to FASTA/Q format in field of Bioinformatics, CSV/TSV formats are basic and ubiquitous file formats in both Bioinformatics and data science.

People usually use spreadsheet software like MS Excel to process table data. However this is all by clicking and typing, which is not automated and is time-consuming to repeat, especially when you want to apply similar operations with different datasets or purposes.

You can also accomplish some CSV/TSV manipulations using shell commands, but more code is needed to handle the header line. Shell commands do not support selecting columns with column names either.

csvtk is convenient for rapid data investigation and also easy to integrate into analysis pipelines. It could save you lots of time in (not) writing Python/R scripts.

Features
Subcommands
Installation
Command-line completion
Compared to csvkit
Examples
Acknowledgements
Contact
License
Starchart

Features

Cross-platform (Linux/Windows/Mac OS X/OpenBSD/FreeBSD)
Light weight and out-of-the-box, no dependencies, no compilation, no configuration
Fast, multiple-CPUs supported (some commands)
Practical functions provided by N subcommands
Support STDIN and gzipped input/output file, easy being used in pipe
Most of the subcommands support unselecting fields and fuzzy fields, e.g. -f "-id,-name" for all fields except "id" and "name", -F -f "a.*" for all fields with prefix "a.".
Support some common plots (see usage)
~~Seamless support for data with meta line (e.g., sep=,) of separator declaration used by MS Excel~~

Subcommands

56 subcommands in total.

Information

headers: prints headers
dim: dimensions of CSV file
nrow: print number of records
ncol: print number of columns
summary: summary statistics of selected numeric or text fields (groupby group fields)
watch: online monitoring and histogram of selected field
corr: calculate Pearson correlation between numeric columns

Format conversion

pretty: converts CSV to a readable aligned table
csv2tab: converts CSV to tabular format
tab2csv: converts tabular format to CSV
space2tab: converts space delimited format to TSV
csv2md: converts CSV to markdown format
csv2rst: converts CSV to reStructuredText format
csv2json: converts CSV to JSON format
csv2xlsx: converts CSV/TSV files to XLSX file
xlsx2csv: converts XLSX to CSV format

Set operations

head: prints first N records
concat: concatenates CSV/TSV files by rows
sample: sampling by proportion
cut: select and arrange fields
grep: greps data by selected fields with patterns/regular expressions
uniq: unique data without sorting
freq: frequencies of selected fields
inter: intersection of multiple files
filter: filters rows by values of selected fields with arithmetic expression
filter2: filters rows by awk-like arithmetic/string expressions
join: join files by selected fields (inner, left and outer join)
split splits CSV/TSV into multiple files according to column values
splitxlsx: splits XLSX sheet into multiple sheets according to column values
comb: compute combinations of items at every row

Edit

fix: fix CSV/TSV with different numbers of columns in rows
fix-quotes: fix malformed CSV/TSV caused by double-quotes
del-quotes: remove extra double-quotes added by fix-quotes
add-header: add column names
del-header: delete column names
rename: renames column names with new names
rename2: renames column names by regular expression
replace: replaces data of selected fields by regular expression
round: round float to n decimal places
comma: make numbers more readable by adding commas
mutate: creates new columns from selected fields by regular expression
mutate2: creates a new column from selected fields by awk-like arithmetic/string expressions
mutate3: create a new column from selected fields with Go-like expressions
fmtdate: format date of selected fields

Transform

transpose: transposes CSV data
sep: separate column into multiple columns
gather: gather columns into key-value pairs, like tidyr::gather/pivot_longer
spread: spread a key-value pair across multiple columns, like tidyr::spread/pivot_wider
unfold: unfold multiple values in cells of a field
fold: fold multiple values of a field into cells of groups

Ordering

sort: sorts by selected fields

Ploting

plot see usage
- plot hist histogram
- plot box boxplot
- plot line line plot and scatter plot
- plot bar plot bar chart

Misc

cat stream file and report progress
version print version information and check for update
genautocomplete generate shell autocompletion script (bash|zsh|fish|powershell)

Installation

Download Page

csvtk is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page.

Method 1: Download binaries (latest stable/dev version)

Just download compressed executable file of your operating system, and decompress it with tar -zxvf *.tar.gz command or other tools. And then:

For Linux-like systems
1. If you have root privilege simply copy it to /usr/local/bin:
```
 sudo cp csvtk /usr/local/bin/
```
2. Or copy to anywhere in the environment variable PATH:
```
 mkdir -p $HOME/bin/; cp csvtk $HOME/bin/
```
For windows, just copy csvtk.exe to C:\WINDOWS\system32.

Method 2: Install via Pixi

pixi global install csvtk

Method 3: Install via conda (latest stable version)

# >= v0.31.0
conda install -c conda-forge csvtk

# <= v0.31.0
conda install -c bioconda csvtk

Method 4: Install via homebrew

brew install csvtk

Method 5: For Go developer (latest stable/dev version)

go install github.com/shenwei356/csvtk/csvtk@latest

Method 6: For ArchLinux AUR users (may be not the latest)

yaourt -S csvtk

Command-line completion

Bash:

# generate completion shell
csvtk genautocomplete --shell bash

# configure if never did.
# install bash-completion if the "complete" command is not found.
echo "for bcfile in ~/.bash_completion.d/* ; do source \$bcfile; done" >> ~/.bash_completion
echo "source ~/.bash_completion" >> ~/.bashrc

Zsh:

# generate completion shell
csvtk genautocomplete --shell zsh --file ~/.zfunc/_csvtk

# configure if never did
echo 'fpath=( ~/.zfunc "${fpath[@]}" )' >> ~/.zshrc
echo "autoload -U compinit; compinit" >> ~/.zshrc

fish:

csvtk genautocomplete --shell fish --file ~/.config/fish/completions/csvtk.fish

Compared to `csvkit`

csvkit, attention: this table wasn't updated for many years.

Features	csvtk	csvkit	Note
Read Gzip	Yes	Yes	read gzip files
Fields ranges	Yes	Yes	e.g. `-f 1-4,6`
Unselect fields	Yes	--	e.g. `-1` for excluding first column
Fuzzy fields	Yes	--	e.g. `ab*` for columns with name prefix "ab"
Reorder fields	Yes	Yes	it means `-f 1,2` is different from `-f 2,1`
Rename columns	Yes	--	rename with new name(s) or from existed names
Sort by multiple keys	Yes	Yes	bash sort like operations
Sort by number	Yes	--	e.g. `-k 1:n`
Multiple sort	Yes	--	e.g. `-k 2:r -k 1:nr`
Pretty output	Yes	Yes	convert CSV to readable aligned table
Unique data	Yes	--	unique data of selected fields
frequency	Yes	--	frequencies of selected fields
Sampling	Yes	--	sampling by proportion
Mutate fields	Yes	--	create new columns from selected fields
Replace	Yes	--	replace data of selected fields

Similar tools:

csvkit - A suite of utilities for converting to and working with CSV, the king of tabular file formats. http://csvkit.rtfd.org/
xsv - A fast CSV toolkit written in Rust.
miller - Miller is like sed, awk, cut, join, and sort for name-indexed data such as CSV and tabular JSON http://johnkerl.org/miller
tsv-utils - Command line utilities for tab-separated value files written in the D programming language.

Examples

More examples and tutorial.

Attention

By default, csvtk assumes input files have header row, if not, switch flag -H on.
By default, csvtk handles CSV files, use flag -t for tab-delimited files.
Column names should be unique.
By default, lines starting with # will be ignored, if the header row starts with #, please assign flag -C another rare symbol, e.g. $.
Do not mix use field (column) numbers and names to specify columns to operate.
The CSV parser requires all the lines have same numbers of fields/columns. Even lines with spaces will cause error. Use -I/--ignore-illegal-row to skip these lines if neccessary. You can also use "csvtk fix" to fix files with different numbers of columns in rows.
If double-quotes exist in fields not enclosed with double-quotes, e.g.,
```
 x,a "b" c,1
```
It would report error:
```
 bare `"` in non-quoted-field.
```
Please switch on the flag -l or use csvtk fix-quotes to fix it.
If somes fields have only a double-quote either in the beginning or in the end, e.g.,
```
 x,d "e","a" b c,1
```
It would report an error:
```
 extraneous or missing " in quoted-field
```
Please use csvtk fix-quotes to fix it, and use csvtk del-quotes to reset to the original format as needed.

Examples

Pretty result

 $ csvtk pretty names.csv
 id   first_name   last_name   username
 --   ----------   ---------   --------
 11   Rob          Pike        rob
 2    Ken          Thompson    ken
 4    Robert       Griesemer   gri
 1    Robert       Thompson    abc
 NA   Robert       Abel        123

 $ csvtk pretty names.csv -S 3line
 ââââââââââââââââââââââââââââââââââââââââ
  id   first_name   last_name   username
 ----------------------------------------
  11   Rob          Pike        rob
  2    Ken          Thompson    ken
  4    Robert       Griesemer   gri
  1    Robert       Thompson    abc
  NA   Robert       Abel        123
 ââââââââââââââââââââââââââââââââââââââââ

 $ csvtk pretty names.csv -S bold -w 5 -m 1-
 âââââââââ³âââââââââââââ³ââââââââââââ³âââââââââââ
 â  id   â first_name â last_name â username â
 â£ââââââââââââââââââââââââââââââââââââââââââââ«
 â  11   â    Rob     â   Pike    â   rob    â
 â£ââââââââââââââââââââââââââââââââââââââââââââ«
 â   2   â    Ken     â Thompson  â   ken    â
 â£ââââââââââââââââââââââââââââââââââââââââââââ«
 â   4   â   Robert   â Griesemer â   gri    â
 â£ââââââââââââââââââââââââââââââââââââââââââââ«
 â   1   â   Robert   â Thompson  â   abc    â
 â£ââââââââââââââââââââââââââââââââââââââââââââ«
 â  NA   â   Robert   â   Abel    â   123    â
 âââââââââ»âââââââââââââ»ââââââââââââ»âââââââââââ

Summary of selected numeric fields, supporting "group-by"

 $ cat testdata/digitals2.csv \
     | csvtk summary -i -f f4:sum,f5:sum -g f1,f2 \
     | csvtk pretty
 f1    f2     f4:sum   f5:sum
 bar   xyz    7.00     106.00
 bar   xyz2   4.00     4.00
 foo   bar    6.00     3.00
 foo   bar2   4.50     5.00

Select fields/columns (cut)
- By index: csvtk cut -f 1,2
- By names: csvtk cut -f first_name,username
- Unselect: csvtk cut -f -1,-2 or csvtk cut -f -first_name
- Fuzzy fields: csvtk cut -F -f "*_name,username"
- Field ranges: csvtk cut -f 2-4 for column 2,3,4 or csvtk cut -f -3--1 for discarding column 1,2,3
- All fields: csvtk cut -f 1- or csvtk cut -F -f "*"
Search by selected fields (grep) (matched parts will be highlighted as red)
- By exactly matching: csvtk grep -f first_name -p Robert -p Rob
- By regular expression: csvtk grep -f first_name -r -p Rob
- By pattern list: csvtk grep -f first_name -P name_list.txt
- Remore rows containing missing data (NA): csvtk grep -F -f "*" -r -p "^$" -v
Rename column names (rename and rename2)
- Setting new names: csvtk rename -f A,B -n a,b or csvtk rename -f 1-3 -n a,b,c
- Replacing with original names by regular express: csvtk rename2 -f 1- -p "(.*)" -r 'prefix_$1' for adding prefix to all column names.
Edit data with regular expression (replace)
- Remove Chinese charactors: csvtk replace -F -f "*_name" -p "\p{Han}+" -r ""
Create new column from selected fields by regular expression (mutate)
- In default, copy a column: csvtk mutate -f id
- Extract prefix of data as group name (get "A" from "A.1" as group name): csvtk mutate -f sample -n group -p "^(.+?)\." --after sample
Sort by multiple keys (sort)
- By single column : csvtk sort -k 1 or csvtk sort -k last_name
- By multiple columns: csvtk sort -k 1,2 or csvtk sort -k 1 -k 2 or csvtk sort -k last_name,age
- Sort by number: csvtk sort -k 1:n or csvtk sort -k 1:nr for reverse number
- Complex sort: csvtk sort -k region -k age:n -k id:nr
- In natural order: csvtk sort -k chr:N
Join multiple files by keys (join)
- All files have same key column: csvtk join -f id file1.csv file2.csv
- Files have different key columns: csvtk join -f "username;username;name" names.csv phone.csv adress.csv -k
Filter by numbers (filter)
- Single field: csvtk filter -f "id>0"
- Multiple fields: csvtk filter -f "1-3>0"
- Using --any to print record if any of the field satisfy the condition: csvtk filter -f "1-3>0" --any
- fuzzy fields: csvtk filter -F -f "A*!=0"
Filter rows by awk-like arithmetic/string expressions (filter2)
- Using field index: csvtk filter2 -f '$3>0'
- Using column names: csvtk filter2 -f '$id > 0'
- Both arithmetic and string expressions: csvtk filter2 -f '$id > 3 || $username=="ken"'
- More complicated: csvtk filter2 -H -t -f '$1 > 2 && $2 % 2 == 0'

Plotting

plot histogram with data of the second column:

  csvtk -t plot hist testdata/grouped_data.tsv.gz -f 2 | display

plot boxplot with data of the "GC Content" (third) column, group information is the "Group" column.

  csvtk -t plot box testdata/grouped_data.tsv.gz -g "Group" \
      -f "GC Content" --width 3 --title "Box plot" | display

plot horiz boxplot with data of the "Length" (second) column, group information is the "Group" column.

 csvtk -t plot box testdata/grouped_data.tsv.gz -g "Group" -f "Length"  \
     --height 3 --width 5 --horiz --title "Horiz box plot" | display

plot line plot with X-Y data

  csvtk -t plot line testdata/xy.tsv -x X -y Y -g Group | display

plot line plot with X-Y data (where X values are dates or other sortable values)

  cat testdata/date2value.csv \
      | csvtk gather -f 2- -k type -v count \
      | csvtk plot line --group-field type -x date -y count --data-field-x-nominal \
          -o testdata/figures/line_plot_date.png

plot scatter plot with X-Y data

  csvtk -t plot line testdata/xy.tsv -x X -y Y -g Group --scatter | display

Acknowledgements

We are grateful to Zhiluo Deng and Li Peng for suggesting features and reporting bugs.

Thanks Albert Vilella for feature suggestions, which makes csvtk feature-richã

Contact

Create an issue to report bugs, propose new functions or ask for help.

Or leave a comment.

License

MIT License

Starchart

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of xsv

Cons of xsv

Code Comparison

Pros of tsv-utils

Cons of tsv-utils

Code comparison

Summary

Pros of csvkit

Cons of csvkit

Code comparison

Pros of Miller

Cons of Miller

Code Comparison

Summary

Pros of csvq

Cons of csvq

Code comparison

Key differences

Convert designs to code with AI

README

csvtk - a cross-platform, efficient and practical CSV/TSV toolkit

Introduction

Table of Contents

Features

Subcommands

Installation

Method 1: Download binaries (latest stable/dev version)

Method 2: Install via Pixi

Method 3: Install via conda (latest stable version)

Method 4: Install via homebrew

Method 5: For Go developer (latest stable/dev version)

Method 6: For ArchLinux AUR users (may be not the latest)

Command-line completion

Compared to csvkit

Examples

Acknowledgements

Contact

License

Starchart

Top Related Projects

Convert designs to code with AI

Compared to `csvkit`