Top Related Projects
eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.
A terminal spreadsheet multitool for discovering and arranging data
A suite of utilities for converting to and working with CSV, the king of tabular file formats.
Fast and powerful CSV (delimited text) parser that gracefully handles large files and malformed input
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
Quick Overview
BurntSushi/xsv is a fast, cross-platform CSV command-line toolkit written in Rust. It provides a set of utilities for working with CSV files, including sorting, indexing, slicing, and performing various statistical operations. The project aims to handle large CSV files efficiently and offer a user-friendly interface for data manipulation tasks.
Pros
- High performance and efficient handling of large CSV files
- Cross-platform compatibility (Windows, macOS, Linux)
- Rich set of features for CSV manipulation and analysis
- Written in Rust, ensuring memory safety and thread safety
Cons
- Limited to command-line interface, which may not be suitable for all users
- Requires learning specific command syntax for different operations
- May have a steeper learning curve compared to GUI-based CSV tools
- Limited support for complex data transformations compared to full-fledged data processing libraries
Code Examples
Since xsv is a command-line tool and not a code library, there are no code examples to provide. Instead, here are a few example commands:
# Count the number of records in a CSV file
xsv count data.csv
# Select specific columns from a CSV file
xsv select name,age data.csv > filtered.csv
# Sort a CSV file by a specific column
xsv sort -s age data.csv > sorted.csv
Getting Started
To get started with xsv, follow these steps:
-
Install xsv:
- On macOS with Homebrew:
brew install xsv
- On other platforms, download the latest release from the GitHub repository
- On macOS with Homebrew:
-
Basic usage:
# View the first 5 rows of a CSV file xsv slice -l 5 data.csv # Get summary statistics for numeric columns xsv stats data.csv # Join two CSV files based on a common column xsv join --left id file1.csv id file2.csv > joined.csv
For more detailed information and advanced usage, refer to the project's documentation on GitHub.
Competitor Comparisons
eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.
Pros of tsv-utils
- Written in D, offering high performance and speed
- Focuses specifically on TSV files, providing specialized tools
- Includes unique utilities like
tsv-append
andtsv-sample
Cons of tsv-utils
- Limited to TSV format, less flexible than xsv for other delimiters
- Smaller community and ecosystem compared to xsv
- Fewer general-purpose CSV/TSV manipulation features
Code Comparison
xsv:
xsv select 1,2,3 input.csv > output.csv
tsv-utils:
tsv-select -f 1,2,3 input.tsv > output.tsv
Summary
xsv is a versatile CSV manipulation tool written in Rust, offering broad functionality for various delimiter-separated formats. It has a larger community and more general-purpose features.
tsv-utils, written in D, specializes in TSV files and offers high performance. It includes unique utilities for specific TSV operations but is less flexible for other formats.
Both tools provide command-line interfaces for data manipulation, with similar syntax for basic operations. The choice between them depends on specific needs, such as format requirements, performance priorities, and desired features.
A terminal spreadsheet multitool for discovering and arranging data
Pros of VisiData
- Interactive TUI for data exploration and manipulation
- Supports a wide variety of file formats beyond CSV
- Powerful data analysis features like pivot tables and frequency charts
Cons of VisiData
- Steeper learning curve due to its extensive feature set
- May be slower for processing very large datasets compared to xsv
- Requires more system resources for its interactive interface
Code Comparison
VisiData example:
import visidata
vd = visidata.VisiData()
vd.run(['example.csv'])
xsv example:
use std::process::Command;
let output = Command::new("xsv")
.args(&["select", "1,2", "example.csv"])
.output()
.expect("Failed to execute command");
VisiData offers a more interactive approach with its TUI, while xsv provides a command-line interface for CSV operations. VisiData is more versatile in terms of supported file formats and analysis capabilities, but xsv excels in speed and efficiency for large CSV files. The choice between the two depends on the specific use case, with VisiData being better for exploratory data analysis and xsv for quick, scriptable CSV operations.
A suite of utilities for converting to and working with CSV, the king of tabular file formats.
Pros of csvkit
- Written in Python, making it more accessible for scripting and integration with other Python tools
- Offers a wider range of CSV manipulation tools, including SQL-like operations
- Provides more extensive documentation and tutorials
Cons of csvkit
- Generally slower performance compared to xsv, especially for large datasets
- Requires Python installation and potential dependency management
- Less memory-efficient for processing very large CSV files
Code Comparison
csvkit example:
csvcut -c 1,3 data.csv | csvgrep -c 1 -m "pattern" | csvsort -c 3
xsv example:
xsv select 1,3 data.csv | xsv search -s 1 "pattern" | xsv sort -s 3
Both tools offer similar functionality for basic CSV operations, but csvkit provides more complex operations out of the box, while xsv focuses on speed and efficiency.
csvkit is better suited for users who prioritize a wide range of features and Python integration, while xsv is ideal for those who need high-performance CSV processing, especially for large datasets. The choice between the two depends on specific project requirements, performance needs, and the user's familiarity with Python or Rust ecosystems.
Fast and powerful CSV (delimited text) parser that gracefully handles large files and malformed input
Pros of PapaParse
- Browser-based parsing, suitable for client-side applications
- Supports streaming large files for memory efficiency
- More flexible with various input formats (strings, files, blobs)
Cons of PapaParse
- JavaScript-only, limiting use in other environments
- Generally slower performance compared to xsv
- Less feature-rich for complex CSV operations
Code Comparison
PapaParse (JavaScript):
Papa.parse(file, {
complete: function(results) {
console.log(results.data);
}
});
xsv (Command-line):
xsv select 1,2 input.csv > output.csv
Key Differences
- xsv is a command-line tool written in Rust, focusing on performance and system-level operations
- PapaParse is a JavaScript library, ideal for web applications and in-browser CSV parsing
- xsv offers more advanced CSV manipulation features, while PapaParse provides simpler parsing with browser integration
Use Cases
- Choose PapaParse for web applications requiring client-side CSV parsing
- Opt for xsv when working with large datasets or need for high-performance CSV operations on the command line
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
Pros of Miller
- Supports multiple data formats (CSV, TSV, JSON, etc.) vs. xsv's CSV-only focus
- Offers more advanced data manipulation capabilities (e.g., complex transformations, regex)
- Provides a domain-specific language for data processing
Cons of Miller
- Generally slower performance compared to xsv for large datasets
- Steeper learning curve due to its more complex syntax and features
- Less memory-efficient for very large files
Code Comparison
Miller:
mlr --csv filter '$age > 30' \
then sort -f last_name \
then cut -f first_name,last_name,age \
input.csv
xsv:
xsv select first_name,last_name,age input.csv \
| xsv search -s age '^[3-9]' \
| xsv sort -s last_name
Both tools can perform similar operations, but Miller's syntax is more expressive for complex transformations. xsv's commands are simpler and more focused on CSV-specific operations. Miller offers more flexibility in data manipulation, while xsv excels in performance and simplicity for CSV processing tasks.
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
xsv is a command line program for indexing, slicing, analyzing, splitting and joining CSV files. Commands should be simple, fast and composable:
- Simple tasks should be easy.
- Performance trade offs should be exposed in the CLI interface.
- Composition should not come at the expense of performance.
This README contains information on how to
install xsv
, in addition to
a quick tour of several commands.
Dual-licensed under MIT or the UNLICENSE.
Available commands
- cat - Concatenate CSV files by row or by column.
- count - Count the rows in a CSV file. (Instantaneous with an index.)
- fixlengths - Force a CSV file to have same-length records by either padding or truncating them.
- flatten - A flattened view of CSV records. Useful for viewing one record
at a time. e.g.,
xsv slice -i 5 data.csv | xsv flatten
. - fmt - Reformat CSV data with different delimiters, record terminators or quoting rules. (Supports ASCII delimited data.)
- frequency - Build frequency tables of each column in CSV data. (Uses parallelism to go faster if an index is present.)
- headers - Show the headers of CSV data. Or show the intersection of all headers between many CSV files.
- index - Create an index for a CSV file. This is very quick and provides constant time indexing into the CSV file.
- input - Read CSV data with exotic quoting/escaping rules.
- join - Inner, outer and cross joins. Uses a simple hash index to make it fast.
- partition - Partition CSV data based on a column value.
- sample - Randomly draw rows from CSV data using reservoir sampling (i.e., use memory proportional to the size of the sample).
- reverse - Reverse order of rows in CSV data.
- search - Run a regex over CSV data. Applies the regex to each field individually and shows only matching rows.
- select - Select or re-order columns from CSV data.
- slice - Slice rows from any part of a CSV file. When an index is present, this only has to parse the rows in the slice (instead of all rows leading up to the start of the slice).
- sort - Sort CSV data.
- split - Split one CSV file into many CSV files of N chunks.
- stats - Show basic types and statistics of each column in the CSV file. (i.e., mean, standard deviation, median, range, etc.)
- table - Show aligned output of any CSV data using elastic tabstops.
A whirlwind tour
Let's say you're playing with some of the data from the Data Science Toolkit, which contains several CSV files. Maybe you're interested in the population counts of each city in the world. So grab the data and start examining it:
$ curl -LO https://burntsushi.net/stuff/worldcitiespop.csv
$ xsv headers worldcitiespop.csv
1 Country
2 City
3 AccentCity
4 Region
5 Population
6 Latitude
7 Longitude
The next thing you might want to do is get an overview of the kind of data that
appears in each column. The stats
command will do this for you:
$ xsv stats worldcitiespop.csv --everything | xsv table
field type min max min_length max_length mean stddev median mode cardinality
Country Unicode ad zw 2 2 cn 234
City Unicode bab el ahmar Ãykkvibaer 1 91 san jose 2351892
AccentCity Unicode Bâb el Ahmar ïn Bou Chella 1 91 San Antonio 2375760
Region Unicode 00 Z9 0 2 13 04 397
Population Integer 7 31480498 0 8 47719.570634 302885.559204 10779 28754
Latitude Float -54.933333 82.483333 1 12 27.188166 21.952614 32.497222 51.15 1038349
Longitude Float -179.983333 180 1 14 37.08886 63.22301 35.28 23.8 1167162
The xsv table
command takes any CSV data and formats it into aligned columns
using elastic tabstops. You'll
notice that it even gets alignment right with respect to Unicode characters.
So, this command takes about 12 seconds to run on my machine, but we can speed it up by creating an index and re-running the command:
$ xsv index worldcitiespop.csv
$ xsv stats worldcitiespop.csv --everything | xsv table
...
Which cuts it down to about 8 seconds on my machine. (And creating the index takes less than 2 seconds.)
Notably, the same type of "statistics" command in another CSV command line toolkit takes about 2 minutes to produce similar statistics on the same data set.
Creating an index gives us more than just faster statistics gathering. It also makes slice operations extremely fast because only the sliced portion has to be parsed. For example, let's say you wanted to grab the last 10 records:
$ xsv count worldcitiespop.csv
3173958
$ xsv slice worldcitiespop.csv -s 3173948 | xsv table
Country City AccentCity Region Population Latitude Longitude
zw zibalonkwe Zibalonkwe 06 -19.8333333 27.4666667
zw zibunkululu Zibunkululu 06 -19.6666667 27.6166667
zw ziga Ziga 06 -19.2166667 27.4833333
zw zikamanas village Zikamanas Village 00 -18.2166667 27.95
zw zimbabwe Zimbabwe 07 -20.2666667 30.9166667
zw zimre park Zimre Park 04 -17.8661111 31.2136111
zw ziyakamanas Ziyakamanas 00 -18.2166667 27.95
zw zizalisari Zizalisari 04 -17.7588889 31.0105556
zw zuzumba Zuzumba 06 -20.0333333 27.9333333
zw zvishavane Zvishavane 07 79876 -20.3333333 30.0333333
These commands are instantaneous because they run in time and memory proportional to the size of the slice (which means they will scale to arbitrarily large CSV data).
Switching gears a little bit, you might not always want to see every column in the CSV data. In this case, maybe we only care about the country, city and population. So let's take a look at 10 random rows:
$ xsv select Country,AccentCity,Population worldcitiespop.csv \
| xsv sample 10 \
| xsv table
Country AccentCity Population
cn Guankoushang
za Klipdrift
ma Ouled Hammou
fr Les Gravues
la Ban Phadèng
de Lüdenscheid 80045
qa Umm ash Shubrum
bd Panditgoan
us Appleton
ua Lukashenkivske
Whoops! It seems some cities don't have population counts. How pervasive is that?
$ xsv frequency worldcitiespop.csv --limit 5
field,value,count
Country,cn,238985
Country,ru,215938
Country,id,176546
Country,us,141989
Country,ir,123872
City,san jose,328
City,san antonio,320
City,santa rosa,296
City,santa cruz,282
City,san juan,255
AccentCity,San Antonio,317
AccentCity,Santa Rosa,296
AccentCity,Santa Cruz,281
AccentCity,San Juan,254
AccentCity,San Miguel,254
Region,04,159916
Region,02,142158
Region,07,126867
Region,03,122161
Region,05,118441
Population,(NULL),3125978
Population,2310,12
Population,3097,11
Population,983,11
Population,2684,11
Latitude,51.15,777
Latitude,51.083333,772
Latitude,50.933333,769
Latitude,51.116667,769
Latitude,51.133333,767
Longitude,23.8,484
Longitude,23.2,477
Longitude,23.05,476
Longitude,25.3,474
Longitude,23.1,459
(The xsv frequency
command builds a frequency table for each column in the
CSV data. This one only took 5 seconds.)
So it seems that most cities do not have a population count associated with them at all. No matterâwe can adjust our previous command so that it only shows rows with a population count:
$ xsv search -s Population '[0-9]' worldcitiespop.csv \
| xsv select Country,AccentCity,Population \
| xsv sample 10 \
| xsv table
Country AccentCity Population
es Barañáin 22264
es Puerto Real 36946
at Moosburg 4602
hu Hejobaba 1949
ru Polyarnyye Zori 15092
gr KandÃla 1245
is ÃlafsvÃk 992
hu Decs 4210
bg Sliven 94252
gb Leatherhead 43544
Erk. Which country is at
? No clue, but the Data Science Toolkit has a CSV
file called countrynames.csv
. Let's grab it and do a join so we can see which
countries these are:
curl -LO https://gist.githubusercontent.com/anonymous/063cb470e56e64e98cf1/raw/98e2589b801f6ca3ff900b01a87fbb7452eb35c7/countrynames.csv
$ xsv headers countrynames.csv
1 Abbrev
2 Country
$ xsv join --no-case Country sample.csv Abbrev countrynames.csv | xsv table
Country AccentCity Population Abbrev Country
es Barañáin 22264 ES Spain
es Puerto Real 36946 ES Spain
at Moosburg 4602 AT Austria
hu Hejobaba 1949 HU Hungary
ru Polyarnyye Zori 15092 RU Russian Federation | Russia
gr KandÃla 1245 GR Greece
is ÃlafsvÃk 992 IS Iceland
hu Decs 4210 HU Hungary
bg Sliven 94252 BG Bulgaria
gb Leatherhead 43544 GB Great Britain | UK | England | Scotland | Wales | Northern Ireland | United Kingdom
Whoops, now we have two columns called Country
and an Abbrev
column that we
no longer need. This is easy to fix by re-ordering columns with the xsv select
command:
$ xsv join --no-case Country sample.csv Abbrev countrynames.csv \
| xsv select 'Country[1],AccentCity,Population' \
| xsv table
Country AccentCity Population
Spain Barañáin 22264
Spain Puerto Real 36946
Austria Moosburg 4602
Hungary Hejobaba 1949
Russian Federation | Russia Polyarnyye Zori 15092
Greece KandÃla 1245
Iceland ÃlafsvÃk 992
Hungary Decs 4210
Bulgaria Sliven 94252
Great Britain | UK | England | Scotland | Wales | Northern Ireland | United Kingdom Leatherhead 43544
Perhaps we can do this with the original CSV data? Indeed we canâbecause
joins in xsv
are fast.
$ xsv join --no-case Abbrev countrynames.csv Country worldcitiespop.csv \
| xsv select '!Abbrev,Country[1]' \
> worldcitiespop_countrynames.csv
$ xsv sample 10 worldcitiespop_countrynames.csv | xsv table
Country City AccentCity Region Population Latitude Longitude
Sri Lanka miriswatte Miriswatte 36 7.2333333 79.9
Romania livezile Livezile 26 1985 44.512222 22.863333
Indonesia tawainalu Tawainalu 22 -4.0225 121.9273
Russian Federation | Russia otar Otar 45 56.975278 48.305278
France le breuil-bois robert le Breuil-Bois Robert A8 48.945567 1.717026
France lissac Lissac B1 45.103094 1.464927
Albania lumalasi Lumalasi 46 40.6586111 20.7363889
China motzushih Motzushih 11 27.65 111.966667
Russian Federation | Russia svakino Svakino 69 55.60211 34.559785
Romania tirgu pancesti Tirgu Pancesti 38 46.216667 27.1
The !Abbrev,Country[1]
syntax means, "remove the Abbrev
column and remove
the second occurrence of the Country
column." Since we joined with
countrynames.csv
first, the first Country
name (fully expanded) is now
included in the CSV data.
This xsv join
command takes about 7 seconds on my machine. The performance
comes from constructing a very simple hash index of one of the CSV data files
given. The join
command does an inner join by default, but it also has left,
right and full outer join support too.
Installation
Binaries for Windows, Linux and macOS are available from Github.
If you're a macOS Homebrew user, then you can install xsv from homebrew-core:
$ brew install xsv
If you're a macOS MacPorts user, then you can install xsv from the official ports:
$ sudo port install xsv
If you're a Nix/NixOS user, you can install xsv from nixpkgs:
$ nix-env -i xsv
Alternatively, you can compile from source by
installing Cargo
(Rust's package manager)
and installing xsv
using Cargo:
cargo install xsv
Compiling from this repository also works similarly:
git clone git://github.com/BurntSushi/xsv
cd xsv
cargo build --release
Compilation will probably take a few minutes depending on your machine. The
binary will end up in ./target/release/xsv
.
Benchmarks
I've compiled some very rough
benchmarks of
various xsv
commands.
Motivation
Here are several valid criticisms of this project:
- You shouldn't be working with CSV data because CSV is a terrible format.
- If your data is gigabytes in size, then CSV is the wrong storage type.
- Various SQL databases provide all of the operations available in
xsv
with more sophisticated indexing support. And the performance is a zillion times better.
I'm sure there are more criticisms, but the impetus for this project was a 40GB CSV file that was handed to me. I was tasked with figuring out the shape of the data inside of it and coming up with a way to integrate it into our existing system. It was then that I realized that every single CSV tool I knew about was woefully inadequate. They were just too slow or didn't provide enough flexibility. (Another project I had comprised of a few dozen CSV files. They were smaller than 40GB, but they were each supposed to represent the same kind of data. But they all had different column and unintuitive column names. Useful CSV inspection tools were critical hereâand they had to be reasonably fast.)
The key ingredients for helping me with my task were indexing, random sampling, searching, slicing and selecting columns. All of these things made dealing with 40GB of CSV data a bit more manageable (or dozens of CSV files).
Getting handed a large CSV file once was enough to launch me on this quest. From conversations I've had with others, CSV data files this large don't seem to be a rare event. Therefore, I believe there is room for a tool that has a hope of dealing with data that large.
Naming collision
This project is unrelated to another similar project with the same name: https://mj.ucw.cz/sw/xsv/
Top Related Projects
eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.
A terminal spreadsheet multitool for discovering and arranging data
A suite of utilities for converting to and working with CSV, the king of tabular file formats.
Fast and powerful CSV (delimited text) parser that gracefully handles large files and malformed input
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot