Data Organization Documentation

This document explains how to use the data organization commands in tva: sort, reverse, join, append, and split. These commands allow you to rearrange, combine, and split your data.

Introduction

Data organization involves sorting rows, combining multiple datasets, or splitting data into multiple files. These operations are essential for data preparation and pipeline construction.

Sorting & Reversing:
- sort: Sorts rows based on one or more key fields.
- reverse: Reverses the order of lines (like tac), optionally keeping the header at the top.
Combining:
- join: Joins two files based on common keys.
- append: Concatenates multiple TSV files, handling headers correctly.
Splitting:
- split: Splits a file into multiple files (by size, key, or random).

`sort` (External Sort)

The sort command sorts the lines of a TSV file based on the values in specified columns. It supports both lexicographic (string) and numeric sorting.

Basic Usage

tva sort [input_files...] [options]

--key / -k: Specify the field(s) to use as the sort key. You can use 1-based indices ( e.g., 1, 2) or ranges (e.g., 2,4-5).
--numeric / -n: Compare the key fields numerically instead of lexicographically.
--reverse / -r: Reverse the sort result (descending order).

Examples

1. Sort by a single column (Lexicographic)

Sort docs/data/us_rent_income.tsv by the NAME column (column 2):

tva sort docs/data/us_rent_income.tsv -k 2

Output (first 5 lines):

01	Alabama	income	24476	136
01	Alabama	rent	747	3
02	Alaska	income	32940	508
02	Alaska	rent	1200	13
04	Arizona	income	27517	148

2. Sort numerically

Sort docs/data/us_rent_income.tsv by the estimate column (column 4) numerically:

tva sort docs/data/us_rent_income.tsv -k 4 -n

Output (first 5 lines):

GEOID	NAME	variable	estimate	moe
05	Arkansas	rent	709	5
01	Alabama	rent	747	3
04	Arizona	rent	972	4
02	Alaska	rent	1200	13

3. Sort by multiple columns

Sort first by GEOID (column 1), then by NAME (column 2):

tva sort docs/data/us_rent_income.tsv -k 1,2

`reverse` (Reverse Lines)

The reverse command reverses the order of lines in the input. This is similar to the Unix tac command but includes features specifically for tabular data, such as header preservation.

Basic Usage

tva reverse [input_files...] [options]

--header / -H: Treat the first line as a header and keep it at the top of the output.

Examples

Reverse a file keeping the header

Reverse docs/data/us_rent_income.tsv but keep the header line at the top:

tva reverse docs/data/us_rent_income.tsv --header

Output (first 5 lines):

GEOID	NAME	variable	estimate	moe
06	California	rent	1358	3
06	California	income	29454	109
05	Arkansas	rent	709	5
05	Arkansas	income	23789	165

tva join -H --filter-file docs/data/who.tsv --key-fields iso3 --data-fields country docs/data/world_bank_pop.tsv

Output:

country	indicator	2000	2001
AFG	SP.URB.TOTL	4436311	4648139
AFG	SP.URB.GROW	3.91	4.66

2. Append fields from the filter file

To add the year column from who.tsv to the output:

tva join -H --filter-file docs/data/who.tsv -k iso3 -d country --append-fields year docs/data/world_bank_pop.tsv

Output:

country	indicator	2000	2001	year
AFG	SP.URB.TOTL	4436311	4648139	1980
AFG	SP.URB.GROW	3.91	4.66	1980

tva append -H docs/data/world_bank_pop.tsv docs/data/world_bank_pop.tsv

Output:

country	indicator	2000	2001
ABW	SP.URB.TOTL	42444	43048
ABW	SP.URB.GROW	1.18	1.41
AFG	SP.URB.TOTL	4436311	4648139
AFG	SP.URB.GROW	3.91	4.66
ABW	SP.URB.TOTL	42444	43048
ABW	SP.URB.GROW	1.18	1.41
AFG	SP.URB.TOTL	4436311	4648139
AFG	SP.URB.GROW	3.91	4.66

2. Track source file

Add a column indicating the source file:

tva append -H --track-source docs/data/world_bank_pop.tsv

Output:

file	country	indicator	2000	2001
world_bank_pop	ABW	SP.URB.TOTL	42444	43048
world_bank_pop	ABW	SP.URB.GROW	1.18	1.41
...

`split`

Splits TSV rows into multiple output files.

Usage

Split file.tsv into multiple files with 1000 lines each:

tva split --lines-per-file 1000 --header-in-out file.tsv

This will create files like file_0001.tsv, file_0002.tsv, etc., each containing up to 1000 data rows (plus the header in each file if --header-in-out is used).

tva Documentation

Data Organization Documentation

Introduction

`sort` (External Sort)

Basic Usage

Examples

1. Sort by a single column (Lexicographic)

2. Sort numerically

3. Sort by multiple columns

`reverse` (Reverse Lines)

Basic Usage

Examples

Reverse a file keeping the header

`join`

Examples

1. Join two files by a common key

2. Append fields from the filter file

`append`

Examples

1. Concatenate files with headers

2. Track source file

`split`

Usage

Keyboard shortcuts

tva Documentation