Selection & Filtering Documentation
This document explains how to use the selection, filtering, and sampling commands in tva:
select, filter, slice, and sample. These commands allow you to subset your
data based on structure (columns), values (rows), position (index), or randomly.
Introduction
Data analysis often begins with selecting the relevant subset of data:
select: Selects and reorders columns (e.g., “keep onlynameandemail”).filter: Selects rows where a condition is true (e.g., “keep rows whereage > 30”).slice: Selects rows by their position (index) in the file (e.g., “keep rows 10-20”).sample: Randomly selects a subset of rows.
Field Syntax
All tools use a unified syntax to identify fields (columns). See Field Syntax Documentation for details.
select (Column Selection)
The select command allows you to keep only specific columns and reorder them.
Basic Usage
tva select [input_files...] --fields <columns>
--fields/-f: Comma-separated list of columns to select.- Names:
name,email - Indices:
1,3(1-based) - Ranges:
1-3,start_col-end_col - Wildcards:
user_*,*_id
- Names:
Examples
1. Select by Name and Index
Consider the dataset docs/data/us_rent_income.tsv:
GEOID NAME variable estimate moe
01 Alabama income 24476 136
01 Alabama rent 747 3
02 Alaska income 32940 508
...
To keep only the state name (NAME) and the estimate value (estimate):
tva select docs/data/us_rent_income.tsv -f NAME,estimate
Output:
NAME estimate
Alabama 24476
Alabama 747
Alaska 32940
...
2. Reorder Columns
You can change the order of columns. Let’s move variable to the first column:
tva select docs/data/us_rent_income.tsv -f variable,estimate,NAME
Output:
variable estimate NAME
income 24476 Alabama
rent 747 Alabama
income 32940 Alaska
...
3. Select by Range and Wildcard
Consider docs/data/billboard.tsv which has many week columns (wk1, wk2, wk3…):
artist track wk1 wk2 wk3
2 Pac Baby Don't Cry 87 82 72
2Ge+her The Hardest Part 91 87 92
To select the artist, track, and all week columns:
tva select docs/data/billboard.tsv -f artist,track,wk*
Or using a range (if you know the indices):
tva select docs/data/billboard.tsv -f 1-2,3-5
filter (Row Filtering)
The filter command selects rows where a condition is true. It supports field-based tests,
expressions, empty/blank checks, and field-to-field comparisons.
Basic Usage
tva filter [input_files...] [options]
Filter tests can be combined (default is AND logic, use --or for OR logic).
Filter Types
1. Expression Filter
Use the -E option to filter with an expression:
tva filter docs/data/us_rent_income.tsv -H -E '@estimate > 30000'
2. Empty/Blank Checks
--empty <field>: True if the field is empty (no characters)--not-empty <field>: True if the field is not empty--blank <field>: True if the field is empty or all whitespace--not-blank <field>: True if the field contains a non-whitespace character
tva filter docs/data/us_rent_income.tsv --not-empty NAME
3. Numeric Comparison
Format: --<op> <field>:<value>
--eq,--ne,--gt,--ge,--lt,--le
tva filter docs/data/us_rent_income.tsv --gt estimate:30000
Output:
GEOID NAME variable estimate moe
02 Alaska income 32940 508
04 Arizona income 31614 242
06 California income 33095 172
...
4. String Comparison
--str-eq,--str-ne: String equality/inequality--str-gt,--str-ge,--str-lt,--str-le: String ordering--istr-eq,--istr-ne: Case-insensitive string comparison--str-in-fld,--str-not-in-fld: Substring test--istr-in-fld,--istr-not-in-fld: Case-insensitive substring test
tva filter docs/data/us_rent_income.tsv --str-eq variable:rent
Output:
GEOID NAME variable estimate moe
01 Alabama rent 747 3
02 Alaska rent 1200 13
04 Arizona rent 976 4
...
5. Regular Expression
--regex <field>:<pattern>: Field matches regex--iregex <field>:<pattern>: Case-insensitive regex match--not-regex <field>:<pattern>: Field does not match regex--not-iregex <field>:<pattern>: Case-insensitive non-match
tva filter docs/data/billboard.tsv --regex track:"Baby"
Output:
artist track wk1 wk2 wk3
2 Pac Baby Don't Cry 87 82 72
Beenie Man Girls Dem Sugar 87 70 63
...
6. Length Comparison
--char-len-eq,--char-len-ne,--char-len-gt,--char-len-ge,--char-len-lt,--char-len-le: Character length--byte-len-eq,--byte-len-ne,--byte-len-gt,--byte-len-ge,--byte-len-lt,--byte-len-le: Byte length
tva filter docs/data/billboard.tsv --char-len-gt track:10
7. Field Type Checks
--is-numeric <field>: True if field can be parsed as a number--is-finite <field>: True if field is numeric and finite--is-nan <field>: True if field is NaN--is-infinity <field>: True if field is positive or negative infinity
tva filter docs/data/us_rent_income.tsv --is-numeric estimate
8. Field-to-Field Comparison
--ff-eq,--ff-ne,--ff-lt,--ff-le,--ff-gt,--ff-ge: Numeric field-to-field--ff-str-eq,--ff-str-ne: String field-to-field--ff-istr-eq,--ff-istr-ne: Case-insensitive string field-to-field--ff-absdiff-le <f1>:<f2>:<num>: Absolute difference <= NUM--ff-absdiff-gt <f1>:<f2>:<num>: Absolute difference > NUM--ff-reldiff-le <f1>:<f2>:<num>: Relative difference <= NUM--ff-reldiff-gt <f1>:<f2>:<num>: Relative difference > NUM
tva filter docs/data/us_rent_income.tsv --ff-gt estimate:moe
Common Options
--or: Evaluate tests as OR instead of AND-v,--invert: Invert the filter, selecting non-matching rows-c,--count: Print only the count of matching data rows--label <header>: Label matched records instead of filtering (outputs 1/0)--label-values <pass:fail>: Custom values for –label (default: 1:0)
slice (Row Selection by Index)
The slice command selects rows based on their integer index (position). Indices are 1-based.
Basic Usage
tva slice [input_files...] --rows <range> [options]
--rows/-r: The range of rows to keep (e.g.,1-10,5,100-). Can be specified multiple times.--invert/-v: Invert selection (drop the specified rows).--header/-H: Always preserve the first row (header).
Examples
1. Keep Specific Range (Head/Body)
To inspect the first 5 rows of docs/data/billboard.tsv (including header):
tva slice docs/data/billboard.tsv -r 1-5
Output:
artist track wk1 wk2 wk3
2 Pac Baby Don't Cry 87 82 72
2Ge+her The Hardest Part 91 87 92
...
2. Drop Header (Data Only)
Sometimes you want to process data without the header. You can drop the first row using --invert:
tva slice docs/data/billboard.tsv -r 1 --invert
Output:
2 Pac Baby Don't Cry 87 82 72
2Ge+her The Hardest Part 91 87 92
...
3. Keep Header and Specific Data Rows
To keep the header (row 1) and a slice of data from the middle (rows 10-15), use the -H flag:
tva slice docs/data/us_rent_income.tsv -H -r 10-15
This ensures the first line is always printed, even if it’s not in the range 10-15.
sample (Random Sampling)
The sample command randomly selects a subset of rows. This is useful for exploring large datasets.
Basic Usage
tva sample [input_files...] [options]
--rate/-r: Sampling rate (probability 0.0-1.0). (Bernoulli sampling)--n/-n: Exact number of rows to sample. (Reservoir sampling)--seed/-s: Random seed for reproducibility.
Examples
1. Sample by Rate
To keep approximately 10% of the rows from docs/data/us_rent_income.tsv:
tva sample docs/data/us_rent_income.tsv -r 0.1
2. Sample Exact Number
To pick exactly 5 random rows for inspection:
tva sample docs/data/us_rent_income.tsv -n 5
Output (example):
GEOID NAME variable estimate moe
35 New Mexico rent 809 11
55 Wisconsin income 32018 247
18 Indiana rent 782 5
...