uniq
Deduplicates TSV rows from one or more files without sorting.
Behavior:
- Keeps a 64-bit hash for each unique key; ~8 bytes of memory per unique row.
- Only the first occurrence of each key is kept by default.
- Use
--repeated/-rto output only lines that are repeated. - Use
--at-least/-ato output only lines repeated at least N times. - Use
--max/-mto limit the number of occurrences output per key. - Use
--equiv/-eto append equivalence class IDs. - Use
--number/-zto append occurrence numbers for each key.
Input:
- Reads from files or standard input.
- Files ending in
.gzare transparently decompressed.
Output:
- By default, output is written to standard output.
- Use
--outfile/-oto write to a file instead.
Header behavior:
- Supports
--header/-Hand--header-hash1modes. - When using header mode with multiple files, only the header from the first file is written; headers from subsequent files are skipped.
Field syntax:
- Use
--fields/-fto specify columns to use as the deduplication key. - Use
0to indicate the entire line should be used as the key (default behavior). - Field lists support 1-based indices, ranges (
1-3,5-7), header names, name ranges (run-user_time), and wildcards (*_time). - Run
tva --help-fieldsfor a full description shared across tva commands.
Examples:
-
Deduplicate whole rows
tva uniq data.tsv -
Deduplicate by column 2
tva uniq data.tsv -f 2 -
Deduplicate with header using named fields
tva uniq --header -f name,age data.tsv -
Output only repeated lines
tva uniq --repeated data.tsv -
Output lines repeated at least 3 times
tva uniq --at-least 3 data.tsv -
Output with equivalence class IDs
tva uniq --header -f 1 --equiv --number data.tsv -
Deduplicate multiple files with header
tva uniq --header file1.tsv file2.tsv file3.tsv -
Ignore case when comparing
tva uniq --ignore-case data.tsv