tva Design
This document outlines the core design decisions behind tva, drawing inspiration from the
original TSV Utilities by eBay.
Why TSV?
The Tab-Separated Values (TSV) format is chosen over Comma-Separated Values (CSV) for several key reasons, especially in data mining and large-scale data processing contexts:
1. No Escapes = Reliability & Speed
- CSV Complexity: CSV uses escape characters (usually quotes) to handle delimiters (commas) and newlines within fields. Parsing this requires a state machine, which is slower and prone to errors in ad-hoc scripts.
- TSV Simplicity: TSV disallows tabs and newlines within fields. This means:
- Parsing is trivial:
split('\t')works reliably. - Record boundaries are clear: Every newline is a record separator.
- Performance: Highly optimized routines can be used to find delimiters.
- Robustness: No “malformed CSV” errors due to incorrect quoting.
- Parsing is trivial:
2. Unix Tool Compatibility
- Traditional Unix tools (
cut,awk,sort,join,uniq) work seamlessly with TSV files by specifying the delimiter (e.g.,cut -f1). - The CSV Problem: Standard Unix tools fail on CSV files with quoted fields or newlines. This
forces CSV toolkits (like
xsv) to re-implement standard operations (sorting, joining) just to handle parsing correctly. - The TSV Advantage:
tvaleverages the simplicity of TSV. Whiletvaprovides its ownsortandjoinfor header awareness and Windows support, the underlying data remains compatible with the vast ecosystem of standard Unix text processing tools.
Why Rust?
tva is implemented in Rust, differing from the original TSV Utilities (written in D).
1. Safety & Performance
- Memory Safety: Rust’s ownership model ensures memory safety without a garbage collector, crucial for high-performance data processing tools.
- Zero-Cost Abstractions: High-level constructs (iterators, closures) compile down to efficient machine code, often matching or beating C/C++.
- Predictable Performance: No GC pauses means consistent throughput for large datasets.
2. Cross-Platform & Deployment
- Single Binary: Rust compiles to a static binary with no runtime dependencies (unlike Python or Java).
- Windows Support: Rust has first-class support for Windows, making
tvaeasily deployable on non-Unix environments (a key differentiator from many Unix-centric tools).
Design Goals
1. Unix Philosophy
- Do One Thing Well: Each subcommand (
filter,select,stats) focuses on a specific task. - Pipeable: Tools read from stdin and write to stdout by default, enabling powerful pipelines:
tva filter --gt score:0.9 data.tsv | tva select name,score | tva sort -k score - Streaming: Stateless where possible to support infinite streams and large files.
2. Header Awareness
- Unlike generic Unix tools,
tvais aware of headers. - Field Selection: Columns can be selected by name (
--fields user_id) rather than just index. - Header Preservation: Operations like
filterorsampleautomatically preserve the header row.
3. TSV-first
- Default separator is TAB.
- Processing revolves around the “Row + Field” model.
- CSV is treated as an import format (
from-csv), but core logic is TSV-centric.
4. Explicit CLI & Fail-fast
- Options should be explicit (no “magic” behavior).
- Strict error handling: mismatched field counts or broken headers result in immediate error exit ( stderr + non-zero status), rather than silent truncation.
5. High Performance
- Aim for single-pass processing.
- Avoid unnecessary allocations and sorting.
6. Single-Threaded by Default
Core Philosophy: Single-threaded extreme performance + external parallel tools
tva adopts a single-threaded model for most data processing scenarios. This is not a technical
limitation, but an active choice based on Unix philosophy:
- Do One Thing Well:
tvafocuses on streaming data parsing, transformation, and statistics, leaving parallel scheduling complexity to specialized tools (like GNU Parallel). - Don’t Reinvent the Wheel: GNU Parallel is already a mature, powerful parallel task scheduler.
Rather than implementing complex thread pools and task distribution inside
tva, it’s better to maketvathe best partner for Parallel. - Determinism and Simplicity: Single-threaded models make data processing order naturally deterministic, debugging easier, and greatly reduce memory management complexity and overhead ( lock-free, zero-copy easier to achieve).
Implementation Details
tva adopts several optimization strategies similar to tsv-utils to ensure high performance:
1. Buffered I/O
- Input: Uses
std::io::BufReaderto minimize system calls when reading large files. Transparently handles.gzfiles (viaflate2). - Output: Uses
std::io::BufWriterto batch writes, significantly improving throughput for commands that produce large output.
2. Zero-Copy & Re-use
- String Reuse: Where possible,
tvareuses allocated string buffers (e.g., viaread_lineinto a cleared String) to avoid the overhead of repeated memory allocation and deallocation. - Iterator-Based Processing: Leverages Rust’s iterator lazy evaluation to process data line-by-line without loading entire files into memory, enabling processing of datasets larger than RAM.
Performance Architecture & Benchmarks
tva is built on a philosophy of “Zero-Copy” and “SIMD-First”. We continuously benchmark different
parsing strategies to ensure tva remains the fastest tool for TSV processing.
Parsing Strategy Evolution
We compared multiple parsing strategies to find the optimal balance between speed and correctness. The evolution shows a clear progression from naive implementations to hand-optimized SIMD:
- Naive Split → Memchr-based → Single-Pass SIMD → Hand-written SIMD
- Each step eliminates overhead: allocation, function calls, or redundant scanning.
Latest Benchmark Results
Test Data 1: Short Fields, Few Columns (5 cols, ~8 bytes/field)
| Implementation | Time | Throughput | Notes |
|---|---|---|---|
| TVA for_each_row (Single-Pass) | 43 µs | 1.63 GiB/s | Current: Hand-written SIMD (SSE2), true single-pass |
| simd-csv | 80 µs | 905 MiB/s | Hybrid SIMD state machine, previous ceiling |
| TVA for_each_line + memchr | 87 µs | 830 MiB/s | Two-pass: SIMD for lines, memchr for fields |
| Memchr Reused Buffer | 113 µs | 639 MiB/s | Line-by-line memchr, limited by function call overhead |
| csv crate | 130 µs | 556 MiB/s | Classic DFA state machine, correctness baseline |
| Naive Split | 562 µs | 129 MiB/s | Original implementation, slowest |
Test Data 2: Wide Rows, Many Columns (20 cols, ~6 bytes/field)
| Implementation | Time | Throughput | Notes |
|---|---|---|---|
| TVA for_each_row (Single-Pass) | 128 µs | 896 MiB/s | Current: Hand-written SIMD (SSE2), true single-pass |
| simd-csv | 180 µs | 635 MiB/s | Hybrid SIMD state machine |
| TVA for_each_line + memchr | 247 µs | 463 MiB/s | Two-pass: SIMD for lines, memchr for fields |
| Memchr Reused Buffer | 344 µs | 333 MiB/s | Line-by-line memchr |
| csv crate | 320 µs | 358 MiB/s | Classic DFA state machine |
| Naive Split | 1167 µs | 98 MiB/s | Original implementation |
Key Findings:
- Performance Leap:
for_each_rowachieves 1.63 GiB/s on short fields—1.8x faster thansimd-csvand 12.6x faster than naive split. On wide rows, it maintains 896 MiB/s, demonstrating consistent advantage across data shapes. - Single-Pass Wins: True single-pass scanning outperforms two-pass approaches by ~95% regardless of row width, as more delimiter searches are eliminated.
- Scalability: All implementations show expected throughput decrease on wide rows (more delimiters to process), but TVA’s single-pass approach maintains the lead.
TSV Parser Design
This section details the design of tva’s custom TSV parser, which leverages the simplicity of the
TSV format to achieve high performance.
Format Differences: CSV vs TSV
| Feature | CSV (RFC 4180) | TSV (Simple) | Impact |
|---|---|---|---|
| Delimiter | , (variable) | \t (fixed) | TSV can hardcode delimiter, enabling SIMD optimization. |
| Quotes | Supports " wrapping | Not supported | TSV eliminates “in_quote” state machine, removing branch misprediction. |
| Escapes | "" escapes quotes | None | TSV supports true zero-copy slicing without rewriting. |
| Newlines | Allowed in fields | Not allowed | TSV guarantees \n always means record end, enabling parallel chunking. |
Implementation
Architecture:
src/libs/tsv/simd/
├── mod.rs - DelimiterSearcher trait, platform abstraction
├── sse2.rs - x86_64 SSE2 implementation (128-bit vectors)
└── neon.rs - aarch64 NEON implementation (128-bit vectors)
Key Design Decisions:
-
Hand-written SIMD: Platform-specific searchers simultaneously scan for
\tand\n, eliminating generic library overhead. -
Single-Pass Scanning: All delimiter positions are found in one pass, storing field boundaries in a pre-allocated array. This eliminates the ~95% overhead of two-pass approaches.
-
Unified CR Handling: Only
\tand\nare searched during SIMD scan. When\nis found, we check if the preceding byte is\r. This reduces register pressure compared to searching for three characters simultaneously. -
Zero-Copy API:
TsvRowstructs yield borrowed slices into the internal buffer, eliminating per-row allocation.
Platform Support:
- x86_64: SSE2 intrinsics (baseline for all x86_64 CPUs)
- aarch64: NEON intrinsics (baseline for all ARM64 CPUs)
- Fallback:
memchr2for other platforms
Performance Validation
| Metric | Target | Achieved | Status |
|---|---|---|---|
| Throughput (short fields) | 2-3 GiB/s | 1.63 GiB/s | ✅ Near theoretical limit |
Speedup vs simd-csv | 1.5-2x | 1.8x | ✅ Exceeded target |
| Speedup vs memchr2 | 1.5-2x | 2.0x | ✅ Achieved target |
Key Insights:
- SSE2 over AVX2: 128-bit SSE2 outperformed 256-bit AVX2. Wider registers added overhead without proportional gains for TSV’s simple structure.
- Single-Pass Architecture: The dominant performance factor, providing ~95% improvement over two-pass approaches regardless of data shape.
Common Behavior & Syntax
tva tools share a consistent set of behaviors and syntax conventions, making them easy to learn
and combine.
Field Syntax
All tools use a unified syntax to identify fields (columns). See Field Syntax Documentation for details.
- Index:
1(first column),2(second column). - Range:
1-3(columns 1, 2, 3). - List:
1,3,5. - Name:
user_id(requires--header). - Wildcard:
user_*(matchesuser_id,user_name, etc.). - Exclusion:
--exclude 1,2(select all except 1 and 2).
Header Processing
- Input: Most tools accept a
--header(or-H) flag to indicate the first line of input is a header. This enables field selection by name.- Note: The
longerandwidercommands assume a header by default.
- Note: The
- Output: When
--headeris used,tvaensures the header is preserved in the output (unless explicitly suppressed). - No Header: Without this flag, the first row is treated as data. Field selection is limited to indices (no names).
- Multiple Files: If processing multiple files with
--header:- The header from the first file is written to output.
- Headers from subsequent files are skipped (assumed to be identical to the first).
- Validation: Field counts must be consistent;
tvafails immediately on jagged rows.
Multiple Files & Standard Input
- Standard Input: If no files are provided, or if
-is used as a filename,tvareads from standard input (stdin). - Concatenation: When multiple files are provided,
tvaprocesses them sequentially as a single continuous stream of data (logical concatenation).- Example:
tva filter --gt value:10 file1.tsv file2.tsvprocesses both files.
- Example:
Comparison with Other Tools
tva is designed to coexist with and complement other excellent open-source tools for tabular data.
It combines the strict, high-performance nature of tsv-utils with the cross-platform accessibility
and modern ecosystem of Rust.
| Feature | tva (Rust) | tsv-utils (D) | xsv / qsv (Rust) | datamash (C) |
|---|---|---|---|---|
| Primary Format | TSV (Strict) | TSV (Strict) | CSV (Flexible) | TSV (Default) |
| Escapes | No | No | Yes | No |
| Header Aware | Yes | Yes | Yes | Partial |
| Field Syntax | Names & Indices | Names & Indices | Names & Indices | Indices |
| Platform | Cross-platform | Unix-focused | Cross-platform | Unix-focused |
| Performance | High | High | High (CSV cost) | High |
Detailed Breakdown
-
tsv-utils (D):
- The direct inspiration for
tva.tvaaims to be a Rust-based alternative that is easier to install (no D compiler needed) and extends functionality (e.g.,sample,slice).
- The direct inspiration for
-
- The premier tools for CSV processing.
- Because they must handle CSV escapes, they are inherently more complex than TSV-only tools.
- Use these if you must work with CSVs directly; use
tvaif you can convert to TSV for faster, simpler processing.
-
GNU Datamash (C):
- Excellent for statistical operations (groupby, pivot) on TSV files.
tva statsis similar but adds header awareness and named field selection, making it friendlier for interactive use.
-
Miller (mlr) (C):
- A powerful “awk for CSV/TSV/JSON”. Supports many formats and complex transformations.
- Miller is a DSL (Domain Specific Language);
tvafollows the “do one thing well” Unix philosophy with separate subcommands.
-
csvkit (Python):
- Very feature-rich but slower due to Python overhead. Great for converting obscure formats ( XLSX, DBF) to CSV/TSV.
-
GNU shuf (C):
- Standard tool for random permutations.
tva sampleadds specific data science sampling methods: weighted sampling (by column value) and Bernoulli sampling.
Aggregation Architecture
This section provides a deep dive into the architectural differences between tva and other tools
like xan (Rust) and tsv-utils (D Language) in their aggregation module designs.
tva: Runtime Polymorphism with SoA Memory Layout
Design: Hybrid Struct-of-Arrays (SoA). The Schema (StatsProcessor) builds the computation
graph at runtime, while the State (Aggregator) uses compact columnar storage (Vec<f64>,
Vec<String>). Computation logic is dynamically dispatched via Box<dyn Calculator> trait objects.
Advantages:
- Memory Efficient: Even with millions of groups, each group’s
Aggregatoroverhead is minimal (only a fewVecheaders). - Modular: Adding new operators only requires implementing the
Calculatortrait, completely decoupled from existing code. - Fast Compilation: Compared to generic/template bloat,
dyn Traitsignificantly reduces compile times and binary size. - Deterministic: Uses
IndexMapto guarantee that GroupBy output order matches the first-occurrence order in the input.
Trade-offs: Virtual function calls (vtable) have a tiny overhead compared to inlined code in
extremely high-frequency loops (e.g., 10 calls per row), but this is usually negligible in I/O-bound
CLI tools.
Other Tools
xan: Uses enum dispatch (enum Agg { Sum(SumState), ... }) to avoid heap allocation, but
requires modifying core enum definitions to add new operators.
tsv-utils (D): Uses compile-time template specialization for extreme performance, but has long compile times and high code complexity.
datamash (C): Uses sort-based grouping with O(1) memory, but requires pre-sorted input.
dplyr (R): Uses vectorized mask evaluation, but depends on columnar storage and is unsuitable for streaming.
Expr Language
TVA’s Expr language is designed for concise, shell-friendly data processing:
Source → Pest Parser → AST (Expr) → Direct Interpretation (eval)
↑______________________________↓
(Parse Cache)
Design Principles
- Conciseness: Short syntax for common operations (e.g.,
@1,@namefor column references). - Shell-friendly: Avoids conflicts with Shell special characters (
$,`,!). - Streaming: Row-by-row evaluation with no global state, suitable for big data.
- Type-aware: Recognizes numbers/dates when needed, treats data as strings by default for speed.
- Error Handling: Defaults to permissive mode (invalid operations return
null). - Consistency: Similar to jq/xan to reduce learning costs.
Expr Engine Optimizations
| Optimization | Technique | Speedup |
|---|---|---|
| Global Function Registry | OnceLock static registry | 35-57x |
| Parse Cache | HashMap<String, Expr> caching | 12x |
| Column Name Resolution | Compile-time name→index conversion | 3x |
| Constant Folding | Compile-time constant evaluation | 10x |
| HashMap (ahash) | Faster HashMap implementation | 6% |
Details:
- Parse caching: Expressions are parsed once and cached for all rows. Identical expressions reuse the cached AST.
- Column name resolution: When headers are available,
@namereferences are resolved to@indexat parse time for O(1) access. - Constant folding: Constant sub-expressions (e.g.,
2 + 3 * 4) are pre-computed during parsing. - Function registry: Built-in functions are looked up once and cached, avoiding repeated hash map lookups.
- Hash algorithm: Uses
ahashfor faster hash map operations.
For best performance, use column indices (@1, @2) instead of names.