from html
Extracts data from HTML files using CSS selectors.
Behavior:
This command converts HTML content into TSV format using three different modes:
- Query Mode: For quick extraction of specific elements.
- Table Mode: For automatically converting HTML tables (
<table>). - Struct Mode: For extracting lists of objects into rows and columns.
Input:
- Reads from standard input if no input file is given or if the input file is ‘stdin’.
- Supports plain text HTML files.
Output:
- Writes to standard output by default.
- Use
--outfile/-oto write to a file ([stdout]for screen).
Query Mode:
- Activated by the
--query/-qflag. - Syntax:
selector [display_function] - Selectors: Standard CSS selectors (e.g.,
div.content,#main a). - Display Functions:
text{}ortext(): Print the text content of the selected elements.attr{name}orattr("name"): Print the value of the specified attribute.- If omitted, prints the full HTML of selected elements.
- Empty results are kept by default (prints blank lines for empty text or missing attributes).
- For advanced CSS selector reference, see:
docs/selectors.md.
Table Mode:
- Activated by the
--tableflag. - Extracts data from HTML
<table>elements. - Use
--index Nto select the N-th matched table (1-based). Implies--table. - Use
--table=<css>to target specific tables (e.g.,div.result table).
Struct Mode (List Extraction):
- Activated by using
--rowand--colflags. - Designed to extract repetitive structures (like cards, list items) into a TSV table.
--row <selector>: Defines the container for each record (e.g.,div.product,li).--col "Name:Selector [Function]": Defines a column in the output TSV.Name: The column header name.Selector: CSS selector relative to the row element.Function:text{}(default) orattr{name}.- Example:
--col "Link:a.title attr{href}" - Missing elements or attributes result in empty TSV cells.
Input:
- Reads from files or standard input.
- Use
stdinor omit the file argument to read from standard input.
Output:
- By default, output is written to standard output.
- Use
--outfileto write to a file instead.
Examples:
-
Extract all links (Query Mode)
tva from html -q "a attr{href}" index.html -
Extract the first table (Table Mode)
tva from html --table data.html -
Extract product list (Struct Mode) tva from html –row “div.product-card”
–col “Title: h2.title text{}”
–col “Price: .price”
–col “URL: a.link attr{href}”
products.html