Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

CSS Selectors Reference

tva from html uses the scraper crate which implements a robust subset of CSS selectors. This document provides a comprehensive reference and examples, inspired by pup.

Basic Selectors

SelectorDescriptionExampleMatches
tagSelects elements by tag name.div<div>...</div>
.classSelects elements by class..content<div class="content">
#idSelects elements by ID.#header<div id="header">
*Universal selector, matches everything.*Any element

Combinators

Combinators allow you to select elements based on their relationship to other elements.

SelectorNameDescriptionExample
A BDescendantSelects B inside A (any depth).div p (paragraphs inside divs)
A > BChildSelects B directly inside A.ul > li (direct children list items)
A + BAdjacent SiblingSelects B immediately after A.h1 + p (paragraph right after h1)
A ~ BGeneral SiblingSelects B after A (same parent).h1 ~ p (all paragraphs after h1)
A, BGroupingSelects both A and B.h1, h2 (all h1 and h2 headers)

Attribute Selectors

Filter elements based on their attributes.

SelectorDescriptionExample
[attr]Has attribute attr.[href]
[attr="val"]Attribute exactly equals val.[type="text"]
[attr~="val"]Attribute contains word val (space separated).[class~="btn"]
`[attr=“val”]`Attribute starts with val (hyphen separated).
[attr^="val"]Attribute starts with val.[href^="https"]
[attr$="val"]Attribute ends with val.[href$=".pdf"]
[attr*="val"]Attribute contains substring val.[href*="google"]

Pseudo-classes

Pseudo-classes select elements based on their state or position in the document tree.

Structural & Position

SelectorDescriptionExample
:first-childFirst child of its parent.li:first-child
:last-childLast child of its parent.li:last-child
:only-childElements that are the only child.p:only-child
:first-of-typeFirst element of its type among siblings.p:first-of-type
:last-of-typeLast element of its type among siblings.p:last-of-type
:only-of-typeOnly element of its type among siblings.img:only-of-type
:nth-child(n)Selects the n-th child (1-based).tr:nth-child(2)
:nth-last-child(n)n-th child from end.li:nth-last-child(1)
:nth-of-type(n)n-th element of its type.p:nth-of-type(2)
:nth-last-of-type(n)n-th element of its type from end.tr:nth-last-of-type(2)
:emptyElements with no children (including text).td:empty

Note on nth-child arguments:

  • 2: The 2nd child.
  • odd: 1st, 3rd, 5th…
  • even: 2nd, 4th, 6th…
  • 2n+1: Every 2nd child starting from 1 (1, 3, 5…).
  • 3n: Every 3rd child (3, 6, 9…).

Logic & Content

SelectorDescriptionExample
:not(selector)Elements that do NOT match the selector.input:not([type="submit"])
:is(selector)Matches any of the selectors in the list.:is(header, footer) a
:where(selector)Same as :is but with 0 specificity.:where(section, article)
:has(selector)(Experimental) Elements containing specific descendants.div:has(img)
:contains("text")Not supported by scraper.(Use text{} and filter downstream.)

Display Functions

When using tva from html -q, you can append a display function to format the output. If omitted, the full HTML of selected elements is printed.

FunctionDescriptionExample Output
text{}Prints text content of element and children.Hello World
attr{name}Prints value of attribute name.https://example.com
json{}(Not yet implemented) Output as JSON structure.N/A

Note: pup supports json{}, but tva currently focuses on TSV/Text extraction. Use Struct Mode (--row/--col) for structured data extraction.

Known Limitations

The following features from pup are not planned for implementation:

  • json{} output mode (use text{} or attr{} with TSV output).
  • pup-specific pseudo-classes (e.g., :parent-of).
  • :contains() selector (not supported by the underlying scraper engine).

Examples

Basic Filtering

Extract page title:

tva from html -q "title text{}" index.html

Extract all links from a specific list:

tva from html -q "ul#menu > li > a attr{href}" index.html

Advanced Filtering

Extract rows from the second table on the page, skipping the header:

tva from html -q "table:nth-of-type(2) tr:nth-child(n+2)" index.html

Find all images that are NOT icons:

tva from html -q "img:not(.icon) attr{src}" index.html

Extract meta description:

tva from html -q "meta[name='description'] attr{content}" index.html