Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

nwr - NCBI taxonomy/assembly WRangler

Publish Build Codecov Crates.io Lines of code Documentation

Install

Current release: 0.9.0

cargo install nwr

# or
cargo install --path . --force # --offline

Or install the pre-compiled binary via the cross-platform package manager cbp (supports older Linux systems with glibc 2.17+):

cbp install nwr

You can also download the pre-compiled binaries from the Releases page.

nwr help

$ nwr help
`nwr` is a command line **N**CBI taxonomy and assembly **WR**angler.

Usage: nwr [COMMAND]

Commands:
  download     Download the latest releases of `taxdump` and assembly reports
  txdb         Init the taxonomy database
  ardb         Init the assembly database
  info         Information of Taxonomy ID(s) or scientific name(s)
  lineage      Output the lineage of the term
  member       List members (of certain ranks) under ancestral term(s)
  append       Append fields of higher ranks to a TSV file
  restrict     Restrict taxonomy terms to ancestral descendants
  common       Output the common tree of terms
  template     Create dirs, data and scripts for a phylogenomic research
  kb           Prints docs (knowledge bases)
  seqdb        Init the seq database
  help         Print this message or the help of the given subcommand(s)

Options:
  -h, --help     Print help
  -V, --version  Print version

Subcommand groups:

* Database
    * download / txdb / ardb
* Taxonomy
    * info / lineage / member / append / restrict / common
* Assembly
    * template / kb / seqdb

Examples

Initiate local databases

The date date --utc of executing nwr download is Sun Apr 5 15:59:45 UTC 2026

The database doesn’t need frequent updates. In our lab, we update it approximately once a year. For reproducibility, I provide database files for the above date in the Releases page.

cbp install nwr

nwr download
nwr txdb

nwr ardb
nwr ardb --genbank

cd $HOME/.nwr
tar cvfz ncbi.$(date +"%Y%m%d").tar.gz \
    taxdump.tar.gz \
    taxdump.tar.gz.md5 \
    assembly_summary_genbank.txt \
    assembly_summary_refseq.txt

rm \
    *.dmp \
    taxdump.tar.gz \
    taxdump.tar.gz.md5 \
    assembly_summary_genbank.txt \
    assembly_summary_refseq.txt

Usage of each command

For practical uses of nwr and other awesome companions, follow this page.

# nwr download

# nwr txdb

nwr info "Homo sapiens" 4932

nwr lineage "Homo sapiens"
nwr lineage 4932

nwr restrict "Vertebrata" -c 2 -f tests/nwr/taxon.tsv
##sci_name       tax_id
#Human   9606

nwr member "Homo"

nwr append tests/nwr/taxon.tsv -c 2 -r species -r family --id

# nwr ardb
# nwr ardb --genbank

nwr common "Escherichia coli" 4932 Drosophila_melanogaster 9606 Mus_musculus

seqdb

export SPECIES="$HOME/data/Archaea/Protein/Sulfolobus_acidocaldarius"

cargo run --bin nwr seqdb -d ${SPECIES} --init --strain

cargo run --bin nwr seqdb -d ${SPECIES} \
    --size <(
        hnsm size ${SPECIES}/pro.fa.gz
    ) \
    --clust

cargo run --bin nwr seqdb -d ${SPECIES} \
    --anno <(
        gzip -dcf "${SPECIES}"/anno.tsv.gz
    ) \
    --asmseq <(
        gzip -dcf "${SPECIES}"/asmseq.tsv.gz
    )

cargo run --bin nwr seqdb -d ${SPECIES} --rep f1="${SPECIES}"/fam88_cluster.tsv

echo "
    SELECT
        *
    FROM asm
    WHERE 1=1
    " |
    sqlite3 -tabs ${SEQ_DIR}/seq.sqlite

echo "
    SELECT
        COUNT(distinct asm_seq.asm_id)
    FROM asm_seq
    WHERE 1=1
    " |
    sqlite3 -tabs ${SEQ_DIR}/seq.sqlite

echo "
.header ON
    SELECT
        'species' AS species,
        COUNT(distinct asm_seq.asm_id) AS strain,
        COUNT(*) AS total,
        COUNT(distinct rep_seq.seq_id) AS dedup,
        COUNT(distinct rep_seq.rep_id) AS rep
    FROM asm_seq
    JOIN rep_seq ON asm_seq.seq_id = rep_seq.seq_id
    WHERE 1=1
    " |
    sqlite3 -tabs ${SEQ_DIR}/seq.sqlite

NCBI Assembly Reports

Preparations

cbp install nwr
cbp install sqlite3
cbp install tva

Requires SQLite version 3.34 or above. sqlite that comes with mac does not work.

NCBI Taxonomy Statistics

curl -L "https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=statistics&?&unclassified=hide&uncultured=hide" |
    tva from html -q 'table[bgcolor="#CCCCFF"] table[bgcolor="#FFFFFF"] tr td text{}' |
    grep '\S' |
    paste -d $'\t' - - - - - - |
    tva to md --right 2-6
Ranks:higher taxagenusspecieslower taxatotal
Archaea03401,2002,2902,290
Bacteria05,78233,61590,21890,218
Eukaryota0104,261631,437804,447804,447
Fungi08,09574,50788,46088,460
Metazoa075,546340,416453,240453,240
Viridiplantae016,338198,532237,280237,280
Viruses363,49314,612200,795201,328
All taxa54113,878700,7621,097,7581,118,224

NCBI ASSEMBLY

  • assembly_level
for C in refseq genbank; do
    cat ~/.nwr/assembly_summary_${C}.txt |
        sed '1d' |
        tva stats -H -g assembly_level,genome_rep --count |
        tva keep-header -- sort |
        tva to md --fmt

    echo -e "\nTable: ${C}\n\n"
done
assembly_levelgenome_repcount
ChromosomeFull8,629
ChromosomePartial355
Complete GenomeFull76,533
Complete GenomePartial7
ContigFull280,107
ContigPartial30
ScaffoldFull158,032

Table: refseq

assembly_levelgenome_repcount
ChromosomeFull44,020
ChromosomePartial1,196
Complete GenomeFull309,100
Complete GenomePartial131
ContigFull2,549,556
ContigPartial933
ScaffoldFull515,294
ScaffoldPartial363

Table: genbank

Example 1: count qualified assemblies of Eukaryote groups

ARRAY=(
    # Animals - Metazoa - kingdom
    'Flatworms::Platyhelminthes' # phylum
    'Roundworms::Nematoda'
    'Insects::Hexapoda' # subphylum
    'Reptiles::Testudines' # order
    'Reptiles::Lepidosauria'
    'Reptiles::Crocodylia'
    'Fishes::Chondrichthyes' # class
    'Fishes::Dipnoi'
    'Fishes::Actinopterygii'
    'Fishes::Hyperotreti'
    'Fishes::Hyperoartia'
    'Fishes::Coelacanthimorpha'
    'Mammals::Mammalia'
    'Birds::Aves'
    'Amphibians::Amphibia'
    # Fungi - kindom
    'Ascomycetes::Ascomycota' # phylum
    'Basidiomycetes::Basidiomycota'
    # Plants - Viridiplantae
    'Green Plants::Viridiplantae'
    'Land Plants::Embryophyta'
    # Protists
    'Apicomplexans::Apicomplexa'
    'Kinetoplasts::Kinetoplastida'
)

echo -e "GROUP_NAME\tSCI_NAME\tComplete Genome\tChromosome\tScaffold\tContig" \
    > groups.tsv

for item in "${ARRAY[@]}" ; do
    GROUP_NAME="${item%%::*}"
    SCI_NAME="${item##*::}"

    GENUS=$(
        nwr member ${SCI_NAME} -r genus |
            grep -v -i "Candidatus " |
            grep -v -i "candidate " |
            sed '1d' |
            cut -f 1 |
            tr "\n" "," |
            sed 's/,$/\)/' |
            sed 's/^/\(/'
    )

    printf "$GROUP_NAME\t$SCI_NAME\t"

    for L in 'Complete Genome' 'Chromosome' 'Scaffold' 'Contig'; do
        echo "
            SELECT
                COUNT(*)
            FROM ar
            WHERE 1=1
                AND genus_id IN $GENUS
                AND assembly_level IN ('$L')
            " |
            sqlite3 -tabs ~/.nwr/ar_refseq.sqlite
    done |
    tr "\n" "\t" |
    sed 's/\t$//'

    echo;
done \
    >> groups.tsv

cat groups.tsv |
    tva to md --num

GROUP_NAMESCI_NAMEComplete GenomeChromosomeScaffoldContig
FlatwormsPlatyhelminthes0250
RoundwormsNematoda1430
InsectsHexapoda120810530
ReptilesTestudines01711
ReptilesLepidosauria02591
ReptilesCrocodylia0160
FishesChondrichthyes02610
FishesDipnoi0100
FishesActinopterygii1225399
FishesHyperotreti0100
FishesHyperoartia0400
FishesCoelacanthimorpha0100
MammalsMammalia4173897
BirdsAves1106545
AmphibiansAmphibia02931
AscomycetesAscomycota4749276162
BasidiomycetesBasidiomycota27184832
Green PlantsViridiplantae9155589
Land PlantsEmbryophyta7152538
ApicomplexansApicomplexa225393
KinetoplastsKinetoplastida11373

Table: refseq - Eukaryotes

GROUP_NAMESCI_NAMEComplete GenomeChromosomeScaffoldContig
FlatwormsPlatyhelminthes0478920
RoundwormsNematoda4157348218
InsectsHexapoda21351333892573
ReptilesTestudines1595010
ReptilesLepidosauria011728130
ReptilesCrocodylia05140
FishesChondrichthyes056606
FishesDipnoi0402
FishesActinopterygii3111112107320
FishesHyperotreti0430
FishesHyperoartia07144
FishesCoelacanthimorpha0130
MammalsMammalia2514712280973
BirdsAves34472191330
AmphibiansAmphibia09318612
AscomycetesAscomycota4681312108726713
BasidiomycetesBasidiomycota12718817461247
Green PlantsViridiplantae252420328951261
Land PlantsEmbryophyta220413226881024
ApicomplexansApicomplexa2013219989
KinetoplastsKinetoplastida1672119104

Table: genbank - Eukaryotes

Example 2: count qualified assemblies of Prokaryote groups

echo -e "GROUP_NAME\tComplete Genome\tChromosome\tScaffold\tContig" \
    > groups.tsv

for item in Bacteria Archaea ; do
    PHYLUM=$(
        nwr member ${item} -r phylum |
            grep -v -i "Candidatus " |
            grep -v -i "candidate " |
            sed '1d' |
            cut -f 2 |
            sort
    )

    echo -e "$item\t\t\t\t"

    for P in $PHYLUM; do
        GENUS=$(
            nwr member ${P} -r genus |
                grep -v -i "Candidatus " |
                grep -v -i "candidate " |
                sed '1d' |
                cut -f 1 |
                tr "\n" "," |
                sed 's/,$/\)/' |
                sed 's/^/\(/'
        )

        if [[ ${#GENUS} -lt 3 ]]; then
            >&2 echo $P has no genera
            continue
        fi

        printf "$P\t"

        for L in 'Complete Genome' 'Chromosome' 'Scaffold' 'Contig'; do
            echo "
                SELECT
                    COUNT(*)
                FROM ar
                WHERE 1=1
                    AND genus_id IN $GENUS
                    AND assembly_level IN ('$L')
                " |
                sqlite3 -tabs ~/.nwr/ar_refseq.sqlite
        done |
        tr "\n" "\t" |
        sed 's/\t$//'

        echo;
    done
done  \
    >> groups.tsv

cat groups.tsv |
    tva to md --right 2-5

GROUP_NAMEComplete GenomeChromosomeScaffoldContig
Bacteria
Abditibacteriota1001
Acidobacteriota47113867
Actinomycetota60509762612420668
Aquificota2522667
Armatimonadota3448
Atribacterota3012
Bacillota1411516024296669650
Bacteroidota1997284792810745
Balneolota311539
Bdellovibrionota49104844
Caldisericota1092
Calditrichota1103
Campylobacterota148211625848148
Chlamydiota3039054193
Chlorobiota161936
Chloroflexota54163109
Chrysiogenota3050
Coprothermobacterota1012
Cyanobacteriota416448031331
Deferribacterota90922
Deinococcota1135142234
Dictyoglomota7061
Elusimicrobiota4001
Fibrobacterota202360
Fidelibacterota1000
Fusobacteriota2629211472
Gemmatimonadota101948
Ignavibacteriota30512
Kiritimatiellota2006
Lentisphaerota20123
Minisyncoccota1000
Mycoplasmatota953713821135
Myxococcota131937148
Nitrospinota10110
Nitrospirota2401924
Planctomycetota863061117
Pseudomonadota33304359771037157832
Rhodothermota1934199
Spirochaetota4672843731411
Synergistota12449110
Thermodesulfobacteriota18612279487
Thermodesulfobiota2002
Thermomicrobiota2039
Thermosulfidibacterota1000
Thermotogota61110599
Verrucomicrobiota1499237272
Vulcanimicrobiota1000
Zhurongbacterota1000
Archaea
Methanobacteriota523215471147
Microcaldota0000
Nanobdellota1000
Nitrososphaerota2131226
Promethearchaeota1000
Thermoplasmatota160975
Thermoproteota1336117127

Table: refseq - Prokaryotes

GROUP_NAMEComplete GenomeChromosomeScaffoldContig
Bacteria
Abditibacteriota11511
Acidobacteriota5613160612
Actinomycetota63718453327633974
Aquificota22282172
Armatimonadota413057
Atribacterota3057
Bacillota16847190687271465348
Bacteroidota21433141709630441
Balneolota1354396
Bdellovibrionota5310147223
Caldisericota10204
Calditrichota11742
Campylobacterota27991626074157198
Chlamydiota40879118225
Chlorobiota1713067
Chloroflexota571286375
Chrysiogenota3020
Coprothermobacterota101410
Cyanobacteriota4688114663874
Deferribacterota70520266
Deinococcota1185193282
Dictyoglomota70155
Elusimicrobiota40145
Fibrobacterota20109199
Fidelibacterota1000
Fusobacteriota29314258906
Gemmatimonadota8133167
Ignavibacteriota316245
Kiritimatiellota201348
Lentisphaerota201255
Minisyncoccota1001
Mycoplasmatota11322624471561
Myxococcota1371078351
Nitrospinota101367
Nitrospirota355307456
Planctomycetota10533172699
Pseudomonadota4202546391227191437144
Rhodothermota20352260
Spirochaetota5787136772696
Synergistota144127239
Thermodesulfobacteriota189116871767
Thermodesulfobiota2056
Thermomicrobiota20834
Thermosulfidibacterota1013
Thermotogota561232219
Verrucomicrobiota1641114322010
Vulcanimicrobiota1000
Zhurongbacterota1000
Archaea
Methanobacteriota5432512962504
Microcaldota0000
Nanobdellota2001
Nitrososphaerota4322200653
Promethearchaeota1006
Thermoplasmatota18046213
Thermoproteota1376476436

Table: genbank - Prokaryotes

Example 3: find accessions of a species

Staphylococcus capitis - 29388 - 头状葡萄球菌

nwr info "Staphylococcus capitis"

nwr member 29388

echo '
.headers ON
    SELECT
        organism_name,
        species,
        genus,
        ftp_path,
        assembly_level
    FROM ar
    WHERE 1=1
        AND tax_id != species_id    -- with strain ID
        AND species_id IN (29388)
    ' |
    sqlite3 -tabs ~/.nwr/ar_refseq.sqlite \
    > Scap.assembly.tsv

echo '
    SELECT
        species || " " || REPLACE(assembly_accession, ".", "_") AS organism_name,
        species,
        genus,
        ftp_path,
        assembly_level
    FROM ar
    WHERE 1=1
        AND tax_id = species_id     -- no strain ID
        AND assembly_level IN ("Chromosome", "Complete Genome")
        AND species_id IN (29388)
    ' |
    sqlite3 -tabs ~/.nwr/ar_refseq.sqlite \
    >> Scap.assembly.tsv

Example 4: find model organisms in a family

echo "
.headers ON
    SELECT
        tax_id,
        organism_name
    FROM ar
    WHERE 1=1
        AND family IN ('Enterobacteriaceae')
        AND refseq_category IN ('reference genome')
    " |
    sqlite3 -tabs ~/.nwr/ar_refseq.sqlite |
    sed '1s/^/#/' |
    tva to md

#tax_idorganism_name
511145Escherichia coli str. K-12 substr. MG1655
198214Shigella flexneri 2a str. 301
99287Salmonella enterica subsp. enterica serovar Typhimurium str. LT2
386585Escherichia coli O157:H7 str. Sakai
1125630Klebsiella pneumoniae subsp. pneumoniae HS11286

Download files from NCBI Assembly

Strain info

cat ~/.nwr/assembly_summary_refseq.txt |
    sed '1d' |
    tva stats -H --missing-count infraspecific_name,isolate,biosample |
    tva to md --fmt

# infraspecific_name
cat ~/.nwr/assembly_summary_refseq.txt |
    sed '1d' |
    tva select -H -f infraspecific_name |
    perl -nla -F"=" -e 'print $F[0]' |
    tva keep-header -- sort |
    uniq -c
#       1 infraspecific_name
#      38 breed
#     109 cultivar
#      82 ecotype
#   67134 na
#  456330 strain

cat ~/.nwr/assembly_summary_genbank.txt |
    sed '1d' |
    tva select -H -f infraspecific_name |
    perl -nla -F"=" -e 'print $F[0]' |
    tva keep-header -- sort |
    uniq -c
#       1 infraspecific_name
#     565 breed
#    2587 cultivar
#    1098 ecotype
# 1264802 na
# 2151541 strain

cat ~/.nwr/assembly_summary_refseq.txt ~/.nwr/assembly_summary_genbank.txt |
    grep -v "^#" |
    tva select -f 9 | # infraspecific_name
    perl -nla -F"=" -e 'print $F[1]' |
    tva keep-header -- sort |
    uniq -c |
    sort -nr |
    head
# 1331958
#    3722 GPSC3
#    2491 Human
#    2451 clinical isolate of L. monocytogenes
#    2360 MSSA
#    1612 ExPEC
#    1377 MRSA
#    1285 GPSC12
#     898 GPSC16
#     854 GPSC55

# String length
cat ~/.nwr/assembly_summary_refseq.txt |
    sed '1d' |
    tva select -H -f organism_name,infraspecific_name,asm_name,ftp_path |
    sed '1d' |
    perl -nla -F"\t" -e 'print join qq(\t), map {length} @F ;' |
    tva stats --exclude-missing --max 1,2,3,4
#91      88      92      166

infraspecific_name_missing_countisolate_missing_countbiosample_missing_count
000

Reference genomes

cd ~/Scripts/nwr/docs/

nwr member Bacteria Archaea -r family |
    grep -v -i "Candidatus " |
    grep -v -i "candidate " |
    grep -v " sp." |
    grep -v " spp." |
    sed '1d' |
    sort -n -k1,1 \
    > family.list.tsv

wc -l family.list.tsv
#707 family.list.tsv

FAMILY=$(
    cat family.list.tsv |
        cut -f 1 |
        tr "\n" "," |
        sed 's/,$//'
)

echo "
.headers ON
    SELECT
        *
    FROM ar
    WHERE 1=1
        AND family_id IN ($FAMILY)
        AND refseq_category IN ('reference genome')
    " |
    sqlite3 -tabs ~/.nwr/ar_refseq.sqlite \
    > reference.tsv

cat reference.tsv |
    tsv-select -H -f organism_name,species,genus,ftp_path,assembly_level \
    > raw.tsv

cat raw.tsv |
    grep -v '^#' |
    rgr dedup stdin |
    perl ~/Scripts/withncbi/taxon/abbr_name.pl -c "1,2,3" -s '\t' -m 3 --shortsub |
    (echo -e '#name\tftp_path\torganism\tassembly_level' && cat ) |
    perl -nl -a -F"," -e '
        BEGIN{my %seen};
        /^#/ and print and next;
        /^organism_name/i and next;
        $seen{$F[3]}++; # ftp_path
        $seen{$F[3]} > 1 and next;
        $seen{$F[5]}++; # abbr_name
        $seen{$F[5]} > 1 and next;
        printf qq{%s\t%s\t%s\t%s\n}, $F[5], $F[3], $F[1], $F[4];
        ' |
    keep-header -- sort -k3,3 -k1,1 \
    > Bacteria.assembly.tsv

File format: .assembly.tsv

A TAB-delimited file for downloading assembly files.

ColTypeDescription
1string#name: species + infraspecific_name + assembly_accession
2stringftp_path
3stringbiosample
4stringspecies
5stringassembly_level

download

Behavior:

  • Downloads the latest releases of taxdump and assembly reports from NCBI.
  • Automatically verifies MD5 checksum for taxdump.
  • Extracts taxdump.tar.gz to the NWR directory.
  • Skips downloading if files already exist.

Manual Download:

You can also download the files manually:

mkdir -p ~/.nwr

# taxdump
wget -N -P ~/.nwr https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
wget -N -P ~/.nwr https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz.md5

# assembly reports
wget -N -P ~/.nwr https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt
wget -N -P ~/.nwr https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt

# with aria2
cat <<EOF > download.txt
https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz.md5
https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt
https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt

EOF

aria2c -x 4 -s 2 -c -d ~/.nwr -i download.txt

Examples:

  1. Download with default settings nwr download

  2. Use a different FTP host nwr download --host ftp.ncbi.nih.gov:21

  3. Custom paths nwr download --tx /pub/taxonomy --ar /genomes/ASSEMBLY_REPORTS

txdb

Behavior:

  • Initializes the taxonomy database from taxdump.tar.gz.
  • Creates a SQLite database at ~/.nwr/taxonomy.sqlite.
  • Loads data from division.dmp, names.dmp, and nodes.dmp.
  • Creates indexes for efficient querying.

Database Location:

~/.nwr/taxonomy.sqlite

The DDL:

DROP TABLE IF EXISTS division;
DROP TABLE IF EXISTS node;
DROP TABLE IF EXISTS name;

CREATE TABLE IF NOT EXISTS division (
    id       INTEGER      NOT NULL
                          PRIMARY KEY,
    division VARCHAR (50) NOT NULL
);

CREATE TABLE IF NOT EXISTS node (
    tax_id        INTEGER      NOT NULL
                               PRIMARY KEY,
    parent_tax_id INTEGER,
    rank          VARCHAR (25) NOT NULL,
    division_id   INTEGER      NOT NULL,
    comment       TEXT,
    FOREIGN KEY (
        division_id
    )
    REFERENCES division (id)
);

CREATE TABLE IF NOT EXISTS name (
    id         INTEGER      NOT NULL
                            PRIMARY KEY,
    tax_id     INTEGER      NOT NULL,
    name       VARCHAR (50) NOT NULL,
    name_class VARCHAR (50) NOT NULL
);

Query the database:

echo "
    SELECT sql
    FROM sqlite_master
    WHERE type='table'
    ORDER BY name;
    " |
    sqlite3 -tabs ~/.nwr/taxonomy.sqlite

Examples:

  1. Initialize the taxonomy database nwr txdb

  2. Use a custom directory nwr txdb --dir /path/to/nwr

ardb

Behavior:

  • Initializes the assembly database from assembly summary files.
  • Creates SQLite databases at ~/.nwr/ar_refseq.sqlite and ~/.nwr/ar_genbank.sqlite.
  • Loads data from assembly_summary_refseq.txt or assembly_summary_genbank.txt.
  • Appends taxonomic lineage information (species, genus, family).
  • Filters out incompetent strains (uncultured, unidentified, etc.).

Database Location:

~/.nwr/ar_refseq.sqlite
~/.nwr/ar_genbank.sqlite

Input Columns:

  • assembly_summary_*.txt have 23 tab-delimited columns.

  • Fields with numbers are used in the database.

    0 assembly_accession 6 1 bioproject 4 2 biosample 5 3 wgs_master 4 refseq_category 7 5 taxid AS tax_id 1 6 species_taxid 7 organism_name 2 8 infraspecific_name 3 9 isolate 10 version_status 11 assembly_level 8 12 release_type 13 genome_rep 9 14 seq_rel_date 10 15 asm_name 11 16 submitter 17 gbrs_paired_asm 12 18 paired_asm_comp 19 ftp_path 13 20 excluded_from_refseq 21 relation_to_type_material 22 asm_not_live_date

Appended Columns:

14  species
15  species_id
16  genus
17  genus_id
18  family
19  family_id

Filtered Strains:

Incompetent strains matching the following regexes in their organism_name are removed:

\bCandidatus\b
\bcandidate\b
\buncultured\b
\bunidentified\b
\bbacterium\b
\barchaeon\b
\bmetagenome\b
virus\b
phage\b

Requirements:

  • Strains with assembly_level of Scaffold or Contig should have a genome_rep of full.
  • Requires SQLite version 3.34 or above.

Query the database:

echo "
    SELECT
        COUNT(*)
    FROM ar
    WHERE 1=1
        AND genus IN ('Pseudomonas')
        AND assembly_level IN ('Complete Genome', 'Chromosome')
    " |
    sqlite3 -tabs ~/.nwr/ar_refseq.sqlite

The DDL:

DROP TABLE IF EXISTS ar;

CREATE TABLE IF NOT EXISTS ar (
    tax_id             INTEGER,
    organism_name      VARCHAR (200),
    infraspecific_name VARCHAR (200),
    bioproject         VARCHAR (50),
    biosample          VARCHAR (50),
    assembly_accession VARCHAR (50),
    refseq_category    VARCHAR (50),
    assembly_level     VARCHAR (50),
    genome_rep         VARCHAR (50),
    seq_rel_date       DATE,
    asm_name           VARCHAR (200),
    gbrs_paired_asm    VARCHAR (200),
    ftp_path           VARCHAR (200),
    species            VARCHAR (50),
    species_id         INTEGER,
    genus              VARCHAR (50),
    genus_id           INTEGER,
    family             VARCHAR (50),
    family_id          INTEGER
);

Examples:

  1. Initialize the RefSeq assembly database nwr ardb

  2. Initialize the GenBank assembly database nwr ardb --genbank

  3. Use a custom directory nwr ardb --dir /path/to/nwr

append

Behavior:

  • Retrieves taxonomic information from the local taxonomy database.
  • Appends scientific names and/or taxon IDs of specified ranks to each row.
  • If --rank is not specified, appends the scientific name of the input taxon.
  • Header lines (starting with “#”) are processed to append appropriate column names.

Valid ranks:

  • species, genus, family, order, class, phylum, kingdom
  • Other ranks (e.g., clade, no rank) may work but are not officially supported.

Input:

  • Accepts one or more TSV files as input.
  • Reads from standard input if “stdin” is specified.
  • The input file should contain taxon IDs or scientific names in a specific column.

Output:

  • Tab-separated values with appended rank columns.
  • By default, output is written to standard output.
  • Use --outfile to write to a file instead.

Examples:

  1. Append scientific names for specified ranks nwr append input.tsv --rank genus --rank family

  2. Append both names and IDs nwr append input.tsv --rank species --id

  3. Read from stdin, append genus information cat input.tsv | nwr append stdin --rank genus

  4. Specify column and output file nwr append input.tsv -c 2 --rank kingdom -o output.tsv

common

Behavior:

  • Outputs the common tree of terms as Newick format.
  • Finds the most recent common ancestor of all input terms.
  • Constructs a phylogenetic tree showing the relationship.
  • Ancestral terms can be Taxonomy IDs or scientific names.

Input:

  • Accepts two or more Taxonomy IDs or scientific names.
  • Terms are provided as positional arguments.

Output:

  • Newick format tree string.
  • Tree includes scientific names as node labels.
  • By default, output is written to standard output.
  • Use --outfile to write to a file instead.

Examples:

  1. Find common ancestor of two species nwr common 9606 10090

  2. Find common ancestor of multiple taxa nwr common "Homo sapiens" "Mus musculus" "Danio rerio"

  3. Write to file nwr common 9606 10090 -o tree.nwk

  4. Use taxonomy IDs nwr common 9605 10090 10116

info

Behavior:

  • Retrieves taxonomic information from the local taxonomy database.
  • Accepts Taxonomy IDs or scientific names as input.
  • By default, outputs detailed information in a custom format.
  • Use --tsv to output results as tab-separated values.

Input:

  • Accepts one or more Taxonomy IDs or scientific names.
  • Terms can be provided as positional arguments.

Output:

  • Default format shows detailed taxonomic information.
  • TSV output includes: tax_id, sci_name, rank, division.
  • By default, output is written to standard output.
  • Use --outfile to write to a file instead.

Examples:

  1. Get information for a single taxon nwr info 9606

  2. Get information for multiple taxa nwr info 9606 10090 10116

  3. Output as TSV nwr info Homo_sapiens --tsv

  4. Use scientific names nwr info "Homo sapiens" "Mus musculus"

lineage

Behavior:

  • Retrieves the lineage of a taxon from root to the specified term.
  • Returns the full taxonomic hierarchy including all ranks.
  • By default, outputs rank, scientific name, and taxonomy ID for each level.

Input:

  • Accepts a single Taxonomy ID or scientific name.
  • Use --tsv for tab-separated output format.

Output:

  • Default output: rank, scientific_name, tax_id (tab-separated)
  • TSV output: rank, scientific_name, tax_id (tab-separated)
  • By default, output is written to standard output.
  • Use --outfile to write to a file instead.

Examples:

  1. Get lineage for a species nwr lineage 9606

  2. Get lineage using scientific name nwr lineage "Homo sapiens"

  3. Output as TSV nwr lineage 9606 --tsv

  4. Write to file nwr lineage 9606 -o lineage.txt

member

Behavior:

  • Lists members (of certain ranks) under ancestral term(s).
  • Retrieves taxonomic information from the local taxonomy database.
  • Ancestral terms can be Taxonomy IDs or scientific names.
  • By default, excludes “Environmental samples” division.
  • The output file is in the same TSV format as nwr info --tsv.

Valid ranks:

  • species, genus, family, order, class, phylum, kingdom
  • Other ranks (e.g., clade, no rank) may work but are not officially supported.

Input:

  • Accepts one or more ancestral Taxonomy IDs or scientific names.
  • Optionally filter results by rank using --rank.

Output:

  • TSV output includes: tax_id, sci_name, rank, division.
  • By default, output is written to standard output.
  • Use --outfile to write to a file instead.

Examples:

  1. List all members under a genus nwr member 9605

  2. List only species under a genus nwr member Homo --rank species

  3. Include environmental samples nwr member 4751 --env

  4. Multiple ancestors with rank filter nwr member Homo Pan --rank genus

restrict

Behavior:

  • Restricts taxonomy terms to descendants of specified ancestor(s).
  • Terms can be Taxonomy IDs or scientific names.
  • Use --exclude to invert the filter (exclude matching lines).
  • Header lines (starting with “#”) are always outputted.

Input:

  • Accepts one or more TSV files via --file option.
  • Reads from standard input by default.
  • The input file should contain taxon IDs or scientific names in a specific column.

Output:

  • Filtered tab-separated values.
  • By default, output is written to standard output.
  • Use --outfile to write to a file instead.

Examples:

  1. Restrict to descendants of a specific genus nwr restrict "Homo" --file input.tsv

  2. Restrict using taxonomy ID nwr restrict 9605 --file input.tsv

  3. Exclude descendants (inverse filter) nwr restrict "Bacteria" --file input.tsv --exclude

  4. Specify column and output file nwr restrict "Mammalia" --file input.tsv -c 2 -o output.tsv

  5. Multiple ancestors nwr restrict "Homo" "Pan" --file input.tsv

template

Behavior:

  • Creates directories, data files, and scripts for phylogenomic research.
  • Generates materials for ASSEMBLY, BioSample, MinHash, Count, and Protein steps.
  • Uses Tera templates to generate Bash scripts.

Input File Format:

.assembly.tsv is a TAB-delimited file to guide downloading and processing:

ColTypeDescription
1string#name: species + infraspecific_name + assembly_accession
2stringftp_path
3stringbiosample
4stringspecies
5stringassembly_level

Generated Materials:

  • --ass: ASSEMBLY/

    • One TSV file: url.tsv
    • Five Bash scripts: rsync.sh, check.sh, n50.sh, collect.sh, finish.sh
  • --bs: BioSample/

    • One TSV file: sample.tsv
    • Two Bash scripts: download.sh, collect.sh
  • --mh: MinHash/

    • One TSV file: species.tsv
    • Five Bash scripts: compute.sh, species.sh, abnormal.sh, nr.sh, dist.sh
  • --count: Count/

    • One TSV file: species.tsv
    • Three Bash scripts: strains.sh, rank.sh, lineage.sh
  • --pro: Protein/

    • One TSV file: species.tsv
    • Bash scripts: collect.sh, info.sh, count.sh

Examples:

  1. Generate ASSEMBLY materials nwr template input.assembly.tsv --ass

  2. Generate all materials nwr template input.assembly.tsv --ass --bs --mh --count --pro

  3. Specify output directory nwr template input.assembly.tsv --ass -o output_dir

  4. Use parallel processing nwr template input.assembly.tsv --mh --parallel 16

abbr

Behavior:

  • Abbreviates strain scientific names to unique short identifiers.
  • Generates abbreviations for genus, species, and strain parts.
  • Handles special cases like Candidatus and subspecies names.
  • Ensures uniqueness of abbreviations across all input names.

Input:

  • Accepts a TSV/CSV file or standard input.
  • Each row should contain strain, species, and genus names in separate columns.
  • Use --column to specify which columns contain these names (default: 1,2,3).
  • Common column patterns:
    • 1,2,3 - strain in column 1, species in 2, genus in 3
    • 1,1,2 - no strain: strain and species both in column 1, genus in 2
    • 2,2,3 - don’t need strain part: strain and species in 2, genus in 3
    • 1,1,1 - only strain: all three in column 1

Output:

  • Original line followed by a tab and the generated abbreviation.
  • Abbreviation format:
    • Normal mode: Genus_Species_Strain (e.g., H_sapiens_sapiens)
    • Tight mode (--tight): GenusSpecies_Strain (e.g., Hsapiens_sapiens)
  • Special handling:
    • Candidatus is abbreviated to C
    • Non-alphanumeric characters are replaced with underscores
    • Consecutive underscores are collapsed
    • Leading and trailing underscores are removed

Examples:

  1. Basic usage with default columns echo -e 'Homo sapiens,Homo\nHomo erectus,Homo' | nwr abbr -s ',' -c "1,1,2"

  2. Tight mode (no underscore between genus and species) echo -e 'Homo sapiens,Homo\nHomo erectus,Homo' | nwr abbr -s ',' -c "1,1,2" --tight

  3. Clean subspecies names echo 'Legionella pneumophila subsp. pneumophila' | nwr abbr --shortsub

  4. Process a file nwr abbr names.tsv -o abbreviated.tsv

  5. Custom separator and columns nwr abbr data.csv -s ',' -c "1,2,3" -o output.tsv

kb

Behavior:

  • Prints embedded documentation and knowledge bases.
  • Extracts built-in files to stdout or a specified output directory.

Available Documents:

  • bac120 - 120 bacterial marker genes (tar.gz archive)
  • ar53 - 53 archaeal marker genes (tar.gz archive)

Output:

  • Archive files (bac120, ar53) are extracted to the specified directory.
  • By default, output is written to standard output.
  • Use --outfile to specify an output file or directory.

Examples:

  1. Extract bacterial marker genes nwr kb bac120 -o marker_genes/

  2. Extract archaeal marker genes nwr kb ar53 -o marker_genes/

seqdb

Behavior:

  • Initializes the sequence database for protein sequence information.
  • Creates a SQLite database at ./seq.sqlite.
  • Loads data from various TSV files into appropriate tables.
  • Supports loading strains, sizes, clusters, annotations, and assembly sequences.

Database Location:

./seq.sqlite

The DDL:

CREATE TABLE rank (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name VARCHAR NOT NULL UNIQUE
);
-- assembly
CREATE TABLE asm (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name VARCHAR NOT NULL UNIQUE,
    rank_id INTEGER NOT NULL,
    FOREIGN KEY (rank_id) REFERENCES rank(id)
);
-- sequence
CREATE TABLE seq (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name VARCHAR NOT NULL UNIQUE,
    size INTEGER,
    anno TEXT
);
-- representative
CREATE TABLE rep (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name VARCHAR NOT NULL UNIQUE,
    f1 TEXT,
    f2 TEXT,
    f3 TEXT,
    f4 TEXT,
    f5 TEXT,
    f6 TEXT,
    f7 TEXT,
    f8 TEXT
);
-- Junction table to associate rep with seq
CREATE TABLE rep_seq (
    rep_id INTEGER NOT NULL,
    seq_id INTEGER NOT NULL,
    PRIMARY KEY (rep_id, seq_id),
    FOREIGN KEY (rep_id) REFERENCES rep(id),
    FOREIGN KEY (seq_id) REFERENCES seq(id)
);
-- Junction table to associate asm with seq
CREATE TABLE asm_seq (
    asm_id INTEGER NOT NULL,
    seq_id INTEGER NOT NULL,
    PRIMARY KEY (asm_id, seq_id),
    FOREIGN KEY (asm_id) REFERENCES asm(id),
    FOREIGN KEY (seq_id) REFERENCES seq(id)
);
-- Regular indices
CREATE INDEX rep_idx_f1 ON rep(f1);
CREATE INDEX rep_idx_f2 ON rep(f2);
CREATE INDEX rep_idx_f3 ON rep(f3);
CREATE INDEX rep_idx_f4 ON rep(f4);
CREATE INDEX rep_idx_f5 ON rep(f5);
CREATE INDEX rep_idx_f6 ON rep(f6);
CREATE INDEX rep_idx_f7 ON rep(f7);
CREATE INDEX rep_idx_f8 ON rep(f8);
-- Case-insensitive indices for `like`
CREATE INDEX seq_idx_anno ON seq(anno COLLATE NOCASE);

Notes:

  • If --strain is called without specifying a path, it will load the default file under --dir.
  • --rep requires a key-value pair in the format --rep f1=file.
  • Valid fields for --rep are: f1, f2, f3, f4, f5, f6, f7, f8.

Examples:

  1. Initialize the database nwr seqdb --init

  2. Load strain information nwr seqdb --strain strains.tsv

  3. Load multiple data types nwr seqdb --strain --size --clust

  4. Load features into rep table nwr seqdb --rep f1=features.tsv

taxonomy.sqlite

Tables

NameColumnsCommentType
division2table
node5table
name4table

Relations

er


Generated by tbls

ar_refseq.sqlite

Tables

NameColumnsCommentType
ar17table

Relations

er


Generated by tbls