nwr - NCBI taxonomy/assembly WRangler

Install

Current release: 0.9.0

cargo install nwr

# or
cargo install --path . --force # --offline

Or install the pre-compiled binary via the cross-platform package manager cbp (supports older Linux systems with glibc 2.17+):

cbp install nwr

You can also download the pre-compiled binaries from the Releases page.

`nwr help`

$ nwr help
`nwr` is a command line **N**CBI taxonomy and assembly **WR**angler.

Usage: nwr [COMMAND]

Commands:
  download     Download the latest releases of `taxdump` and assembly reports
  txdb         Init the taxonomy database
  ardb         Init the assembly database
  info         Information of Taxonomy ID(s) or scientific name(s)
  lineage      Output the lineage of the term
  member       List members (of certain ranks) under ancestral term(s)
  append       Append fields of higher ranks to a TSV file
  restrict     Restrict taxonomy terms to ancestral descendants
  common       Output the common tree of terms
  template     Create dirs, data and scripts for a phylogenomic research
  kb           Prints docs (knowledge bases)
  seqdb        Init the seq database
  help         Print this message or the help of the given subcommand(s)

Options:
  -h, --help     Print help
  -V, --version  Print version

Subcommand groups:

* Database
    * download / txdb / ardb
* Taxonomy
    * info / lineage / member / append / restrict / common
* Assembly
    * template / kb / seqdb

Examples

Initiate local databases

The date date --utc of executing nwr download is Sun Apr 5 15:59:45 UTC 2026

The database doesn’t need frequent updates. In our lab, we update it approximately once a year. For reproducibility, I provide database files for the above date in the Releases page.

cbp install nwr

nwr download
nwr txdb

nwr ardb
nwr ardb --genbank

cd $HOME/.nwr
tar cvfz ncbi.$(date +"%Y%m%d").tar.gz \
    taxdump.tar.gz \
    taxdump.tar.gz.md5 \
    assembly_summary_genbank.txt \
    assembly_summary_refseq.txt

rm \
    *.dmp \
    taxdump.tar.gz \
    taxdump.tar.gz.md5 \
    assembly_summary_genbank.txt \
    assembly_summary_refseq.txt

Usage of each command

For practical uses of nwr and other awesome companions, follow this page.

# nwr download

# nwr txdb

nwr info "Homo sapiens" 4932

nwr lineage "Homo sapiens"
nwr lineage 4932

nwr restrict "Vertebrata" -c 2 -f tests/nwr/taxon.tsv
##sci_name       tax_id
#Human   9606

nwr member "Homo"

nwr append tests/nwr/taxon.tsv -c 2 -r species -r family --id

# nwr ardb
# nwr ardb --genbank

nwr common "Escherichia coli" 4932 Drosophila_melanogaster 9606 Mus_musculus

seqdb

export SPECIES="$HOME/data/Archaea/Protein/Sulfolobus_acidocaldarius"

cargo run --bin nwr seqdb -d ${SPECIES} --init --strain

cargo run --bin nwr seqdb -d ${SPECIES} \
    --size <(
        pgr fa size ${SPECIES}/pro.fa.gz
    ) \
    --clust

cargo run --bin nwr seqdb -d ${SPECIES} \
    --anno <(
        gzip -dcf "${SPECIES}"/anno.tsv.gz
    ) \
    --asmseq <(
        gzip -dcf "${SPECIES}"/asmseq.tsv.gz
    )

cargo run --bin nwr seqdb -d ${SPECIES} --rep f1="${SPECIES}"/fam88_cluster.tsv

echo "
    SELECT
        *
    FROM asm
    WHERE 1=1
    " |
    sqlite3 -tabs ${SEQ_DIR}/seq.sqlite

echo "
    SELECT
        COUNT(distinct asm_seq.asm_id)
    FROM asm_seq
    WHERE 1=1
    " |
    sqlite3 -tabs ${SEQ_DIR}/seq.sqlite

echo "
.header ON
    SELECT
        'species' AS species,
        COUNT(distinct asm_seq.asm_id) AS strain,
        COUNT(*) AS total,
        COUNT(distinct rep_seq.seq_id) AS dedup,
        COUNT(distinct rep_seq.rep_id) AS rep
    FROM asm_seq
    JOIN rep_seq ON asm_seq.seq_id = rep_seq.seq_id
    WHERE 1=1
    " |
    sqlite3 -tabs ${SEQ_DIR}/seq.sqlite

NCBI Assembly Reports

Preparations

cbp install nwr
cbp install sqlite3
cbp install tva

Requires SQLite version 3.34 or above. sqlite that comes with mac does not work.

NCBI Taxonomy Statistics

curl -L "https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=statistics&?&unclassified=hide&uncultured=hide" |
    tva from html -q 'table[bgcolor="#CCCCFF"] table[bgcolor="#FFFFFF"] tr td text{}' |
    grep '\S' |
    paste -d $'\t' - - - - - - |
    tva to md --right 2-6

Ranks:	higher taxa	genus	species	lower taxa	total
Archaea	0	340	1,200	2,290	2,290
Bacteria	0	5,782	33,615	90,218	90,218
Eukaryota	0	104,261	631,437	804,447	804,447
Fungi	0	8,095	74,507	88,460	88,460
Metazoa	0	75,546	340,416	453,240	453,240
Viridiplantae	0	16,338	198,532	237,280	237,280
Viruses	36	3,493	14,612	200,795	201,328
All taxa	54	113,878	700,762	1,097,758	1,118,224

NCBI ASSEMBLY

assembly_level

for C in refseq genbank; do
    cat ~/.nwr/assembly_summary_${C}.txt |
        sed '1d' |
        tva stats -H -g assembly_level,genome_rep --count |
        tva keep-header -- sort |
        tva to md --fmt

    echo -e "\nTable: ${C}\n\n"
done

assembly_level	genome_rep	count
Chromosome	Full	8,629
Chromosome	Partial	355
Complete Genome	Full	76,533
Complete Genome	Partial	7
Contig	Full	280,107
Contig	Partial	30
Scaffold	Full	158,032

Table: refseq

assembly_level	genome_rep	count
Chromosome	Full	44,020
Chromosome	Partial	1,196
Complete Genome	Full	309,100
Complete Genome	Partial	131
Contig	Full	2,549,556
Contig	Partial	933
Scaffold	Full	515,294
Scaffold	Partial	363

Table: genbank

Example 1: count qualified assemblies of Eukaryote groups

ARRAY=(
    # Animals - Metazoa - kingdom
    'Flatworms::Platyhelminthes' # phylum
    'Roundworms::Nematoda'
    'Insects::Hexapoda' # subphylum
    'Reptiles::Testudines' # order
    'Reptiles::Lepidosauria'
    'Reptiles::Crocodylia'
    'Fishes::Chondrichthyes' # class
    'Fishes::Dipnoi'
    'Fishes::Actinopterygii'
    'Fishes::Hyperotreti'
    'Fishes::Hyperoartia'
    'Fishes::Coelacanthimorpha'
    'Mammals::Mammalia'
    'Birds::Aves'
    'Amphibians::Amphibia'
    # Fungi - kindom
    'Ascomycetes::Ascomycota' # phylum
    'Basidiomycetes::Basidiomycota'
    # Plants - Viridiplantae
    'Green Plants::Viridiplantae'
    'Land Plants::Embryophyta'
    # Protists
    'Apicomplexans::Apicomplexa'
    'Kinetoplasts::Kinetoplastida'
)

echo -e "GROUP_NAME\tSCI_NAME\tComplete Genome\tChromosome\tScaffold\tContig" \
    > groups.tsv

for item in "${ARRAY[@]}" ; do
    GROUP_NAME="${item%%::*}"
    SCI_NAME="${item##*::}"

    GENUS=$(
        nwr member ${SCI_NAME} -r genus |
            grep -v -i "Candidatus " |
            grep -v -i "candidate " |
            sed '1d' |
            cut -f 1 |
            tr "\n" "," |
            sed 's/,$/\)/' |
            sed 's/^/\(/'
    )

    printf "$GROUP_NAME\t$SCI_NAME\t"

    for L in 'Complete Genome' 'Chromosome' 'Scaffold' 'Contig'; do
        echo "
            SELECT
                COUNT(*)
            FROM ar
            WHERE 1=1
                AND genus_id IN $GENUS
                AND assembly_level IN ('$L')
            " |
            sqlite3 -tabs ~/.nwr/ar_refseq.sqlite
    done |
    tr "\n" "\t" |
    sed 's/\t$//'

    echo;
done \
    >> groups.tsv

cat groups.tsv |
    tva to md --num

GROUP_NAME	SCI_NAME	Complete Genome	Chromosome	Scaffold	Contig
Flatworms	Platyhelminthes	0	2	5	0
Roundworms	Nematoda	1	4	3	0
Insects	Hexapoda	1	208	105	30
Reptiles	Testudines	0	17	1	1
Reptiles	Lepidosauria	0	25	9	1
Reptiles	Crocodylia	0	1	6	0
Fishes	Chondrichthyes	0	26	1	0
Fishes	Dipnoi	0	1	0	0
Fishes	Actinopterygii	1	225	39	9
Fishes	Hyperotreti	0	1	0	0
Fishes	Hyperoartia	0	4	0	0
Fishes	Coelacanthimorpha	0	1	0	0
Mammals	Mammalia	4	173	89	7
Birds	Aves	1	106	54	5
Amphibians	Amphibia	0	29	3	1
Ascomycetes	Ascomycota	47	49	276	162
Basidiomycetes	Basidiomycota	27	18	48	32
Green Plants	Viridiplantae	9	155	58	9
Land Plants	Embryophyta	7	152	53	8
Apicomplexans	Apicomplexa	2	25	39	3
Kinetoplasts	Kinetoplastida	1	13	7	3

Table: refseq - Eukaryotes

GROUP_NAME	SCI_NAME	Complete Genome	Chromosome	Scaffold	Contig
Flatworms	Platyhelminthes	0	47	89	20
Roundworms	Nematoda	4	157	348	218
Insects	Hexapoda	21	3513	3389	2573
Reptiles	Testudines	1	59	50	10
Reptiles	Lepidosauria	0	117	281	30
Reptiles	Crocodylia	0	5	14	0
Fishes	Chondrichthyes	0	56	60	6
Fishes	Dipnoi	0	4	0	2
Fishes	Actinopterygii	31	1111	2107	320
Fishes	Hyperotreti	0	4	3	0
Fishes	Hyperoartia	0	7	14	4
Fishes	Coelacanthimorpha	0	1	3	0
Mammals	Mammalia	25	1471	2280	973
Birds	Aves	3	447	2191	330
Amphibians	Amphibia	0	93	186	12
Ascomycetes	Ascomycota	468	1312	10872	6713
Basidiomycetes	Basidiomycota	127	188	1746	1247
Green Plants	Viridiplantae	252	4203	2895	1261
Land Plants	Embryophyta	220	4132	2688	1024
Apicomplexans	Apicomplexa	20	132	199	89
Kinetoplasts	Kinetoplastida	16	72	119	104

Table: genbank - Eukaryotes

Example 2: count qualified assemblies of Prokaryote groups

echo -e "GROUP_NAME\tComplete Genome\tChromosome\tScaffold\tContig" \
    > groups.tsv

for item in Bacteria Archaea ; do
    PHYLUM=$(
        nwr member ${item} -r phylum |
            grep -v -i "Candidatus " |
            grep -v -i "candidate " |
            sed '1d' |
            cut -f 2 |
            sort
    )

    echo -e "$item\t\t\t\t"

    for P in $PHYLUM; do
        GENUS=$(
            nwr member ${P} -r genus |
                grep -v -i "Candidatus " |
                grep -v -i "candidate " |
                sed '1d' |
                cut -f 1 |
                tr "\n" "," |
                sed 's/,$/\)/' |
                sed 's/^/\(/'
        )

        if [[ ${#GENUS} -lt 3 ]]; then
            >&2 echo $P has no genera
            continue
        fi

        printf "$P\t"

        for L in 'Complete Genome' 'Chromosome' 'Scaffold' 'Contig'; do
            echo "
                SELECT
                    COUNT(*)
                FROM ar
                WHERE 1=1
                    AND genus_id IN $GENUS
                    AND assembly_level IN ('$L')
                " |
                sqlite3 -tabs ~/.nwr/ar_refseq.sqlite
        done |
        tr "\n" "\t" |
        sed 's/\t$//'

        echo;
    done
done  \
    >> groups.tsv

cat groups.tsv |
    tva to md --right 2-5

GROUP_NAME	Complete Genome	Chromosome	Scaffold	Contig
Bacteria
Abditibacteriota	1	0	0	1
Acidobacteriota	47	11	38	67
Actinomycetota	6050	976	26124	20668
Aquificota	25	2	26	67
Armatimonadota	3	4	4	8
Atribacterota	3	0	1	2
Bacillota	14115	1602	42966	69650
Bacteroidota	1997	284	7928	10745
Balneolota	3	1	15	39
Bdellovibrionota	49	10	48	44
Caldisericota	1	0	9	2
Calditrichota	1	1	0	3
Campylobacterota	1482	116	2584	8148
Chlamydiota	303	90	54	193
Chlorobiota	16	1	9	36
Chloroflexota	54	1	63	109
Chrysiogenota	3	0	5	0
Coprothermobacterota	1	0	1	2
Cyanobacteriota	416	44	803	1331
Deferribacterota	9	0	9	22
Deinococcota	113	5	142	234
Dictyoglomota	7	0	6	1
Elusimicrobiota	4	0	0	1
Fibrobacterota	2	0	23	60
Fidelibacterota	1	0	0	0
Fusobacteriota	262	9	211	472
Gemmatimonadota	10	1	9	48
Ignavibacteriota	3	0	5	12
Kiritimatiellota	2	0	0	6
Lentisphaerota	2	0	1	23
Minisyncoccota	1	0	0	0
Mycoplasmatota	953	71	382	1135
Myxococcota	131	9	37	148
Nitrospinota	1	0	1	10
Nitrospirota	24	0	19	24
Planctomycetota	86	30	61	117
Pseudomonadota	33304	3597	71037	157832
Rhodothermota	19	3	41	99
Spirochaetota	467	284	373	1411
Synergistota	12	4	49	110
Thermodesulfobacteriota	186	12	279	487
Thermodesulfobiota	2	0	0	2
Thermomicrobiota	2	0	3	9
Thermosulfidibacterota	1	0	0	0
Thermotogota	61	1	105	99
Verrucomicrobiota	149	9	237	272
Vulcanimicrobiota	1	0	0	0
Zhurongbacterota	1	0	0	0
Archaea
Methanobacteriota	523	21	547	1147
Microcaldota	0	0	0	0
Nanobdellota	1	0	0	0
Nitrososphaerota	21	3	12	26
Promethearchaeota	1	0	0	0
Thermoplasmatota	16	0	9	75
Thermoproteota	133	6	117	127

Table: refseq - Prokaryotes

GROUP_NAME	Complete Genome	Chromosome	Scaffold	Contig
Bacteria
Abditibacteriota	1	1	5	11
Acidobacteriota	56	13	160	612
Actinomycetota	6371	845	33276	33974
Aquificota	22	2	82	172
Armatimonadota	4	1	30	57
Atribacterota	3	0	5	7
Bacillota	16847	1906	87271	465348
Bacteroidota	2143	314	17096	30441
Balneolota	13	5	43	96
Bdellovibrionota	53	10	147	223
Caldisericota	1	0	20	4
Calditrichota	1	1	7	42
Campylobacterota	2799	162	6074	157198
Chlamydiota	408	79	118	225
Chlorobiota	17	1	30	67
Chloroflexota	57	1	286	375
Chrysiogenota	3	0	2	0
Coprothermobacterota	1	0	14	10
Cyanobacteriota	468	81	1466	3874
Deferribacterota	7	0	520	266
Deinococcota	118	5	193	282
Dictyoglomota	7	0	15	5
Elusimicrobiota	4	0	1	45
Fibrobacterota	2	0	109	199
Fidelibacterota	1	0	0	0
Fusobacteriota	293	14	258	906
Gemmatimonadota	8	1	33	167
Ignavibacteriota	3	1	62	45
Kiritimatiellota	2	0	13	48
Lentisphaerota	2	0	12	55
Minisyncoccota	1	0	0	1
Mycoplasmatota	1132	262	447	1561
Myxococcota	137	10	78	351
Nitrospinota	1	0	13	67
Nitrospirota	35	5	307	456
Planctomycetota	105	33	172	699
Pseudomonadota	42025	4639	122719	1437144
Rhodothermota	20	3	52	260
Spirochaetota	578	713	677	2696
Synergistota	14	4	127	239
Thermodesulfobacteriota	189	11	687	1767
Thermodesulfobiota	2	0	5	6
Thermomicrobiota	2	0	8	34
Thermosulfidibacterota	1	0	1	3
Thermotogota	56	1	232	219
Verrucomicrobiota	164	11	1432	2010
Vulcanimicrobiota	1	0	0	0
Zhurongbacterota	1	0	0	0
Archaea
Methanobacteriota	543	25	1296	2504
Microcaldota	0	0	0	0
Nanobdellota	2	0	0	1
Nitrososphaerota	43	22	200	653
Promethearchaeota	1	0	0	6
Thermoplasmatota	18	0	46	213
Thermoproteota	137	6	476	436

Table: genbank - Prokaryotes

Example 3: find accessions of a species

Staphylococcus capitis - 29388 - 头状葡萄球菌

nwr info "Staphylococcus capitis"

nwr member 29388

echo '
.headers ON
    SELECT
        organism_name,
        species,
        genus,
        ftp_path,
        assembly_level
    FROM ar
    WHERE 1=1
        AND tax_id != species_id    -- with strain ID
        AND species_id IN (29388)
    ' |
    sqlite3 -tabs ~/.nwr/ar_refseq.sqlite \
    > Scap.assembly.tsv

echo '
    SELECT
        species || " " || REPLACE(assembly_accession, ".", "_") AS organism_name,
        species,
        genus,
        ftp_path,
        assembly_level
    FROM ar
    WHERE 1=1
        AND tax_id = species_id     -- no strain ID
        AND assembly_level IN ("Chromosome", "Complete Genome")
        AND species_id IN (29388)
    ' |
    sqlite3 -tabs ~/.nwr/ar_refseq.sqlite \
    >> Scap.assembly.tsv

Example 4: find model organisms in a family

echo "
.headers ON
    SELECT
        tax_id,
        organism_name
    FROM ar
    WHERE 1=1
        AND family IN ('Enterobacteriaceae')
        AND refseq_category IN ('reference genome')
    " |
    sqlite3 -tabs ~/.nwr/ar_refseq.sqlite |
    sed '1s/^/#/' |
    tva to md

#tax_id	organism_name
511145	Escherichia coli str. K-12 substr. MG1655
198214	Shigella flexneri 2a str. 301
99287	Salmonella enterica subsp. enterica serovar Typhimurium str. LT2
386585	Escherichia coli O157:H7 str. Sakai
1125630	Klebsiella pneumoniae subsp. pneumoniae HS11286

Download files from NCBI Assembly

Strain info

cat ~/.nwr/assembly_summary_refseq.txt |
    sed '1d' |
    tva stats -H --missing-count infraspecific_name,isolate,biosample |
    tva to md --fmt

# infraspecific_name
cat ~/.nwr/assembly_summary_refseq.txt |
    sed '1d' |
    tva select -H -f infraspecific_name |
    perl -nla -F"=" -e 'print $F[0]' |
    tva keep-header -- sort |
    uniq -c
#       1 infraspecific_name
#      38 breed
#     109 cultivar
#      82 ecotype
#   67134 na
#  456330 strain

cat ~/.nwr/assembly_summary_genbank.txt |
    sed '1d' |
    tva select -H -f infraspecific_name |
    perl -nla -F"=" -e 'print $F[0]' |
    tva keep-header -- sort |
    uniq -c
#       1 infraspecific_name
#     565 breed
#    2587 cultivar
#    1098 ecotype
# 1264802 na
# 2151541 strain

cat ~/.nwr/assembly_summary_refseq.txt ~/.nwr/assembly_summary_genbank.txt |
    grep -v "^#" |
    tva select -f 9 | # infraspecific_name
    perl -nla -F"=" -e 'print $F[1]' |
    tva keep-header -- sort |
    uniq -c |
    sort -nr |
    head
# 1331958
#    3722 GPSC3
#    2491 Human
#    2451 clinical isolate of L. monocytogenes
#    2360 MSSA
#    1612 ExPEC
#    1377 MRSA
#    1285 GPSC12
#     898 GPSC16
#     854 GPSC55

# String length
cat ~/.nwr/assembly_summary_refseq.txt |
    sed '1d' |
    tva select -H -f organism_name,infraspecific_name,asm_name,ftp_path |
    sed '1d' |
    perl -nla -F"\t" -e 'print join qq(\t), map {length} @F ;' |
    tva stats --exclude-missing --max 1,2,3,4
#91      88      92      166

infraspecific_name_missing_count	isolate_missing_count	biosample_missing_count
0	0	0

Reference genomes

cd ~/Scripts/nwr/docs/

nwr member Bacteria Archaea -r family |
    grep -v -i "Candidatus " |
    grep -v -i "candidate " |
    grep -v " sp." |
    grep -v " spp." |
    sed '1d' |
    sort -n -k1,1 \
    > family.list.tsv

wc -l family.list.tsv
#707 family.list.tsv

FAMILY=$(
    cat family.list.tsv |
        cut -f 1 |
        tr "\n" "," |
        sed 's/,$//'
)

echo "
.headers ON
    SELECT
        *
    FROM ar
    WHERE 1=1
        AND family_id IN ($FAMILY)
        AND refseq_category IN ('reference genome')
    " |
    sqlite3 -tabs ~/.nwr/ar_refseq.sqlite \
    > reference.tsv

cat reference.tsv |
    tsv-select -H -f organism_name,species,genus,ftp_path,assembly_level \
    > raw.tsv

cat raw.tsv |
    grep -v '^#' |
    rgr dedup stdin |
    perl ~/Scripts/withncbi/taxon/abbr_name.pl -c "1,2,3" -s '\t' -m 3 --shortsub |
    (echo -e '#name\tftp_path\torganism\tassembly_level' && cat ) |
    perl -nl -a -F"," -e '
        BEGIN{my %seen};
        /^#/ and print and next;
        /^organism_name/i and next;
        $seen{$F[3]}++; # ftp_path
        $seen{$F[3]} > 1 and next;
        $seen{$F[5]}++; # abbr_name
        $seen{$F[5]} > 1 and next;
        printf qq{%s\t%s\t%s\t%s\n}, $F[5], $F[3], $F[1], $F[4];
        ' |
    keep-header -- sort -k3,3 -k1,1 \
    > Bacteria.assembly.tsv

File format: .assembly.tsv

A TAB-delimited file for downloading assembly files.

Col	Type	Description
1	string	#name: species + infraspecific_name + assembly_accession
2	string	ftp_path
3	string	biosample
4	string	species
5	string	assembly_level

download

Behavior:

Downloads the latest releases of taxdump and assembly reports from NCBI.
Automatically verifies MD5 checksum for taxdump.
Extracts taxdump.tar.gz to the NWR directory.
Skips downloading if files already exist.

Manual Download:

You can also download the files manually:

mkdir -p ~/.nwr

# taxdump
wget -N -P ~/.nwr https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
wget -N -P ~/.nwr https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz.md5

# assembly reports
wget -N -P ~/.nwr https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt
wget -N -P ~/.nwr https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt

# with aria2
cat <<EOF > download.txt
https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz.md5
https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt
https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt

EOF

aria2c -x 4 -s 2 -c -d ~/.nwr -i download.txt

Examples:

Download with default settings nwr download
Use a different FTP host nwr download --host ftp.ncbi.nih.gov:21
Custom paths nwr download --tx /pub/taxonomy --ar /genomes/ASSEMBLY_REPORTS

txdb

Behavior:

Initializes the taxonomy database from taxdump.tar.gz.
Creates a SQLite database at ~/.nwr/taxonomy.sqlite.
Loads data from division.dmp, names.dmp, and nodes.dmp.
Creates indexes for efficient querying.

Database Location:

~/.nwr/taxonomy.sqlite

The DDL:

DROP TABLE IF EXISTS division;
DROP TABLE IF EXISTS node;
DROP TABLE IF EXISTS name;

CREATE TABLE IF NOT EXISTS division (
    id       INTEGER      NOT NULL
                          PRIMARY KEY,
    division VARCHAR (50) NOT NULL
);

CREATE TABLE IF NOT EXISTS node (
    tax_id        INTEGER      NOT NULL
                               PRIMARY KEY,
    parent_tax_id INTEGER,
    rank          VARCHAR (25) NOT NULL,
    division_id   INTEGER      NOT NULL,
    comment       TEXT,
    FOREIGN KEY (
        division_id
    )
    REFERENCES division (id)
);

CREATE TABLE IF NOT EXISTS name (
    id         INTEGER      NOT NULL
                            PRIMARY KEY,
    tax_id     INTEGER      NOT NULL,
    name       VARCHAR (50) NOT NULL,
    name_class VARCHAR (50) NOT NULL
);

Query the database:

echo "
    SELECT sql
    FROM sqlite_master
    WHERE type='table'
    ORDER BY name;
    " |
    sqlite3 -tabs ~/.nwr/taxonomy.sqlite

Examples:

Initialize the taxonomy database nwr txdb
Use a custom directory nwr txdb --dir /path/to/nwr

ardb

Behavior:

Initializes the assembly database from assembly summary files.
Creates SQLite databases at ~/.nwr/ar_refseq.sqlite and ~/.nwr/ar_genbank.sqlite.
Loads data from assembly_summary_refseq.txt or assembly_summary_genbank.txt.
Appends taxonomic lineage information (species, genus, family).
Filters out incompetent strains (uncultured, unidentified, etc.).

Database Location:

~/.nwr/ar_refseq.sqlite
~/.nwr/ar_genbank.sqlite

Input Columns:

assembly_summary_*.txt have 23 tab-delimited columns.
Fields with numbers are used in the database.

0 assembly_accession 6 1 bioproject 4 2 biosample 5 3 wgs_master 4 refseq_category 7 5 taxid AS tax_id 1 6 species_taxid 7 organism_name 2 8 infraspecific_name 3 9 isolate 10 version_status 11 assembly_level 8 12 release_type 13 genome_rep 9 14 seq_rel_date 10 15 asm_name 11 16 submitter 17 gbrs_paired_asm 12 18 paired_asm_comp 19 ftp_path 13 20 excluded_from_refseq 21 relation_to_type_material 22 asm_not_live_date

Appended Columns:

14  species
15  species_id
16  genus
17  genus_id
18  family
19  family_id

Filtered Strains:

Incompetent strains matching the following regexes in their organism_name are removed:

\bCandidatus\b
\bcandidate\b
\buncultured\b
\bunidentified\b
\bbacterium\b
\barchaeon\b
\bmetagenome\b
virus\b
phage\b

Requirements:

Strains with assembly_level of Scaffold or Contig should have a genome_rep of full.
Requires SQLite version 3.34 or above.

Query the database:

echo "
    SELECT
        COUNT(*)
    FROM ar
    WHERE 1=1
        AND genus IN ('Pseudomonas')
        AND assembly_level IN ('Complete Genome', 'Chromosome')
    " |
    sqlite3 -tabs ~/.nwr/ar_refseq.sqlite

The DDL:

DROP TABLE IF EXISTS ar;

CREATE TABLE IF NOT EXISTS ar (
    tax_id             INTEGER,
    organism_name      VARCHAR (200),
    infraspecific_name VARCHAR (200),
    bioproject         VARCHAR (50),
    biosample          VARCHAR (50),
    assembly_accession VARCHAR (50),
    refseq_category    VARCHAR (50),
    assembly_level     VARCHAR (50),
    genome_rep         VARCHAR (50),
    seq_rel_date       DATE,
    asm_name           VARCHAR (200),
    gbrs_paired_asm    VARCHAR (200),
    ftp_path           VARCHAR (200),
    species            VARCHAR (50),
    species_id         INTEGER,
    genus              VARCHAR (50),
    genus_id           INTEGER,
    family             VARCHAR (50),
    family_id          INTEGER
);

Examples:

Initialize the RefSeq assembly database nwr ardb
Initialize the GenBank assembly database nwr ardb --genbank
Use a custom directory nwr ardb --dir /path/to/nwr

append

Behavior:

Retrieves taxonomic information from the local taxonomy database.
Appends scientific names and/or taxon IDs of specified ranks to each row.
If --rank is not specified, appends the scientific name of the input taxon.
Header lines (starting with “#”) are processed to append appropriate column names.

Valid ranks:

species, genus, family, order, class, phylum, kingdom
Other ranks (e.g., clade, no rank) may work but are not officially supported.

Input:

Accepts one or more TSV files as input.
Reads from standard input if “stdin” is specified.
The input file should contain taxon IDs or scientific names in a specific column.

Output:

Tab-separated values with appended rank columns.
By default, output is written to standard output.
Use --outfile to write to a file instead.

Examples:

Append scientific names for specified ranks nwr append input.tsv --rank genus --rank family
Append both names and IDs nwr append input.tsv --rank species --id
Read from stdin, append genus information cat input.tsv | nwr append stdin --rank genus
Specify column and output file nwr append input.tsv -c 2 --rank kingdom -o output.tsv

common

Behavior:

Outputs the common tree of terms as Newick format.
Finds the most recent common ancestor of all input terms.
Constructs a phylogenetic tree showing the relationship.
Ancestral terms can be Taxonomy IDs or scientific names.

Input:

Accepts two or more Taxonomy IDs or scientific names.
Terms are provided as positional arguments.

Output:

Newick format tree string.
Tree includes scientific names as node labels.
By default, output is written to standard output.
Use --outfile to write to a file instead.

Examples:

Find common ancestor of two species nwr common 9606 10090
Find common ancestor of multiple taxa nwr common "Homo sapiens" "Mus musculus" "Danio rerio"
Write to file nwr common 9606 10090 -o tree.nwk
Use taxonomy IDs nwr common 9605 10090 10116

info

Behavior:

Retrieves taxonomic information from the local taxonomy database.
Accepts Taxonomy IDs or scientific names as input.
By default, outputs detailed information in a custom format.
Use --tsv to output results as tab-separated values.

Input:

Accepts one or more Taxonomy IDs or scientific names.
Terms can be provided as positional arguments.

Output:

Default format shows detailed taxonomic information.
TSV output includes: tax_id, sci_name, rank, division.
By default, output is written to standard output.
Use --outfile to write to a file instead.

Examples:

Get information for a single taxon nwr info 9606
Get information for multiple taxa nwr info 9606 10090 10116
Output as TSV nwr info Homo_sapiens --tsv
Use scientific names nwr info "Homo sapiens" "Mus musculus"

lineage

Behavior:

Retrieves the lineage of a taxon from root to the specified term.
Returns the full taxonomic hierarchy including all ranks.
By default, outputs rank, scientific name, and taxonomy ID for each level.

Input:

Accepts a single Taxonomy ID or scientific name.
Use --tsv for tab-separated output format.

Output:

Default output: rank, scientific_name, tax_id (tab-separated)
TSV output: rank, scientific_name, tax_id (tab-separated)
By default, output is written to standard output.
Use --outfile to write to a file instead.

Examples:

Get lineage for a species nwr lineage 9606
Get lineage using scientific name nwr lineage "Homo sapiens"
Output as TSV nwr lineage 9606 --tsv
Write to file nwr lineage 9606 -o lineage.txt

member

Behavior:

Lists members (of certain ranks) under ancestral term(s).
Retrieves taxonomic information from the local taxonomy database.
Ancestral terms can be Taxonomy IDs or scientific names.
By default, excludes “Environmental samples” division.
The output file is in the same TSV format as nwr info --tsv.

Valid ranks:

species, genus, family, order, class, phylum, kingdom
Other ranks (e.g., clade, no rank) may work but are not officially supported.

Input:

Accepts one or more ancestral Taxonomy IDs or scientific names.
Optionally filter results by rank using --rank.

Output:

TSV output includes: tax_id, sci_name, rank, division.
By default, output is written to standard output.
Use --outfile to write to a file instead.

Examples:

List all members under a genus nwr member 9605
List only species under a genus nwr member Homo --rank species
Include environmental samples nwr member 4751 --env
Multiple ancestors with rank filter nwr member Homo Pan --rank genus

restrict

Behavior:

Restricts taxonomy terms to descendants of specified ancestor(s).
Terms can be Taxonomy IDs or scientific names.
Use --exclude to invert the filter (exclude matching lines).
Header lines (starting with “#”) are always outputted.

Input:

Accepts one or more TSV files via --file option.
Reads from standard input by default.
The input file should contain taxon IDs or scientific names in a specific column.

Output:

Filtered tab-separated values.
By default, output is written to standard output.
Use --outfile to write to a file instead.

Examples:

Restrict to descendants of a specific genus nwr restrict "Homo" --file input.tsv
Restrict using taxonomy ID nwr restrict 9605 --file input.tsv
Exclude descendants (inverse filter) nwr restrict "Bacteria" --file input.tsv --exclude
Specify column and output file nwr restrict "Mammalia" --file input.tsv -c 2 -o output.tsv
Multiple ancestors nwr restrict "Homo" "Pan" --file input.tsv

template

Behavior:

Creates directories, data files, and scripts for phylogenomic research.
Generates materials for ASSEMBLY, BioSample, MinHash, Count, and Protein steps.
Uses Tera templates to generate Bash scripts.

Input File Format:

.assembly.tsv is a TAB-delimited file to guide downloading and processing:

Col	Type	Description
1	string	#name: species + infraspecific_name + assembly_accession
2	string	ftp_path
3	string	biosample
4	string	species
5	string	assembly_level

Generated Materials:

--ass: ASSEMBLY/
- One TSV file: url.tsv
- Five Bash scripts: rsync.sh, check.sh, n50.sh, collect.sh, finish.sh
--bs: BioSample/
- One TSV file: sample.tsv
- Two Bash scripts: download.sh, collect.sh
--mh: MinHash/
- One TSV file: species.tsv
- Five Bash scripts: compute.sh, species.sh, abnormal.sh, nr.sh, dist.sh
--count: Count/
- One TSV file: species.tsv
- Three Bash scripts: strains.sh, rank.sh, lineage.sh
--pro: Protein/
- One TSV file: species.tsv
- Bash scripts: collect.sh, info.sh, count.sh

Examples:

Generate ASSEMBLY materials nwr template input.assembly.tsv --ass
Generate all materials nwr template input.assembly.tsv --ass --bs --mh --count --pro
Specify output directory nwr template input.assembly.tsv --ass -o output_dir
Use parallel processing nwr template input.assembly.tsv --mh --parallel 16

abbr

Behavior:

Abbreviates strain scientific names to unique short identifiers.
Generates abbreviations for genus, species, and strain parts.
Handles special cases like Candidatus and subspecies names.
Ensures uniqueness of abbreviations across all input names.

Input:

Accepts a TSV/CSV file or standard input.
Each row should contain strain, species, and genus names in separate columns.
Use --column to specify which columns contain these names (default: 1,2,3).
Common column patterns:
- 1,2,3 - strain in column 1, species in 2, genus in 3
- 1,1,2 - no strain: strain and species both in column 1, genus in 2
- 2,2,3 - don’t need strain part: strain and species in 2, genus in 3
- 1,1,1 - only strain: all three in column 1

Output:

Original line followed by a tab and the generated abbreviation.
Abbreviation format:
- Normal mode: Genus_Species_Strain (e.g., H_sapiens_sapiens)
- Tight mode (--tight): GenusSpecies_Strain (e.g., Hsapiens_sapiens)
Special handling:
- Candidatus is abbreviated to C
- Non-alphanumeric characters are replaced with underscores
- Consecutive underscores are collapsed
- Leading and trailing underscores are removed

Examples:

Basic usage with default columns echo -e 'Homo sapiens,Homo\nHomo erectus,Homo' | nwr abbr -s ',' -c "1,1,2"
Tight mode (no underscore between genus and species) echo -e 'Homo sapiens,Homo\nHomo erectus,Homo' | nwr abbr -s ',' -c "1,1,2" --tight
Clean subspecies names echo 'Legionella pneumophila subsp. pneumophila' | nwr abbr --shortsub
Process a file nwr abbr names.tsv -o abbreviated.tsv
Custom separator and columns nwr abbr data.csv -s ',' -c "1,2,3" -o output.tsv

kb

Behavior:

Prints embedded documentation and knowledge bases.
Extracts built-in files to stdout or a specified output directory.

Available Documents:

bac120 - 120 bacterial marker genes (tar.gz archive)
ar53 - 53 archaeal marker genes (tar.gz archive)

Output:

Archive files (bac120, ar53) are extracted to the specified directory.
By default, output is written to standard output.
Use --outfile to specify an output file or directory.

Examples:

Extract bacterial marker genes nwr kb bac120 -o marker_genes/
Extract archaeal marker genes nwr kb ar53 -o marker_genes/

seqdb

Behavior:

Initializes the sequence database for protein sequence information.
Creates a SQLite database at ./seq.sqlite.
Loads data from various TSV files into appropriate tables.
Supports loading strains, sizes, clusters, annotations, and assembly sequences.

Database Location:

./seq.sqlite

The DDL:

CREATE TABLE rank (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name VARCHAR NOT NULL UNIQUE
);
-- assembly
CREATE TABLE asm (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name VARCHAR NOT NULL UNIQUE,
    rank_id INTEGER NOT NULL,
    FOREIGN KEY (rank_id) REFERENCES rank(id)
);
-- sequence
CREATE TABLE seq (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name VARCHAR NOT NULL UNIQUE,
    size INTEGER,
    anno TEXT
);
-- representative
CREATE TABLE rep (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name VARCHAR NOT NULL UNIQUE,
    f1 TEXT,
    f2 TEXT,
    f3 TEXT,
    f4 TEXT,
    f5 TEXT,
    f6 TEXT,
    f7 TEXT,
    f8 TEXT
);
-- Junction table to associate rep with seq
CREATE TABLE rep_seq (
    rep_id INTEGER NOT NULL,
    seq_id INTEGER NOT NULL,
    PRIMARY KEY (rep_id, seq_id),
    FOREIGN KEY (rep_id) REFERENCES rep(id),
    FOREIGN KEY (seq_id) REFERENCES seq(id)
);
-- Junction table to associate asm with seq
CREATE TABLE asm_seq (
    asm_id INTEGER NOT NULL,
    seq_id INTEGER NOT NULL,
    PRIMARY KEY (asm_id, seq_id),
    FOREIGN KEY (asm_id) REFERENCES asm(id),
    FOREIGN KEY (seq_id) REFERENCES seq(id)
);
-- Regular indices
CREATE INDEX rep_idx_f1 ON rep(f1);
CREATE INDEX rep_idx_f2 ON rep(f2);
CREATE INDEX rep_idx_f3 ON rep(f3);
CREATE INDEX rep_idx_f4 ON rep(f4);
CREATE INDEX rep_idx_f5 ON rep(f5);
CREATE INDEX rep_idx_f6 ON rep(f6);
CREATE INDEX rep_idx_f7 ON rep(f7);
CREATE INDEX rep_idx_f8 ON rep(f8);
-- Case-insensitive indices for `like`
CREATE INDEX seq_idx_anno ON seq(anno COLLATE NOCASE);

Notes:

If --strain is called without specifying a path, it will load the default file under --dir.
--rep requires a key-value pair in the format --rep f1=file.
Valid fields for --rep are: f1, f2, f3, f4, f5, f6, f7, f8.

Examples:

Initialize the database nwr seqdb --init
Load strain information nwr seqdb --strain strains.tsv
Load multiple data types nwr seqdb --strain --size --clust
Load features into rep table nwr seqdb --rep f1=features.tsv

taxonomy.sqlite

Tables

Name	Columns	Type
division	2	table
node	5	table
name	4	table

Relations

Generated by tbls

ar_refseq.sqlite

Tables

Name	Columns	Comment	Type
ar	17		table

Relations

Generated by tbls

Keyboard shortcuts

NWR Documentation