nwr - NCBI taxonomy/assembly WRangler
Install
Current release: 0.9.0
cargo install nwr
# or
cargo install --path . --force # --offline
Or install the pre-compiled binary via the cross-platform package manager cbp (supports older Linux systems with glibc 2.17+):
cbp install nwr
You can also download the pre-compiled binaries from the Releases page.
nwr help
$ nwr help
`nwr` is a command line **N**CBI taxonomy and assembly **WR**angler.
Usage: nwr [COMMAND]
Commands:
download Download the latest releases of `taxdump` and assembly reports
txdb Init the taxonomy database
ardb Init the assembly database
info Information of Taxonomy ID(s) or scientific name(s)
lineage Output the lineage of the term
member List members (of certain ranks) under ancestral term(s)
append Append fields of higher ranks to a TSV file
restrict Restrict taxonomy terms to ancestral descendants
common Output the common tree of terms
template Create dirs, data and scripts for a phylogenomic research
kb Prints docs (knowledge bases)
seqdb Init the seq database
help Print this message or the help of the given subcommand(s)
Options:
-h, --help Print help
-V, --version Print version
Subcommand groups:
* Database
* download / txdb / ardb
* Taxonomy
* info / lineage / member / append / restrict / common
* Assembly
* template / kb / seqdb
Examples
Initiate local databases
The date date --utc of executing nwr download is Sun Apr 5 15:59:45 UTC 2026
The database doesn’t need frequent updates. In our lab, we update it approximately once a year. For reproducibility, I provide database files for the above date in the Releases page.
cbp install nwr
nwr download
nwr txdb
nwr ardb
nwr ardb --genbank
cd $HOME/.nwr
tar cvfz ncbi.$(date +"%Y%m%d").tar.gz \
taxdump.tar.gz \
taxdump.tar.gz.md5 \
assembly_summary_genbank.txt \
assembly_summary_refseq.txt
rm \
*.dmp \
taxdump.tar.gz \
taxdump.tar.gz.md5 \
assembly_summary_genbank.txt \
assembly_summary_refseq.txt
Usage of each command
For practical uses of nwr and other awesome companions, follow this page.
# nwr download
# nwr txdb
nwr info "Homo sapiens" 4932
nwr lineage "Homo sapiens"
nwr lineage 4932
nwr restrict "Vertebrata" -c 2 -f tests/nwr/taxon.tsv
##sci_name tax_id
#Human 9606
nwr member "Homo"
nwr append tests/nwr/taxon.tsv -c 2 -r species -r family --id
# nwr ardb
# nwr ardb --genbank
nwr common "Escherichia coli" 4932 Drosophila_melanogaster 9606 Mus_musculus
seqdb
export SPECIES="$HOME/data/Archaea/Protein/Sulfolobus_acidocaldarius"
cargo run --bin nwr seqdb -d ${SPECIES} --init --strain
cargo run --bin nwr seqdb -d ${SPECIES} \
--size <(
hnsm size ${SPECIES}/pro.fa.gz
) \
--clust
cargo run --bin nwr seqdb -d ${SPECIES} \
--anno <(
gzip -dcf "${SPECIES}"/anno.tsv.gz
) \
--asmseq <(
gzip -dcf "${SPECIES}"/asmseq.tsv.gz
)
cargo run --bin nwr seqdb -d ${SPECIES} --rep f1="${SPECIES}"/fam88_cluster.tsv
echo "
SELECT
*
FROM asm
WHERE 1=1
" |
sqlite3 -tabs ${SEQ_DIR}/seq.sqlite
echo "
SELECT
COUNT(distinct asm_seq.asm_id)
FROM asm_seq
WHERE 1=1
" |
sqlite3 -tabs ${SEQ_DIR}/seq.sqlite
echo "
.header ON
SELECT
'species' AS species,
COUNT(distinct asm_seq.asm_id) AS strain,
COUNT(*) AS total,
COUNT(distinct rep_seq.seq_id) AS dedup,
COUNT(distinct rep_seq.rep_id) AS rep
FROM asm_seq
JOIN rep_seq ON asm_seq.seq_id = rep_seq.seq_id
WHERE 1=1
" |
sqlite3 -tabs ${SEQ_DIR}/seq.sqlite
NCBI Assembly Reports
Preparations
cbp install nwr
cbp install sqlite3
cbp install tva
Requires SQLite version 3.34 or above. sqlite that comes with mac does not work.
NCBI Taxonomy Statistics
curl -L "https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=statistics&?&unclassified=hide&uncultured=hide" |
tva from html -q 'table[bgcolor="#CCCCFF"] table[bgcolor="#FFFFFF"] tr td text{}' |
grep '\S' |
paste -d $'\t' - - - - - - |
tva to md --right 2-6
| Ranks: | higher taxa | genus | species | lower taxa | total |
|---|---|---|---|---|---|
| Archaea | 0 | 340 | 1,200 | 2,290 | 2,290 |
| Bacteria | 0 | 5,782 | 33,615 | 90,218 | 90,218 |
| Eukaryota | 0 | 104,261 | 631,437 | 804,447 | 804,447 |
| Fungi | 0 | 8,095 | 74,507 | 88,460 | 88,460 |
| Metazoa | 0 | 75,546 | 340,416 | 453,240 | 453,240 |
| Viridiplantae | 0 | 16,338 | 198,532 | 237,280 | 237,280 |
| Viruses | 36 | 3,493 | 14,612 | 200,795 | 201,328 |
| All taxa | 54 | 113,878 | 700,762 | 1,097,758 | 1,118,224 |
NCBI ASSEMBLY
- assembly_level
for C in refseq genbank; do
cat ~/.nwr/assembly_summary_${C}.txt |
sed '1d' |
tva stats -H -g assembly_level,genome_rep --count |
tva keep-header -- sort |
tva to md --fmt
echo -e "\nTable: ${C}\n\n"
done
| assembly_level | genome_rep | count |
|---|---|---|
| Chromosome | Full | 8,629 |
| Chromosome | Partial | 355 |
| Complete Genome | Full | 76,533 |
| Complete Genome | Partial | 7 |
| Contig | Full | 280,107 |
| Contig | Partial | 30 |
| Scaffold | Full | 158,032 |
Table: refseq
| assembly_level | genome_rep | count |
|---|---|---|
| Chromosome | Full | 44,020 |
| Chromosome | Partial | 1,196 |
| Complete Genome | Full | 309,100 |
| Complete Genome | Partial | 131 |
| Contig | Full | 2,549,556 |
| Contig | Partial | 933 |
| Scaffold | Full | 515,294 |
| Scaffold | Partial | 363 |
Table: genbank
Example 1: count qualified assemblies of Eukaryote groups
ARRAY=(
# Animals - Metazoa - kingdom
'Flatworms::Platyhelminthes' # phylum
'Roundworms::Nematoda'
'Insects::Hexapoda' # subphylum
'Reptiles::Testudines' # order
'Reptiles::Lepidosauria'
'Reptiles::Crocodylia'
'Fishes::Chondrichthyes' # class
'Fishes::Dipnoi'
'Fishes::Actinopterygii'
'Fishes::Hyperotreti'
'Fishes::Hyperoartia'
'Fishes::Coelacanthimorpha'
'Mammals::Mammalia'
'Birds::Aves'
'Amphibians::Amphibia'
# Fungi - kindom
'Ascomycetes::Ascomycota' # phylum
'Basidiomycetes::Basidiomycota'
# Plants - Viridiplantae
'Green Plants::Viridiplantae'
'Land Plants::Embryophyta'
# Protists
'Apicomplexans::Apicomplexa'
'Kinetoplasts::Kinetoplastida'
)
echo -e "GROUP_NAME\tSCI_NAME\tComplete Genome\tChromosome\tScaffold\tContig" \
> groups.tsv
for item in "${ARRAY[@]}" ; do
GROUP_NAME="${item%%::*}"
SCI_NAME="${item##*::}"
GENUS=$(
nwr member ${SCI_NAME} -r genus |
grep -v -i "Candidatus " |
grep -v -i "candidate " |
sed '1d' |
cut -f 1 |
tr "\n" "," |
sed 's/,$/\)/' |
sed 's/^/\(/'
)
printf "$GROUP_NAME\t$SCI_NAME\t"
for L in 'Complete Genome' 'Chromosome' 'Scaffold' 'Contig'; do
echo "
SELECT
COUNT(*)
FROM ar
WHERE 1=1
AND genus_id IN $GENUS
AND assembly_level IN ('$L')
" |
sqlite3 -tabs ~/.nwr/ar_refseq.sqlite
done |
tr "\n" "\t" |
sed 's/\t$//'
echo;
done \
>> groups.tsv
cat groups.tsv |
tva to md --num
| GROUP_NAME | SCI_NAME | Complete Genome | Chromosome | Scaffold | Contig |
|---|---|---|---|---|---|
| Flatworms | Platyhelminthes | 0 | 2 | 5 | 0 |
| Roundworms | Nematoda | 1 | 4 | 3 | 0 |
| Insects | Hexapoda | 1 | 208 | 105 | 30 |
| Reptiles | Testudines | 0 | 17 | 1 | 1 |
| Reptiles | Lepidosauria | 0 | 25 | 9 | 1 |
| Reptiles | Crocodylia | 0 | 1 | 6 | 0 |
| Fishes | Chondrichthyes | 0 | 26 | 1 | 0 |
| Fishes | Dipnoi | 0 | 1 | 0 | 0 |
| Fishes | Actinopterygii | 1 | 225 | 39 | 9 |
| Fishes | Hyperotreti | 0 | 1 | 0 | 0 |
| Fishes | Hyperoartia | 0 | 4 | 0 | 0 |
| Fishes | Coelacanthimorpha | 0 | 1 | 0 | 0 |
| Mammals | Mammalia | 4 | 173 | 89 | 7 |
| Birds | Aves | 1 | 106 | 54 | 5 |
| Amphibians | Amphibia | 0 | 29 | 3 | 1 |
| Ascomycetes | Ascomycota | 47 | 49 | 276 | 162 |
| Basidiomycetes | Basidiomycota | 27 | 18 | 48 | 32 |
| Green Plants | Viridiplantae | 9 | 155 | 58 | 9 |
| Land Plants | Embryophyta | 7 | 152 | 53 | 8 |
| Apicomplexans | Apicomplexa | 2 | 25 | 39 | 3 |
| Kinetoplasts | Kinetoplastida | 1 | 13 | 7 | 3 |
Table: refseq - Eukaryotes
| GROUP_NAME | SCI_NAME | Complete Genome | Chromosome | Scaffold | Contig |
|---|---|---|---|---|---|
| Flatworms | Platyhelminthes | 0 | 47 | 89 | 20 |
| Roundworms | Nematoda | 4 | 157 | 348 | 218 |
| Insects | Hexapoda | 21 | 3513 | 3389 | 2573 |
| Reptiles | Testudines | 1 | 59 | 50 | 10 |
| Reptiles | Lepidosauria | 0 | 117 | 281 | 30 |
| Reptiles | Crocodylia | 0 | 5 | 14 | 0 |
| Fishes | Chondrichthyes | 0 | 56 | 60 | 6 |
| Fishes | Dipnoi | 0 | 4 | 0 | 2 |
| Fishes | Actinopterygii | 31 | 1111 | 2107 | 320 |
| Fishes | Hyperotreti | 0 | 4 | 3 | 0 |
| Fishes | Hyperoartia | 0 | 7 | 14 | 4 |
| Fishes | Coelacanthimorpha | 0 | 1 | 3 | 0 |
| Mammals | Mammalia | 25 | 1471 | 2280 | 973 |
| Birds | Aves | 3 | 447 | 2191 | 330 |
| Amphibians | Amphibia | 0 | 93 | 186 | 12 |
| Ascomycetes | Ascomycota | 468 | 1312 | 10872 | 6713 |
| Basidiomycetes | Basidiomycota | 127 | 188 | 1746 | 1247 |
| Green Plants | Viridiplantae | 252 | 4203 | 2895 | 1261 |
| Land Plants | Embryophyta | 220 | 4132 | 2688 | 1024 |
| Apicomplexans | Apicomplexa | 20 | 132 | 199 | 89 |
| Kinetoplasts | Kinetoplastida | 16 | 72 | 119 | 104 |
Table: genbank - Eukaryotes
Example 2: count qualified assemblies of Prokaryote groups
echo -e "GROUP_NAME\tComplete Genome\tChromosome\tScaffold\tContig" \
> groups.tsv
for item in Bacteria Archaea ; do
PHYLUM=$(
nwr member ${item} -r phylum |
grep -v -i "Candidatus " |
grep -v -i "candidate " |
sed '1d' |
cut -f 2 |
sort
)
echo -e "$item\t\t\t\t"
for P in $PHYLUM; do
GENUS=$(
nwr member ${P} -r genus |
grep -v -i "Candidatus " |
grep -v -i "candidate " |
sed '1d' |
cut -f 1 |
tr "\n" "," |
sed 's/,$/\)/' |
sed 's/^/\(/'
)
if [[ ${#GENUS} -lt 3 ]]; then
>&2 echo $P has no genera
continue
fi
printf "$P\t"
for L in 'Complete Genome' 'Chromosome' 'Scaffold' 'Contig'; do
echo "
SELECT
COUNT(*)
FROM ar
WHERE 1=1
AND genus_id IN $GENUS
AND assembly_level IN ('$L')
" |
sqlite3 -tabs ~/.nwr/ar_refseq.sqlite
done |
tr "\n" "\t" |
sed 's/\t$//'
echo;
done
done \
>> groups.tsv
cat groups.tsv |
tva to md --right 2-5
| GROUP_NAME | Complete Genome | Chromosome | Scaffold | Contig |
|---|---|---|---|---|
| Bacteria | ||||
| Abditibacteriota | 1 | 0 | 0 | 1 |
| Acidobacteriota | 47 | 11 | 38 | 67 |
| Actinomycetota | 6050 | 976 | 26124 | 20668 |
| Aquificota | 25 | 2 | 26 | 67 |
| Armatimonadota | 3 | 4 | 4 | 8 |
| Atribacterota | 3 | 0 | 1 | 2 |
| Bacillota | 14115 | 1602 | 42966 | 69650 |
| Bacteroidota | 1997 | 284 | 7928 | 10745 |
| Balneolota | 3 | 1 | 15 | 39 |
| Bdellovibrionota | 49 | 10 | 48 | 44 |
| Caldisericota | 1 | 0 | 9 | 2 |
| Calditrichota | 1 | 1 | 0 | 3 |
| Campylobacterota | 1482 | 116 | 2584 | 8148 |
| Chlamydiota | 303 | 90 | 54 | 193 |
| Chlorobiota | 16 | 1 | 9 | 36 |
| Chloroflexota | 54 | 1 | 63 | 109 |
| Chrysiogenota | 3 | 0 | 5 | 0 |
| Coprothermobacterota | 1 | 0 | 1 | 2 |
| Cyanobacteriota | 416 | 44 | 803 | 1331 |
| Deferribacterota | 9 | 0 | 9 | 22 |
| Deinococcota | 113 | 5 | 142 | 234 |
| Dictyoglomota | 7 | 0 | 6 | 1 |
| Elusimicrobiota | 4 | 0 | 0 | 1 |
| Fibrobacterota | 2 | 0 | 23 | 60 |
| Fidelibacterota | 1 | 0 | 0 | 0 |
| Fusobacteriota | 262 | 9 | 211 | 472 |
| Gemmatimonadota | 10 | 1 | 9 | 48 |
| Ignavibacteriota | 3 | 0 | 5 | 12 |
| Kiritimatiellota | 2 | 0 | 0 | 6 |
| Lentisphaerota | 2 | 0 | 1 | 23 |
| Minisyncoccota | 1 | 0 | 0 | 0 |
| Mycoplasmatota | 953 | 71 | 382 | 1135 |
| Myxococcota | 131 | 9 | 37 | 148 |
| Nitrospinota | 1 | 0 | 1 | 10 |
| Nitrospirota | 24 | 0 | 19 | 24 |
| Planctomycetota | 86 | 30 | 61 | 117 |
| Pseudomonadota | 33304 | 3597 | 71037 | 157832 |
| Rhodothermota | 19 | 3 | 41 | 99 |
| Spirochaetota | 467 | 284 | 373 | 1411 |
| Synergistota | 12 | 4 | 49 | 110 |
| Thermodesulfobacteriota | 186 | 12 | 279 | 487 |
| Thermodesulfobiota | 2 | 0 | 0 | 2 |
| Thermomicrobiota | 2 | 0 | 3 | 9 |
| Thermosulfidibacterota | 1 | 0 | 0 | 0 |
| Thermotogota | 61 | 1 | 105 | 99 |
| Verrucomicrobiota | 149 | 9 | 237 | 272 |
| Vulcanimicrobiota | 1 | 0 | 0 | 0 |
| Zhurongbacterota | 1 | 0 | 0 | 0 |
| Archaea | ||||
| Methanobacteriota | 523 | 21 | 547 | 1147 |
| Microcaldota | 0 | 0 | 0 | 0 |
| Nanobdellota | 1 | 0 | 0 | 0 |
| Nitrososphaerota | 21 | 3 | 12 | 26 |
| Promethearchaeota | 1 | 0 | 0 | 0 |
| Thermoplasmatota | 16 | 0 | 9 | 75 |
| Thermoproteota | 133 | 6 | 117 | 127 |
Table: refseq - Prokaryotes
| GROUP_NAME | Complete Genome | Chromosome | Scaffold | Contig |
|---|---|---|---|---|
| Bacteria | ||||
| Abditibacteriota | 1 | 1 | 5 | 11 |
| Acidobacteriota | 56 | 13 | 160 | 612 |
| Actinomycetota | 6371 | 845 | 33276 | 33974 |
| Aquificota | 22 | 2 | 82 | 172 |
| Armatimonadota | 4 | 1 | 30 | 57 |
| Atribacterota | 3 | 0 | 5 | 7 |
| Bacillota | 16847 | 1906 | 87271 | 465348 |
| Bacteroidota | 2143 | 314 | 17096 | 30441 |
| Balneolota | 13 | 5 | 43 | 96 |
| Bdellovibrionota | 53 | 10 | 147 | 223 |
| Caldisericota | 1 | 0 | 20 | 4 |
| Calditrichota | 1 | 1 | 7 | 42 |
| Campylobacterota | 2799 | 162 | 6074 | 157198 |
| Chlamydiota | 408 | 79 | 118 | 225 |
| Chlorobiota | 17 | 1 | 30 | 67 |
| Chloroflexota | 57 | 1 | 286 | 375 |
| Chrysiogenota | 3 | 0 | 2 | 0 |
| Coprothermobacterota | 1 | 0 | 14 | 10 |
| Cyanobacteriota | 468 | 81 | 1466 | 3874 |
| Deferribacterota | 7 | 0 | 520 | 266 |
| Deinococcota | 118 | 5 | 193 | 282 |
| Dictyoglomota | 7 | 0 | 15 | 5 |
| Elusimicrobiota | 4 | 0 | 1 | 45 |
| Fibrobacterota | 2 | 0 | 109 | 199 |
| Fidelibacterota | 1 | 0 | 0 | 0 |
| Fusobacteriota | 293 | 14 | 258 | 906 |
| Gemmatimonadota | 8 | 1 | 33 | 167 |
| Ignavibacteriota | 3 | 1 | 62 | 45 |
| Kiritimatiellota | 2 | 0 | 13 | 48 |
| Lentisphaerota | 2 | 0 | 12 | 55 |
| Minisyncoccota | 1 | 0 | 0 | 1 |
| Mycoplasmatota | 1132 | 262 | 447 | 1561 |
| Myxococcota | 137 | 10 | 78 | 351 |
| Nitrospinota | 1 | 0 | 13 | 67 |
| Nitrospirota | 35 | 5 | 307 | 456 |
| Planctomycetota | 105 | 33 | 172 | 699 |
| Pseudomonadota | 42025 | 4639 | 122719 | 1437144 |
| Rhodothermota | 20 | 3 | 52 | 260 |
| Spirochaetota | 578 | 713 | 677 | 2696 |
| Synergistota | 14 | 4 | 127 | 239 |
| Thermodesulfobacteriota | 189 | 11 | 687 | 1767 |
| Thermodesulfobiota | 2 | 0 | 5 | 6 |
| Thermomicrobiota | 2 | 0 | 8 | 34 |
| Thermosulfidibacterota | 1 | 0 | 1 | 3 |
| Thermotogota | 56 | 1 | 232 | 219 |
| Verrucomicrobiota | 164 | 11 | 1432 | 2010 |
| Vulcanimicrobiota | 1 | 0 | 0 | 0 |
| Zhurongbacterota | 1 | 0 | 0 | 0 |
| Archaea | ||||
| Methanobacteriota | 543 | 25 | 1296 | 2504 |
| Microcaldota | 0 | 0 | 0 | 0 |
| Nanobdellota | 2 | 0 | 0 | 1 |
| Nitrososphaerota | 43 | 22 | 200 | 653 |
| Promethearchaeota | 1 | 0 | 0 | 6 |
| Thermoplasmatota | 18 | 0 | 46 | 213 |
| Thermoproteota | 137 | 6 | 476 | 436 |
Table: genbank - Prokaryotes
Example 3: find accessions of a species
Staphylococcus capitis - 29388 - 头状葡萄球菌
nwr info "Staphylococcus capitis"
nwr member 29388
echo '
.headers ON
SELECT
organism_name,
species,
genus,
ftp_path,
assembly_level
FROM ar
WHERE 1=1
AND tax_id != species_id -- with strain ID
AND species_id IN (29388)
' |
sqlite3 -tabs ~/.nwr/ar_refseq.sqlite \
> Scap.assembly.tsv
echo '
SELECT
species || " " || REPLACE(assembly_accession, ".", "_") AS organism_name,
species,
genus,
ftp_path,
assembly_level
FROM ar
WHERE 1=1
AND tax_id = species_id -- no strain ID
AND assembly_level IN ("Chromosome", "Complete Genome")
AND species_id IN (29388)
' |
sqlite3 -tabs ~/.nwr/ar_refseq.sqlite \
>> Scap.assembly.tsv
Example 4: find model organisms in a family
echo "
.headers ON
SELECT
tax_id,
organism_name
FROM ar
WHERE 1=1
AND family IN ('Enterobacteriaceae')
AND refseq_category IN ('reference genome')
" |
sqlite3 -tabs ~/.nwr/ar_refseq.sqlite |
sed '1s/^/#/' |
tva to md
| #tax_id | organism_name |
|---|---|
| 511145 | Escherichia coli str. K-12 substr. MG1655 |
| 198214 | Shigella flexneri 2a str. 301 |
| 99287 | Salmonella enterica subsp. enterica serovar Typhimurium str. LT2 |
| 386585 | Escherichia coli O157:H7 str. Sakai |
| 1125630 | Klebsiella pneumoniae subsp. pneumoniae HS11286 |
Download files from NCBI Assembly
Strain info
cat ~/.nwr/assembly_summary_refseq.txt |
sed '1d' |
tva stats -H --missing-count infraspecific_name,isolate,biosample |
tva to md --fmt
# infraspecific_name
cat ~/.nwr/assembly_summary_refseq.txt |
sed '1d' |
tva select -H -f infraspecific_name |
perl -nla -F"=" -e 'print $F[0]' |
tva keep-header -- sort |
uniq -c
# 1 infraspecific_name
# 38 breed
# 109 cultivar
# 82 ecotype
# 67134 na
# 456330 strain
cat ~/.nwr/assembly_summary_genbank.txt |
sed '1d' |
tva select -H -f infraspecific_name |
perl -nla -F"=" -e 'print $F[0]' |
tva keep-header -- sort |
uniq -c
# 1 infraspecific_name
# 565 breed
# 2587 cultivar
# 1098 ecotype
# 1264802 na
# 2151541 strain
cat ~/.nwr/assembly_summary_refseq.txt ~/.nwr/assembly_summary_genbank.txt |
grep -v "^#" |
tva select -f 9 | # infraspecific_name
perl -nla -F"=" -e 'print $F[1]' |
tva keep-header -- sort |
uniq -c |
sort -nr |
head
# 1331958
# 3722 GPSC3
# 2491 Human
# 2451 clinical isolate of L. monocytogenes
# 2360 MSSA
# 1612 ExPEC
# 1377 MRSA
# 1285 GPSC12
# 898 GPSC16
# 854 GPSC55
# String length
cat ~/.nwr/assembly_summary_refseq.txt |
sed '1d' |
tva select -H -f organism_name,infraspecific_name,asm_name,ftp_path |
sed '1d' |
perl -nla -F"\t" -e 'print join qq(\t), map {length} @F ;' |
tva stats --exclude-missing --max 1,2,3,4
#91 88 92 166
| infraspecific_name_missing_count | isolate_missing_count | biosample_missing_count |
|---|---|---|
| 0 | 0 | 0 |
Reference genomes
cd ~/Scripts/nwr/docs/
nwr member Bacteria Archaea -r family |
grep -v -i "Candidatus " |
grep -v -i "candidate " |
grep -v " sp." |
grep -v " spp." |
sed '1d' |
sort -n -k1,1 \
> family.list.tsv
wc -l family.list.tsv
#707 family.list.tsv
FAMILY=$(
cat family.list.tsv |
cut -f 1 |
tr "\n" "," |
sed 's/,$//'
)
echo "
.headers ON
SELECT
*
FROM ar
WHERE 1=1
AND family_id IN ($FAMILY)
AND refseq_category IN ('reference genome')
" |
sqlite3 -tabs ~/.nwr/ar_refseq.sqlite \
> reference.tsv
cat reference.tsv |
tsv-select -H -f organism_name,species,genus,ftp_path,assembly_level \
> raw.tsv
cat raw.tsv |
grep -v '^#' |
rgr dedup stdin |
perl ~/Scripts/withncbi/taxon/abbr_name.pl -c "1,2,3" -s '\t' -m 3 --shortsub |
(echo -e '#name\tftp_path\torganism\tassembly_level' && cat ) |
perl -nl -a -F"," -e '
BEGIN{my %seen};
/^#/ and print and next;
/^organism_name/i and next;
$seen{$F[3]}++; # ftp_path
$seen{$F[3]} > 1 and next;
$seen{$F[5]}++; # abbr_name
$seen{$F[5]} > 1 and next;
printf qq{%s\t%s\t%s\t%s\n}, $F[5], $F[3], $F[1], $F[4];
' |
keep-header -- sort -k3,3 -k1,1 \
> Bacteria.assembly.tsv
File format: .assembly.tsv
A TAB-delimited file for downloading assembly files.
| Col | Type | Description |
|---|---|---|
| 1 | string | #name: species + infraspecific_name + assembly_accession |
| 2 | string | ftp_path |
| 3 | string | biosample |
| 4 | string | species |
| 5 | string | assembly_level |
download
Behavior:
- Downloads the latest releases of
taxdumpand assembly reports from NCBI. - Automatically verifies MD5 checksum for taxdump.
- Extracts taxdump.tar.gz to the NWR directory.
- Skips downloading if files already exist.
Manual Download:
You can also download the files manually:
mkdir -p ~/.nwr
# taxdump
wget -N -P ~/.nwr https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
wget -N -P ~/.nwr https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz.md5
# assembly reports
wget -N -P ~/.nwr https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt
wget -N -P ~/.nwr https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt
# with aria2
cat <<EOF > download.txt
https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz.md5
https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt
https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt
EOF
aria2c -x 4 -s 2 -c -d ~/.nwr -i download.txt
Examples:
-
Download with default settings
nwr download -
Use a different FTP host
nwr download --host ftp.ncbi.nih.gov:21 -
Custom paths
nwr download --tx /pub/taxonomy --ar /genomes/ASSEMBLY_REPORTS
txdb
Behavior:
- Initializes the taxonomy database from
taxdump.tar.gz. - Creates a SQLite database at
~/.nwr/taxonomy.sqlite. - Loads data from
division.dmp,names.dmp, andnodes.dmp. - Creates indexes for efficient querying.
Database Location:
~/.nwr/taxonomy.sqlite
The DDL:
DROP TABLE IF EXISTS division;
DROP TABLE IF EXISTS node;
DROP TABLE IF EXISTS name;
CREATE TABLE IF NOT EXISTS division (
id INTEGER NOT NULL
PRIMARY KEY,
division VARCHAR (50) NOT NULL
);
CREATE TABLE IF NOT EXISTS node (
tax_id INTEGER NOT NULL
PRIMARY KEY,
parent_tax_id INTEGER,
rank VARCHAR (25) NOT NULL,
division_id INTEGER NOT NULL,
comment TEXT,
FOREIGN KEY (
division_id
)
REFERENCES division (id)
);
CREATE TABLE IF NOT EXISTS name (
id INTEGER NOT NULL
PRIMARY KEY,
tax_id INTEGER NOT NULL,
name VARCHAR (50) NOT NULL,
name_class VARCHAR (50) NOT NULL
);
Query the database:
echo "
SELECT sql
FROM sqlite_master
WHERE type='table'
ORDER BY name;
" |
sqlite3 -tabs ~/.nwr/taxonomy.sqlite
Examples:
-
Initialize the taxonomy database
nwr txdb -
Use a custom directory
nwr txdb --dir /path/to/nwr
ardb
Behavior:
- Initializes the assembly database from assembly summary files.
- Creates SQLite databases at
~/.nwr/ar_refseq.sqliteand~/.nwr/ar_genbank.sqlite. - Loads data from
assembly_summary_refseq.txtorassembly_summary_genbank.txt. - Appends taxonomic lineage information (species, genus, family).
- Filters out incompetent strains (uncultured, unidentified, etc.).
Database Location:
~/.nwr/ar_refseq.sqlite
~/.nwr/ar_genbank.sqlite
Input Columns:
-
assembly_summary_*.txthave 23 tab-delimited columns. -
Fields with numbers are used in the database.
0 assembly_accession 6 1 bioproject 4 2 biosample 5 3 wgs_master 4 refseq_category 7 5 taxid AS tax_id 1 6 species_taxid 7 organism_name 2 8 infraspecific_name 3 9 isolate 10 version_status 11 assembly_level 8 12 release_type 13 genome_rep 9 14 seq_rel_date 10 15 asm_name 11 16 submitter 17 gbrs_paired_asm 12 18 paired_asm_comp 19 ftp_path 13 20 excluded_from_refseq 21 relation_to_type_material 22 asm_not_live_date
Appended Columns:
14 species
15 species_id
16 genus
17 genus_id
18 family
19 family_id
Filtered Strains:
Incompetent strains matching the following regexes in their organism_name are removed:
\bCandidatus\b
\bcandidate\b
\buncultured\b
\bunidentified\b
\bbacterium\b
\barchaeon\b
\bmetagenome\b
virus\b
phage\b
Requirements:
- Strains with
assembly_levelof Scaffold or Contig should have agenome_repoffull. - Requires SQLite version 3.34 or above.
Query the database:
echo "
SELECT
COUNT(*)
FROM ar
WHERE 1=1
AND genus IN ('Pseudomonas')
AND assembly_level IN ('Complete Genome', 'Chromosome')
" |
sqlite3 -tabs ~/.nwr/ar_refseq.sqlite
The DDL:
DROP TABLE IF EXISTS ar;
CREATE TABLE IF NOT EXISTS ar (
tax_id INTEGER,
organism_name VARCHAR (200),
infraspecific_name VARCHAR (200),
bioproject VARCHAR (50),
biosample VARCHAR (50),
assembly_accession VARCHAR (50),
refseq_category VARCHAR (50),
assembly_level VARCHAR (50),
genome_rep VARCHAR (50),
seq_rel_date DATE,
asm_name VARCHAR (200),
gbrs_paired_asm VARCHAR (200),
ftp_path VARCHAR (200),
species VARCHAR (50),
species_id INTEGER,
genus VARCHAR (50),
genus_id INTEGER,
family VARCHAR (50),
family_id INTEGER
);
Examples:
-
Initialize the RefSeq assembly database
nwr ardb -
Initialize the GenBank assembly database
nwr ardb --genbank -
Use a custom directory
nwr ardb --dir /path/to/nwr
append
Behavior:
- Retrieves taxonomic information from the local taxonomy database.
- Appends scientific names and/or taxon IDs of specified ranks to each row.
- If
--rankis not specified, appends the scientific name of the input taxon. - Header lines (starting with “#”) are processed to append appropriate column names.
Valid ranks:
- species, genus, family, order, class, phylum, kingdom
- Other ranks (e.g., clade, no rank) may work but are not officially supported.
Input:
- Accepts one or more TSV files as input.
- Reads from standard input if “stdin” is specified.
- The input file should contain taxon IDs or scientific names in a specific column.
Output:
- Tab-separated values with appended rank columns.
- By default, output is written to standard output.
- Use
--outfileto write to a file instead.
Examples:
-
Append scientific names for specified ranks
nwr append input.tsv --rank genus --rank family -
Append both names and IDs
nwr append input.tsv --rank species --id -
Read from stdin, append genus information
cat input.tsv | nwr append stdin --rank genus -
Specify column and output file
nwr append input.tsv -c 2 --rank kingdom -o output.tsv
common
Behavior:
- Outputs the common tree of terms as Newick format.
- Finds the most recent common ancestor of all input terms.
- Constructs a phylogenetic tree showing the relationship.
- Ancestral terms can be Taxonomy IDs or scientific names.
Input:
- Accepts two or more Taxonomy IDs or scientific names.
- Terms are provided as positional arguments.
Output:
- Newick format tree string.
- Tree includes scientific names as node labels.
- By default, output is written to standard output.
- Use
--outfileto write to a file instead.
Examples:
-
Find common ancestor of two species
nwr common 9606 10090 -
Find common ancestor of multiple taxa
nwr common "Homo sapiens" "Mus musculus" "Danio rerio" -
Write to file
nwr common 9606 10090 -o tree.nwk -
Use taxonomy IDs
nwr common 9605 10090 10116
info
Behavior:
- Retrieves taxonomic information from the local taxonomy database.
- Accepts Taxonomy IDs or scientific names as input.
- By default, outputs detailed information in a custom format.
- Use
--tsvto output results as tab-separated values.
Input:
- Accepts one or more Taxonomy IDs or scientific names.
- Terms can be provided as positional arguments.
Output:
- Default format shows detailed taxonomic information.
- TSV output includes: tax_id, sci_name, rank, division.
- By default, output is written to standard output.
- Use
--outfileto write to a file instead.
Examples:
-
Get information for a single taxon
nwr info 9606 -
Get information for multiple taxa
nwr info 9606 10090 10116 -
Output as TSV
nwr info Homo_sapiens --tsv -
Use scientific names
nwr info "Homo sapiens" "Mus musculus"
lineage
Behavior:
- Retrieves the lineage of a taxon from root to the specified term.
- Returns the full taxonomic hierarchy including all ranks.
- By default, outputs rank, scientific name, and taxonomy ID for each level.
Input:
- Accepts a single Taxonomy ID or scientific name.
- Use
--tsvfor tab-separated output format.
Output:
- Default output: rank, scientific_name, tax_id (tab-separated)
- TSV output: rank, scientific_name, tax_id (tab-separated)
- By default, output is written to standard output.
- Use
--outfileto write to a file instead.
Examples:
-
Get lineage for a species
nwr lineage 9606 -
Get lineage using scientific name
nwr lineage "Homo sapiens" -
Output as TSV
nwr lineage 9606 --tsv -
Write to file
nwr lineage 9606 -o lineage.txt
member
Behavior:
- Lists members (of certain ranks) under ancestral term(s).
- Retrieves taxonomic information from the local taxonomy database.
- Ancestral terms can be Taxonomy IDs or scientific names.
- By default, excludes “Environmental samples” division.
- The output file is in the same TSV format as
nwr info --tsv.
Valid ranks:
- species, genus, family, order, class, phylum, kingdom
- Other ranks (e.g., clade, no rank) may work but are not officially supported.
Input:
- Accepts one or more ancestral Taxonomy IDs or scientific names.
- Optionally filter results by rank using
--rank.
Output:
- TSV output includes: tax_id, sci_name, rank, division.
- By default, output is written to standard output.
- Use
--outfileto write to a file instead.
Examples:
-
List all members under a genus
nwr member 9605 -
List only species under a genus
nwr member Homo --rank species -
Include environmental samples
nwr member 4751 --env -
Multiple ancestors with rank filter
nwr member Homo Pan --rank genus
restrict
Behavior:
- Restricts taxonomy terms to descendants of specified ancestor(s).
- Terms can be Taxonomy IDs or scientific names.
- Use
--excludeto invert the filter (exclude matching lines). - Header lines (starting with “#”) are always outputted.
Input:
- Accepts one or more TSV files via
--fileoption. - Reads from standard input by default.
- The input file should contain taxon IDs or scientific names in a specific column.
Output:
- Filtered tab-separated values.
- By default, output is written to standard output.
- Use
--outfileto write to a file instead.
Examples:
-
Restrict to descendants of a specific genus
nwr restrict "Homo" --file input.tsv -
Restrict using taxonomy ID
nwr restrict 9605 --file input.tsv -
Exclude descendants (inverse filter)
nwr restrict "Bacteria" --file input.tsv --exclude -
Specify column and output file
nwr restrict "Mammalia" --file input.tsv -c 2 -o output.tsv -
Multiple ancestors
nwr restrict "Homo" "Pan" --file input.tsv
template
Behavior:
- Creates directories, data files, and scripts for phylogenomic research.
- Generates materials for ASSEMBLY, BioSample, MinHash, Count, and Protein steps.
- Uses Tera templates to generate Bash scripts.
Input File Format:
.assembly.tsv is a TAB-delimited file to guide downloading and processing:
| Col | Type | Description |
|---|---|---|
| 1 | string | #name: species + infraspecific_name + assembly_accession |
| 2 | string | ftp_path |
| 3 | string | biosample |
| 4 | string | species |
| 5 | string | assembly_level |
Generated Materials:
-
--ass: ASSEMBLY/- One TSV file: url.tsv
- Five Bash scripts: rsync.sh, check.sh, n50.sh, collect.sh, finish.sh
-
--bs: BioSample/- One TSV file: sample.tsv
- Two Bash scripts: download.sh, collect.sh
-
--mh: MinHash/- One TSV file: species.tsv
- Five Bash scripts: compute.sh, species.sh, abnormal.sh, nr.sh, dist.sh
-
--count: Count/- One TSV file: species.tsv
- Three Bash scripts: strains.sh, rank.sh, lineage.sh
-
--pro: Protein/- One TSV file: species.tsv
- Bash scripts: collect.sh, info.sh, count.sh
Examples:
-
Generate ASSEMBLY materials
nwr template input.assembly.tsv --ass -
Generate all materials
nwr template input.assembly.tsv --ass --bs --mh --count --pro -
Specify output directory
nwr template input.assembly.tsv --ass -o output_dir -
Use parallel processing
nwr template input.assembly.tsv --mh --parallel 16
abbr
Behavior:
- Abbreviates strain scientific names to unique short identifiers.
- Generates abbreviations for genus, species, and strain parts.
- Handles special cases like Candidatus and subspecies names.
- Ensures uniqueness of abbreviations across all input names.
Input:
- Accepts a TSV/CSV file or standard input.
- Each row should contain strain, species, and genus names in separate columns.
- Use
--columnto specify which columns contain these names (default: 1,2,3). - Common column patterns:
1,2,3- strain in column 1, species in 2, genus in 31,1,2- no strain: strain and species both in column 1, genus in 22,2,3- don’t need strain part: strain and species in 2, genus in 31,1,1- only strain: all three in column 1
Output:
- Original line followed by a tab and the generated abbreviation.
- Abbreviation format:
- Normal mode:
Genus_Species_Strain(e.g., H_sapiens_sapiens) - Tight mode (
--tight):GenusSpecies_Strain(e.g., Hsapiens_sapiens)
- Normal mode:
- Special handling:
- Candidatus is abbreviated to C
- Non-alphanumeric characters are replaced with underscores
- Consecutive underscores are collapsed
- Leading and trailing underscores are removed
Examples:
-
Basic usage with default columns
echo -e 'Homo sapiens,Homo\nHomo erectus,Homo' | nwr abbr -s ',' -c "1,1,2" -
Tight mode (no underscore between genus and species)
echo -e 'Homo sapiens,Homo\nHomo erectus,Homo' | nwr abbr -s ',' -c "1,1,2" --tight -
Clean subspecies names
echo 'Legionella pneumophila subsp. pneumophila' | nwr abbr --shortsub -
Process a file
nwr abbr names.tsv -o abbreviated.tsv -
Custom separator and columns
nwr abbr data.csv -s ',' -c "1,2,3" -o output.tsv
kb
Behavior:
- Prints embedded documentation and knowledge bases.
- Extracts built-in files to stdout or a specified output directory.
Available Documents:
bac120- 120 bacterial marker genes (tar.gz archive)ar53- 53 archaeal marker genes (tar.gz archive)
Output:
- Archive files (bac120, ar53) are extracted to the specified directory.
- By default, output is written to standard output.
- Use
--outfileto specify an output file or directory.
Examples:
-
Extract bacterial marker genes
nwr kb bac120 -o marker_genes/ -
Extract archaeal marker genes
nwr kb ar53 -o marker_genes/
seqdb
Behavior:
- Initializes the sequence database for protein sequence information.
- Creates a SQLite database at
./seq.sqlite. - Loads data from various TSV files into appropriate tables.
- Supports loading strains, sizes, clusters, annotations, and assembly sequences.
Database Location:
./seq.sqlite
The DDL:
CREATE TABLE rank (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name VARCHAR NOT NULL UNIQUE
);
-- assembly
CREATE TABLE asm (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name VARCHAR NOT NULL UNIQUE,
rank_id INTEGER NOT NULL,
FOREIGN KEY (rank_id) REFERENCES rank(id)
);
-- sequence
CREATE TABLE seq (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name VARCHAR NOT NULL UNIQUE,
size INTEGER,
anno TEXT
);
-- representative
CREATE TABLE rep (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name VARCHAR NOT NULL UNIQUE,
f1 TEXT,
f2 TEXT,
f3 TEXT,
f4 TEXT,
f5 TEXT,
f6 TEXT,
f7 TEXT,
f8 TEXT
);
-- Junction table to associate rep with seq
CREATE TABLE rep_seq (
rep_id INTEGER NOT NULL,
seq_id INTEGER NOT NULL,
PRIMARY KEY (rep_id, seq_id),
FOREIGN KEY (rep_id) REFERENCES rep(id),
FOREIGN KEY (seq_id) REFERENCES seq(id)
);
-- Junction table to associate asm with seq
CREATE TABLE asm_seq (
asm_id INTEGER NOT NULL,
seq_id INTEGER NOT NULL,
PRIMARY KEY (asm_id, seq_id),
FOREIGN KEY (asm_id) REFERENCES asm(id),
FOREIGN KEY (seq_id) REFERENCES seq(id)
);
-- Regular indices
CREATE INDEX rep_idx_f1 ON rep(f1);
CREATE INDEX rep_idx_f2 ON rep(f2);
CREATE INDEX rep_idx_f3 ON rep(f3);
CREATE INDEX rep_idx_f4 ON rep(f4);
CREATE INDEX rep_idx_f5 ON rep(f5);
CREATE INDEX rep_idx_f6 ON rep(f6);
CREATE INDEX rep_idx_f7 ON rep(f7);
CREATE INDEX rep_idx_f8 ON rep(f8);
-- Case-insensitive indices for `like`
CREATE INDEX seq_idx_anno ON seq(anno COLLATE NOCASE);
Notes:
- If
--strainis called without specifying a path, it will load the default file under--dir. --reprequires a key-value pair in the format--rep f1=file.- Valid fields for
--repare: f1, f2, f3, f4, f5, f6, f7, f8.
Examples:
-
Initialize the database
nwr seqdb --init -
Load strain information
nwr seqdb --strain strains.tsv -
Load multiple data types
nwr seqdb --strain --size --clust -
Load features into rep table
nwr seqdb --rep f1=features.tsv
taxonomy.sqlite
Tables
Relations
Generated by tbls
ar_refseq.sqlite
Tables
| Name | Columns | Comment | Type |
|---|---|---|---|
| ar | 17 | table |
Relations
Generated by tbls