genome_updater is a bash script that downloads and updates (non-redundant) snapshots of the NCBI Genomes repository (RefSeq/GenBank) [1] with advanced filters, detailed logs and reports, file integrity checks (MD5, gzip), NCBI taxonomy and GTDB [2] integration and support for parallel [3] downloads. genome_updater uses the assembly_summary.txt to retrieve data.
# Download script, allow execution
wget --quiet --show-progress https://raw.githubusercontent.com/pirovc/genome_updater/master/genome_updater.sh
chmod +x genome_updater.sh
# Download archaeal complete genome sequences from RefSeq (-t parallel downloads, -G check gz integrity)
./genome_updater.sh -o "arc_refseq_cg" -d "refseq" -g "archaea" -l "complete genome" -f "genomic.fna.gz" -t 12 -GSome time later, get a up-to-date version from NCBI, downloading only newly added files:
# Check if updates are available with a dry-run (-k)
./genome_updater.sh -o "arc_refseq_cg" -k
# Execute update
./genome_updater.sh -o "arc_refseq_cg" -G- boolean flags (e.g.
-G) are not tracked between versions and have to be repeated on the update command to be included.
conda install -c bioconda genome_updater wget https://raw.githubusercontent.com/pirovc/genome_updater/master/genome_updater.sh
chmod +x genome_updater.shbash>=4.*or above is required.- this script attempts to be portable and only depends on GNU Core Utilities + few additional tools (
awkbcfindfmtgzipjoinmd5sumparallelsedtarwgetand optionallycurl) which are commonly available and installed in most distributions. genome_updater.sh -Zwill show more info about dependencies in your system and report if someting is missing- To test if all genome_updater functions are running properly on your system:
git clone --recurse-submodules https://github.com/pirovc/genome_updater.git
cd genome_updater
tests/test.sh./genome_updater -h
┌─┐┌─┐┌┐┌┌─┐┌┬┐┌─┐ ┬ ┬┌─┐┌┬┐┌─┐┌┬┐┌─┐┬─┐
│ ┬├┤ ││││ ││││├┤ │ │├─┘ ││├─┤ │ ├┤ ├┬┘
└─┘└─┘┘└┘└─┘┴ ┴└─┘────└─┘┴ ─┴┘┴ ┴ ┴ └─┘┴└─
v0.8.0
Source:
-d Database(s) (comma-separated, mandatory)
Options: "genbank, refseq"
Default: ""
-f File type(s) to download (comma-separated, mandatory)
Options: "genomic.fna.gz, assembly_report.txt, protein.faa.gz, genomic.gbff.gz, ..." all available formats
are described at https://ftp.ncbi.nlm.nih.gov/genomes/all/README.txt
Default: "assembly_report.txt"
Organism/Taxa:
-g Organism group(s) (comma-separated, empty for all)
Options: "archaea, bacteria, fungi, human, invertebrate, metagenomes, other, plant, protozoa,
vertebrate_mammalian, vertebrate_other, viral"
Default: ""
-T Taxonomic group(s) (comma-separated, empty for all)
Optional negation using the ^ prefix.
Example: "543,^562" (for -M ncbi) or "f__Enterobacteriaceae,^s__Escherichia coli" (for -M gtdb)
Default: ""
Filter:
-c RefSeq category (comma-separated, empty for all)
Options: "reference genome, na"
Default: ""
-l Assembly level (comma-separated, empty for all)
Options: "Complete Genome, Chromosome, Scaffold, Contig"
Default: ""
-D Start date (empty for no filter)
Keep assemblies with sequence release date greater then or equal (>=) to value. Format YYYYMMDD.
Default: ""
-E End date (empty for no filter)
Keep assemblies with sequence release date less then or equal (<=) to value. Format YYYYMMDD.
Default: ""
-F Custom assembly summary filter (empty for no filter)
Use awk syntax, e.g.: $ for column index, || "or", && "and", ! "not", parentheses for nesting. Case
sensitive. Columns info at https://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt
Examples:
Single: -F '$14 == "Full"'
Multi: -F '($2 == "PRJNA12377" || $2 == "PRJNA670754") && $4 != "Partial"'
Regex: -F '$8 ~ /bacterium/'
Whole-file: -F '$0 ~ "plasmid"'
Default: ""
Taxonomy:
-M Taxonomy
"gtdb[-*]" filters assemblies present in the GTDB version, which contains archaea and bacteria only. "gtdb"
uses latest GTDB release. "ncbi" filters latest assemblies (version_status=latest). This option changes
the behavior of -T -A -a.
Options: "ncbi, gtdb, gtdb-80, gtdb-83, gtdb-89, gtdb-232, gtdb-214.1, gtdb-207, gtdb-202, gtdb-86.2,
gtdb-95, gtdb-226, gtdb-220"
Default: "ncbi"
-A Top assemblies (0 for all)
Option to keep a limited number of assemblies for each taxa leaf nodes. Selection by tax. ranks are supported
in the format "rank:number", e.g.: "genus:3" to keep only 3 assemblies for each genus. Top choice based
on sorted fields: RefSeq Category, Assembly level, Relation to type material, Date (most recent).
Options (ranks): "species, genus, family, order, class, phylum, domain"
Default: 0
-a (boolean flag)
Download and keep taxonomy database files in the output folder
Run:
-k Dry-run mode
Only checks for possible actions, no real data is downloaded, deleted or updated
-i Fix mode
Re-download incomplete or failed data from a previous run. Can also be used to change files (-f).
-t Threads
Number of processes to parallelize downloads and some file operations
Default: 1
-L Downloader program
Options: "wget, curl"
Default: "wget"
-G gzip check (boolean flag)
Check integrity of downloaded gzipped files with "gzip -t". Downloaded files are removed if test fail.
-m MD5 check (boolean flag)
Download, compute and check the MD5 checksum for all downloaded files. Downloaded files are removed if
checksum can be downloaded but does not match.
Output:
-o Output directory
Default: "./tmp.XXXXXXXXXX" (random folder)
-b Version label
Name for the downloaded version. Will generate a directory inside the output directory (-o).
Default: "YYYY-MM-DD_HH-MM-SS" (current timestamp)
-N Files directory structure
The "split" structure store files in sub-directories based on the assembly accession, e.g.:
files/GCF/000/499/605/GCF_000499605.1_genomic.fna.gz. The "flat" will store everything under one dir,
e.g.: files/GCF_000499605.1_genomic.fna.gz
Options: "split, flat"
Default: "split"
Report:
-u Assembly accession report (boolean flag)
Generate a report (*_assembly_accession.txt) with updated assembly accessions with the fields (tab-separated):
Added/Removed, assembly accession, url
-r Sequence accession report (boolean flag)
Generate a report (*_sequence_accession.txt) with updated sequence accessions with the fields (tab-separated):
Added/Removed, assembly accession, genbank accession, refseq accession, sequence length, taxid. Only
available when file format (-f) "assembly_report.txt" is selected and successfully downloaded.
-p URL report (boolean flag)
Generate two files with successful and failed URLs (url_downloaded.txt, url_failed.txt)
Misc.:
-e Local assembly_summary.txt
Use provided "assembly_summary.txt" instead of downloading. Mutually exclusive with -d and -g
Default: ""
-B Alternative version label
Use a previous version label instead of the latest as base version. Can be also used to rollback to an
older version or to create multiple branches from a base version. Mutually exclusive with -i.
Default: ""
-H Link mode
Change link type for files kept between versions. Hard links save inodes (useful on HPC systems) and allow
version deletion.
Options: "hard, soft"
Default: "hard"
-R Retry batches
Number of attempts to retry failed downloads in batches.
Default: "5"
-n Conditional exit status
Change exit code based on number of failures accepted, otherwise will Exit Code = 1. For example: -n 10
will exit code 1 if 10 or more files failed to download
Options: integer for file number, float for percentage, 0 = off
Default: "0"
-x Delete extra files (boolean flag)
Search and delete files that do not belong to the current version inside "files/" directory.
-s Silent output
-w Silent output with download progress
-V Verbose log
-Z Print debug information and run in debug mode
When updating an existing local repository:
- Newly added sequences will be downloaded, creating a new version.
-bcan be used to change the version name, otherwise will use the current timestamp by default.
- Removed or old sequences will be retained, but not transferred to the new version.
- Repeated/unchanged files are linked to the new version.
- Hard links are used by default, soft links can be enforced with
-H soft.
- Hard links are used by default, soft links can be enforced with
- Arguments can be added to or changed.
- Based on the quick start example, use the command
./genome_updater.sh -o "arc_refseq_cg" -t 2to specify a different number of threads, or./genome_updater.sh -o "arc_refseq_cg" -l ""to remove thecomplete genomefilter.
- Based on the quick start example, use the command
- The file
history.tsvwill be created in the output folder (-o), tracking the versions and arguments used. Please note that boolean flags/arguments are not tracked (e.g.-m).- The
history.tsvhas the columns (tab-separated):current_label: empty for new downloads, or refering to previous versionnew_label: the current label. It is empty for fix runs-itimestamp: time and date of the executionassembly_summary_entries: number of assemblies for thenew_labelversionarguments: used arguments without boolean flags
- The
# Download (-m to check integrity of downloaded files)
./genome_updater.sh -d "refseq" -g "archaea,bacteria,fungi,viral" -f "genomic.fna.gz" -o "arc_bac_fun_vir_refseq_cg" -t 12 -m
# Update (e.g. some days later)
./genome_updater.sh -o "arc_bac_fun_vir_refseq_cg" -m# -t 12 for using 12 threads to download in parallel
./genome_updater.sh -d "refseq" -T "2559587" -f "genomic.fna.gz" -o "all_rna_virus" -t 12 -m./genome_updater.sh -d "genbank" -g "bacteria" -f "genomic.fna.gz" -o "top1_bacteria_genbank" -A 1 -t 12 -m ./genome_updater.sh -d "genbank" -g "bacteria" -f "genomic.fna.gz" -o "top1species_bacteria_genbank" -A "species:1" -t 12 -m ./genome_updater.sh -d "refseq,genbank" -g "archaea,bacteria" -f "genomic.fna.gz" -o "GTDB_complete" -M "gtdb" -t 12 -m./genome_updater.sh -d "refseq,genbank" -g "archaea,bacteria" -f "genomic.fna.gz" -o "GTDB_R220_complete" -M "gtdb-220" -t 12 -m./genome_updater.sh -d "refseq,genbank" -g "archaea,bacteria" -f "genomic.fna.gz" -o "GTDB_top2genus" -M "gtdb" -A "genus:2" -t 12 -m./genome_updater.sh -d "refseq,genbank" -g "archaea,bacteria" -f "genomic.fna.gz" -o "GTDB_family_Gastranaerophilaceae" -M "gtdb" -T "f__Gastranaerophilaceae" -t 12 -m./genome_updater.sh -d "refseq,genbank" -g "archaea,bacteria" -f "genomic.fna.gz" -o "GTDB_Mycobacteriacea_minus_Mycobacterium" -M "gtdb" -T "f__Mycobacteriacea,^g__Mycobacterium" -t 12 -m# Dry-run to check files available
./genome_updater.sh -d "refseq" -g "archaea,bacteria" -l "complete genome" -f "genomic.fna.gz" -k
# Download (-o output folder, -t threads, -m checking md5, -u extended assembly accession report)
./genome_updater.sh -d "refseq" -g "archaea,bacteria" -l "complete genome" -f "genomic.fna.gz" -o "arc_bac_refseq_cg" -t 12 -u -m
# Downloading additional .gbff files for the current snapshot (adding genomic.gbff.gz to -f , -i to just add files and not update)
./genome_updater.sh -f "genomic.fna.gz,genomic.gbff.gz" -o "arc_bac_refseq_cg" -i
# Some days later, just check for updates but do not update
./genome_updater.sh -o "arc_bac_refseq_cg" -k
# Perform update
./genome_updater.sh -o "arc_bac_refseq_cg" -u -m# Download the complete bacterial refseq
./genome_updater.sh -d "refseq" -g "bacteria" -f "genomic.fna.gz" -o "bac_refseq" -t 12 -m -b "all"
# Branch the main files into two sub-versions (no new files will be downloaded or copied)
./genome_updater.sh -o "bac_refseq" -B "all" -b "complete" -l "complete genome"
./genome_updater.sh -o "bac_refseq" -B "all" -b "reference" -c "reference genome"./genome_updater.sh -d "refseq" -g "fungi" -f "assembly_report.txt" -o "fungi" -t 12 -rpu./genome_updater.sh -e /my/path/assembly_summary.txt -f "genomic.fna.gz" -o "recovered_sequences"# Defaults: retries=3 timeout=120
retries=10 timeout=600 ./genome_updater.sh -g "fungi" -o fungi -t 12 -f "genomic.fna.gz,assembly_report.txt" -L curl -R 10new_taxdump_file="my/local/new_taxdump.tar.gz" ./genome_updater.sh -T 562 -o 562assemblies -t 12- Note that the new_taxdump is required, not the more common
taxdump.tar.gz.
# NCBI
ncbi_base_url="https://ftp.ncbi.nih.gov/" ./genome_updater.sh -d refseq -g bacteria
# GTDB
gtdb_base_url="https://data.ace.uq.edu.au/public/gtdb/data/releases/" ./genome_updater.sh -d refseq,genbank -g bacteria,archaeaThe -u parameter activates the output of a list of updated assembly accessions for entries where all files have been successfully downloaded. The file {timestamp}_assembly_accession.txt contains the following tab-separated fields: Added/Removed [A/R], assembly accession, url
Example:
A GCF_000146045.2 ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64
A GCF_000002515.2 ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/515/GCF_000002515.2_ASM251v1
R GCF_000091025.4 ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/091/025/GCF_000091025.4_ASM9102v4The -r parameter activates the output of a list of updated sequence accessions for entries for which all files have been successfully downloaded. This option is only available when the file type contains assembly_report.txt . The file {timestamp}_sequence_accession.txt contains the following tab-separated fields: Added/Removed [A/R], assembly accession, genbank accession, refseq accession, sequence length, taxonomic id
Example:
A GCA_000243255.1 CM001436.1 NZ_CM001436.1 3200946 937775
R GCA_000275865.1 CM001555.1 NZ_CM001555.1 2475100 28892- Note: if the run breaks or does not finish successfuly, some files may be missing from the assembly and sequence accession reports.
The -p parameter activates the output of a list of failed and successfully downloaded URLs to the files {timestamp}_url_downloaded.txt and {timestamp}_url_failed.txt. The failed list will only be complete if the command runs to completion without errors or interruptions.
The -A option will select the 'best' assemblies for each taxonomic node (leaf or specific rank) according to four categories (A–D), in order of importance:
A) refseq Category:
1) reference genome
2) na
B) Assembly level:
3) Complete Genome
4) Chromosome
5) Scaffold
6) Contig
C) Relation to type material:
7) assembly from type material
8) assembly from synonym type material
9) assembly from pathotype material
10) assembly designated as neotype
11) assembly designated as reftype
12) ICTV species exemplar
13) ICTV additional isolate
D) Date:
14) Most recent first
[1] https://ftp.ncbi.nlm.nih.gov/genomes/
[2] https://gtdb.ecogenomic.org/
[3] O. Tange (2018): GNU Parallel 2018, March 2018, https://doi.org/10.5281/zenodo.1146014.