GovInfo Downloader

This is a powerful and essential task for downloading documents from GPO's GovInfo.gov website. It uses sitemaps to efficiently determine what needs to be downloaded or updated, making it suitable for both initial bulk downloads and regular updates.

This task is often a prerequisite for other tasks like bills and statutes.

Collections vs. Bulk Data

GovInfo provides data in two main ways, and this task can handle both:

  1. Collections (--collections): These are standard document collections, often containing multiple formats for each item (PDF, HTML, XML). Examples include BILLS (for bill text) and STATUTE (for Statutes at Large).
  2. Bulk Data (--bulkdata): These are collections specifically designed for bulk download, often organized by Congress. The most important one is BILLSTATUS, which contains the detailed metadata processed by the bills task.

Usage

Downloading a Standard Collection

Use the --collections flag with a comma-separated list of collection codes.

# Download bill text and Statutes at Large for the year 2022
usc-run govinfo --collections=BILLS,STATUTE --years=2022

Downloading Bulk Data

Use the --bulkdata flag with a comma-separated list of collection codes.

# Download bill status information for the 117th Congress
usc-run govinfo --bulkdata=BILLSTATUS --congress=117

Options

  • --collections=<CODES>: A comma-separated list of standard collection codes to download (e.g., BILLS, STATUTE, CRPT).
  • --bulkdata=<CODES>: A comma-separated list of bulk data collection codes to download (e.g., BILLSTATUS, FR).
  • --years=<YYYY,YYYY>: Restricts downloads to specific years. Applies to collections organized by year.
  • --congress=<NUM,NUM>: Restricts downloads to specific congresses. Applies to collections like BILLSTATUS.
  • --type=<type,type>: Restricts downloads to specific bill types (e.g., hr, s). Applies to BILLSTATUS.
  • --extract=<formats>: After downloading a package ZIP file from a standard collection, extract the specified internal files. Common formats are mods, pdf, text, xml, premis.

    Example:

    // Download bill text packages and extract the PDF and MODS XML files
    usc-run govinfo --collections=BILLS --extract=pdf,mods
    
  • --filter="<regex>": Only downloads files where the package name (for collections) or file path (for bulk data) matches the provided regular expression.

  • --cached: Forces the use of the cache, preventing any network requests.
  • --force: Forces re-download of all files, ignoring the cache.