GovInfo Downloader
This is a powerful and essential task for downloading documents from GPO's GovInfo.gov website. It uses sitemaps to efficiently determine what needs to be downloaded or updated, making it suitable for both initial bulk downloads and regular updates.
This task is often a prerequisite for other tasks like bills
and statutes
.
Collections vs. Bulk Data
GovInfo provides data in two main ways, and this task can handle both:
- Collections (
--collections
): These are standard document collections, often containing multiple formats for each item (PDF, HTML, XML). Examples includeBILLS
(for bill text) andSTATUTE
(for Statutes at Large). - Bulk Data (
--bulkdata
): These are collections specifically designed for bulk download, often organized by Congress. The most important one isBILLSTATUS
, which contains the detailed metadata processed by thebills
task.
Usage
Downloading a Standard Collection
Use the --collections
flag with a comma-separated list of collection codes.
# Download bill text and Statutes at Large for the year 2022
usc-run govinfo --collections=BILLS,STATUTE --years=2022
Downloading Bulk Data
Use the --bulkdata
flag with a comma-separated list of collection codes.
# Download bill status information for the 117th Congress
usc-run govinfo --bulkdata=BILLSTATUS --congress=117
Options
--collections=<CODES>
: A comma-separated list of standard collection codes to download (e.g.,BILLS
,STATUTE
,CRPT
).--bulkdata=<CODES>
: A comma-separated list of bulk data collection codes to download (e.g.,BILLSTATUS
,FR
).--years=<YYYY,YYYY>
: Restricts downloads to specific years. Applies to collections organized by year.--congress=<NUM,NUM>
: Restricts downloads to specific congresses. Applies to collections likeBILLSTATUS
.--type=<type,type>
: Restricts downloads to specific bill types (e.g.,hr
,s
). Applies toBILLSTATUS
.-
--extract=<formats>
: After downloading a package ZIP file from a standard collection, extract the specified internal files. Common formats aremods
,pdf
,text
,xml
,premis
.Example:
// Download bill text packages and extract the PDF and MODS XML files usc-run govinfo --collections=BILLS --extract=pdf,mods
-
--filter="<regex>"
: Only downloads files where the package name (for collections) or file path (for bulk data) matches the provided regular expression. --cached
: Forces the use of the cache, preventing any network requests.--force
: Forces re-download of all files, ignoring the cache.