Data & Commands Overview

This project's functionality is organized into tasks, each responsible for collecting a specific type of congressional data. These tasks are executed via the usc-run command-line utility.

Basic Usage

The general syntax for running a task is:

usc-run <data-type> [options]

Where <data-type> is one of the available scrapers. This section provides a detailed guide to each primary data type and its specific options.

Common Options

Several options are common to most or all commands:

  • --log=<level>: Sets the logging verbosity. Can be debug, info, warn, or error. Default is warn.
  • --debug: A shortcut for --log=debug.
  • --force: Forces the re-downloading of all network resources, ignoring anything in the cache directory.
  • --timestamps: Adds timestamps to all log output.
  • --govtrack: When outputting XML, this flag ensures the output uses GovTrack legislator IDs for full backward compatibility with legacy data.
  • --congress=<number>: Restricts the scrape to a specific Congress (e.g., --congress=117). Many commands support a comma-separated list for multiple congresses.

Output Structure

By default, all tasks write their output to two top-level directories:

  • cache/: Stores raw downloaded files from the internet. Subsequent runs will use these files instead of re-downloading them unless you use the --force flag.
  • data/: Stores the final, processed, structured data. The internal structure of this directory depends on the data type being collected.

For most data objects (like a bill or vote), two files are generated:

  • data.json: A detailed JSON representation of the data.
  • data.xml: An XML version that maintains backward compatibility with the format historically provided by GovTrack.us.