Configuration

While the scrapers can run without any configuration, you can customize their behavior using a config.yml file and various command-line options.

Configuration File (config.yml)

To configure the project, copy the example configuration file and edit it to your needs.

cp config.yml.example config.yml

The script will automatically detect and use this file if it exists.

Output Directories

You can specify custom paths for the cache and data output directories.

# output directories
output:
  cache: /path/to/your/cache
  data: /path/to/your/data

Email Notifications

The script can send email notifications when a parsing or execution error occurs. This is useful for monitoring long-running scraping tasks. Fill in your SMTP server details to enable this feature.

# email settings
email: 

  # smtp details
  hostname: smtp.example.com
  port: 587
  user_name: your_username
  password: your_password
  starttls: true

  # email content
  subject: "[CONGRESS] Scraper Notice"
  from: scraper@example.com
  from_name: Congress Scraper
  to: your_email@example.com

Command-Line Options

Several common options can be passed to the usc-run command to control logging and caching behavior.

  • --log=<level>: Control the logging verbosity. By default, only warnings and errors are shown (--log=warn).
    • --log=debug: Show all messages, including detailed debugging information.
    • --log=info: Show informational messages, such as which files are being processed.
    • --log=warn: The default level.
    • --log=error: Show only critical errors.
  • --debug: A shortcut for --log=debug.
  • --force: Forces the re-downloading of all network resources, ignoring the local cache.
  • --timestamps: Adds timestamps to log messages.
  • --patch=<module>: Applies a monkey-patch from a specified Python module to extend or modify the scraper's behavior. See the Advanced Topics section for an example.

Example usage:

# Run the bills scraper with verbose logging and force a re-download of all data
usc-run bills --debug --force