Configuration
While the scrapers can run without any configuration, you can customize their behavior using a config.yml
file and various command-line options.
Configuration File (config.yml
)
To configure the project, copy the example configuration file and edit it to your needs.
cp config.yml.example config.yml
The script will automatically detect and use this file if it exists.
Output Directories
You can specify custom paths for the cache
and data
output directories.
# output directories
output:
cache: /path/to/your/cache
data: /path/to/your/data
Email Notifications
The script can send email notifications when a parsing or execution error occurs. This is useful for monitoring long-running scraping tasks. Fill in your SMTP server details to enable this feature.
# email settings
email:
# smtp details
hostname: smtp.example.com
port: 587
user_name: your_username
password: your_password
starttls: true
# email content
subject: "[CONGRESS] Scraper Notice"
from: scraper@example.com
from_name: Congress Scraper
to: your_email@example.com
Command-Line Options
Several common options can be passed to the usc-run
command to control logging and caching behavior.
--log=<level>
: Control the logging verbosity. By default, only warnings and errors are shown (--log=warn
).--log=debug
: Show all messages, including detailed debugging information.--log=info
: Show informational messages, such as which files are being processed.--log=warn
: The default level.--log=error
: Show only critical errors.
--debug
: A shortcut for--log=debug
.--force
: Forces the re-downloading of all network resources, ignoring the local cache.--timestamps
: Adds timestamps to log messages.--patch=<module>
: Applies a monkey-patch from a specified Python module to extend or modify the scraper's behavior. See the Advanced Topics section for an example.
Example usage:
# Run the bills scraper with verbose logging and force a re-download of all data
usc-run bills --debug --force