Extending with Beanstalkd
This project includes a powerful --patch
option that allows you to modify the scraper's behavior at runtime by applying a "monkey-patch." A prime example of this is the included contrib.beanstalkd
module, which integrates the scraper with the Beanstalkd work queue.
Purpose
When enabled, this patch will push the ID of a bill, amendment, or vote onto a Beanstalkd queue immediately after its data file has been written to disk. This allows you to build a real-time processing pipeline where other services can listen to the queue and react instantly to new or updated data.
Usage
To enable the patch, add the --patch=contrib.beanstalkd
flag to your usc-run
command.
usc-run bills --patch=contrib.beanstalkd
This command will run the bills
scraper as usual, but after each bill is processed, it will send the bill ID to the configured Beanstalkd tube.
Configuration
To use this feature, you must have a config.yml
file with a beanstalk
section specifying your queue connection details and tube names.
Example config.yml
:
beanstalk:
connection:
host: 'localhost'
port: 11300
tubes:
bills: 'us_bills'
amendments: 'us_amendments'
votes: 'us_votes'
The script requires unique names for each tube to avoid ambiguity.
How It Works
The congress/contrib/beanstalkd.py
module contains a patch()
function. When usc-run
is invoked with --patch=contrib.beanstalkd
, it imports this module and calls patch()
. This function then replaces the standard process_bill
, process_amendment
, and output_vote
functions with wrapped versions that add the queueing logic. This is an advanced technique that provides great flexibility for integrating the scrapers into larger systems.