Deployment with Docker
For users who prefer containerized environments, this project includes a Dockerfile
to create a minimal, isolated environment for running the scrapers. This is an excellent way to ensure consistency between development, testing, and production environments.
Building the Docker Image
To build the Docker image, navigate to the root of the project repository and run the docker build
command. We recommend tagging the image for easier reference.
docker build --rm -t unitedstates/congress .
This command will:
1. Start from a debian:bullseye
base image.
2. Install all necessary system dependencies like python3
, git
, and libxml2
.
3. Copy the project source code into the image.
4. Install the Python dependencies using pip
.
5. Set up the environment for the scrapers to run.
Running the Scrapers in Docker
To run a scraper command inside the Docker container, you use docker run
. The key is to mount a host directory to the /congress
volume inside the container. This ensures that the data generated by the scraper is saved to your local machine.
Example
-
Define an Output Directory:
First, choose or create a directory on your host machine where you want the
data
andcache
folders to be written.export CONGRESS_OUTPUT_DIR=/tmp/congress_data mkdir -p $CONGRESS_OUTPUT_DIR
-
Run a Scraper Command:
Now, execute the
docker run
command. The arguments after the image name (unitedstates/congress
) are passed directly to theusc-run
script inside the container.The example below runs the
bills
scraper.docker run \ -t --rm \ -v ${CONGRESS_OUTPUT_DIR}:/congress \ unitedstates/congress \ bills --congress=117
Let's break down the
docker run
options:-t
: Allocates a pseudo-TTY, which makes the output look nice.--rm
: Automatically removes the container when it exits.-v ${CONGRESS_OUTPUT_DIR}:/congress
: This is the crucial part. It mounts the directory specified byCONGRESS_OUTPUT_DIR
on your host to the/congress
directory inside the container. The scraper is configured to write its output to/congress
, so alldata
andcache
will appear in your host directory.unitedstates/congress
: The name of the image to run.bills --congress=117
: The arguments passed to theusc-run
command inside the container.
After the command finishes, the scraped data for the 117th Congress will be available in the ${CONGRESS_OUTPUT_DIR}
directory on your host machine.