Deployment with Docker

For users who prefer containerized environments, this project includes a Dockerfile to create a minimal, isolated environment for running the scrapers. This is an excellent way to ensure consistency between development, testing, and production environments.

Building the Docker Image

To build the Docker image, navigate to the root of the project repository and run the docker build command. We recommend tagging the image for easier reference.

docker build --rm -t unitedstates/congress .

This command will: 1. Start from a debian:bullseye base image. 2. Install all necessary system dependencies like python3, git, and libxml2. 3. Copy the project source code into the image. 4. Install the Python dependencies using pip. 5. Set up the environment for the scrapers to run.

Running the Scrapers in Docker

To run a scraper command inside the Docker container, you use docker run. The key is to mount a host directory to the /congress volume inside the container. This ensures that the data generated by the scraper is saved to your local machine.

Example

  1. Define an Output Directory:

    First, choose or create a directory on your host machine where you want the data and cache folders to be written.

    export CONGRESS_OUTPUT_DIR=/tmp/congress_data
    mkdir -p $CONGRESS_OUTPUT_DIR
  2. Run a Scraper Command:

    Now, execute the docker run command. The arguments after the image name (unitedstates/congress) are passed directly to the usc-run script inside the container.

    The example below runs the bills scraper.

    docker run \
      -t --rm \
      -v ${CONGRESS_OUTPUT_DIR}:/congress \
      unitedstates/congress \
      bills --congress=117

    Let's break down the docker run options:

    • -t: Allocates a pseudo-TTY, which makes the output look nice.
    • --rm: Automatically removes the container when it exits.
    • -v ${CONGRESS_OUTPUT_DIR}:/congress: This is the crucial part. It mounts the directory specified by CONGRESS_OUTPUT_DIR on your host to the /congress directory inside the container. The scraper is configured to write its output to /congress, so all data and cache will appear in your host directory.
    • unitedstates/congress: The name of the image to run.
    • bills --congress=117: The arguments passed to the usc-run command inside the container.

After the command finishes, the scraped data for the 117th Congress will be available in the ${CONGRESS_OUTPUT_DIR} directory on your host machine.