High Availability with RedBeat

RedBeat is designed to support a highly available Celery Beat setup out of the box. By leveraging a distributed lock in Redis, you can run multiple Beat instances in an active-passive configuration without the risk of duplicate task scheduling.

The Distributed Lock

The core of the HA mechanism is a Redis key that acts as a distributed lock. Only one Beat instance can hold this lock at any given time, and only the instance holding the lock will schedule tasks.

How It Works

  1. Acquisition: When a RedBeat scheduler starts, it attempts to acquire the lock by setting a specific Redis key (redbeat_lock_key). This operation is atomic, ensuring only one instance succeeds.
  2. Ownership: The instance that successfully acquires the lock becomes the active scheduler. It begins its normal ticking process of scheduling tasks.
  3. Refreshing (Heartbeat): On every tick, the active scheduler refreshes the lock by extending its Time-To-Live (TTL). This serves as a heartbeat, signaling that the instance is still alive and functioning.
  4. Failover: You can run one or more additional Beat instances as passive standbys. These instances will also attempt to acquire the lock on startup. They will fail and enter a waiting loop, periodically re-checking the lock's status.
  5. Takeover: If the active instance crashes, stops responding, or loses network connectivity to Redis, it will fail to refresh the lock. After the redbeat_lock_timeout period expires, the lock key is automatically deleted by Redis. The next time a standby instance checks for the lock, it will find it available, acquire it, and become the new active scheduler.

Configuration

HA is configured with two main settings:

  • redbeat_lock_key: The Redis key to use for the lock. By default, this is 'redbeat:lock'. This must be the same for all Beat instances in your cluster.

  • redbeat_lock_timeout: The lock's expiration time in seconds. This is a critical setting. It determines the failover time. A shorter timeout allows for faster failover but may increase the risk of a split-brain scenario if the active node is merely slow, not dead. The default is beat_max_loop_interval * 5 (typically 1500 seconds or 25 minutes).

# Example HA-focused configuration

# Ensure all beat nodes use the same lock key
REDBEAT_LOCK_KEY = 'production:celery:beat:lock'

# Set a failover time of 5 minutes
REDBEAT_LOCK_TIMEOUT = 300

Disabling the Lock

To disable the locking mechanism entirely, set redbeat_lock_key to None. This is not recommended for production, as running multiple Beat instances without a lock will cause every task to be scheduled multiple times.

# Disables the distributed lock
REDBEAT_LOCK_KEY = None