High Availability with RedBeat
RedBeat is designed to support a highly available Celery Beat setup out of the box. By leveraging a distributed lock in Redis, you can run multiple Beat instances in an active-passive configuration without the risk of duplicate task scheduling.
The Distributed Lock
The core of the HA mechanism is a Redis key that acts as a distributed lock. Only one Beat instance can hold this lock at any given time, and only the instance holding the lock will schedule tasks.
How It Works
- Acquisition: When a RedBeat scheduler starts, it attempts to acquire the lock by setting a specific Redis key (
redbeat_lock_key). This operation is atomic, ensuring only one instance succeeds. - Ownership: The instance that successfully acquires the lock becomes the active scheduler. It begins its normal ticking process of scheduling tasks.
- Refreshing (Heartbeat): On every tick, the active scheduler refreshes the lock by extending its Time-To-Live (TTL). This serves as a heartbeat, signaling that the instance is still alive and functioning.
- Failover: You can run one or more additional Beat instances as passive standbys. These instances will also attempt to acquire the lock on startup. They will fail and enter a waiting loop, periodically re-checking the lock's status.
- Takeover: If the active instance crashes, stops responding, or loses network connectivity to Redis, it will fail to refresh the lock. After the
redbeat_lock_timeoutperiod expires, the lock key is automatically deleted by Redis. The next time a standby instance checks for the lock, it will find it available, acquire it, and become the new active scheduler.
Configuration
HA is configured with two main settings:
-
redbeat_lock_key: The Redis key to use for the lock. By default, this is'redbeat:lock'. This must be the same for all Beat instances in your cluster. -
redbeat_lock_timeout: The lock's expiration time in seconds. This is a critical setting. It determines the failover time. A shorter timeout allows for faster failover but may increase the risk of a split-brain scenario if the active node is merely slow, not dead. The default isbeat_max_loop_interval * 5(typically 1500 seconds or 25 minutes).
# Example HA-focused configuration
# Ensure all beat nodes use the same lock key
REDBEAT_LOCK_KEY = 'production:celery:beat:lock'
# Set a failover time of 5 minutes
REDBEAT_LOCK_TIMEOUT = 300
Disabling the Lock
To disable the locking mechanism entirely, set redbeat_lock_key to None. This is not recommended for production, as running multiple Beat instances without a lock will cause every task to be scheduled multiple times.
# Disables the distributed lock
REDBEAT_LOCK_KEY = None