Skip to content

Schedule Runbooks

Verify Dual-Write Consistency

Check that every pending or active row in Postgres has a corresponding BullMQ entry in Redis.

Postgres side — list jobs that should be active:

sql
SELECT id, topic, kind, status, run_at, cron_pattern, interval_ms, attempts, max_attempts
FROM schedule.scheduled_jobs
WHERE status IN ('pending', 'active')
ORDER BY created_at DESC
LIMIT 50;

Redis side — inspect delayed (one-shot) and repeatable jobs:

bash
# List all delayed BullMQ jobs for the schedule queue
redis-cli ZRANGE bull:schedule:delayed 0 -1 WITHSCORES

# List all repeatable job keys
redis-cli KEYS "bull:schedule:repeat:*"

# Inspect a specific job by id (replace <job-id>)
redis-cli HGETALL "bull:schedule:<job-id>"

If a row exists in Postgres but has no Redis entry, the reconciler will fix it on the next restart (see below). You can also trigger recovery manually by restarting the API process.

Inspect Redis Keys

bash
# All schedule queue keys
redis-cli KEYS "bull:schedule:*"

# Jobs waiting in the delayed set (sorted by score = fire timestamp ms)
redis-cli ZRANGE bull:schedule:delayed 0 -1 WITHSCORES

# Active (currently processing) jobs
redis-cli LRANGE bull:schedule:active 0 -1

# Failed jobs
redis-cli LRANGE bull:schedule:failed 0 -1

# Count jobs by state
redis-cli LLEN bull:schedule:wait
redis-cli ZCARD bull:schedule:delayed
redis-cli LLEN bull:schedule:active
redis-cli LLEN bull:schedule:failed

Run the Migration

Do not run the migration command yourself — provide it to the developer or DBA who runs migrations for this project.

The migration file is located at:

apps/api/src/modules/schedule/infrastructure/migrations/<timestamp>-add_scheduled_jobs.ts

The command to generate and run migrations follows the project convention:

bash
pnpm --filter api migration:schedule:run

The migration creates the schedule Postgres schema and the schedule.scheduled_jobs table with all constraints and indexes.

Recover from Redis Flush

If Redis is flushed or restarted with data loss, the boot reconciler automatically recovers all pending and active jobs from Postgres when the API process restarts.

Steps:

  1. Confirm the API is stopped or restarting.
  2. Start (or restart) the API process.
  3. On OnApplicationBootstrap, ScheduleReconcilerService will re-enqueue every pending/active row. Jobs with runAt in the past fire immediately (delay = 0).
  4. Check the logs for schedule.reconciler.summary to confirm healed count.

Note: Jobs with status = 'active' at the time of the flush were in-flight. They will be re-enqueued and may fire again — listeners must handle this duplicate delivery idempotently.

Manually Cancel a Stuck Job

If a job is stuck in active status (e.g., worker crashed after DB update but before BullMQ ack), update the row directly and optionally clean Redis:

sql
-- Mark the job cancelled in Postgres
UPDATE schedule.scheduled_jobs
SET status = 'cancelled', cancelled_at = now()
WHERE id = '<job-id>';
bash
# Remove from BullMQ if still present
redis-cli HDEL "bull:schedule:<job-id>"

After the next restart, the reconciler will not re-enqueue rows with status = 'cancelled'.

Check No-Listener Jobs

Jobs that fire but have no registered @OnEvent listener are logged as WARN and tracked by the app_schedule_job_no_listener_total metric. To find them in logs:

grep "schedule.no_listener" <log-file>

Or query in Grafana Loki:

{app="api"} |= "schedule.no_listener"

This is not an error, but indicates a topic with no subscriber — likely a misconfigured listener or a topic mismatch.