Docs
Troubleshooting (self-hosted)

Troubleshooting (self-hosted)

Operational playbooks—failed tasks, stalled queues, Redis pressure, webhook retries, DLQ handling.

Assume Linux containers or equivalent process supervision. Tail logs with workflow run IDs correlated to task attempts.

Symptoms → checks

Failed jobs / red tasks

  1. Pull the workflow run timeline plus task stderr payloads from storage/logs.
  2. Validate timeouts versus external latency—upstream systems often exceed defaults (retries & timeouts).
  3. Compare enqueue-time proofs vs live definitions for drift after hotfixes landed mid-run (GET /api/proofs/verify-definition/{runId}) (drift detection).

Stuck queues / no progress

Concrete checks:

  • Confirm workers subscribed to Redis streams/queues backing your deployment (queue workers).
  • Inspect heartbeat / stale consumer metrics—Hung tasks sometimes indicate DB connection saturation.
  • Evaluate exclusive locks preventing parallel incompatible tasks.

Redis issues

Indicators: elevated latency spikes, eviction warnings, persistence failures.

Mitigations:

  • Ensure bounded memory, correct maxmemory-policy, and backups per platform guidance.
  • Never share one Redis ACL between unrelated fleets—risk noisy neighbour starvation.
  • If using Redis clustering, verify cascaded clients respect routing + timeout knobs in environment bundles relevant to Cascades deployments.

Replaying failures

  1. Freeze context: screenshot enqueue snapshot hash, task IDs, and integration versions.
  2. Determine whether rerun should be idempotent replay versus repair-run (failure recovery).
  3. For vendor outages, enqueue follow-up remediation once SLAs stabilize—preserve original execution proofs untouched for audit timelines.

Dead-letter queues (DLQ)

When tasks exhaust retries:

  • Items land DLQ-compatible stores for analysts—do not silently delete.
  • Replay only after RCA; attach ticketing references to follow-up automation changes.

Webhooks from GitHub/GitLab never arrive

Walk the integration-specific debugging in integrations and webhooks guides—signature mismatch, rotated secrets, VPC egress.

CommunityReport issue / Discuss(tags: Cascades, workflows)