Automates Databricks incident response with real-time triage scripts, recovery procedures, and postmortem documentation.
This skill provides a comprehensive framework for managing Databricks-related outages and pipeline failures directly within the Claude Code environment. It equips on-call engineers with automated triage scripts to assess platform status, interactive decision trees for rapid troubleshooting, and specific remediation steps for common issues like cluster startup failures, code errors, and data corruption. Beyond immediate mitigation, it facilitates professional incident communication and automated evidence collection to streamline the creation of high-quality postmortems and long-term preventive actions.
Key Features
01Step-by-step decision tree for identifying root causes of pipeline failures
02Real-time health triage for clusters, jobs, and API connectivity
03Structured incident communication and evidence collection tools
04Delta Lake restoration templates for data quality issues
051,887 GitHub stars
06Automated recovery scripts for cluster restarts and job repairs
Use Cases
01Troubleshooting Databricks cluster startup and cloud provider errors
02Generating professional postmortem reports and incident timelines
03Responding to critical production pipeline outages or job failures