Does it automate evidence collection for postmortems?

Yes, the skill includes a bash script that bundles run logs, task outputs, and cluster events into a compressed archive for thorough root cause analysis.

How does the triage script work?

The triage script uses the Databricks CLI and curl to check platform status via status.databricks.com, verify API reachability, and list recent job failures and cluster health in a single operation.

How does it handle communication during an incident?

It provides pre-formatted templates for internal Slack updates and external status pages, ensuring consistent and professional communication with stakeholders throughout the incident lifecycle.

Is the Databricks CLI required?

Yes, this skill relies on the Databricks CLI and Bash tools to interact with your Databricks workspace and execute commands.

Can this skill help with data corruption?

Yes, it includes SQL templates to perform data sanity checks and commands to restore Delta tables to specific versions or timestamps using the RESTORE command.

Databricks Incident Runbook

Name: Databricks Incident Runbook
Author: jeremylongshore

byjeremylongshore

•

1,887

•

Cloud Infrastructure

Automates Databricks incident response with real-time triage scripts, recovery procedures, and postmortem documentation.

This skill provides a comprehensive framework for managing Databricks-related outages and pipeline failures directly within the Claude Code environment. It equips on-call engineers with automated triage scripts to assess platform status, interactive decision trees for rapid troubleshooting, and specific remediation steps for common issues like cluster startup failures, code errors, and data corruption. Beyond immediate mitigation, it facilitates professional incident communication and automated evidence collection to streamline the creation of high-quality postmortems and long-term preventive actions.

Key Features

01Step-by-step decision tree for identifying root causes of pipeline failures

02Real-time health triage for clusters, jobs, and API connectivity

03Structured incident communication and evidence collection tools

04Delta Lake restoration templates for data quality issues

051,887 GitHub stars

06Automated recovery scripts for cluster restarts and job repairs

Use Cases

01Troubleshooting Databricks cluster startup and cloud provider errors

02Generating professional postmortem reports and incident timelines

03Responding to critical production pipeline outages or job failures

Key Features

01Step-by-step decision tree for identifying root causes of pipeline failures

02Real-time health triage for clusters, jobs, and API connectivity

03Structured incident communication and evidence collection tools

04Delta Lake restoration templates for data quality issues

051,887 GitHub stars

06Automated recovery scripts for cluster restarts and job repairs

Use Cases

01Troubleshooting Databricks cluster startup and cloud provider errors

02Generating professional postmortem reports and incident timelines

03Responding to critical production pipeline outages or job failures