# ETL Troubleshooting Checklist (Sample) ## 1) Intake and scope - Confirm failed component: SQL job, ADF activity, Databricks task, or downstream consumer. - Record run ID, pipeline/job name, environment, and failure timestamp (UTC). - Capture business impact and SLA risk. ## 2) SQL checks - Validate source row counts against baseline. - Check recent schema changes and data type drift. - Re-run staging validation query and inspect error rows. ## 3) ADF checks - Review activity output JSON and dependency chain. - Confirm linked service credentials and integration runtime health. - Verify retry policy, timeout settings, and parameter values. ## 4) Databricks checks - Review cluster state, job run logs, and notebook errors. - Validate mount/path access and input dataset availability. - Route malformed records to quarantine and re-run curated write. ## 5) Recovery and escalation - Apply contained fix in non-prod first. - Reprocess only impacted window/partition where possible. - Escalate when data integrity risk or SLA breach exceeds threshold. ## 6) Closure - Document root cause, fix, and prevention action. - Attach run evidence and update release/runbook notes.