# ETL Troubleshooting Checklist (Sample)

## 1) Intake and scope
- Confirm failed component: SQL job, ADF activity, Databricks task, or downstream consumer.
- Record run ID, pipeline/job name, environment, and failure timestamp (UTC).
- Capture business impact and SLA risk.

## 2) SQL checks
- Validate source row counts against baseline.
- Check recent schema changes and data type drift.
- Re-run staging validation query and inspect error rows.

## 3) ADF checks
- Review activity output JSON and dependency chain.
- Confirm linked service credentials and integration runtime health.
- Verify retry policy, timeout settings, and parameter values.

## 4) Databricks checks
- Review cluster state, job run logs, and notebook errors.
- Validate mount/path access and input dataset availability.
- Route malformed records to quarantine and re-run curated write.

## 5) Recovery and escalation
- Apply contained fix in non-prod first.
- Reprocess only impacted window/partition where possible.
- Escalate when data integrity risk or SLA breach exceeds threshold.

## 6) Closure
- Document root cause, fix, and prevention action.
- Attach run evidence and update release/runbook notes.