Glossary

MTTR

Mean time to recovery — the average elapsed time from an incident’s start to its resolution. A core reliability KPI.

MTTR (mean time to recovery, sometimes “repair”) is the average elapsed time between the start of an incident and its resolution. It captures how quickly the team detects, responds, and mitigates.

MTTR is most useful as a trend: an organisation whose MTTR is falling over quarters is getting better at incident response — through better runbooks, better alerting, faster rollback tools, or more on-call practice. MTTR is most misleading as a single-number SLO: a single very-long incident can dominate the average for an entire quarter.

Decomposing MTTR helps: time to detect (was monitoring fast enough?), time to acknowledge (did the page reach a human?), time to mitigate (how long to apply the workaround?), and time to resolve (how long to remove the workaround). Each component has a different fix.

MTTR

MTBF

incident

SLA