MTTR
Mean time to recovery — the average elapsed time from an incident’s start to its resolution. A core reliability KPI.
MTTR (mean time to recovery, sometimes “repair”) is the average elapsed time between the start of an incident and its resolution. It captures how quickly the team detects, responds, and mitigates.
MTTR is most useful as a trend: an organisation whose MTTR is falling over quarters is getting better at incident response — through better runbooks, better alerting, faster rollback tools, or more on-call practice. MTTR is most misleading as a single-number SLO: a single very-long incident can dominate the average for an entire quarter.
Decomposing MTTR helps: time to detect (was monitoring fast enough?), time to acknowledge (did the page reach a human?), time to mitigate (how long to apply the workaround?), and time to resolve (how long to remove the workaround). Each component has a different fix.
MTBF
Mean time between failures — the average uptime between consecutive incidents. A measure of underlying reliability.
incident
A discrete event during which a service was unavailable or degraded, with a defined start, updates, and resolution.
SLA
Service-level agreement — a contractual promise of a target metric, often availability, with consequences for missing it.