MTBF
Mean time between failures — the average uptime between consecutive incidents. A measure of underlying reliability.
MTBF (mean time between failures) is the average uptime between consecutive incidents. It’s a measure of how often the system breaks, independent of how quickly the team recovers from each break.
MTBF and MTTR together describe a service’s reliability profile: high MTBF and low MTTR means a service that rarely breaks and recovers quickly when it does; low MTBF and high MTTR means a service that breaks often and stays broken for a long time. Most real services sit somewhere in between, and the goal of a reliability practice is to push both numbers in the right direction.
MTTR
Mean time to recovery — the average elapsed time from an incident’s start to its resolution. A core reliability KPI.
incident
A discrete event during which a service was unavailable or degraded, with a defined start, updates, and resolution.
uptime
The proportion of time a service is reachable and responding correctly, usually expressed as a percentage over a window.