Glossary

MTBF

Mean time between failures — the average uptime between consecutive incidents. A measure of underlying reliability.

MTBF (mean time between failures) is the average uptime between consecutive incidents. It’s a measure of how often the system breaks, independent of how quickly the team recovers from each break.

MTBF and MTTR together describe a service’s reliability profile: high MTBF and low MTTR means a service that rarely breaks and recovers quickly when it does; low MTBF and high MTTR means a service that breaks often and stays broken for a long time. Most real services sit somewhere in between, and the goal of a reliability practice is to push both numbers in the right direction.