Reliability is one of those properties that everyone agrees is important and almost nobody invests in until it is too late. The teams that ship reliable software are not heroically more talented. They are operationally more disciplined. They write down what they care about, they put a number on it, and they treat the number as a constraint that decisions get made against. The whole practice fits in a few primitives: performance budgets, error budgets, and the operating habits that keep both honest.
The SRE Frame Still Holds Up
Google's Site Reliability Engineering book codified the modern reliability vocabulary — SLI, SLO, error budget — and a decade later, the underlying frame still holds up. The job of the engineering organization is to decide what reliability levels matter for which user journeys, quantify the gap, and run the rest of the operation against that quantification. Most reliability programs that fail are programs that skipped the first step.
Performance Budgets For The Frontend
Performance budgets are the equivalent discipline for the frontend. The web.dev team's Performance Budgets 101 walks through the canonical pattern: pick a small number of metrics that map to user-felt performance, set numeric ceilings on each, and wire CI to fail builds that breach them. The specifics matter less than the discipline. A team with a 500KB JavaScript ceiling that holds will outship a team with an aspirational performance goal that does not.
Core Web Vitals Are The Public Floor
Google's Core Web Vitals are the public, accountable floor for marketing-site performance — and they doubled as the de facto SEO ranking input once the page-experience update landed. INP replaced FID in early 2024, and the threshold is genuinely harder to clear than the team that built FID-era infrastructure expects. Treat Core Web Vitals as the public metric and use tighter internal budgets to stay clear of the thresholds.
Error Budgets For The Backend
The error budget pattern — allowed unreliability per window — is operationally underrated. The CNCF's cloud-native operations writing has documented the pattern across hundreds of teams: the budget governs how aggressively the team can ship. Plenty of budget left over a 30-day window means the team can take risks on velocity. Burning the budget early means the team freezes deploys until reliability is restored. The mechanism is what makes velocity-vs-reliability a decision instead of an argument.
Observability That Funds The Discipline
The discipline runs on observability. The OpenTelemetry project's instrumentation guidance is the right canonical reference. The principle: every user-felt SLI gets a tracer, every dependency gets a propagation header, and every dashboard shows the budget alongside the raw metric. Teams that skip the budget overlay treat dashboards as decoration. Teams that include it use dashboards to make decisions.
Operating Rituals That Keep The Budgets Honest
A weekly fifteen-minute reliability review. A monthly budget-review meeting where the burn rate is the agenda. A clear escalation path when the budget is exhausted. None of this is exotic. All of it is what separates teams that say they care about reliability from teams that ship it.
Key Takeaways
- Reliability is operational discipline, not heroics — pick numbers, then defend them
- SLIs, SLOs, and error budgets are the durable backend vocabulary
- Performance budgets are the equivalent for the frontend — wire them into CI
- Core Web Vitals are the public floor; use tighter internal budgets to stay clear
- Error budgets turn velocity-vs-reliability into a decision instead of an argument
- Observability dashboards have to show budgets alongside raw metrics to be useful
