confessional

When the System Breaks

5 minEitan Gorodetsky

I told them it was stable.

In the quarterly review, I used the words "robust" and "reliable." I pointed at uptime charts. I referenced the incident count -- down 40% year-over-year. The leadership team nodded. Someone said "great work." I believed it.

Two weeks later, the system collapsed on a Tuesday morning.

Not a partial outage. A full stop. The kind where people stand up at their desks and look around the room like the lights just went out.

What I Missed

The uptime charts were accurate. The incident count was real. But I had confused stability with resilience. The system was stable the way a house of cards is stable -- perfectly fine until someone opens a window.

What I hadn't measured was the depth of our dependencies. One service going down triggered a cascade across four others. The failover hadn't been tested in eleven months. The runbook referenced a team that had been reorganized twice since it was written.

I'd been reporting on the surface while the foundation was rotting.

What I Learned

The hardest part wasn't fixing the system. It was walking into the next leadership meeting and saying: "The metrics I showed you were true but misleading. I gave you confidence where I should have given you caution."

Nobody celebrated that presentation. But it was the most important one I ever gave.