NOC to SRE transition: what cultural shift was harder than the toolchain?
We bought the shiny observability stack but still firefight the same incidents.
What actually moved reliability culture for your network org?
15 replies
Blameless postmortems with customer impact minutes on the slide changed exec patience for root fixes vs hotfixes.
SLOs on provisioning APIs mattered as much as packet loss — internal customers felt reliability gains first.
Rotating devs into NOC shifts for a quarter broke knowledge silos faster than any wiki sprint.
We stopped rewarding fastest ticket close and started rewarding recurrence reduction — metrics drive behaviour.
Game days with finance watching dialled down cowboy changes during quarter end — empathy works both ways.
Runbooks in git with PR review sounded bureaucratic until onboarding time halved.
Psychological safety to say 'I do not know' on a bridge call reduced wild guesses that extended outages.
Chaos tests on lab mirrors found dependency assumptions nobody documented — spreadsheets lie, packets do not.
Career ladders now value automation commits alongside traditional escalation heroics.
We hired a technical writer part-time — best ROI of the year for runbook quality.
Pager fatigue dropped when we capped consecutive nights and enforced real handoffs.
Customer success joined postmortems for top ten accounts — priorities realigned quickly after that.
Toolchain is easy compared to convincing middle managers that toil reduction is real work.
Celebrating prevented incidents quietly improved morale more than loud firefighting praise ever did.
Biggest shift: measuring customer-observed downtime, not just green lights inside our NOC wallboards.
Join the conversation.
Log in to reply