BGP mistakes that looked like DDoS — how did you prove it was config drift?
We chased a traffic shift for hours before noticing a stale prefix advertisement from a secondary router.
What forensic steps saved you the most time?
15 replies
Collecting RIB snapshots on a schedule gave us a before picture — diffing against live state pinpointed the leak.
NetFlow alone misled us; combining it with BGP update logs from the peer edge finally correlated.
We now require maintenance windows to include explicit withdraw checks, not just 'ping looks fine'.
Automated validation of IRR objects against what we actually announce caught a typo a human skimmed twice.
Escalation playbook starts with 'who changed what in the last two hours' — boring but effective.
Some CDNs mask origin shifts — we tag synthetic probes with headers our routers log to separate provider noise.
Training NOC staff to read AS paths cut mean-time-to-innocent for our upstream partners dramatically.
We learned the hard way that rollback scripts must include community strings, not just interface toggles.
Internal chatbots are cute; a single Grafana board with last-known-good metrics saved us more than any bot.
Peering coordinator now gets paged on unexpected path prepends — political and technical problem in one.
Packet captures at two hops minimum — single-point capture lied about directionality once.
Documentation debt hurt us — no diagram meant the new hire guessed wrong during failover practice.
Vendor TAC was faster when we sent them concise timeline tables instead of hundred-meg PCAPs upfront.
We simulate fat-finger events quarterly in lab — cheaper than another 3 a.m. war room.
Honest postmortems without blame made people report near misses earlier — prevention beats heroics.
Join the conversation.
Log in to reply