Core router refresh: how did you phase traffic without trusting a single maintenance window?
Tens of terabits make 'we will be quick' feel naive. Interested in canary flows, parallel fabrics, or other patterns.
15 replies
We moved low-risk transit peers first while monitoring error counters per line card — incremental courage.
Parallel fabric meant capex hit upfront but removed the knife-edge cutover everyone feared.
Automated rollback on BFD session loss saved us during a bad optics batch — invest in fast detection.
Traffic engineering labels let us drain specific communities without touching unrelated wholesale customers.
Lab traffic never matched production entropy — we replayed sampled production headers in test.
Vendor TAC recommended steps that assumed default queue settings — our QoS maps differed silently.
Optical layer validation caught a dirty connector a software health check swore was fine.
We rehearsed verbal comms scripts — silly until a misheard VLAN number nearly caused a loop.
Post-refresh we left old cards racked but powered for a week as cold spares — saved a weekend once.
Telemetry cardinality exploded after upgrade — budgeted time to tune exporters beforehand next time.
Some peers accepted graceful shutdown notices; others needed manual calls — relationship map helped.
We documented every knob changed from factory default — future you will thank present you.
Canary AS prepends let us measure latency shift before committing full weight.
Power sequencing mistakes are embarrassingly common — checklist taped to the rack now.
Honest takeaway: parallel capacity plus boring monitoring beats clever single-window heroics.
Join the conversation.
Log in to reply