You likely heard that Internet registrar Godaddy.com had a 6-hour network outage this week and most initial predictions pointed to a hacker or DDoS attack. However, CEO Scott Wagner stated the cause was actually “due to a corruption of router tables.” From the post-mortem analysis, the issue that brought down their network appears to be tied very closely to changes that occur to core devices like routers and switches. It is just another high-profile network problem that appears to be caused by the number 1 culprit of performance issues, network change.
While Godaddy’s outage got worldwide headlines because of its size and number of impacted users, errors like this happen across the world every hour of every day for several key reasons:
- Manual change processes via CLI and/or custom scripts are time consuming and error-prone
- Changes are made outside maintenance windows and aren’t properly reviewed
- Changes are not completely documented, and
- Overburdened staff overlook and miss the change’s impact on other devices
Now if your organization had unlimited time, people and router/switch expertise, these problems likely wouldn’t impact overall network health. But if you’re like the 99.99% of the rest of IT teams, you are already overburdened with day-to-day issues and firefights and you’re doing two, three or more things at one time.
But if your organization cannot risk a six-hour outage or you’re still using CLI and Perl scripts for network changes, network automation can be a huge benefit. The power of automation is the ability to do the tedious, time consuming, unglamorous tasks that we know should be done continuously, but don’t have the time, people or resources to do in our current environment. For example, with Infoblox’s NetMRI platform, users get:
- Automatic detection and notification of changes for better documentation and no surprises
- Change automation to reduce manual time and eliminate risk of human error
- Configuration analysis to detect poor configuration settings often before end users are impacted
- Rollback options to reduce the time of impact by going back to previous state, and
- Network discovery, switch port management and compliance management.
We may never know if this problem was an inadvertent mistake, a rogue internal employee or an external hacker making unplanned changes to the network devices. However, with NetMRI’s automatic detection of changes, the IT staff would have had a leg up on identifying the problem – whether it was an accident or malicious activity. To learn more, go to http://www.infoblox.com/en/products/netmri.html