Why One Bug Caused Widespread Internet Chaos

Anybody familiar with the concept of the butterfly effect is likely to have some sympathy for the Amazon Web Services engineer who inadvertently caused an Internet meltdown on February 28, 2017.

The cloud-based infrastructure provider suffered an hours-long outage that crippled a significant proportion of the Internet, with a plethora of high-profile websites and apps stopped in their tracks. The outage was centered in the Northern Virginia region and ensured that AWS was unable to service requests for around five hours.

Although there was no evidence that this was linked to a malicious attack, people (unsurprisingly) wanted to know what was going on.

And it turns out that human error was to blame. To be more specific, a member of the Simple Storage Service (S3) team typed in the wrong command as part of a debugging process intended to speed up the S3 billing process.

Yes, a typo brought parts of the Internet to its knees.

According to Amazon Web Services, the debugging was only supposed to affect a limited number of servers. The incorrect command not only took out a larger set of servers than intended in the US-EAST-1 datacenter—a huge location that is also Amazon’s oldest, ZDNet reported—but also removed support for two other S3 subsystems—the index subsystem and the placement subsystem. Both of these subsystems required a full restart and safety checks that took “longer than expected.”

“At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process,” said AWS, in a press release. “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”

The full AWS explanation for the outage can be read here.

Never Hurts To Plan For An Unexpected Event

The outage only lasted for a few hours but it highlights the ripples that can be created when the human factor comes into play.

AWS said that it builds its systems with the assumption that things might fail but admitted that it had not restarted either of these subsystems in its larger regions for years. S3 has experienced massive growth, which means that any outage or bug is always going to have a significant effect on the digital world.

To be fair to AWS, it has already made several changes to its operational practices as a result of this unforeseen event.

These changes include modifying the tools used to remove capacity and adding safeguards to ensure that other subsystems are not affected by an incorrect input. AWS has also begun auditing its other operational tools to speed up recovery time, splitting services into small partitions or cells and reducing the dependency that the AWS Service Health Dashboard—which was also taken out by the incorrect command—has on S3.

AWS has apologized for the impact that this event had for its thousands of customers, but it acknowledged that one small bug caused a whole heap of problems. With that in mind, this widespread outage through human error should be a wakeup call for any companies that test for bugs on an infrequent basis.

Want to see more like this?

Dev & QA Trends

David Bolton

Former ARC Writer

Published: March 3, 2017

Reading Time: 3 min

4 Emerging Customer Journey Trends to Test

4 Emerging Customer Journey Trends to Test Our modern day tech evolution has enabled new ways for customers to ...

3 Keys to a Successful Digital Launch in Africa

3 Keys to a Successful Digital Launch in Africa Africa may seem like one of the last great frontiers. Unlike the ...

How Gen AI is Transforming Software Development

How Gen AI Is Transforming Software Development Gen AI is far more than a tool for writing business emails or ...

Maintaining High Quality While Replatforming

Maintaining High Quality While Replatforming Replatforming, in which an organization opts to solicit services from ...

Credit Card Rewards Demystified: Give the People What They Want

If your credit cards rewards program isn’t paying off in greater card use and increased customer loyalty, you need to find out where you’re missing the mark.

Software Testing Basics: Manual Functional Testing FAQs

Manual functional testing is the foundation of software quality assurance. Learn more about this essential element of a digital quality strategy.