Lessons Learned from a Web Outage

By Carmen Carey

More frequently we hear about high-profile web outages and the impact they have on business. One of the most significant this year was the Amazon AWS outage, which took down a major portion of internet traffic when an update went awry. However, more typically it’s the failure of a businesses to properly monitor and test their websites that causes unplanned and unacceptable downtime. Other major outages we’ve seen this year include social media platforms, including WhatsApp, Facebook, Twitter, and Reddit; major travel sites Ryanair and British Airways; and retail sites such as Nordstrom.

Many companies aren’t aware of how much traffic their sites can handle while still maintaining a positive user experience, how to prepare for busy periods, or how to respond to an unexpected surge in users. However, what they all have in common is that outages result in unhappy customers, reputation damage, and lost revenue.

Important Considerations
As any company that has suffered an outage will know, downtime can lead to significant consequences. To maintain a competitive edge and keep customers happy, organizations can learn a lesson from these high-profile web outages. Three lessons include, realize your limits, test early and often, and proactively communicate if you do have an outage.

Realize Your Limits
It’s important to determine what will cause your website or application to crash. How many concurrent users can be handled and where are the breaking points? Website traffic is not going to be the same day in and day out as user behaviors change and a new peak time can emerge. In many instances, websites crash because growth and peak behaviors have not been taken into consideration and acted upon. Monitor, track and understand your traffic trends and prepare accordingly.

Test Early and Often
Integrating performance testing into the software development lifecycle can help organizations to develop higher quality software in less time, while reducing development cost. The longer you wait to conduct performance tests, the more expensive it will become to incorporate changes. Check that you have accurate load predictions, dynamic distribution, and scaling in place to optimize infrastructure and the user experience coupled with the validation that load can be moved around seamlessly without negative impacts to the system.

Proactively Communicate if you do have an Outage
In the event of an outage, organizations must be transparent with customers and let them know what’s going on. Acknowledge the outage, explain the reason for the occurrence, apologize for the inconvenience and provide a realistic timescale to return to business as usual if possible. A key to success here is having an incident or status page that is hosted by a third party and not your own organization. This guarantees the ability to communicate via a third-party interface that will stay up during your outage. Many vendors offer incident status pages as a service.

Learn from Mistakes
If you do suffer an outage, make sure you learn from it and put measures in place to ensure it doesn’t happen again. One way to do that is to invest in proactive monitoring and testing services. But, one size does not fit all. Monitoring and testing should be tailored depending on the needs of the business.

Three points to keep in mind for testing and monitoring your website include your audience and purpose, busy periods, and the unexpected.

Audience and Purpose
Every website has a specific audience and purpose. For example, a website that sells concert tickets and a narrowly focused blog site appeal to different scales of users. The ticket website will need to be able to cope with large bursts of traffic, whereas the personal blog will have much smaller peaks. Stress testing lets you see exactly how much your site or application can handle before falling over and whether you need to increase your capacity. By comparing the absolute bottlenecks with typical traffic, you can stay on top of potential failures.

Busy Periods
Some websites can predict their busy periods. Amazon and other internet retailers know they will be inundated on Black Friday and Cyber Monday. Networks and content providers expect peak traffic when a new episode of the latest hit show, or premiere event airs. Consequently, there is a specific calendar that these types of sites adhere to when it comes to website traffic. Concurrency testing allows businesses to see how many users their site can handle at any one time, so when their site is needed most, it can continue to function at its best.

The Unexpected
Unexpected surges in traffic are—by their nature—difficult to predict. Influencers and promoted material can seriously affect site and application demand. Recently, the FFC site went down when John Oliver asked viewers to visit its website in support of net neutrality. In another example, Instagram crashed when president Trump directed people to his Instagram feed. If a company feels its website could be particularly vulnerable to this sort of “flash traffic,” implementing sound dynamic scaling practices, a well-tuned disaster recovery process—for the worst-case scenario and performance testing for extreme peaks are all smart choices.

Testing and monitoring are most effective when they are tailored to the specific needs of a website or application. It is important to have the right tools in place to monitor website traffic and test your applications so any issues that arise can be proactively addressed in the development cycle and rectified quickly and effectively in production scenarios.

Carmen Carey is the CEO of Apica, an application performance company. Apica’s Load Test and Synthetic Monitoring solutions give the world’s leading brands confidence in their applications.

Nov 8, 2017Olivia Cahoon

Lessons Learned from a Web Outage. How to Individually Tailor Website Testing.

Product Centrics

Quick Links