April 1st Service Disruption Postmortem
On Tuesday, April 1st, Pertino experienced a service disruption. From 3:40pm to 4:27pm PDT, almost all customer networks experienced some level of impact. Four percent of customers continued to have some network connectivity issues until 9:30pm PDT. I would like to apologize to our customers for any interruption to your daily operations that may have resulted from this event. We've invested a lot of time and money to ensure the high availability of our cloud networking service. However, in this case we fell short of both our customers' expectations and our own. For that, I am truly sorry.
Following is an unvarnished account of what happened and why, and what we’re doing to prevent it in the future.
When developing a cloud-based service, the common wisdom is to “embrace failure.” Specifically, cloud computing requires architecting for individual failures and unreliable connectivity. Pertino’s Cloud Network Engine (CNE) is designed to do exactly that, thanks to an engineering team that possesses hundreds of years of collective network and cloud expertise from industry-leading companies.
Pertino’s CNE consists of two primary tiers: the control plane and the data plane. The control plane is in charge of authentication, policies, traffic routing configuration, monitoring, and is generally the “brains” of the service. The data plane is where customer virtual networks reside and network traffic flows.
Both tiers are architected to be highly available and reliable. The control plane is deployed across multiple cloud datacenters. It uses redundant services, stateless transaction processing, and transient state representation to scale horizontally. Any element of the control plane can crash with no effect on the overall service.
The data plane consists of virtual switches (v-switches) running in multiple cloud datacenters around the worldwide, located near to our customers. When an individual v-switch or even an entire cloud datacenter inevitably fails, the control plane normally migrates affected customer networks to an alternate heathy v-switch and the whole process is imperceptible by users.
By “embracing failure" and leveraging cloud scale and economics, Pertino can deliver a high availability cloud networking service that is very cost effective for customers to deploy and scales linearly as their users, resources, and traffic grows.
On any given day, there are intermittent disruptions around the world that occur within the Internet and the cloud datacenters that underpin our CNE service. Pertino’s technology and talent work together seamlessly to ensure customers don’t feel a thing. On Tuesday, however, one such failure caused our CNE platform to respond in unexpected ways that resulted in a customer-affecting service disruption.
On the afternoon of April 1st, there was a network outage between a cluster of data plane v-switches and the control plane, which was located in a different cloud datacenter. The disruption was brief, but lasted long enough for the control plane to consider those v-switches at risk. So, the control plane began migrating customers to other v-switches.
However, due to a new bug in the data plane, too many messages were sent from the data plane v-switches to the control plane, increasingly loading it with requests. While the CNE control plane didn’t fall over, it did begin to slow down. The slowdown automatically rate limited the incoming control plane traffic. That rate limiting triggered the bug on more v-switches, making more requests from data plane to control plane, and thus more migrations, more control plane slowdown, and so on.
Like a snowball rolling down a cartoon mountain, the load grew and grew, leading to cascading data plane failures across the entire CNE platform.
When we realized what was happening, we blocked even more traffic to the control plane so it could recover. Simultaneously we performed a mass reset of the data plane. Within 30 seconds, 96% of customer device client connections were restored. The remaining 4% were recovered manually over the next few hours as some v-switches had completely lost their ability to synchronize state of the customer device client and had to be restarted.
What We Learned
While we can rate-limit traffic to the control plane, we need a finer-grain control of that traffic. Specific classes of traffic must be guaranteed control plane capacity. In the case of a large traffic spike, the control plane must be able to shed or reschedule those events. Having the ability to selectively allocate control plane capacity will give us the most flexibility in dealing with future challenges.
And of course, we learned we had a bug synchronizing the state between data and control planes.
How We Are Fixing It
First and foremost, we are repairing the bug.
Second, we are adding a “traffic cop” to police the flow of data plane traffic to the control plane. We’ll expand the capacity of the control plane and use the “traffic cop” to route traffic to dedicated capacity for all classes of traffic. The next versions of the traffic cop will allow us to enforce SLAs on all control plane services. The general nature of the traffic control will also allow us to bring more business logic and defensive capacity to the control plane.
Third, we are adding more negative testing to our release process. We currently have a “Chaos Monkey” test ala Netflix’ Simian Army. Our Chaos Monkey tests complete loss of the control plane. We need to expand that testing to include the phenomenon of a control plane that is present but running in diminished capacity.
There will be additional items that come out of this effort and learning. I look forward to sharing them here with all of you.
Pertino is proud to be a vital service for our customers and we take that responsibility very seriously. I want to apologize once again for any impact our service disruption may have had on your operations. Our goal is to be an exemplar of quality service. Occasionally we run into problems like this one. We strive to be as candid as possible and use these issues to drive improvement. We are continuously endeavoring to solidify our service and ensure the trust you have placed in us.
Please let us know if you have any questions or comments by sending them to firstname.lastname@example.org. Thank you for choosing Pertino as your cloud networking partner.
Ronn Brashear - VP Engineering