Expedia’s Journey toward Site Resiliency: Embracing Chaos Testing in Dev and Production at QCon SF

At QCon SF, Sahar Samiei and Willie Wheeler presented “Expedia’s Journey Toward Site Resiliency”, and discussed the building of a community of practice around resilience testing within Expedia. The results have generally been positive: Netflix’s Chaos Monkey has been running daily in production since May 15th; resilience tests have been added to four Tier 1 service pipelines; and there has been an increase in organisational awareness in regards to the value of building resilient services.

Samiei, senior product manager at Expedia, began the talk by stating that in 2016 Expedia was the 11th largest Internet company by revenue, at $8.77B. With a “back of the envelope calculation” of revenue loss due to unplanned site unavailability, moving from 99% uptime ($87.7M potential loss) to 99.9% site availability ($8.77M loss) results in ~$80M difference:

Keeping [the Expedia site] up protects tens of millions of dollars of revenue per year.

Expedia has a “test-and-learn” culture, and innovation is about constantly iterating products and features. Resilience is not always treated last a first-class citizen: there are often too many competing priorities, there are major misconceptions about resilience, and team autonomy can mean it is challenging to diffuse learnings and tooling effectively.

To address these issues, Wheeler, principal application engineer at Expedia, discussed how a shared learning space was created within Expedia, which facilitated the sharing of information around resilience, and led to the creation of “resilience champions”. Much effort was made to collect and present baseline resiliency data, in order to allow teams to track improvements.

A large organisation such as Expedia has a plethora of tooling and platforms in use, and it can be a challenge to steer adoption. Wheeler discussed how the focus on core principles was more valuable than individual tooling, and shared how his team defined a “resilience engineering lifecycle”:

  1. Prioritise services that will benefit from improved resilience
  2. Investigate vulnerabilities
  3. Apply resilience patterns
  4. Conduct resilience experiment in test
  5. Conduct resilience experiments in production (increasingly referred to as “chaos testing”)

Services within Expedia are classified as Tier 1 (essential), Tier 2 (important) and Tier 3 (nice to have). Scorecards and reporting were used to share and highlight information around a service’s resilience, such as the number of incidents and current availability. This combination of tiered service classification and scorecard data enabled the prioritisation of resilience testing in order to get the biggest return on investment.

Resilience testing in dev, test and production

Vulnerabilities were investigated with interactive experiments — for example using the Gremlinchaos testing toolset — and a service’s resilience was defined within a maturity model: survive instance loss; survive dependency loss; survive AZ loss; survive region loss; and so forth. When the vulnerabilities were identified and understood, the team applied a series of resiliency patterns to address them:

  • Autoscaling
  • Rate limiting
  • Circuit-breaking – for example, protecting services with Netflix’s Hystrix
  • Bulkheads – as popularised in Michael Nygard’s book “Release It!”
  • Multi-geographic deployment – for example, multi-zone and multi-region
  • Database failover

Resilience experiments were conducted in tests as an addition to the continuous delivery pipeline. Production experiments were conducted with the use of Netflix’s Simian Army and Chaos Monkeys. Due to Expedia’s core value of autonomy, and the resilience team wanting to champion improvements (and not simply break things), each service owner could “opt-in” to a resilience testing whitelist. Each service exposed core health checks and metrics, and these were examined pre-, during, and post-attack.

Anatomy of a resilience test

The results of resilience testing have generally been positive: Chaos Monkey has been running daily in production since May 15th; resilience tests have been added to four Tier 1 service pipelines; there has been an increase in organisational awareness; and a resilience community of practice has been established with 65+ active members. In regards to the challenges, establishing development team engagement has been a struggle due to limited team capacity, and the drive for improving Expedia’s products is still currently greater than the need for improved resilience.

Samiei and Wheeler concluded the talk by discussing that the focus of resilience engineering at Expedia for 2018 will be around automation, specifically: service mesh/proxy-based resilience testing enablement (e.g. via Linkerd or Envoy); testing via service discovery; and increased observability. The primary goal is to reduce the cost of resilience engineering through automation.

The slides for Sahar Samiei’s and Willie Wheeler’s “Expedia’s Journey Toward Site Resiliency” (PPTX, 25MB) talk can be found on the QCon SF website. The video for this and all QCon SF talks will be made available over the coming months on InfoQ.