US 2021

Chaos and Reliability: A Surprising Friendship in the Enterprise

Chaos Engineering is often characterized as “breaking things in production,” which lends it an air of something only feasible for technologically elite or sophisticated organizations. In practice, it’s been a key element in digital transformation from the ground up for a number of companies ranging from pre-streaming Netflix to those in highly regulated industries like healthcare, telecommunications, and financial services.

Many enterprises are grappling with application modernization at an ever-increasing scale, and leveraging chaos-informed experimentation as a facet of their SRE practices can help them get their arms around the complexity of their systems. Understanding the complexity of distributed systems is foundational but critical to true observability. These practices can inevitably lead to clarity in metrics like SLOs, grounded in reality instead of guesswork.

In this talk, Troy Koss (Director of SRE at Capital One) joins Courtney Nash (researcher at Verica) to explore some of the myths of Chaos Engineering, and how he’s put it into practice at multiple enterprise companies to foster a culture focused on reliability. Join them to learn how not chaotic it can be to adopt chaos engineering and how effective it can be at accelerating your SRE journey. You might be surprised to find out how close you already are to getting started...

Troy Koss

Director, Site Reliability Engineering (SRE), Capital One

Courtney Nash

Senior Research Analyst, Verica