Virtual US 2022

Why Does Capital One Test in Production?

In the IT industry, testing in production has always been considered an anti-pattern. However, in Capital One we have been successfully testing our critical digital customer facing applications in production for over 18 months now! Why do we do it?



The answer is simple, the alternative to proactive chaos engineering is reactive crisis management.



Chaos engineering is not new to the industry or Capital One, what's new is the scale of experiments being executed on a regular basis and a set of closely integrated software solutions we are utilizing to make it successful.

BP

Bryan Pinos

Sr. Director, Software Engineering, Capital One

YS

Yar Savchenko

Director, Software Engineering, Capital One

Transcript

00:00:06

Hi everyone, my name is Brian Pinos. I'm a senior director of software engineering at Capital One. In my role, I'm responsible for keeping Capital One's banking and credit card services always on for our customers. I do that through enabling teams to proactively find latent defects in their applications and infrastructure through chaos, experimentation. And we develop a suite of in-house tools that enable our teams to remediate problems in production through automation. With me today is ya, Chenko. Yara, would you like to introduce yourself?

00:00:38

Definitely. Thank you Brian. Hello everybody. My name is Yaf Chanko and I happen to work for the same company as Brian Capital One. And I am a director of software engineering. I work in the same area as Brian, and we provide site availability, engineering, operational support to all of our critical applications. And not only do we do that, but we also try to find those latent defects that Brian mentioned in this presentation is all about how we're doing it.

00:01:11

So first, a little bit about Capital One and who we are. You know, when most people think of Capital One, they either think of the Visigoths getting ready to rampage a credit card customer until they find out that they're a Capital One customer, or they think about Jennifer Garner asking what's in your wallet. What most people don't know though is that we're the first bank running entirely on the public cloud. We're pretty proud of that. We're a 25 year old company. And to think about that, we started out in the data centers as a traditional like legacy bank and then we migrated to the cloud through a transformation that took several years and now we're a hundred percent in the cloud. All along the way, we've been led by our founder Rich Fairbank. We're also one of the top 10 banks and credit card issuers, and we're the second largest auto loan originator. We have a hundred million customers, we have 50,000 associates. Many of those associates, our technology associates, and many of those are software engineers at Capital One. We don't say that we're just a bank, we like to say that we are a tech company that does banking. All right, thank you ya.

00:02:24

Thank you Brian. So now we get to the main portion of our presentation and hopefully the title that intrigued you all, but Capital One Tests and Production, right? If you have been in tech for any amount of time, you know that testing in production is considered a bad word. That means that you did not fully test your code, you deployed it to production and something went horribly wrong and you negatively impacted your customer. That is obviously something that every company out there is trying to avoid. However, in Capital One, we do test in production and not only do we do it every once in a while, we do this on a regular basis, both planned and no notice events. And you know what? We're not afraid to admit it. So the question that we get asked a lot is, why do we do it? Why do we test in production?

00:03:13

Well, to be totally honest with everybody attending the session is because maintaining consistency between lower QA environment and production is extremely difficult. There's numerous reasons for that. Some of it comes down to cost spending millions, sometimes hundreds of millions of dollars to make your QA environment just like production is not cost effective for a lot of the organizations. In addition to that, there's so many changes that happen in a lower environment at all times. So bringing some level of consistency to make sure that your code base is the same across production environment and QA is almost impossible. There are ways that you can enforce it with sim simPRO environment and others, but it, it's extremely difficult and sometimes doesn't force the trouble. So going down the list of things is, um, one of the things that we deal with is how do we generate the load that we see in production where if you have, um, an application that consists of number of microservices and the customer can use some of those microservices depending on their transaction, how do we mimic this in QA environment?

00:04:25

We can do that with some tools that are available on the market. We can develop our own tools, but complexity of it to truly mimic what a customer does with your application and regardless if it's a mobile or a web application, is very, very challenging. Again, just like mimicking your production and QA environment cannot be done absolutely, but the cost and the effort to do that sometimes just there doesn't justified. So what are we doing about it? Well, we're conducting KS experiments on a regular basis, as I mentioned before, both planned and no notice. And the way that we're able to do this is by utilizing industry tools such as a W Ss, uh, fold injection simulators and other, as well as some of the internally developed Capital One tools that allow us to test various failure scenarios in a controlled environment in production and having those tools. We also have a program wrapped around it. So one of our biggest program that have started several years ago is a game day. And as part of the game day program, we're able to consistently on a monthly basis conduct KS exercises on our most critical applications in production. And let me tell you, we have been very successful so far.

00:05:48

So how do we do this at scale? That's what this really is all about. If you have a simple application running on a few e c two instances and a database, it's probably somewhat easy for the engineers to really understand the full dependency map and the complexity of it. But when we start scaling it up from a few e c two instances to thousands, upon thousands of e c two instances and an entire ecosystem of microservices that all depend on one another, it gets too complex for the engineers and the production support guys to, to really keep in their head. So we look to tooling and you know, as we instrumented and automated our tooling to, to implement chaos experiments, we, we looked at what was out in the marketplace and, and we, we decided to leverage AWS's fault injection service. In addition, we also leveraged AWS's systems manager, uh, but then we paired that with some of our interning tools like Cloud Doctor to uh, kind of manage our entire environment to understand the complexity of our environment and to tailor those tools to how we would use them in our environment.

00:06:59

And we do all of this right to, to try to simulate app layer failures, right? Because an app layer failure occurs in the natural production environment, we always have to do a postmortem to figure out what happened and customers are impacted. But if we can simulate app layer failures in a small percentage of production in environment, we can see what happened and then we can make the application more resilience against it. That's, that's kind of one aspect of our chaos experimentation, but other aspect is how do we look at our infrastructure and our environments and how they fail? A w s will tell you in their best practices that you have to be resilient against failures, especially within a region. And that's why they have provided multiple availability zones. So at Capital One, we like to simulate availability zone failures to show that our applications and our APIs and our microservices, they can withstand a single availability zone failure.

00:07:53

Is there enough capacity in the other availability zones to handle it? When a failure ha occurs, does the primary note of the database automatically move to one of the other availability zones? Do all the instances reconnect to it or the containers, do they all reconnect to it all automatically? So these are things that we like to simulate in the controlled environment so that way we understand what could happen in the real environment in the middle of the night when everybody's sleeping and can't react really quickly. Uh, we we, we want it to obviously automatically heal and and ensure that we protected the customer's experience. Additionally, it doesn't happen often, but there have been times where there are regional failures when an entire region of a w s might go down for, for one reason or another. And those occurrences at Capital One, we like to simulate what would happen in those occurrences and ensure that we have the capability to run out of a single region.

00:08:45

And so we set up large testing events to do this and we use tooling and automation to simulate those types of events. There are some scenarios that we, we just can't actually instrument and simulate. And those cases we, we host, uh, exercises internally where we talk about what we would do in those cases. I mean, this is a little old school, but in some cases it's still required. Let's say all of a W SS went down, what would Capital N do? Those are the types of scenarios that we still have to think about. Uh, but looking at this from the layers, we look at that as kind of the, the least technical. As we move backwards, we can go through and instrument and simulate, uh, different layers. And I think Yara mentioned it, you know, we, we have game days. Uh, we also do specific chaos exercises that we leverage to test against specific architectural standards at Capital One.

00:09:36

And then we also do like what we were mentioning before about regional isolation, like how do we isolate all of our services in a single region and just make sure that we can operate on that one region. But in order for all of this to work, you have to standardize deployment. And in order to standardize deployments, you have to embrace infrastructure as a code. Without that, uh, you know, you, you end up having uniqueness or there ends up being things that change because of the human aspect and, and going out and building those things in order to understand the complexity of all the dependencies and the the cloud and the call flow, you have to invest in tooling and a lot of cases, some of this is available in the marketplace and some of this is stuff that you might have to build internally. And then lastly, you've gotta get rid of the manual intervention and you gotta do that through targeted exercises.

00:10:26

So when we have exercises that we can do, like ya mentioned are no notice exercises where we, we don't wanna give notice that we're gonna perform a chaos experiment. We wanna just go in and do the experiment and see how the application naturally reacted and self healed. That's how you gotta get to that. So that way your engineers aren't getting paged at two in the morning or on Christmas day or on New Year's Day and they're able to enjoy their time with their families because the system is able to self-heal and continue to provide that customer experience.

00:11:03

Alright, now let's talk a little bit about benefits. Um, I think both Brian and I mentioned that it does take some time, um, and investment to get to the point where you are consistently doing cast exercises. So what are the benefits? Well, we'll use Capital One as a case study and talk about the benefits that we have realized. So by conducting numerous cast exercises both planned and unplanned, we have identified latency, right? And latency is a bane of existence for anybody that has ever troubleshot a network problem. We can't beat the speed of light. So the more data you have to push through the wire and the farther your data centers or regions are apart from each other, the more latency we'll see. And in cases where you have a transaction that gets bounced multiple times between two separate data centers or regions, if you're in a cloud, you see a latency increase with each bounce back and forth.

00:12:01

And sometimes that could still be with an acceptable parameters. However, if something does change, one of your component fails, it doesn't have enough capacity, or that your primary database that you're writing to moves from one database, from one data center to another or one region to another, your latency could exceed the timeout threshold and then you're starting to negatively impact your customers. Or it could get to the point where your timeout could be set to 30 seconds and no customer is gonna wait 30 seconds for your application to load because let's be honest with it, people are not <laugh> willing to wait for something to load, right? We have been using Netflix, we have been using all of those tools that are available to us immediately at a click of a button. So by conducting this exercises, we have found a number of cases where moving components away from each other introduce latency.

00:12:55

We were able to resolve those. Um, and we'll talk a little bit more about how we address them by sometimes re-architecting the whole application, which does take a while. Sometimes it's a quick fix, sometimes it's moving one component or maybe sizing it correctly. And talking about sizing, I will move to the next most common finding, finding that we have had is the capacity. And again, it doesn't matter if you are housed in on-prem data centers in the cloud or hybrid where you have some components in a cloud and some in your on-prem data centers. Being able to size your resources correctly is very, very important. The benefit of cloud is that you can scale up pretty much unlimited. Both, uh, you know, a w s um, Google, Microsoft, they all have almost unlimited resources and you can scale up your servers, um, to pretty much any reasonable number.

00:13:50

There is a cost to that, but taking cost out of the equation, sometimes you don't size your cloud resources correctly. So when the traffic shifts from one region to another or from one data center to another, you just don't have enough pure computing power to process all of their requests and you start dropping customer transactions negatively impacting them. And again, that is something we're trying to avoid. So in conducting CAS exercise and Capital One, we have found a number of cases where our critical applications just wasn't sized correctly for a spike in user access. And those spikes sometimes happen for reasons that we know and expect sometimes when people get paychecks and sometimes they just happen due to other reasons that we don't expect or can't predict. Um, so the other, um, aspect that you have to take into account is how will your system, your critical system perform under extreme load?

00:14:48

If you're running into data centers and you're splitting your traffic 50 50, then your systems are running at 50% utilization. Well, what happens if one of those regions or data centers fail and your single one has to service a hundred percent of the traffic? Is your gateway, your load balancer? Are they sized appropriately to be able to handle that extreme load? And those are all of the great questions that you will hopefully answer by conducting KSS experiments. Now, whenever you conduct KS experiments and you have findings, it's very important. And again, if you have been in the IT industry, nobody likes process because process does add some red tape and slow things down. But wrapping a process around addressing the findings that you come up with from the CAS exercises is extremely important In Capital One. All of the high severity findings, most of them I wouldn't say all have been resolved in 30 days or less. As I mentioned, there are some cases where the whole applications have to be redesigned or rebuilt from scratch, and that does take months, sometimes years. However, if something is not sized correctly, that can be addressed fairly quickly. Addressing those latent defect that we talked about and making your environment a lot more resilient.

00:16:03

So are there risks? Well, of course there's risks to testing and production, but there's the value of those tests outweigh those risks, we think. Yes. Um, things you gotta watch out for though, sometimes there's unexpected impacts, right? So ER's talked a lot about latency and, and latency occurs sometimes where you're, you're not thinking about how the, the call traffic is bouncing between regions and so you eliminate or move one component to another region and you see a bunch of latency. It could be that it could also be actual failures. Customers things aren't set up to work the way we thought they would've worked, and we see failures. All these things, you know, can happen. We could actually have a real incident at the same time we're doing a chaos test in production that's possible too. Um, so you have to be prepared. And so I, I think the, the, the key to managing risks is mitigation.

00:17:00

And so we have some, you know, things that we've learned along the way and that, that we think are, are important. So one, I think you have to have a playbook upfront that defines what's agreed upon. Work with your business stakeholders, your product stakeholders, and your tech teams to really understand and say, okay, if this happens, if this number of customers are impacted, if um, if things get this bad, we'll abort the test and establish that upfront. So that way it's not a line of scrimmage call that you're making in the middle of the heat of the moment, but you actually understand what the rules that everybody's playing by and ensure that whatever you do, you can undo in under five minutes. This ensures that you have complete control to eliminate the, uh, the impact that you might be causing on your customers during some sort of a, a chaos experiment or game day event that's happening in production.

00:17:54

But the, the really, the key is you have to know that there's a problem. So real time monitoring of all your critical systems and transactions during, after the exercise is of the utmost importance. You have to understand what your steady state is. If you normally have a half a percent of error rate, you need to know that before you start your test. So that way when you, when you start looking during your test, you're not saying, oh, there's a half percent error rate. Is this 'cause of the test or is this always out there? We need to understand that upfront. So understanding what your steady state is before the test, what things look like during the test, and then returning to steady state after the test. That's also very important because if you inject latency somewhere into the, the call path, uh, and then you don't verify that after the test that that latency goes away, then you could have a very big problem.

00:18:48

Uh, another thing that we've done to mitigate, uh, mitigate the risks associated tested in production is maybe the first time you do the test, you don't do it at your peak production load. Maybe obviously I hope you would do it maybe in a non-prod environment first, although we know it doesn't match. But at least you can prove out the test, then take it to a production environment, maybe start, you know, in the middle of the night. I know nobody wants to be up all night, but start at the middle of the night, prove it out, show how it might work, and then bring it into more and more into production times. And we've found that this also helps kind of bring our business and our product stakeholders along the way. Because if you go to them and say, Hey, I want to test and our peak time is X, Y, z and I want to test at that time, they're gonna look at you like you've lost your mind.

00:19:34

So bring them along. 'cause it's not just the tech stakeholders you can bring along, you also bring along your business and your product stakeholders. Um, lastly, I know for some events we like to do what we call no notice, and that is we're gonna go and break something and see if the automation fixes it or the alerting works and that the engineers hop on a bridge and fix it. So we're, we're kind of testing not just the, the technology, but also the people and the processes that go along with that technology. Uh, but then in some tests you also want to have your dedicated and experienced SES on standby and ready because if something breaks, if something goes sideways in a way that you didn't anticipate, you wanna make sure you can fix it so that, that you're not impacting customers for a prolonged amount of time. Because that's obviously the utmost point of this whole exercise is how do we protect the customer experience? How do we ensure that we're always on for those customers when those customer's needs are services? And if our testing interjects more problems for our customers, then it helps us avoid in the future, then, then we're not really getting the value outta the testing that we need. So it, it's important that we do all that we can to mitigate the risk to our customer experience along the way.

00:20:53

All right, so now we get to what's next portion of the presentation, right? We talked about some of the great things that we have accomplished, um, with KS engineering or KSS exercises in Capital One. We talked about the benefits that you could realize from conducting this type of exercises as well as some of the potential risks and mitigation techniques. But what do we plan to do next in Capital One? Because again, you should never stop growing. Um, status quo is the worst thing that you can achieve in IT industry. And we wanna make sure that we can continue to push the envelope and test different scenarios. Well, what we plan to do next year is to bring all our critical applications into the scope of cares exercises. Right now we primarily test with our customer facing applications from the digital side, both website and mobile app.

00:21:45

However, we plan to expand that to other parts of our, um, environment such as call centers and I V R systems. Those have always been considered, um, separate from the IT domains. However, chaos could really be a beneficial tool to test the resiliency of those systems as well to provide better customer experience. And Brian, as Brian had mentioned, in addition to that third party vendors, I would say most of the companies out there utilize third party vendors to some extent. And chaos will allow us to test and proactively identify gaps that we might have with utilizing those third party vendors. It could range from a connection point to a specific vendor to what happens if the traffic you get from that vendor slows down back to the latency that we brought up across, uh, several, um, topics of this presentation. So being able to effectively test with third party vendors is very, very important to ensure that your environment is fully resilient.

00:22:45

In addition to testing with third party vendors, we want to be able to test on the highest volume days with no advanced notice, right? As Brian mentioned, if you're just starting on a chaos journey, obviously don't do this as your first exercise, but in Capital One we are getting to the maturity level where we can do a chaos exercise on double payday Friday, one of our highest volume day where our customers are logging in to check their account balance, pay their bills, um, do a lot of various financial transactions, and we want to introduce chaos on that day to see how resilient our environments are without letting anybody know, right? And that is the key because when real incident happened in production, usually nobody's aware that it's going to happen and there's not enough time to prepare. So doing this in a controlled environment is the key to build that muscle memory across all of the engineering teams.

00:23:41

In addition to that, we want to integrate, um, what we call plant failures. Doing the chaos experiments, there are a number of tools that engineers love using. Um, for example, zoom as this is what we're recording this call at. In addition to that, there's Slack, uh, Splunk, new Relic, there's a number of monitoring tools. And part of the experiment that we do, um, when conducting game days is asking the engineers not to use a specific tool that gives them a chance to ensure that you're not tied to a single monitoring tool or a single communication tool. In case that tool goes away during a production incident, they could easily switch to something else and continue to troubleshoot and resolve problems. What we are also working on is a generation of the fake HT t p error codes. Um, I think we all know and love for access.

00:24:32

You know, 4 0 4 is a common error. You know, the 500 error codes and ability to inject those codes into the application stack will allow us to determine how downstream applications react and do so in a very controlled manner because we can control the number of errors we inject. The last thing is that validating automated recovery techniques, right? And I think Brian really hit on this key point, but we want the system to self-heal. We want any incident that occurs in production to be self resolved so the engineers don't have to get cold in the middle of the night. And introducing chaos in our production environment allows us to test those recovery techniques and make sure that your triggers are set correctly, that you are moving traffic between regions or data center at the appropriate time and not causing yourself more latency or more issues. Now our end goal is to ensure that the game day exercises can be executed on the highest volume day, where we have the highest number of customers logging in unannounced when nobody outside of a single team is aware that they're happening with a single click. We don't want to make it very complicated. We don't want somebody to sit down and have to manage and operate it through multiple tools. We wanna use a single tool with one click that allows you to start the experiment and if something goes wrong to roll back, that is our ultimate goal. And once we're able to achieve all that, well, then we'll achieve higher level of resiliency. And I think that is the goal for everybody attending this conference. And with that said, I'm gonna pass this on to Brian for closing comments.

00:26:15

All right, well thank you everybody for attending our presentation today and we hope that you found the information useful and we look forward to answering your questions in the presentation Slack channel. Thanks.

00:26:27

Thank you.