How Many Nines Are Enough?

In this talk, Gremlin CEO Kolton Andrus shares insights from years at Amazon, Netflix, and now working with a wide array of customers across various disciplines and industries.


He'll describe what each level of availability looks like, the challenges faced at each stage, and the trade offs required to achieve the next nine of uptime.

KA

Kolton Andrus

CEO, Co-Founder, Gremlin

Transcript

00:00:08

Hello. My name is Colton Andrus. I'm honored to be able to speak at the DevOps enterprise summit London. Today. I'm going to talk to you about how many nines of availability is the right number for your team, your organization. I like to begin by giving public examples of why this is important. As we've seen the shift to COVID and online systems being paramount over the last few months, we've seen an increase in load and stress on these systems. And as a result, we've seen more outages related to them, whether it's zoom, our ability to game online or our ability to make online trades are our opportunity to interact with these businesses. And our systems online is paramount. Matt now, more so than ever. Why are we seeing more of this failure? Well, the answer is a lot of the design decisions we've made over the last 10 and 15 years come at a cost.

00:01:10

We've prioritized decoupling our systems. We've prioritized speed of innovation and being able to enable teams to move quickly. But with this has come a cost and complexity no longer is the day that an architect can hold the entire system in their head. Now there's many, many moving pieces and they're changing often, if not daily, hourly, what this results in is what I lovingly refer to as the microservice. Deathstar these are examples from amazon.com in 2009 and Netflix in 2012. And just looking at this picture illustrates the point. It's hard to comprehend it. It's hard to make sense of it. There's an inherent complexity in our systems. Now this is the chaos that we have to deal with. The chaos that we're here to contain. The old world approach of testing is no longer sufficient. We used to focus primarily on our code unit testing and integration testing, what we'd written, but in modern distributed systems, the dependencies that we take and their healthiness has a big impact on whether a system behaves correctly.

00:02:21

The configuration needed to run a system within production is also paramount. The timeouts, the thread pools, the security groups, the auto-scaling, our infrastructure is more ephemeral can come and go at any point. And our people on our processes are critical to being able to operate our systems. Well, many engineers are now finding themselves as SRS or on call and may not have had the experience and the opportunity to practice operating these systems. At scale, what we're seeing is a trade off of this ability to move quickly that comes at the cost of reliability. And what we would really want is to be able to move quickly and maintain that reliability to do that. We must shift the curve. We can't simply take the same approach that we have before. It requires a new way of thinking about things, a new approach from my time as an engineer on call and Amazon and Netflix, and my time building an operating systems such as these, the best answer that I've found is chaos engineering.

00:03:31

Now, chaos engineering means different things to different people. Some people believe it means we're going to randomly cause failures and see how the system and the people respond. And while there's truth to that, I think the definition that's best is around thoughtful, controlled experiments, designed to reveal weakness in our system, akin to the scientific method. We're going to go out and test a hypothesis. And because we're here to prevent outages and to build reliability, we're going to be very thoughtful about how to cause those failures. So we don't inadvertently make things worse. Now, some people feel like this is a little counterintuitive and when I'm home for the holidays or hanging out with my family, the vaccine analogy has been one of the best ways to explain this. Now we're going to inject a little bit of harm, indoor systems, same way that we might inject a little bit of harm into our bodies, but this is so our systems and the people that operate them have an opportunity to respond, to learn and to build an immunity to those types of failures. Once we've seen a failure and we've mastered it, we'll be much better prepared to handle the next failure.

00:04:46

As I mentioned, we never want to cause a failure. We never want to cause an outage or customer pain as part of this. And so we're doing this by being thoughtful of what is the blast radius, what is the potential impact or side effect from an experiment that we're going to run. We always want to start with the smallest experiment that will teach us something and slowly grow it. As we build trust and confidence in our systems, we might begin in development or staging by testing. It's a single host or three hosts. And if we find a critical error, great, we've mitigated risk and we're able to fix it much earlier in the process. But if it behaves how we expect, we continue to scale that till we've tested all of our staging environment, when we're comfortable there, we're going to move into production, but we're going to reset that blast radius back down to the smallest piece, a single device, a single user, and then grow it again. It's important that we're testing both of small scale and the large scale, the small scale, we might be catching an old pointer exception or a failure condition. We hadn't tested for it at the large scale. We're testing how our system handles duress. Do we shed load as appropriate? Do we back off of downstream dependencies? Do we have enough capacity to handle that influx when things begin to slow down or degrade?

00:06:08

So I want to set back and give a little bit of context as someone that served as a call leader or an incident commander for 10 years, I have a couple of tips and tricks about how to manage an incident. And what's important when an incident occurs

00:06:22

First is let's talk about Acular, let's define some terms. First of all, we need a good metric to track customer health and behavior. At Netflix, we use string starts per minute at Amazon. We used orders per minute. This is some measure of can the customers use our system and are there now what happens when bad things occur? Well, we can break an incident into a couple of pieces. How long does it take us to understand that there's a serious impact? That's the time to detection. And this usually takes a couple of moments as we're comparing week over week data, or we're waiting for a threshold to be hit. Once we've detected it, we're going to page and alert the people to respond and resolve this incident. This is the time to engagement. How long does it take for those people from the time that their page to get on the call and start working on the issue at hand, in my experience, this can range anywhere from a couple of minutes to 10 or 15, if someone is caught and then odd situation, you know, is out.

00:07:27

And about one time, I was called later driving home on my motorcycle and got paged and had to pull over on the side of the freeway and manage that incident from the shoulder. Um, that that time to engagement was a little longer than normal, but faster than had I continued my journey home and then joined. And then once we're on the call and addressing it, how quickly can we resolve it? That's that time to resolution. And really it's time to mitigation. We're in triage mode. We may not fully correct the issue, but we want to restore service to our customers and ensure that the system is operating as well as possible. And then once we fully resolve that failure, how long does it take until another failure occurs? Each of these are a metric and a measurement that are good to know and are good to measure because if we want to improve our reactive approach, if we want to improve how quickly respond and fix an issue, each of these will need to be optimal. So, as I mentioned, we want to think about what are these metrics that really capture the customer experience? The value that our platform is providing that give us a good signal of what is healthy. And we want to be able to tune our thresholds and our alerts and our sloo. So that those meet those goals.

00:08:47

Now maybe you're on call for the first time. Maybe you've been on call for 10 years, but I have a couple of tips and tricks for when you're on call and on that conference bridge and managing. And it's a, first of all, I'm a big believer that you need one person with the authority to make decisions. This is a key part of being a call leader, the judgment too, in the face of the lack of information piece together, your best guests and make a good call. This is often debatable, but in the heat of the moment, when you're dealing with incomplete information, you need someone familiar with the context of the system and how passed out, or does it go to be able to guide the actions being taken along those lines? I'm a believer that we don't want to be changing many things at the same time.

00:09:33

We want to coordinate our efforts across our teams, because if we make a change and it fixes things, we want to know what we did. And if we're making three changes in parallel, we may be unsure which one actually resolve the issue. It's also good to ensure that the team is acting together and we don't want an individual off on their own making changes, potentially improving things, but potentially making them worse without the knowledge of the group as a whole. Now, when I first joined these calls, uh, for the first five minutes, I'm giving a status update every 30 seconds or every time two or three people join. It's important for them to be able to know what's going on, what actions have we taken? And what would we like them to do? Typically when a service owner or someone who's joining an incident call the ask is, go look at your dashboards, go look at your service and try to determine if you're participating or a part of this, or if we can exclude you. And it's important to know who's not involved because these are people that have been woken up in the middle of the night or people that have been disrupted from their day jobs. And if they're not playing a part and we don't need them to do work, then we want to be able to excuse them, to go back to sleep, to go back to their jobs so that they can focus on other things at hand.

00:10:51

So let's talk about what is the right number of nines for you? The short answer is not everyone has Netflix and there's a cost benefit analysis to how much we invest and what we're able to achieve. So I want to provide a little context about what does the world look like in each of these high level number of nines, two nines, three nines, four nines. What could we be doing to improve and get better? And what are some of the costs of that improvement? So we'll start in the two nines world, uh, in this world, we're having three to four days of outage over the course of a year. This is the floor. In my opinion, uh, things are failing. They're failing. Often. Our customers likely have a perception that our service is broken or has some issues. We probably haven't invested in having the company focus on this effort or even a team focused on this effort. So likely this is one person's life. One person that holds a lot of the tribal knowledge that is responsible when things go wrong and steps up to help fix them.

00:12:00

For those of you that are familiar with the Phoenix project. And if not, I'd highly recommended. This is Brent. Brent is the bottleneck here. He's the one who is reached out to when things go wrong. He's the one that his burden is burdened with keeping the system up and alive and can be taxing and exhausting for a single person to focus on this. And the two nines world. We may not yet have monitoring and alerting. We may have very basic logging. We probably have some unit and integration tests, but we haven't yet gotten into the world of a good deployment pipeline or more sophisticated tests. We probably don't have an incident management process. It's probably whatever folks feel is best to handle the issues at hand. And we may not have designed or built a lot of redundancy endorsed system. We may be running with a very bare bones approach.

00:12:53

And so the good news is these things are, are easy steps that we can take to improve. We need monitoring in the learning, the analogy I draw as to flying an aircraft without an instrument control. Then there's no way that we would do that. We need to understand how things are, how things are responding, how things are operating. Obviously there's a lot of good advice behind building the point pipelines, making it easy to follow a set process, to iterate often and to ship our code often. And this will help us to improve the quality and catch issues earlier in the process. It's important for us to have an incident management program. Now this may be very lightweight, but what is the process when things go wrong and how do we go about addressing it? Who's in charge. And it's important for us to have redundant capacity, whether it's zone, whether it's at the host level, if something goes wrong and something will always go wrong, what is our backup plan?

00:13:54

Now, this is where I'm a believer that fire drills are an important piece to help train teams and prepared them. My on-call training and my career has sadly resulted with here's a pager, you're smart, good luck. You'll figure it out. And I think that we, as an industry can do much better about training this next generation of SRS and operations folks, to know how to handle these incidents. There's a reason that many of us have grown up running fire drills. And that's because when a fire breaks out, we need people to not panic. We want them to respond safely and calmly and thoughtfully the same applies. When we have a major outage in our system, it's a stressful event. You might have a VP or a C-level on the call. You know, that customers are being impacted and it's all hands on deck to fix it as quickly as possible. You may not know what action to take or exactly what's happened. And that puts a lot of stress on individuals and opportunity to practice and prepare ahead of time, allows us to role-play to ask questions and to build some comfort about an uncomfortable situation.

00:15:02

Now, from the chaos engineering world, there's a few things that we can take and apply here to make our lives better. We want to go understand what happens when our alerts and our monitoring has been set up and we tuned it correctly. Is there a lot of signal or is there a lot of noise? And as someone who's operated these systems, if there are 300 alerts and they all make noise, constantly, your engineers will quickly tune it out and stop listening. And so we need the smallest set of alerts and monitors that provide us insight and value without taxing us. And we need an opportunity to practice. We need to let people pretend that there's a real outage, that they join. They join a call, they get engaged, they look at their dashboards, they log into their hosts. There's a lot of little details there.

00:15:51

Things that could go wrong and delay an incident. And by going through them in advance, we can really prepare. And this type of investment doesn't take that much time. This could be a monthly exercise with our teams. It could be a quarterly exercise where we get the whole company together, but with just a few hours of investment and a little bit of tooling, we can really improve from that three nines world from that two nines world into a three nines world. And so what does the world look like when we arrive at three nights? Well, failures happening less often. We've moved from days of failure to hours of failure over the course of a year. In this case, when failure does occur, it's likely our customers are annoyed. And if they have an option, they may choose to go to one of our competitors or another service if they're experiencing failure.

00:16:40

But overall, the system feels like it's working correctly. And most of the time people are able to have a good experience. We've moved from the world of this being one person's problem to where the critical services are now part and parcel of ensuring the systems operating. Well. We likely have a set of tier one services that are the pieces we know cannot fail on a set of tier two or other services that are okay, or a little bit less, the more tolerable. And at this point, we're beginning to capture the learnings. We have an opportunity to review our incidents, to talk about what we can do better and to share, start to begin to share those learnings amongst our teams. So this is really where we've arrived at an SRE team or operations teams, and it's no longer just a burden of, of Brent to figure it out and fix it.

00:17:35

So in this world, we likely have logging and monitoring, but it still might be noisy and scattered. And so this is the opportunity for us to come in and tune those thresholds, tune those alerts and make sure they're actionable. And that when they're producing noise, it's something we can act upon. We're building and deploying more often. And now we're seeing code changes coming across teams and more frequently. So this is where it makes sense to start layering in things like Canary deploys and failure testing more in depth insight in our pipeline that tells us if we've addressed past issues that could have occurred. And if our system is really ready to be deployed and run in production in three nines, world incident reviews might be happening, but we may not be doing a great job capturing those or sharing them. And this is really what helps informs a company's best practices and what the best approach is to improve their overall system.

00:18:32

And so we can now begin to share those and teach the rest of the company, what we're learning when things go wrong, because failure is really an opportunity to learn and improve. And in this world, we might have redundancy, but it may be at the zone level. Maybe we haven't moved into regional redundancy. And there's an important aspect that comes with this. Once we began to run in multiple regions, we have one, a safety mechanism, and if things go wrong, we can shift traffic to another region. But with that safety mechanism comes a responsibility to test it and ensure that it behaves the way we expect. If we don't, we, we have what Adrian Cockcroft calls, availability theater. We think that we're protected, but in reality, we aren't. And so it's important to be able to exercise these, these types of failure modes often, and Netflix, we perform region evacuations every other week.

00:19:28

And in the beginning they were slow and they were painful. And they took a lot of time and a lot of teams efforts to get together. But each week we got better and better to the point that it became a five minute automated process where only the core SRE teams needed to be involved. The other reason it's important to often exercise these is that the system changes there might be a new scaling boundary. And one of the regions that our service will hit, if we fail traffic over to it, we may find that our proxy code or the way in which we're shifting traffic has a bug in it, or it's changed. And so by testing it on a regular basis, when we need it to save us, we can have the confidence that it will instead of resulting in two outages at the same time, the other piece in distributed systems, that's key.

00:20:17

And if there's one thing you take away from this talk, it's this it's that we have to go test the failures of our dependencies. We're building distributed systems, and the adage goes, someone else's computer out of my control can cause a failure. And the number of outages that follow this pattern are plentiful. And so we want to go out and as a service, we want to say, what are my critical dependencies? What are my non-critical dependencies? And we want to go through and carefully fail each one of those. If I'm unable to load data from S3, can I continue operating my service? If I'm unable to reach my database or my internal identity service, do I have a fallback and a cookie or another mechanism that I can operate with? And by going through and thinking through these scenarios and testing them, number one, we can sift out what's critical and what's non-critical. And for the non-critical failures, we can ensure that we gracefully degrade and that they don't become a customer facing issue. And for those that are critical, we can be aware of the rough edges and where we need to invest to improve those.

00:21:25

So great. This helps us get into the four nines world. Uh, it's a sweet place. I've lived there for a time. Um, it's less stressful. Uh, you're getting paid less often, and you're feeling better about the quality of your system, but in this world, when a failure occurs, what we're seeing are that a lot of the low hanging fruit has been picked. And so when a failure occurs, it might be a nasty failure. It's often two or three different things going wrong at the same time. It takes more time to diagnose and, and to understand and can be a little trickier, uh, in general, in this world customers, aren't noticing failure. We've moved from, you know, hours of failure to less than an hour of failure. And hopefully when these failures are occurring, they're brief and they're for only moments at a time. And if the system is able to self heal and customers don't notice great, we always want to ensure that customers have a great experience and that we're winning those moments of truth in this world.

00:22:24

It's more than just the critical teams that are firefighting. This is learnings and best practices would share it across the company. More and more teams have had an opportunity to prepare, to practice, to understand what occurs and by preparing and practicing upfront, they're getting paid less. So ultimately they're spending less time related to operations work because they're able to do it upfront and amortize that cost as opposed to paying for it when things go wrong. And there's a quick aside and outages is more expensive than just the revenue you lose and the customers that are unhappy during the time of that outage, there's often dozens of engineers involved, lots of work after the outages happened to understand the contributing factors and all of the things that influence that outage. We often need to meet as a group and discuss how we can improve it. And from that will come a set of action items. So we needed to go fix to ensure that these failures won't occur again. And all of that becomes a very time intensive and expensive process, but like an iceberg one beneath the water one, we may not think about often as we're prioritizing features and customer facing work. And so prudent says that by investing in this upfront, we're actually saving ourselves a lot of time and pain. In addition to the revenue loss and brand impact.

00:23:46

This is really what we're striving for. A culture of resilience, a culture of learning and sharing of practicing of acknowledging that failure occurs and helping ourselves through it so that we realize that this is a team effort that we can help each other improve. And as a result, build higher quality software and have a better customer experience. So in the four nines world, observability is ubiquitous. We have everything everywhere we need. We know what's occurring. And now we can start to layer in things like anomaly detection, better analysis, predictive analytics, so that we can see failures coming more easily and more early. And when they occur, we can act on them more quickly. We're doing unit tests, integration, tasks, performance tests, and failure tests as part of our pipeline, but how we deploy and roll out changes whether it's software or data becomes important. I've seen, uh, systems that do regional deployments for their software, but global deployments for their data points and have been bitten by seeing the results of that failure had worldwide instead of isolated to a region.

00:24:58

And so by being thoughtful about how we deploy things and how we roll them out slowly, we're able to learn more about our systems and provide a better experience. And that Canary experience to me is key. We can do artificial testing and our synthetic testing all we want, but there's nothing, in my opinion, that's a real replacement for the diversity of customer traffic and production, the load that it provides and the edge cases that we might hit now in this world, we're doing game days, we're doing blameless postmortems, we're doing trainings, but one of the dangers ironically, is that we become complacent when failure is occurring less and less often, we may take our eye off the ball or begin to focus on other things. And this can have a pendulum effect where we feel really good and we regress back into an earlier stage.

00:25:51

And so by being vigilant, by practicing and by making sure we're thinking and talking about these, uh, issues that can occur, we can maintain that high level of availability and ensure we're in strong and being region redundant and running an active act of architecture may no longer be sufficient in this world. We may need multiple cloud providers or multiple infrastructure providers so that we can mitigate those black Swan events. If one cloud provider has a major outage, but we're able to continue operating on another cloud provider while those events might be rare, they could be very impactful to our business.

00:26:30

And so this is where we're really stressing things at a deeper level. What happens when there's packet losses between our data centers, how do we handle these peak traffic spikes? What happens when something key like DNS, we're no longer able to route customers or traffic within our systems. And so in this world, again, we're testing more frequently, but it's just part of the process. We don't need to invest a vast period of time. If a team is spending an hour a week or an hour every other week, thinking about this practicing and preparing, we can really save a lot of pain and time spent and outages and really improve that customer experience. And so this is where the world we want to get to. And candidly, I haven't worked for a software company, that's hit five, nine jet. And I think the bar is extremely hard, horror high here for us to hit.

00:27:22

But in this world, we're gracefully degrading, and we're self-healing. The system is able to correct a lot of the issues without the intervention of people. And due to that engagement time, that becomes paramount. We don't have five minutes. We have five minutes for the year, so we can't wait five minutes for one person to join a call. We need as many things to be, um, hidden from the customer and to continue to operating well in the face of that failure. What we're really focusing for is being like utility. At this point, when you turn on the water, it works. When you flip the power switch, it works. And when you pull out your phone and you log into a browser to transfer money or to buy something, it just works. And that's what people are going to expect more and more as time goes on to me, this is really just an efficient quality engineering culture, that everyone is responsible for the availability, the performance, and the efficiency of their code.

00:28:20

That those are core engineering tenants that we're thoughtful of. And whenever we find opportunities to improve, we are, and we're sharing these learnings, not just with our team, perhaps not just with the company, but with the community. At large, we're able to talk about the failures that we've encountered, the lessons that we've learned and really help lift up those around us by teaching them the hard-fought lessons. We've learned at two in the morning, uh, when the, when the coffee hasn't kicked in and when the system has been in, uh, in a state of duress, and this is really the future, you know, the world that we want to live on the world where software just works. So I hope you found this useful and engaging. Uh, we'd love to have you come participate in our community so that you can learn more so that you can share these stories so that we can help improve the reliability of the internet overall and all of our systems. Thank you very much.