Getting Back Up When You’ve Been Knocked Down: How We Turned Our Company’s Worst Outage into a Powerful Learning Opportunity (Las Vegas 2020)

In February 2019, we experienced the worst outage in company history. This outage was the result of a complex system failure that pushed us beyond the limits of our current response systems, processes and culture. In the face of this adversity, we were able to find opportunity in failure. This resulted in improving how we understand incidents, respond to them, and prevent them in the first place. Now, we have a stronger organizational ability to perform incident management, we’ve reinforced and broadened culture norms around safety, and most impactfully, we have implemented an incident management system that has changed how we run outage calls. Despite how it can feel at the time, failure does not have to be permanently catastrophic. It’s how you respond to that failure that will ultimately shape your organization. Learn how you can apply what we’ve learned to make the most of outages and drive improvement within your organization.

plenaryvegas2020
EM

Erica Morrison

Vice President, Software Engineering, CSG

TRANSCRIPT

00:00:07

Welcome back from the break. I hope you found the networking time valuable and that the formats we created were helpful in creating useful and meaningful interactions. If you have any feedback, please put it in the channel you attended such as birds of feather lean coffee snack club, or just put it in the general channel. One of my favorite quotes in the Phoenix project is feedback is love the opposite of love. Isn't hate it's apathy. All right. So the first speaker we have this afternoon is Erica Morrison, who was recently promoted to VP of software engineering at CSG. She has presented at DevOps enterprise four times, and as someone whose achievements and abilities, I genuinely admire, I suspect you have not heard a presentation quite like this before. And if you're like me, you will be blown away by this presentation. When she first told a story to a group of us last year, you could've heard a pin drop. It was so riveting heart-wrenching and lays bare problems that almost all of us have faced in our career. She provides lessons learned and teachings that will get the attention of anyone who has had to fix production incidents under extreme pressure. I trust that after watching this presentation, you will feel compelled to explore how incident command might help your own organization. Here's Erica.

00:01:35

Hi everyone. I'm here today to talk about getting back up when you've been knocked down and how we turned our company's worst outage into a powerful learning opportunity room briefly. Before I get into that, a quick background on CSG we're north America's largest SAS based customer care and billing provider. We do work on the revenue management and digital monetization space, supporting customers such as those you see on the top of the slide here, support over 65 million subscribers with a tech stack that really runs the gamut. Everything from Java scripts to mainframe, we've been fortunate enough to get to share our dev ops journey over the last several years at DevOps enterprise summit in 2015, Scott Prue. And I talked about how we were reducing batch sizes and applying agile and lean. In 2016, we went underwent a major organizational transformation where we brought a development organization and an ops organization together and put developers and operational engineers on the same teams.

00:02:33

In 2017, we talked about spreading culture, investing in engineering and shifting ops left. And then in 2018, I presented with Joe Wilson. We talked about adding more automation, shifting security left, and then finally, 2019, Scott shared our story on continuing product modernization. So let's talk about the outage. I want to walk you through our outage story today. And this story started on February 4th, two 19. And what became our company's worst outage. If you say two, four, as this outage came to be known internally, people know exactly what you're talking about. So we wanted to respond to this incident differently. We took a number of steps and I will detail each of those with you today. They included incident analysis, a rollout of incident management system, and a number of other operational improvements, which resulted in a lot of learning for our organization. And we also learned that despite how you can, how it feels at the time failure like this doesn't have to be permanently catastrophic. It's really how you respond to that failure that will ultimately shape your organization.

00:03:39

So as for the outage itself, it ended up being 13 hours in duration. When all was said and done, it started abruptly with little to no warning and large portions of our product were unavailable during this entire time. I remember getting paged in the middle of the night and as the early troubleshooting started, I remember thinking we still have a couple hours until the start of business on the U S east coast, which is really when traffic starts to substantially ramp up and the pain to our customers increased as little. Did I know that not only would we not have service restored by the start of business, we would struggle to get restored by the end of the business day, troubleshooting was particularly interesting on this call. We were largely troubleshooting blind. We had problems accessing our tools that we would normally use to troubleshoot this sort of issue.

00:04:27

Things like our system, health monitoring, information, server access, all of those were hampered by the exact same issue that was affecting our production services. Every outage call tends to be a little bit chaotic. This one was particularly so with the number of vendors and customers involved. At one point we had six different bridges going on trying to resolve this issue. And then as the day went on, we would come up with different theories. We would have to work really hard to implement them because of all the tool access problems. And then we would actually see a little bit of relief. Things would look a little bit better for a few minutes only to have them start really not working again. And so as you get your hopes up each time and then you'd have them crushed each time. And as the hour started to pass, this just really started to result in a feeling of helplessness.

00:05:18

So as we went through the day, we started taking more and more drastic action. Obviously we're able to eventually resolve this. We did this by killing the land by V LAN. And when we killed one particular V land, pretty much instantaneously traffic patterns started looking normal and we knew we were onto something. So it would actually take us a couple days. We were very fortunate. We were able to reproduce this in a lab. And so that allowed us to understand what had actually happened here. And so this all started with some routine server maintenance on and O us that's a different OSTP than most of the servers that we run. So when that server rebooted, it put an LLDP packet out on the network and then due to a bug, our network software picked this up and interpreted it as spanning tree. And so it broadcasts us out to the network and then it was picked up by our load balancer and due to a misconfiguration on our load balancer.

00:06:12

This got rebroadcast back to the network, basically creating a network loop, creating a network storm and taking our network down. So we would learn later that this was a great example of complex system failure. So we had multiple failures in the system that had to happen. We had latent failures. In fact, these configurations have been in the system for months, and then the failures were changing throughout the day. And just to give you an idea of kind of some of the chaos and the challenges with, with the troubleshooting, we had actually looked at this particular maintenance and we had said, Hey, this, this timing sure seems coincidental with when this outage started, but when we troubleshot it and looked into it, we said, you know what, no, it's a victim of what's going on in the network. And it's not the cause of that. It would, it would take this reproduction later in the lab for us to fully understand this.

00:07:04

So the aftermath of this issue was quite severe, you know, in the heat of battle, everyone really pulled together, but after the dust settled things, weren't so pretty. First of all, we had very angry customers, as you can imagine, you know, we had, we had damaged their business, which as a result damaged our company reputation. And so there were many onsite meetings, many conference calls, emails, write ups, et cetera. So this required quite a leadership focus pivoting from what they were doing. Things like strategic initiatives, operational improvements, et cetera. And instead focusing largely on this outage CSG takes great pride in the services that we provide our customers. So to have failed them in such grand fashion really led to the sense of, of loss and heartbreak. You could feel it walking through the halls, going to meetings, just the sense of absolutely crushed morale.

00:07:57

So because of this, as you can imagine, there are a lot of open wounds and strong emotions. You know, hurtful things were said, things like dev ops doesn't work. So within this backdrop, we knew we wanted to respond to this incident differently. This incident different, really a terrible thing had happened. We wanted to maximize our learnings from that and also figure out how we could reduce the likelihood of something like this occurring again. So the first step that we took here was around incident analysis. Many of you are probably familiar, but just some brief context, incident analysis is a structured process to understand what happened at an incident and identify opportunities for improvement. So key components here, they include looking at the timeline, asking a series of questions, things like what happened? How can we detect this sooner? How can we recover sooner? What went well, what didn't go so well, oftentimes we get a better understanding of our system behavior as a result of this.

00:08:54

And then all of this as a blameless culture, we avoid finger pointing. We avoid things like human error and try harder. So incident analysis was already part of CSU culture prior to two four, we did incident analysis on all of our major incidents. However, in this case, we really wanted to up our game. So we engage with some experts. We reached out to Dr. Richard Cook and John Allspaw at adaptive capacity labs. So they came in and did two weeks of intense interviews and research. And so they, you know, they spent a lot of time investigating overall what had happened with this outage and this led to a more thorough understanding of events and different perspective. So it was really eye-opening to me as someone who had been on the front lines for pretty much the entire duration of the outage itself and the aftermath, the different things that they learned that I didn't know.

00:09:47

And then also just these different perspectives, people who had sat on that same outage bridge walked away with a different understanding of what had happened. And so I mentioned, we learned about complex system failure. So with this, we learned that those different perspectives, that's actually not unusual, and we're probably not going to change the perspectives with this particular issue, but we can change them going forward, you know, and we can, we can affect cultural change. We also walked away with a better understanding of our incident response state, and a series of recommendations of changes to implement.

00:10:24

So out of this, we wanted to take these recommendations and create a structure to make sure that they got prioritized and implemented. So we created an operational improvements program, and we basically bucketed the work to four categories, incident response, tool, reliability, data center, platform, resiliency, and application reliability. I want to focus for a while on the incident response piece here, because this has been so impactful to CSG. So I want to give an overview of the incident management system and then talk about the rollout at CSG and the impacts that this has had for us. So again, not going to go into a ton of detail on the incident management system, there are some great presentations out there. In fact, Brent Chatman presented at DevOps enterprise summit on this topic a couple of years ago. Uh, also lots of free material out on the internet, but just to give an overview.

00:11:19

So national standard for managing all hazard and risk incidents in the U S for the last 40 years. So if you want to talk about high pressure, high chaotic situations, right? Things like terrorist attacks and forest fires, this is a system used to manage those. It also has a number of key components. So first of all, we have a clear set of established roles. So we talk about peace time and war time. Your peacetime title does not matter. And outage call is war time, and you're going to fulfill the role that you're playing on that bridge. So, first of all, we've got the incident commander. This is the person, this, the boss, they're making sure you're falling incident management system. They're directing traffic. They're the decision making authority. You've also got the scribe. This is the person that is taking notes. And, you know, basically keeping a timeline of what's going on.

00:12:07

You also have the LNO or liaison officer. This person makes sure that your key stakeholders are getting updated with the information they need. They might jump to a separate bridge to talk to some customers, for instance. And then finally you have one or more subject matter experts. Another key component here is a clear communication cadence that happens on a regular cadence and with a pre-agreed upon format. You also have a set of expected behaviors for participants, common terminology, and then finally management by objectives. The call is only focused on restoring service. We're not going to talk about things like root cause what we're going to do tomorrow. We are solely focused on one thing and one thing only, and that is again, restoring service. So in order to roll out IMS and incident command at CSG, we reached out to another group of experts. So we engaged with Ron, Rob and Chris at BlackRock three.

00:13:02

They came in, they interviewed people and they trained on hundred and 30 people as incident commanders through a series of training sessions. And so, you know, with this, we wanted to target a broad swath of people. And so we took a look and we said, who's going to be our incident commanders. Let's make sure and get them in there. But we also wanted to get other people that were going to be on these calls as well, targeted executive leadership. Some of our internal client representatives, some of our customers themselves, and then some technicians that were going to be on the bridge. And to give you an idea of the quality of this particular training, several senior leaders, so that this was the best training they had been through in their entire career. So after we got the training, we knew we wanted to start with a pilot group and we ended up deciding on a group of 14 incident commanders.

00:13:51

There was actually a lot of talk about the size here, and it wasn't ended up being a controversial decision, but we did move forward to start so small, but we wanted to get a small group of people who could collect some experience, get some lessons learned and iterate on this. So what this looked like is basically we had a 14 week time period where each person would be on call for a week at a time where they had a meantime to assemble. Basically they were expected to be on the bird thing five minutes for that entire week. And we used a buddy system. So there were two people on call at any given time. And basically whoever responded first got that call. And then, you know, if you have done the bridge and your buddy was already on there, you would get the next one.

00:14:34

We targeted a group of people to be these incident commanders who are already really good at, at rounding outage bridges. We thought that they would be strong change agents. They were already well versed in the nuances of outages, and they would best recognize the improvements that have framework like this with lead. So the, the reason that this was a controversial decision is because the group of people that we tapped into here also happened to be some of our technical leaders across the organization who have very, very full plates. Some companies implement incident command as a separate role. We were asking people to do this on top of their existing role, and it was kind of a blessing and a curse, right? So congratulations, you're so good at running outage calls that now you get to be on call more often, you're already on call 24 7 for your products.

00:15:25

Now we're asking you to be on call for a week at a time, able to respond within five minutes for all products. So we were, we were really worried about the toll that this was gonna take on the group ended up being not an issue. And I will talk more about that in a few minutes after we got this pilot group together, then we knew we needed to train the larger organization and incident command. So what we did is we put a much smaller training together, about 30 minutes running, and we trained the entire organization, 1600 people and incident command. And then we started, started using it. So then we held two week check points with our incident commanders and several key leaders to talk about what was going well with this. What are the things that we needed to tweak and really just want to share our experiences and learn from one another.

00:16:14

Another key change that we made is to update our white war tooling. So when we went through the BlackRock three training, it was apparent that we needed some work on, on our whiteboards. We looked at several different options and ended up going with just a low cost option with Microsoft Excel. But now when you joined a bridge, you can see the white board and it's got, who's filling each of the roles. It's got the current status report, all previous status reports and a timeline of what's going on. So you really concisely have all the information that you'd want when you joined that bridge. After we did all these changes, we completed the pilot. Then we extended to more incident commander. So we now have far more than 14. You have to have taken the training and you basically have to have leadership approval to do this.

00:17:03

So it was really interesting as we rolled out incident command and we started running our calls this way, the improvements were seen almost immediately. The first two calls that I incident commanded both times afterwards, I have people come and talk to me and say, wow, this is so much better. They hadn't even been through the training yet. They didn't even know what they were seeing. They just, they just knew that it was better. And so it was just such an observable difference and how an outage call has run. I think there were a couple key components to that. First of all, caught her on the call was removed. So our calls tended to have lots of different things going on at the same time, lots of different people talking, you know, we'd go in this direction and then we go over here. Now there's better behavior from call participants, which reduces a lot of that noise.

00:17:52

People understand more about when it's appropriate to talk, what sorts of things should be taken offline and then need to ask for permission to speak on that bridge. So we really can make sure we're not getting distracted by a bunch of noise. The singular focus on resolving the outage has also helped a lot. So I previously mentioned, you know, that you're not going to talk about root cause or what we're doing tomorrow. Those were frequently conversations that we had on our bridges. Now we just stop those as they start and get back to talking about restoring service, the status reports and the Eleanor role have also been really big for us. A lot of our bridges before tend to talk about status. And now we have a regular cadence, every 30 minutes we're giving the current conditions, actions and needs. So if we have someone jumping on the bridge, needing a status, they can see that status from within the last 30 minutes.

00:18:45

And they know another one is coming. They're not going to ask for status information. Similarly, if someone got paged to join the bridge and help the way that this used to look for us almost every time is someone would join and say, hi, this is so-and-so. I got paged. What's going on? And so we would repeat time. And again, you know, wasting valuable time on that call. Now they have everything that they need. And then the LNO role that liaison officer has really been critical as well, to make sure on a larger outage customers want to talk to us. They want to know what's going on. They want to talk to someone who can give them some technical details. The way that we solve this before was that we would take the person running the outage bridge who knew the most about what was going on.

00:19:30

And they would jump over and talk to the customers, highly disruptive to the actual call, restoring service. And now, you know, now we see that we take someone else and go over there. So this was one area that was interesting to me. I had taken the training, but we hadn't rolled out incident command yet. And we had an outage bridge. I was on, we were getting close to restoring service. When a senior leader joined, he didn't have the information that he needed to do his job. He didn't have the current status. He didn't have, you know, all the different things that we tried. So for the next 30 to 45 minutes, the call basically pivoted and answered his questions. Instead of focusing on resolving the issue, not his fault, he wasn't getting what he needed. We didn't have the right format to get him what he needed in some sort of regular basis.

00:20:16

And we realized how frequently this was happening. And just with some small tweaks, we could completely address this issue. Another thing that we got as a sense of control over chaos. So I was one of the people who ran the outage bridges before, and there was no playbook. And you know, you kind of never did the same thing twice, and you're trying to make it up as you go along during an active outage, a very stressful crisis time, not when you want to be doing things on the fly. Now we've got a predictable cadence and pattern followed that makes these outages so much easier to run. Activities can run in parallel. This is a biggie too, before it was really difficult to do a bunch of stuff at once. Now we have a pattern where when we figure out a stuff that we're going to take, we ask for someone to take that assignment.

00:21:00

We ask, they need to give us a status update. They're going to go. We're not going to bug them. We're not going to talk to them until that time is up or until update. We used to ping them all the time. You know, do you have an update? Do you have an update? Which meant they couldn't be working. And also we were interrupting the bridge again with that status info decision-making processes clear. So the two, four outages is a great case of this. We needed to make some hard decisions, but there was no decision making authority. And so we largely had to get to consensus, which is very slow and an outage. You want to make calls quickly and move on. Now that incident commander asks them, make sure we have all the pertinent information, the risks. And then we asked for strong objections and we move on. Similarly, the clear authority is established the person who's running the call, the incident command. It's clear to everyone that that's the authority.

00:21:50

So I reached out to a couple of our senior leaders because I wanted to give you their feedback on what they've seen with us going through this transformation. Um, so Ken Kennedy, the president of technology and product had this to say the BlackRock three incident management system enabled CSG to reduce customer time by 30% last year, this framework outlines critical roles, rules of engagement and means of communication. This framework enabled us to work cooperatively to quickly resolve incidents. When every minute matters also reached out to Darren Stein, a key or CIO. He used to say who my favorite commands training or that everyone now understands their role during an incident and the flow of communications, the company, um, and to our customers is so much better. The first 20 minutes of an Ash used to be spent, just figuring out who was in charge and no one organized customer communications until it was a crisis.

00:22:49

So as we went from the training to the pilot to actually rolling this out, we learned a lot, you know, can you do this without experts? You probably can. You can expect this as a disruptive change and you need to dedicate full-time people to it. But I do recommend just going with the experts we learned with the IMS framework, almost all elements of this framework held true. As far as pulling together, everyone pulls together and does what's needed. And time of a crisis. I mentioned, we were worried about these 14 incident commanders and the toll that it would take on them. What actually ended up happening is that in most cases, when there were two people page, they were both joined the bridge. The buddy would stay on. And in fact, many times 2, 3, 4 other incident commanders would join on the bridge to see how they could help and how they could learn.

00:23:37

First. I see definitely different, you know, basically just prep yourself for that. It gets much easier, spelling it out. We wondered how much of this framework needed to spell out. After we took the train versus letting people figure some of this out. And a great example, this, we had an outage about a month into starting our pilot, and it ended up running over 24 hours. The actual outage wasn't that long, but we were doing a lot of preventative work. We hadn't figured out how to do this, but we always had multiple incident commanders on the bridge, communicating behind the scenes, coordinating with each other, and we got through it. And we actually reduced the duration of the outage substantially. There were some assumptions, challenged, any prolonged issue of significance can be run by any incident. Commander was proven valid for us. Say the training disagrees with us.

00:24:24

I've seen many tweets out there that also disagree with us. We just haven't had good luck here. We've had some outage bridges go pretty poorly with someone who didn't have the domain expertise. And I think some of being a strong incident commander is confidence. And that's knowing that we're asking the right questions and we've got the right people engaged. That's just, what's worked for us. Our leaders do tend to be somewhat technical in nature. We may believe into that SMI role a little bit, or definitely not troubleshooting. Um, but we do know some of those technical questions. Again, it's, what's worked for us. We'll continue to keep an eye on it. Along those same lines, meantime to assemble, we found our greatest opportunity to reduce MTTR. Wasn't getting an incident commander on the bridge as fast as possible, but getting the right one. So because of that, we actually changed how we staff our own calls.

00:25:09

We went away from an official on-call schedule, and we said, if you're involved being involved in IC duties. And so what that means, we already had triaged lists prior to two four, that would page out to large groups of people and a major issue. We have enough incident command or someone's getting paged, who was an incident commander, get on that bridge. If your stuff is involved, he ProMat prepared to step into that incident commander role. You can hand it off. If it starts going a certain direction we had started with, if you break it, use it. But we felt that this basically focused too much on a blame culture. And we wanted to stay away from that. This was another controversial decision and whether it was concerned that we wouldn't get the right person on the bridge fast enough, but that ended up basically being a non-issue.

00:25:54

But we did create a pool of senior incident commanders to help in any sort of atypical scenario. This is about 10 people. You basically are our strongest incident commanders. If for some reason you're not getting an incident commander on the bridge, or if you're having a particularly hard outage and maybe someone's struggling anything like that, there needs to be a safety valve block to this group. At any point in time, I just want to briefly touch on the other operational improvements that we made as well. So number one, we reduced failure group size with our load balancers. We virtualized our load balancers and we went to more smaller instances. We're improving our network monitoring by engaging with a third party. We're continuing our public cloud exploration and growth. We're leveraging group chat for incidents, making it safe to talk about failure. So we now have had sessions where all we've done is talk about some of our epic failures from the past, instead of sweeping this under the rug, we're making sure that we're sharing these lessons learned.

00:26:51

And then finally, we're implementing an out of band network, again, engaging with a third party. So as I look back on the time since the outage here, you know, I really feel like I've learned so much. And so has CSG. We've talked about some of this stuff today with incident analysis, incident management system, and other changes to make our system more resilient. But I want to talk about a few other things just for a minute. So complex system failure, we learned a lot more about this, and once you see it, you can't unsee it. Now we see it everywhere. Great example of this is the HBO mini series on the Chernobyl disaster. My peers, my boss and I were talking about this, and most of us had independently watched it. And we all same conclusion while this is a great example of complex system failure.

00:27:34

Again, you just can't unsee it and it looks the same in every industry. We think we're unique in software, but the reality is that complex system failure looks the same everywhere. Dr. Richard Cook wrote a great paper called how complex systems fail, have the recommend go take a look. You're going to think it was written for the software industry, but it wasn't, which is quite interesting here. There are some common reactions to failure. First of all, on a major outage situation like we have, it's going to create a lot of stress. It's going to cause people to do behave differently. There's a take a look at things with hindsight bias, with all the information that we have at the end, we look back and say, why didn't you do this? Or why didn't you think of this? Um, but the reality is it's not that clear in fog of war, there can be a tendency to blame or to punish.

00:28:22

So we definitely need to fight against that. And then also human error gets blamed. And in fact, human error gets blamed 85% of the time and accidents where in reality, humans are one of keeping these systems from failing. More often, we learned that avoiding failure requires failure or just some things that you can't learn except by doing. And finally, we learned that we can come back from this. So when this happened, we didn't know how we were going to get ourselves back up. We didn't even know if we could, what we have and impact minutes is a great metric that shows us, this is an internal measure that basically we're trying to measure the pain that we cause our customers. It's a formula that basically boils down to duration of an outage and the percent of a product or products that are down during this time.

00:29:07

And after two, four, we ate up almost our entire 2019 budget. We have a budget for each year to reduce from the year before we'd looked at it. And we, we don't know if we can get home on this. This is going to be really, really challenging. But when the end of the year came, we did meet our goal for that year, which I think is a great Testament to all the improvements we've put in place over the last several years in the DevOps space, operational stability. And then laying on top of that are our learnings from our two, four outage, quick recap and recommendations. You know, when I, when my people have a particularly nasty outage or they're going through a hard time, I tell them that you're going to look back on this time as a blessing. It's not going to feel like it at the time, but this is in a time of intense learning.

00:29:51

When I had to take my own advice with two for outage, it wasn't so fun. And this is a day that will stick with me for my entire career. CSG overall has grown so much from this. I hope you've learned from this as well. You know, we've gotten to meet a lot of great people along our journey. Who've been part of their wisdom, shaped our path and helped cause us to, to think differently. So some recommendations that I have for you are to find the opportunity and failure, engage with experts, make post-incident analysis, per culture, implement incident command, and then finally always keep learning. So along those lines, I want to keep learning in this space as well, to help that I'm looking for is understanding and hearing more about your story. So please share them. I'd love to hear. How did you fail in grand fashion and how have you implemented incident command? Thank you.