Las Vegas 2020

Getting Back Up When You’ve Been Knocked Down: How We Turned Our Company’s Worst Outage into a Powerful Learning Opportunity

In February 2019, we experienced the worst outage in company history.


This outage was the result of a complex system failure that pushed us beyond the limits of our current response systems, processes and culture. In the face of this adversity, we were able to find opportunity in failure. This resulted in improving how we understand incidents, respond to them, and prevent them in the first place.


Now, we have a stronger organizational ability to perform incident management, we’ve reinforced and broadened culture norms around safety, and most impactfully, we have implemented an incident management system that has changed how we run outage calls. Despite how it can feel at the time, failure does not have to be permanently catastrophic. It’s how you respond to that failure that will ultimately shape your organization.


Learn how you can apply what we’ve learned to make the most of outages and drive improvement within your organization.

EM

Erica Morrison

Vice President, Software Engineering, CSG

Transcript

00:00:07

Welcome back from the break. I hope you found the networking time valuable and that the formats we created were helpful in creating useful and meaningful interactions. If you have any feedback, please put it in the channel you attended, such as Birds of Feather, lean Coffee, snack club, or just put it in the general channel. One of my favorite quotes in the Phoenix project is feedback is love. The opposite of love is in hate. It's apathy. Alright, so the first speaker we have this afternoon is Erica Morrison, who was recently promoted to VP of Software Engineering at CSG. She has presented at DevOps Enterprise four times and is someone whose achievements and abilities I genuinely admire. I suspect you have not heard a presentation quite like this before, and if you're like me, you'll be blown away by this presentation. When she first told a story to a group of us last year, you could have heard a pin drop. It was so riveting, heart wrenching, and lays bare problems that almost all of us have faced in our career. She provides lessons learned and teachings that will get the attention of anyone who has had to fix production incidents under extreme pressure. I trust that after watching this presentation, you'll feel compelled to explore how incident command might help your own organization. Here's Erica.

00:01:35

Hi everyone. I'm here today to talk about getting back up when you've been knocked down and how we turned our company's worst outage into a powerful learning opportunity. Real briefly before I get into that, a quick background on CSG. We're North America's largest SaaS-based customer care and billing provider. We do work in the revenue management and digital monetization space, supporting customers such as those you see on the top of the slide here, support over 65 million subscribers with the tech stack that really runs the gamut. Everything from JavaScript to mainframe. We've been fortunate enough to get to share our DevOps journey over the last several years at DevOps Enterprise Summit. In 2015, Scott Pru and I talked about how we were reducing batch sizes and applying Agile and lean. In 2016, we underwent a major organizational transformation where we brought a development organization and an ops organization together and put developers and operational engineers on the same teams.

00:02:33

In 2017, we talked about spreading culture, investing in engineering, and shifting ops left. And then in 2018, I presented with Joe Wilson. We talked about adding more automation shifting, security left. And then finally 2019, Scott shared our story on continuing product modernization. So let's talk about the outage. I wanna walk you through our outage story today. And this story started on February 4th, 2019, and what became our company's worst outage. If you say two, four, as this outage came to be known internally, people know exactly what you're talking about. So we wanted to respond to this incident differently. We took a number of steps, and I'll detail each of those with you today. They included incident and analysis, uh, rollout of incident management system, and a number of other operational improvements, which resulted in a lot of learning for our organization. And we also learned that despite how you can, how it feels at the time, failure like this doesn't have to be permanently catastrophic. It's really how you respond to that failure that will ultimately shape your organization.

00:03:39

So as for the outage itself, it ended up being 13 hours in duration. When all was said and done, it started abruptly with little to no warning. And large portions of our product were unavailable during this entire time. I remember getting paged in the middle of the night, and as the early troubleshooting started, I remember thinking, we still have a couple hours until the start of business on the US East Coast, which is really when traffic starts to substantially wrap up, and the pain to our customers increases. Little did I know that not only would we not have service restore by the start of business, we would struggle to get it restored by the end of the business day. Troubleshooting was particularly interesting On this call, we were largely troubleshooting blind. We had problems accessing our tools that we would normally use to troubleshoot this sort of issue.

00:04:27

Things like our system, health monitoring information, server access, all those were hampered by the exact same issue that was affecting our production services. Every outage call tends to be a little bit chaotic. This one was particularly so with the number of vendors and customers involved. At one point we had six different bridges going on trying to resolve this issue. And then as the day went on, we would come up with different theories. We would have to work really hard to implement them because of all the tool access problems, and then we would actually see a little bit of relief. Things would look a little bit better for a few minutes only to have them start really not working again. And so as you get your hopes up each time, and then you'd have them crushed each time. And as the hours started to pass, this just really started to result in, in a feeling of helplessness.

00:05:18

So as we went through the day, we started taking more and more drastic action. Obviously we're able to eventually resolve this. We did this by killing VLAN, by VAN, and when we killed one particular vlan, pretty much instantaneously traffic patterns started looking normal and we knew we were onto something. So it would actually take us a couple days. We were very fortunate we were able to reproduce this in a lab and to, that allowed us to understand what had actually happened here. And so this all started with some routine server maintenance on an os that's a different OS than most of the servers that we run. So when that server rebooted it, put an LLDP packet out on the network, and then due to a bug, our network software picked this up and interpreted it as spanning tree. And so it broadcast this out to the network, and then it was picked up by our load balancer and due to a misconfiguration in our load balancer, this got rebroadcast back to the network basically creating a, a network loop, creating a network storm and, and taking our network down.

00:06:19

So we would learn later that this was a great example of complex system failure. So we had multiple failures in the system that had to happen. We had latent failures. In fact, these configurations have been in the system four months. And then the, the failures were changing throughout the day. And just to give you an idea of kind of some of the chaos and challenges with, with the troubleshooting, we had actually looked at this particular maintenance and we had said, Hey, this, this timing sure seems coincidental with when this outage started, but when we troubleshot it and looked into it, we said, you know what? No, it's a victim of what's going on in the network. And, and it's not the cause of that. It would, it would take this reproduction later in the lab for us to fully understand this.

00:07:04

So the aftermath of this issue was quite severe. You know, in the heat of battle, everyone really pulled together. But after the dust settled, things weren't so pretty. First of all, we had very angry customers, as you can imagine. You know, we had, we had damaged their business, which a result damaged our company reputation. And so there were many onsite meetings, many conference calls, emails, write-ups, et cetera. So this required quite a, a leadership focus, pivoting from what they were doing. Things like strategic initiatives, operational improvements, et cetera. And instead focusing largely on this outage, CSG takes great pride in the services that we provide our customers. So to have failed them in such grand fashion really led to the sense of, of loss and heartbreak. You could feel it walking through the halls, going to meetings, just the sense of absolutely crushed morale. So because of this, as you can imagine, there are a lot of open wounds and strong emotions.

00:08:02

You know, hurtful things were said things like DevOps doesn't work. So with this backdrop, we knew we wanted to respond to this incident differently. This incident different really, you know, a terrible thing had happened. We wanted to maximize our learnings from that and also figure out how we could reduce the likelihood of something like this occurring again. So the first step that we took here was around incident analysis. Many of you're probably familiar, but just some brief context. Incident analysis is a structured process to understand what happened in an incident and identify opportunities for improvement. Some key components here that include looking at the timeline, asking a series of questions, things like what happened? How can we detect this sooner? How can we recover sooner? What went well, what didn't go so well? Oftentimes we get a better understanding of our system behavior as a result of this.

00:08:54

And then underpinning all of this as a blameless culture, we avoid finger pointing, we avoid things like human error and try harder. So incident analysis was already part of, of CSGs culture. Prior to two four, we did incident analysis on all of our major incidents. However, in this case, we really wanted to up our game. So we engaged with some experts. We reached out to Dr. Richard Cook and John Alpa at Adaptive Capacity Labs. So they came in and did two weeks of intense interviews and research. And so they, you know, they spent a lot of time investigating overall what had happened with this outage. And this led to a more thorough understanding of events and different perspectives. So it was really eye-opening to me as someone who had been on the front lines for pretty much this entire duration of the outage itself and the aftermath, the different things that they learned that I didn't know.

00:09:46

And then also just these different perspectives. People who had sat on that same outage bridge walked away with a different understanding, uh, of what had happened. And so I mentioned we learned about complex system failure. So with this, we learned that those different perspectives, that's actually not unusual. We're probably not going to change the perspectives with this particular issue, but we can change 'em going forward, you know, and we can, we can affect cultural change. We also walked away with a better understanding of our incident response state and a series of recommendations of changes to implement.

00:10:24

So out of this, we wanted to take these recommendations and create a structure to make sure that they got prioritized and implemented. So we created an operational improvements program and we basically bucketed the work into four categories, incident response tool, reliability, data center platform resiliency, and application reliability. I wanna focus for a while on the incident response piece here because this has been so impactful to CSG. So I wanna give an overview of the incident management system and then talk about the, the rollout at CSG and the impacts that this has had for us. So again, not gonna go into a ton of detail on the incident management system. There are some great presentations out there. In fact, Brent Chapman presented at DevOps Enterprise Summit on this topic a couple years ago. Um, also lots of free material out on the internet, but just to, to give an overview, it's a national standard for managing all hazard and risk incidents in the US for the last 40 years.

00:11:25

So if you wanna talk about high pressure, high chaotic situations, right? Things like terrorist attacks and forest fires, this is a system used to manage those. It also has a number of, uh, key components. So first of all, we have a clear set of established roles. So we talk about peacetime and wartime. Your peacetime title does not matter, and outage call is wartime and you're gonna fulfill the role that you're playing on that bridge. So first of all, we've got the incident commander. This is the person, this is the boss. They're making sure you're following incident management system. They're directing traffic, they're the decision making authority. You've also got the scribe. This is the person that is taking notes and, you know, basically keeping a timeline of what's going on. You also have the l and o or liaison officer. This person makes sure that your key stakeholders are getting updated with the information they need.

00:12:14

They might jump to a separate bridge to talk to some customers, for instance. And then finally, you have one or more subject matter experts. Another key component here is a clear communication cadence that happens on a, a, this regular cadence. And with a pre-agreed upon format, you also have a set of expected behaviors for participants, common terminology. And then finally, management. By objective, the call is only focused on restoring service. We're not gonna talk about things like root cause. What we're gonna do tomorrow, we are solely focused on one thing and one thing only. And that is, again, restoring service.

00:12:51

So in order to roll out IMS and incident command at CSG, we reached out to another group of experts. So we engaged with Ron, Rob and Chris at BlackRock. Three. They came in, they interviewed people and they trained on 130 people as incident commanders through a series of training sessions. And so, you know, with this, we wanted to target a broad swath of people. And so we took a look and we said, who's going to be our incident commanders? Let's make sure and get them in there. But we also wanted to get other people that were going to be on these calls as well. Targeted executive leadership, some of our internal client representatives, some of our customers themselves, and then some technicians that were gonna be on the bridge. And to give you an idea of the quality of this particular training, several senior leaders said that this was the best training they had been through in their entire career.

00:13:44

So after we got the training, we knew we wanted to start with a pilot group and we ended up deciding on a group of 14 incident commanders. There was actually a lot of talk about the, the size here, and it wasn't ended up being a controversial decision, but we did move forward to start so small. But we wanted to get a, a small group of people who could collect some experience, get some lessons learned and iterate on this. So what this looked like is basically we had a 14 week time period where each person would be on call for a week at a time where they had a meantime to assemble. Basically they were expected to be on the ing five minutes for that entire week. And we used a buddy system. So there were two people on call at, at any given time. And basically whoever responded first got that call.

00:14:29

And then, you know, if you hopped on the bridge and your buddy was already on there, you would get the next one. We targeted a, a group of people to be these incident commanders who were already really good at, at rounding outage bridges. We thought that they would be strong change agents, they were already well versed in the nuances of outages, and they would best recognize the improvement set of framework like this would lead. So the, the reason that this was a controversial decision is because the group of people that we tapped into here also happened to be some of our, you know, technical leaders across the organization who have very, very full plates. Some companies implement incident command as a separate role. We were asking people to do this on top of their existing role, and it was <laugh> kind of a blessing and a curse, right?

00:15:16

So congratulations, you're so good at running outage calls that now you get to be on call more often. You're already on call 24 7 for your products. Now we're asking you to be on call for a week at a time, able to respond within five minutes for all products. So we were, we were really worried about the toll that this was gonna take on the group. It ended up being not an issue. And I'll talk more about that in a few minutes. So after we got this pilot group together, then we knew we needed to train the larger organization and incident command. So what we did is we put a, a much smaller training together about 30 minutes, and we trained the entire organization, 1600 people in incident command. And then we started, started using it. So then we held two week checkpoints with our incident commanders and several key leaders to talk about what was going well with this, what are the things that we needed to tweak and we really just wanna share our experiences and learn from one another.

00:16:14

Another key change that we made is to update our, our whiteboard tooling. So when we went through the BlackRock three training, it was apparent that we know we needed some work on on our whiteboards. We looked at at several different options and ended up going with just a low cost option with Microsoft Excel. But now when you join a bridge, you can see the whiteboard and it's got who's filling each of the roles. It's got the current status report, all previous status reports and a timeline of what's going on. So you really concisely have all the information that you'd want when you join that bridge. So after we did all these changes, we completed the pilot, then we extended to to more incident commanders. So we now have far more than 14. You have to have taken the training and you basically have to have leadership approval to, to do this.

00:17:03

So it was really interesting as we rolled out incident command and we started running our calls this way, the improvements were seen almost immediately. <laugh> the first two calls that I incident commanded. Both times afterwards I had people come and, and talk to me and say, wow, this is so much better. They hadn't even been through the training yet. They didn't even know what they were seeing. They just, they just knew that it was better. And so it was just such an observable difference and how an outage call has run. I think there were a couple key components to the, first of all, clutter on the call was removed. So our calls tended to have lots of different things going on at the same time. Lots of different people talking. You know, we go in this direction and then we go over here. Now there's better behavior from call participants, which reduces a lot of that noise.

00:17:52

People understand more about when it's appropriate to talk, what sorts of things should be taken offline and, and they need to ask for permission to speak on that bridge so we really can make sure we're not getting distracted by a bunch of noise. The singular focus on resolving the outage has also helped a lot. So I previously mentioned, you know, that you're not gonna talk about root cause or what we're doing tomorrow. Those were frequently conversations that we had on our bridges. Now we just stop those as they start and get back to talking about restoring service. The status reports in the l and o role have also been really big for us. A lot of our bridges before tend to talk about status and now we have a regular cadence every 30 minutes we're giving the current conditions, actions, and needs. So if we have someone jumping on the bridge needing a status, they can see that status from within the last 30 minutes and they know another one is coming, they're not gonna ask for status information.

00:18:50

Similarly, if someone got paged to join the bridge and help, the way that this used to look for us almost every time is someone would join and say, hi, this is so and so, I got paged, what's going on? And so we would repeat time and again, you know, wasting valuable time on that call. Now they have everything that they need. And then that l and o role, that liaison officer has really been critical as well to make sure and a larger outage, customers wanna talk to us, they wanna know what's going on, they wanna talk to someone who can give them some technical details. The way that we solved this before was that we would take the person running the outage bridge who, you know, knew the most about what was going on and they would jump over and talk to the customers. Highly disruptive to the actual call restoring service.

00:19:36

And now you know, now we see that we take someone else and, and go over there. So this was one area that was interesting to me. I had taken the training but we hadn't rolled out incident command yet and we had an outage bridge I was on. We were getting close to restoring service when a senior leader joined. He didn't have the information that he needed to do his job, he didn't have the current status, he didn't have, you know, all the different things that we tried. So for the next 30 to 45 minutes, the call basically pivoted and answered his questions. Instead of focusing on resolving the issue. Not his fault, he wasn't getting what he needed, we didn't have the right format to get him what he needed in some sort of regular basis. And we realized how frequently this was happening and just with some small tweaks we could completely address this issue.

00:20:22

Another thing that we got is a sense of control over chaos. So I was one of the people who ran the outage bridges before and there was no playbook. And you know, you kind of never did the same thing twice and you're trying to make it up as you go along during an active outage, a very stressful crisis time, not when you want to be doing things on the fly. Now we've got a predictable cadence and pattern followed that makes these outages so much easier to run. Activities can run in parallel. This is a biggie too. Before it was really difficult to do a bunch of stuff at once. Now we have a pattern where when we figure out a step that we're going to take, we ask for someone to take that assignment, we ask them, they need to give us a status update.

00:21:02

They're gonna go, we're not gonna bug 'em, we're not gonna talk to them until that time is up or until update. We used to ping them all the time, you know, do you have an update? Do you have an update? Which meant they couldn't be working. And also we were interrupting the bridge again with that status info decision making process is clear. So the two four outage is a great case of this. We needed to make some hard decisions but there was no decision making authority. And so we largely had to get to consensus, which was very slow and an outage, you want to make calls quickly and move on. Now the incident commander asks to make sure we have all the pertinent information, the risks, and then we ask for strong objections and we move on. Similarly, the clear authority is established, the person who's running the call, the incident command, it's clear to everyone that that's the authority.

00:21:50

So I reached out to a couple of our senior leaders because I wanted to give you their feedback on what they've seen with us going through this transformation. Um, so Ken Kennedy, the president of technology and product had this to say the BlackRock three incident management system enabled CSG to reduce customer time by 30% last year. This framework outlines critical roles, rules of engagement and means of communication. This framework enabled us to work cooperatively to quickly resolve incidents when every minute matters. Also reached out to Dorn Steiny, our CIO. He has to say, two of my favorite incident command training are that everyone now understands their role during an incident and the flow of communications the company, um, and to our customers is so much better. The first 20 minutes of an <inaudible> used to be spent just figuring out who was in charge and no one organized customer communications until it was a crisis.

00:22:49

So as we went from the training to the pilot to actually rolling this out, we learn a lot. You know, can you do this without experts? You probably can. You can expect this is a disruptive change and you need to dedicate full-time people to it. But I do recommend just going with the experts. We learned with the IMS framework, almost all elements of this framework held true as far as pulling together, everyone pulls together and does what's needed in time of a crisis. I mentioned we were worried about these 14 incident commanders and the toll that it would take on 'em. What actually ended up happening is that in most cases when there were two people paged, they would both join the bridge, the buddy would stay on and in fact many times, 2, 3, 4 other incident commanders would join on the bridge to see how they could help and how they could learn first.

00:23:37

I see definitely different, you know, basically just prep yourself for that. It gets much easier spelling it out. We wondered how much of this framework we needed to spell out after we took the train versus letting people figure some of this out. And a great example of this, we had an outage about a month into starting our pilot and it ended up running over 24 hours. The actual outage wasn't that long, but we were doing a lot of preventative work. We hadn't figured out how to do this, but we always had multiple incident commanders on the bridge communicating behind the scenes, coordinating with each other and, and we got through it and we actually reduced the duration of the outage substantially. There were some assumptions challenged. Any prolonged issue of significance can be run by any incident. Commander was proven invalid for us. Say the training disagrees with us.

00:24:24

I've seen many tweets out there that also disagree with us. We just haven't had good luck here. We've had some outage bridges go pretty poorly with someone who didn't have the domain expertise. You know, I think some of being a strong incident commander is confidence and that's knowing that we're asking the right questions knowing we've got the the right people engaged. That's just what's worked for us. Our leaders do tend to be somewhat technical in nature. We may bleed into that SME role a little bit. We're definitely not troubleshooting. Um, but we do know some of those technical questions. Again, it's what's worked for us. We'll continue to keep an eye on it. Along those same lines, meantime, to assemble, we found our greatest opportunity to reduce MTTR wasn't getting an incident commander on the bridge as fast as possible, but getting the right one.

00:25:05

So because of that, we actually changed how we staff our on-call. We went away from an official on-call schedule and we said if you're involved, be involved in IC duties. And so what that means, we already had lists prior to two four that would page out to large groups of people in a major issue. We have enough incident commander, someone's getting paged who is an incident commander. Get on that bridge if your stuff is involved, be pre prepared to step into that incident commander role. You can hand it off if it starts going a certain direction. We had started with if you break it, you IC it. But we felt that this basically focused too much on a blame culture and we wanted to stay away from that. This was another controversial decision and where there was concern that we wouldn't get the right person on the bridge fast enough, but that ended up basically being a non-issue.

00:25:53

But we did create a pool of senior incident commanders to help in any sort of atypical scenario. This is about 10 people, you know, basically our, our strongest incident commanders. If for some reason you're not getting an incident commander on the bridge or if you're having a particularly hard outage, you know, maybe someone's struggling, anything like that, there needs to be a safety valve blocked to this group at any point in time. I just wanna briefly touch on the other operational improvements that we made as well. So number one, we reduced failure group size with our load balancers. We virtualized our load balancers and we went to more smaller instances. We're improving our network monitoring by engaging with a third party. We're continuing our public cloud exploration and growth. We're leveraging group chat for incidents, making it safe to talk about failure. So we now have had several sessions where all we've done is talk about some of our epic failures from the past.

00:26:47

Instead of sweeping this under the rug, we're making sure that we're sharing these lessons learned. And then finally we're implementing an out-of-band network. Again, engaging with a third party. So as I look back on the time since the outage here, you know, I really feel like I've learned so much and so has CSG. We've talked about some of this stuff today with incident analysis, incident management system, and other changes to make our system more resilient. But I wanna talk about a few other things just for a minute. So, complex system failure, we learned a lot more about this and once you see it, you can't unsee it. Now we see it everywhere. A great example of this is the HBO miniseries on the Chernobyl disaster. My peers, my boss and I were talking about this and most of us had independently watched it and we all look at the same conclusion.

00:27:31

Wow, this is a great example of complex system failure. Again, you just can't unsee it and it looks the same in every industry. We think we're unique in software, but the reality is that complex system failure looks the same everywhere. Dr. Richard Cook wrote a great paper called How Complex Systems Fail. Highly recommend go take a look. You're gonna think it was written for the software industry, but it wasn't. Which is, you know, quite interesting here. There's some common reactions to failure. First of all, on a major outage situation like we had, it's going to create a lot of stress. It's going to cause people to, to behave differently. There's a need to look at things with hindsight bias. With all the information that we have at the end we look back and say, why didn't you do this? Or why didn't you think of this? Um, but the reality is it's not that clear in the fog of war there can be a tendency to blame or to punish.

00:28:22

So we definitely need to fight against that. And then also human error gets blamed. And in fact, human error gets blamed 85% of the time in accidents where in reality humans are what keeping these systems from failing. More often we learned that avoiding failure requires failure or just some things that you can't learn except by doing. And finally we learned that we can come back from this. So when this happened, we didn't know how we were gonna get ourselves back up. We didn't even know if we could, but we have. And impact minutes is a great metric that shows this. This is an internal measure that basically we're trying to measure the pain that we cause our customers. It's a formula that basically boils down to duration of an outage and the percent of a product or products that are down during this time. And after two four, we ate up almost our entire two 19 budget.

00:29:11

We have a budget for each year to reduce from the year before we looked at it and we, we don't know if we can get home on this. This is going to be really, really challenging. But when the end of the year came, we did meet our goal for that year, which I think is a great testament to all the improvements we've put in place over the last several years in the DevOps space. Operational stability. And then laying on top of that are our learnings from our two four outage. Quick recap and recommendations. You know, when I, when my people have a particularly nasty outage or they're going through a hard time, I tell 'em that you're gonna look back on this time as a blessing. It's not gonna feel like it at the time, but it's in a time of intense learning. When I had to take my own advice with two four outage, it wasn't so fun, <laugh>.

00:29:55

And this is a day that will stick with me for my entire career. CSG overall has grown so much from this. I hope you've learned from this as well. You know, we've gotten to meet a lot of great people along our journey who've been part of their wisdom, shaped our path and helped cause us to to think differently. So some recommendations that I have for you are to find the opportunity and failure, engage with experts, make post-incident analysis, part culture, implement incident command, and then finally always keep learning. So along those lines, I wanna keep learning in this space as well. So help that I'm looking for is understanding and hearing more about your story. So please share them. I'd love to hear how did you fail in grand fashion and how have you implemented incident command? Thank you.