Las Vegas 2018

Mastering Outages with Incident Command for DevOps: Learning from the Fire Department

Leading companies such as Google, PagerDuty, and Atlassian have developed successful major incident management practices based on the Incident Command System (ICS), which was first developed by fire departments. We can learn from these organizations, where managing emergencies is a core capability.


Brent Chapman is an expert at emergency management, and at guiding organizations to prepare for and learn from emergencies, working from a strong background in IT infrastructure and site reliability engineering (SRE).


As a leader in Google’s legendary SRE organization, Brent convinced senior management of the need to strengthen and standardize the company’s incident management practices, and created the Incident Management at Google (IMAG) system that is now used throughout the company. He also helped refine the Postmortems at Google (PMAG) system that the company uses to learn from incidents large and small.


Brent brings a unique perspective to his work in IT, as a former air search and rescue pilot and incident commander, an emergency dispatcher and dispatch supervisor for major art & music festivals and events such as Burning Man, and a Community Emergency Response Team (CERT) member and instructor.


Throughout his career, Brent has designed, built, managed, and scaled IT infrastructure and teams for everything from embryonic startups to giants such as Google, Apple, and Netflix. He is the coauthor of the highly regarded O’Reilly book Building Internet Firewalls, and the developer of widely used open source software, and a popular speaker at conferences worldwide. He has worked with dozens of organizations both in Silicon Valley and around the world, as well as with a variety of non-profit and government entities.


Brent has a rare combination of experience as an emergency manager, technology manager, people manager, software developer, network/systems engineer, and educator. Now, he shares that expertise worldwide with clients as the founder and principal of Great Circle Associates, Inc.

BC

Brent Chapman

Principal, Great Circle Associates

Transcript

00:00:05

<silence> My name is Brent Chapman. I am with Great Circle Associates for another week or so. Great Circle is my consulting firm. And, uh, starting on Monday though, I'm taking a full-time position with Slack. So, uh, this is like my last gig as an independent for a couple of years while I go work with with Slack for a couple of years. Um, a little bit about my background. I'm a systems assist admin, networking, programmer architect, et cetera, et cetera. Been in Silicon Valley for about 30 years. Uh, worked at places like Xerox Park, uh, tele bit, Silicon Graphics, co a d tell me networks. Uh, spent six years at Google as a SRE site reliability engineer and SRE manager managing the Google Fiber SRE team. Uh, one of the other things I did at Google was develop their incident management, uh, protocol practice, which is, you know, basically what we're gonna talk about here today.

00:00:57

Um, throughout that time, I've also worked as a consultant with my own company. Great Circle wrote some software called Major Domo that some of you may have used back in the day. Mainly list management software. Uh, wrote the O'Reilly Firewalls book and some other open source software called Net Automata, which is a network automation configuration generation platform. One of the things, but, but alongside all of that work in my day job, my professional job, and so forth, I've always worked, uh, on the side as a volunteer in emergency services. So I started off as a search and rescue pilot with Civil Air Patrol, which is the civilian auxiliary of the US Air Force. Uh, by the time I left that organization, about 10 or 15 years later, uh, I had worked my way up to being one of about 30 people in the state of California that were considered fully qualified to manage a search for a missing aircraft.

00:01:44

So an incident commander for an aircraft search. Um, got involved with community emergency response teams in, in various cities in California, uh, mountain View where I lived and worked for many years, San Francisco Alameda. And then these days, most of my, uh, emergency volunteer work I do with the BlackRock City Emergency Services Department, BlackRock City is Burning Man. So I go out there a month before the event, help build essentially the nine one one call center for a city of 75,000 people that only exists for a month a year in the middle of a dry lake bed in the middle of a desert, 120 miles from the nearest major city. Um, we handle fire, medical, and mental health response for, uh, that city. Uh, for BlackRock City. We handle about as many calls per year as any other 70,000 person city in Nevada does. We just do it all within a two week period when the event is fully up and running at our busiest time, we're running at about the same calls per hour as someplace like the City of Boston. So very busy place. And once I get the 9 1 1 center built out there, every year, I switch over to being one of the 9 1 1 supervisors. And then this year, for the first time, I was also, uh, honored to serve as the battalion chief for, uh, one of our shifts. So essentially the on operational lead of a 150, 200 person department, you know, firefighters, paramedics, so on and so forth. So I still do emergency services. This is what I do for fun,

00:03:11

<laugh>.

00:03:13

Um, so what we're here today to talk about are some lessons from the fire department and from the emergency services world that we can apply in the tech world. Um, the first one that I wanna leave you with is that incident response is a critical capability. Okay? We have building codes. We have sprinkler systems, we have fire alarm systems, we have fire escape ladders, we have building inspectors, and yet we still need a fire department to respond to the unexpected to respond when things get beyond what those automated systems and so forth can do. Well, it's the same thing in the tech world. We still need somebody on call for when things go wrong. 'cause things are still going to go wrong. No matter how much we automate, no matter how much we apply DevOps principles and SRE principles and so forth, that just means that what goes wrong is gonna be that much more complicated, right? And that much trickier to deal with and so forth. So we need capability for responding to emergencies. Now, let's think about your typical incident. Something happens, something goes wrong. Um, it takes you a little bit of time after that to detect that something has happened. It takes you a little bit of time after detecting it to decide that it's happening. It's not just a glitch, it's not just a blip. We're gonna raise an alert about it, okay? It takes a little bit of time after that for someone to respond to the alert, to, to get the page, acknowledge the page, get out their laptop, so on and so forth.

00:04:42

Then it takes more time from the time they, they, they respond to the alert until the time that they've mitigated the problem. Mitigation means it's no longer a problem from the end user's point of view. We may still be dealing with it as an emergency behind the scenes, but it's no longer a problem from the end user's point of view. And then finally, you repair the problem. You return the system to some normal operational state, maybe the same state it was in before the problem, maybe a different state, but it is, it, things are back to normal, the emergency is over. Okay? So a bunch of things I wanna point out on this. First, your customers, your, your end users, whether they're internal or external, they see the duration of the impact as from the time it occurs, if they're one of the affected users all the way through to mitigation. That's, that's their perception of this outage, of this problem.

00:05:35

You, on the other hand, responding to it, you don't even know there is a problem until somebody gets alerted, right? Until there's an alert on the screen or a page goes off or whatever. So there's already a mismatch between your perception of the duration of the emergency and the users, the customer's perception, okay? There's a mismatch on the back end as well, right? From the, from the user's point of view. The problem is over when you've mitigated the problem. But like I said, you may still be responding to it as an emergency, uh, for much longer amount of time until you've managed to repair it. So there's this, this mismatch to be aware of between your perception and your, your customer's or your user's perception. Now, this first period, occurrence detection alert, respond, that's the realm of monitoring systems. Uh, some of you were just in the room here, hearing charity majors talk about observability.

00:06:27

Uh, this is Victor ops, this is PagerDuty. This is, you know, this is your alerting and monitoring systems and your response protocols within your organization. And those are all very important things, but they're not what I'm here to talk about today. What I'm here to talk about today is what happens from when you respond until when you resolve the incident. That's when incident management comes into play. And the reason that I'm here to talk about that today, and the reason that I think it's important is because if you do incident management well, you can bring in those mitigation and resolution times. You can make the outage shorter and less impactful, both from your users, your customer's point of view, and from your own point of view, from your own, you know, the impact on your own team. So that right there is why incident management matters. It's why it's worth paying attention to.

00:07:21

Okay? So first or next lesson to draw on from the fire department, draw a distinction between normal operations and emergency operations. Make it clear when you are dealing with an emergency and when you are following emergency procedures and emergency rules and so forth, clear not just to yourselves, but also to anybody you might normally interact with. Okay? Think about this. If you see, you know, uh, if you're in your city and you see a firetruck at the grocery store, and the, and the firefighters are in there getting groceries for dinner and stuff like that, it's perfectly okay to walk up and chat and say, Hey, can my kid sit in the fire engine? And they love that stuff. That's great. That's fantastic. On the other hand, if you see them somewhere and the red lights are on, and the jackets and the helmets are on and so forth, they're working right?

00:08:12

And 99.9% of the general public knows better than to bother them then, right? Because they're dealing with an emergency. Okay? So they have that clear visual distinction to people when they're dealing with an emergency. And when they're not, they also get to follow different rules during an emergency. They got the lights and sirens on, they get to blow through traffic, right? Everybody else has to pull over and get out of the way. Okay? We can benefit in the IT world from adopting similar practices, similar distinctions between emergencies and non-emergencies. And we do things differently in an emergency in order to get through the emergency as quickly and effectively as possible and get back to our normal way of doing things. And I'm not criticizing our normal way of doing things, right? The, the, the, the normal way we do things with, you know, matrix management and project teams and all of that stuff is great for day-to-day stuff.

00:09:03

It allows our companies to do the amazing things we do, but it doesn't work well during an emergency. There's different ways of working better. And that's a little bit of what I'm here to introduce you to today, to work together in an emergency that will get you through that emergency more quickly and with less impact, and get you back to your normal way of operating. So there's something developed by the fire department's called the incident command system. The incident command system was developed in the 1960s in, uh, late 1960s, early 1970s in Southern California. Now, now, one thing to understand, especially, uh, folks that are coming from, from, from outside the us in the United States, fire departments are typically a local government function, right? In, in many other countries, they are a state or a national level function. But in the US they're a local function.

00:09:53

Every city has its own fire department, with its own budget, its own politics, its own policies, its own practices, its own terminology, its own sets of equip, you know, totally incompatible with each other. Or at least they used to be. We're getting better about it. And the incident command system is very large reason why. So in the 1960s, all of these cities in southern California around Los Angeles, San Diego, you know, and all of the, all of the a hundred plus cities in that area, they would have to come together to fight wildfires every winter or every summer, right? Every fall. And they realized they were not doing it very effectively. They were not doing it as effectively as they could. Um, some of it was terminology, right? What one department would call a truck, another department would call an engine, another would call a pumper, another would call, you know, just, they didn't have even compatible terminology for the types of equipment.

00:10:46

And that's a problem when you call over the radio and say, Hey, I need a truck emergency at fourth and Maine expecting to get, you know, an engine, something with hoses and water pump and a water tank and so forth. And instead, what you get is a hook and ladder truck, or you get a pickup truck, right? Because that department that answered the call didn't use the same terminology as you did, right? This is one of the problems they addressed with the incident command system. But there's, there's, there's many other problems. So the incident command system is a set of principles that these departments worked out so that they could work to together better, uh, when they needed to in an emergency. So it is modular and it's scalable. And that's the other very important point about the incident command system. And that applies very much to what we do.

00:11:35

It's scalable. You often don't know at the start of an incident how big that incident is gonna get, how many teams are gonna be involved, how many different experts, how many different specialties and so forth. So you need a way of managing that incident that will scale as your understanding of the incident unfolds as your resources assigned to the incident, grow and shrink over time. You need a, you need a flexible mechanism for, for managing all of those people. And the incident command system gives you that. Now, going through these points, uh, the ICS principles, that's roughly a half day to a day long class. So we're not gonna go through it all right now, but, um, you can, you can read more about it, and I've got some pointers in the last slides. There are a few tweaks to the public safety world's incident command system that we can adopt that make it work even better for our types of incidents, for outages and so on.

00:12:28

So start with the standard ICS style org chart as applied to it incidents. You start with two people, the incident commander or ic and the tech lead, okay? Some, some organizations call this the ops leader or whatever, okay? The tech lead's job is to solve the problem at hand. The tech lead is usually the on-call engineer who got paged when this incident occurred. And, you know, they do a little bit of investigation. They decide that it's more than they can deal with on their own. They need to launch a full blown incident response that at that point they become the tech lead. And they page someone else who takes over as the ic, the incident commander, and the incident commander's job is to deal with everything else so that the tech lead can focus on solving the problem. So getting more help, um, informing the executive team of what's going on, whatever else needs to be done, that's the ICS problem.

00:13:24

So that the tech lead can just focus, focus, focus. Now, when the tech lead themselves needs more help, just more hands on the keyboards or more specialized knowledge, they pull in a series of subject matter experts, alright? Your database team, your app team, your networking team, you know, whatever they need, just more people from their own team, if that's what they need, okay? So they pull, you pull in a series of, of, of more engineers, of subject matter experts to work for the tech lead. One of the ICS responsibilities is communicating with the rest of the organization. And if necessary, the rest of the world that can take on a life of its own, that can, that can turn into a big job. So the incident commander may designate a communications lead, right, to help them with that communications. Typically, the responders communicate pretty well to each other, and there's ways to make that even better, but they often communicate poorly to the rest of the organization when the, when the, when the incident is in progress, right?

00:14:24

So the communications, uh, having somebody designated to handle that communications can help there. Another thing you may need to add is, you know, what some organizations call ascribe, basically, not necessarily somebody who's taking notes, but somebody who's gathering all of the documents and the data and making sure that the recording is turned on, on the Slack channel. And, you know, just gathering all of the, the, the artifacts that other people involved in the response are producing. And then finally, another type of role that's often helpful in these incidents is the liaison. Uh, liaison is a representative of some other group who's affected by the response, but not necessarily a part of the response. So, for example, your call center, your executive team, okay? These are people that want to know what's happening with this response, right? They're impacted by it, they have input to it, but they're not part of the response itself.

00:15:15

So the liaison is their representative. It's typically someone from that outside group, and they are the representative of that group within the response. And it's a two-way street. They're both carrying information from the responders to that group. Again, the exec team, the call center, whatever, but they're also taking information from their group and feeding it into the response as appropriate. Now, following the, the, the principles, the guidelines and so forth of, of unity, of command and, and, uh, span of control, and, you know, all those other things I said we weren't gonna get into in this very short amount of time we have together today. You can just keep scaling this up as the incident unfolds, as you get more people to where, you know, and you've got multiple DBAs, multiple networking team, both local and wide area storage, customer care, whatever, you, you, you develop an org chart on the fly for that particular incident based on the situation at hand and the resources available at that time.

00:16:14

Every incident is gonna have a different org chart. People may play a different role on one incident as they do on the next incident. Just because I was the incident commander yesterday, I may be one of the subject matter experts today, right? Um, and, and, and people have to sort of, sort of, uh, be clear on that. So it's important in developing your instant management protocol in this way to focus on the roles, not on specific individuals, right? You focus on the roles and on developing a pool of people who can fulfill each of those roles, who can step into each of those roles. But you need to train everybody. And I mean, not just the responders, but also anybody who wants to interact with the response, talk to the role, not to the individual. Talk to whoever today's incident commander is not Joe, because Joe was the incident commander on the last incident.

00:17:04

Joe May not be the incident commander today. She may be the comms lead or the ops lead or something else entirely. You, you have to get people into this habit. You have to get it across to people that the role varies from incident to incident. The other thing about this is that your role on an incident may have only on a particular incident, may have only a very loose relationship to your everyday status in the organization. Your incident commander may be a mid-level engineer or a project manager, right? The comms lead may be the senior vice president. Okay? So everybody needs to know all of our day-to-day organizational structure and ranks and so forth, goes out the window for the purposes of an incident response. And we deal with that response independently of that, okay? And again, people need to get, get comfortable with that idea. And that takes, in some organizations, that takes a lot of doing

00:18:00

The next lesson from the fire department, practice, practice, practice, and then practice some more. Um, this is a foreign way of working for most people in most organizations. It takes some getting used to, the terminology is different, the principles are different. The, the protocols and so forth. It, it, it's a little uncomfortable. You know, the first few times you go through it, it's awkward. So the best thing you can do to be prepared is to practice. Okay? Now, I'm not, you know, if you can, if you can do an actual live drill or something like that, beautiful, great, but work your way up to that. One of my favorite ways of practicing is to take some planned event that's not an emergency. Like say a building move or a data center bring up, or a, uh, quarterly deployment of new software or something like that. Something that's not an emergency. And organize as if it were, use the org chart. Use the terminology, use the communication tools that you would use in an emergency. You know, use the slack channels and so on and so forth. And, you know, just go through the process as if it were in an emergency to get people used to these concepts, to get people used to the terminology and the tools and so on. Okay? Um,

00:19:21

One last point I wanna leave you with. Very important point. It's really easy for senior managers, directors, VPs, et cetera, to totally disrupt an incident response in progress. Totally accidentally. They don't mean to, they don't realize they're doing it, but just, just by showing up, and especially just by, you know, asking questions, all we're doing is asking questions. I just wanna know what's going on, right? Depending on what you ask and who you ask and who sees you asking, who is who all's in the room or on the channel or whatever, when you ask, you can just totally disrupt a response in progress. Because, you know, if, if the senior vice president shows up on the Slack channel that the responders are all using to talk to each other about what's going on and who's troubleshooting what, and what are we finding, we think maybe this problem over there and that SVP pops up and says, Hey, what's going on?

00:20:13

All that other debugging activity grinds to a halt while everybody, you know, pivots to this very important person to try and figure out what they want. 'cause they're the ones that sets the bonuses for the for and the performance reviews and so on, right? And, and, you know, you've just totally derailed that incident response in progress. Okay? So my advice to people who are senior managers and executives and so forth is if you want to interact with an incident response in progress, do it behind the scenes. Do it directly with the incident commander and do it in a, you know, one, in a private channel, a direct message or something like that, that not in the main comms channel or phone bridge or whatever your organization uses for incident response. And things will go much, much smoother, okay? And you'll get, you'll get better, better responses out of it.

00:21:01

Um, that's all the prepared material I have. Like I said, I, you know, these, there's a lot more to learn about this stuff. I've put a few, uh, resources up here. In particular, PagerDuty, uh, on response.pagerduty.com has published, uh, basically a sanitized version of their own internal incident response protocols and practices and things like that. It's pretty good. It's available. Uh, the source code for it, it's source, it's markdown, it's HTML markdown and it's available through GitHub so you can take it and copy it and make your own changes to it and use it yourselves and, and so forth. It's released under an open source license, you know, so they encourage you to do that. There's a couple of good chapters on incident management in both the SRE, uh, the site reliability, the SRE books from Google and O'Reilly. Um, so there's a good, um, starting point as well. And, you know, feel free to, uh, visit that last URL, uh, great circle J MP slash great circle dash 18 10 24. 'cause I'd intended to do this talk on Wednesday, not today. Um, visit that link and you can get a copy of these slides. Does anybody have any questions for me? Yeah,

00:22:08

Outta curiosity, how the liaison comes in on the main channel as opposed to all the <inaudible>.

00:22:14

How come the liaison comes in on the main branch reporting to the incident commander rather than, uh, off of the comms manager? Um, more because if a group is important enough for you to have a liaison too, such as your exec team or your customer care center, your call center or whatever, they're probably gonna be demanding the attention of the ic, you know, rather than funneling that through the communications person. But one of the hallmarks of this, this program is flexibility, right? If, if it makes more sense in your situation and your circumstances on that particular incident to have the liaisons report to the comms lead, nothing says you can't. So do what works. Other questions? Yes.

00:22:58

Any advice or experiences on transitioning after the incident over to the RCA?

00:23:03

Any advice on transitioning after the incident to the RCA, the root cause analysis? Yes. Uh, first off, I hate the term RCA, I'm part of the camp that says root cause calling it that is misleading and, and causes all sorts of other problems. So putting that aside for the moment, I believe that any incident that's big enough to require an organized response like this is also big enough to justify a post-incident review, post-mortem, RCA, whatever you wanna call it. Okay. And that it is essential, I think it's most effective for that, uh, post-incident review process to be led by the incident commander of the incident. If they're not leading that process, they're certainly a key contributor to it. So I think it's, I think it's best if it just kind of naturally flows from one end to the next. And in many cases, you know, when I was working, when I was working on incidents at Google, I would grab the template, the, uh, postmortem template.

00:24:02

'cause Google, they call 'em postmortems. I would grab the postmortem template and start writing the postmortem before we had even wrapped up the incident. 'cause you know, you're closing, you're, you know, what direction things are gonna go and you're, you're starting to do cleanup and things like that. You can start working on it now. Um, but yeah, I believe that that there, there's a link and one of the, so one of the corollaries of that, one of the consequences of that is you have to give people time for that. You have to, if somebody is going to be an incident commander, it's not just they're gonna spend an hour or two dealing with the incident, it's, they're gonna spend an hour or two dealing with the incident and then they're gonna spend, you know, 10 hours over the next week or two dealing with the postmortem and the follow up and so forth. And, you know, management and the people that are in the incident commander rotation and so forth have to understand that, that yeah, that's part of the job and, and, you know, allocate their time accordingly and so forth. Make sense? Yes.

00:24:53

So I'm used, which is always same person. This is their job, right? Carry the

00:25:01

Paper. Yeah.

00:25:02

Whatever. It's today. And it sounds like, uh, from your presentation that this is not

00:25:07

The case. I I don't like that model because you become too, so, so let me repeat the, the statement. Um, many organizations, uh, are set up such that they have one person that is the incident manager and they're the incident manager for every incident. That's their job. Okay? I don't like that model for a bunch of reasons. One, it makes you very, it makes your organization very dependent upon that one person. And if they're out sick or they're on vacation or they're, you know, they're, they're already dealing with another incident or something out like that, what then, right? Who's their backup? Right? Who, who else can handle this? It, it's also a huge burden on that one person they're getting called day and night. They're getting, you know, called at off hours and things like that. I mean, the, the on the on call burden is not trivial. So I think it's better to have a rotation of incident managers on call. You know, if you're informal, you can do it just, yeah, page everybody and see who's most available. Right now I've got, that's not my preferred way to do it. I prefer to have, you know, incident manager is just another on-call rotation, just like the, you know, the engineer on call for a given service or whatever. And, but that means you have to have a set of people that are trained and qualified for that role. Exactly. Yeah. Yes. From

00:26:17

Your experience, can you give an example of one incident process you had worked extremely well and another incident management process where you would say, do not follow this.

00:26:29

Uh, can I give an exam? Can I give two examples? One of an incident management process that worked very well and another that I would say do not follow this at all costs. Um, let me answer that second one first, which was, you know, the example that I, that, that we were just discussing, that having pre-planned not just what the roles are, but who is going to fill those roles, I think leads to brittle, um, emergency response and, and those sorts of plans shatter, you know, when, when they impact with reality and so forth. So that's the what, not, not what not to do. Um, the best, the thing that, that I think is, is really worth doing, especially for those of us in the tech world, uh, use text-based, channel oriented communications mechanisms. And you know, I told you I'm gonna work for Slack on Monday.

00:27:14

Well, okay, slack, hip chat, IRC, things like that. And, and channel focused, channel based so that the channel has a life of its own independent of who's on the conversation at the moment or not, right? Simple group text chats. The problem is how do you get a recording of the whole thing and you know, if, if like the person who created it then leaves the channel or gets dropped because their phone loses service or whatever, do you lose the whole channel and the whole history or whatever? How do you find that thing? How do you, how do you join that conversation? You have to wait for somebody to invite you into it and so forth. So I'm also a huge proponent for tech companies of text channels over phone bridges, right? I hate phone bridges for incident response. So, but I don't think we have enough time to go through why <laugh>. Other questions? One, I think we have time for one more, but not that one. I'm happy to talk about that one at, at like one of the, um, lean coffee incidents or something like that and tell you everything I hate about phone bridges.

00:28:13

Alright. Thank you all very much today. Have a great conference.