Findings From The Field: Two Years of Studying Incidents Closely

In the past two years, we have had the opportunity to observe and explore the real nitty-gritty of how organizations handle, perceive, value, and treat the incidents that they experience. While the size, type, and character of these companies vary wildly, we have observed some common patterns across them.


This talk will outline these patterns we've discovered and explored, some of which run counter to traditional beliefs in the industry, some of which pose dilemmas for businesses, and others that point to undiscovered competitive advantages that are being "left on the table."


These patterns include:


- A mismatch between leadership’s views on incidents and the lived experience that engineers have with them.

- Learning from incidents is given low organizational priority, and the quality and depth of the results reflect this.

- “Fixing” rather than “learning” - the main focus of post-incident activity is repair.

- Engineers learn from incidents in unconventional ways that are largely hidden from management (and therefore unsupported).

JA

John Allspaw

Principal/Founder, Adaptive Capacity Labs, LLC

Transcript

00:00:09

Without doubt, almost everyone in the DevOps community knows a name, John all spa. In fact, if there were a starting gun in the DevOps movement, it was certainly the famous talk at the velocity conference in 2009 when John Allspaw and Paul Hammond talked about doing 10 deploys a day at flicker, I finally got to meet John Allspaw in person at the DevOps days in mountain view in 2010, it's a meeting I'll never forget. And over the years I've learned so much from him. He got his master's degree from Lund university. Back when he was the CTO of Etsy, his advisors included the famous Dr. Richard Cook and Dr. Sydney Decker famous for their contributions to the safety and resilience engineering community. He now works with both of them as partners at adaptive capacity labs. He told me a couple of months ago, just how much he's learned working with them, taking his colleagues expertise to study incidents deeply, such as what Erica just shared. What dazzles me about John's work is that he believes that by understanding how organizations handle incidents, we get incredible clues on how organizations learn. I can't think of a better presentation to follow Erica's presentation than John Allspaw, who has contributed so much to the space. Please welcome John.

00:01:20

Before I start, I want to make a few things clear. So the first thing is that these are only a few of the most common patterns and they're across organizations that we've come into contact with. That means both clients and non-clients. And the second is that these are not judgments or comments on any given signal, single organization or client or non-client. Uh, so I just want to make sure that we're, we're clear on that these are observations that seem to have some valid support. So you're, here's the summary slide. Um, there's the, the bottom line upfront, um, uh, first is the state of maturity in the industry on learning from incidents is low,

00:02:06

Really low. I've been studying this for a number of years with my colleagues. And I can say that I think that we, for the most part blanket generalization in the industry think that this is a solved problem, and that we're quite good at it. This is, uh, could not be further from the truth. I would say that a significant part, you know, I'll go into later tech leadership in the industry, fundamentally misunderstands what it means to effectively learn from incidents, practitioners as well. But there's a gap there. The second is that this gap exists. There's a gap between how technology leaders and hands-on practitioners understand incidents, understand what they mean and understand how to learn from them in a, in a progressive and effective way.

00:02:58

The third point that learning from incidents, as you might guess, at this point is given low priority relative to all of the other things that businesses are doing. And this results in this really narrow focus on fixing rather than learning. And this has a number of follow on cascade effects that tends to be, uh, well, not good for business. And lastly, there's, uh, really an overconfidence in what these shallow, what we'll call shallow incident metrics mean. And there's a great deal of significant energy wasted on tabulating them and producing reports on them. These metrics are frequently gamed. They don't have any predictive value about as useful as counting lines of code. Um, when it comes to shallow incident metrics, we think we're doing astronomy, but we ended up doing astrology instead. So let's talk about this gap, this gap between technology leaders and hands-on practitioners is one of the more prevalent, uh, as bit concerning of the, of these patterns.

00:04:12

So what is actually learned versus what we think we learn, how learning actually takes place versus how we think learning is supposed to happen and all of the programs we might put in place to support that and what the incident actually means for the business, not just what it means for right now, but what it means to inform, anticipating future incidents and future vulnerabilities in the future. So I wanna, I wanna really make sure that I'm clear on this. I, I I'd like to sort of describe when I say technology leaders. I mean, people who are at the upper echelons, uh, hierarchically in an organization and hands-on practitioners who have their hands in the guts of, of the technology. And the reason why I'm making this distinction between tech leaders and practitioners is to draw attention to something that I think is so fundamental that we often don't acknowledge it in research, in human factors and resilience engineering technology leaders.

00:05:21

That is to say those in positions to provide resources and policies and, uh, compliance or, or, or rules, uh, would be known as the blunt end. And this is in contrast to what was known as the sharp end. The people whose, whose day-to-day work is about evolving and maintaining this technology. So here we have practitioners who are charged with the responsibility of designing and involving this technology and organization and leaders are providing, are charged with providing resources and policies that are supposed to at least support and enable practitioners to live up to that responsibility. But as you can see, they're quite distant. And when it comes to incidents, um, technology leaders tend much more often to look at summaries and simplifications or abstraction, certainly lots of statistics about incidents, very rarely what makes incidents difficult? So the people furthest away from the day to day details need to understand what it means to cope with the complexity at that sharp end, at least enough to support those hands on folks effectively.

00:06:42

Isn't that a funny thing? There's somewhat of a dilemma there. So when we talk about incidents and give, having a close understanding of what's going on, how all of the different parts interact and components and subsystems and behaviors hands-on practitioners understand because it's part of their daily work, they might be on call. They might be charged with fixing bugs or anticipating new bugs. They have a much more, almost palpable sense of the health of the, the stuff, all of the, the myriad of stuff in a much more organic sort of living fashion. Whereas technology leaders from to some extent might have an idea, a little bit of what is going on in, uh, uh, in that stuff down below. But for the most part, these are generally sort of simplifications. They're sort of wipe away a lot of what makes incidents difficult and so that it can fit into a neat story.

00:07:45

It says, as a result, they tend to sort of want to retreat back into a world that's nice and linear and, and, uh, and, and, and falls into columns and fits an Excel spreadsheet. So technology leaders, I don't know I'm going to address technology leaders, um, some of these patterns, uh, um, for this little bit, and I'll talk about practitioners after that. So the first is a pattern that we see as that tip that, like I mentioned before, leaders are typically far away from the messy details of incidents. They may have begun their career as a hands on particular practitioner, but when that happens, that actually does the opposite of what you would want. That's frequently brings them to overestimate their ability to understand their real details. Another pattern is that technology leaders frequently believe that their presence and their participation in incident response channels like bridges and chats and, and, and, uh, conference calls and that sort of thing, that their presence has a positive influence.

00:08:49

And I can tell you, this is absolutely not the case. There's a, certainly not it's guaranteed to be negative, but I can guarantee it's not, uh, it's nowhere have we seen is it turns out to be uniformly, uh, positive it also, uh, technology leaders also typically believe that incidents are adverse events that are, that really exists in an otherwise quiet and healthy reality, meaning that if you just don't touch the system, it'll be fine. Or that, that there, that incidents are these sort of epi phenomenon, right? But incidents, we, as we know, represent very important signals to pay attention, to, and to dismiss them as sort of defects as some sort of an anomalous situation where normal is zero, there is no vision, zero in modern software systems, technology leaders also typically fear how incidents reflect poorly on their performance as leaders, more than they fear practitioners, not learning effectively from them. There's a number of reasons for this certainly incentive structures can encourage leaders to put much more priority on how they look from a political standpoint in the organization then on supporting effective learning at the sharp end, you know, that end that's further distant from them.

00:10:16

Technology leaders believe these abstract incident metrics tell enough of a story for them to understand the state of the system when, as it turns out they don't, they typically believe that these metrics reflect more about, or often than not about their team's performance than it reflects the complexity those teams have to cope with. If there's been a rash of incidents, one of the common questions is, are there, are they all happening within this team or is there a common individual or a common group or part of the organization centering it on the people's performance? Not the systems complexity. Finally, I would say that technology leaders typically believe that all of these above observations don't apply to them. So we'll talk a little bit about these sort of abstract incident metrics, what we call sort of shallow metrics. What I mean by those are all of the ones that you've all heard of.

00:11:17

The typical conventional meantime, to resolve meantime, to detect meantime, to acknowledge or know, or meantime to something, the frequency events of events of incidents, the severity customer impact, right? This is all data. This is fine. It's quite quite, quite common to take, to get this, but when it comes to learning from incidents, this data doesn't have any predictive value forward. You can't use. There is no such thing as trending, despite popular belief with this data. If you see incidents, this I'm showing here a, uh, well-known cloud provider, uh, uh, data from their, from their incidents. Um, there's no predictive value forward looking at this data and there's no explanatory value backward. These are just numbers and they're complete. They're divorced from the, the, the substance Quite often, people will say, well, I get that John. But, you know, we understand that doesn't provide a lot of insight into the, uh, incidents, but, you know, they help us ask deeper questions. You know, they kind of give us point us in some directions. And to what I would say is you don't need this chart to ask deeper questions about incidents.

00:12:38

Just ask the questions. You either have an incident, or you don't have an incident. Ask the questions more importantly from a leader standpoint, really probably certainly could ask questions, but you're not the most important person to ask questions. The most important people to ask questions are the people who are going to ultimately be responsible. The people who are getting up in the middle of the night, by the way, if you're going to ask just the questions, then you should ought to ought audit, consider recording those questions and the answers to them, the multiple answers so that others can find them in the future. This is what supporting learning looks like. So unless leaders encourage and support capturing difficulties of handling an incident, then all they'll have is the shallow data. My gut tells me that in our experience, it's not that people fight tooth and nail to have this shallow data it's because they don't have anything else. I'm going to put this question to you. And it's meant to, it's meant to spur some, some, some thinking, how can you tell the difference between a difficult incident handled well and a straightforward incident handled poorly?

00:13:54

Of course, the thing that jumps out at you is difficulty. How would you know whether, uh, uh, one incident is more difficult and in what ways than others?

00:14:08

So where we already have this, the consequences or impact of the incident, incident metrics. I mean, they've got that handled a customer impact and effects on SLA. SLA is in and contractual obligations. We have it, but what's missing here are, is this notion of this element, quality of the difficulty in handling the incident and the performance of, of the teams who are handling the incident. And without these two, you can't understand what these incidents mean in context. So this is topic is beyond the scope of, uh, of this talk, but it's worth mentioning, I'm going to put on a slide here, um, to help convince you that incident metrics don't do not do what you think they do. Uh, I've got a couple of books here and a link to a blog post that talks a little bit more about this. I just, I very much urge you to take a look at these.

00:15:09

So let's turn our attention to hands-on practitioners. Um, some of the common beliefs is that they, uh, patterns is that they typically view post-incident activities to be a check the box chore. Uh, very rarely in some certainly in some cases that does happen, but more often than not practitioners don't look forward with excitement and enthusiasm to attend a group incident, a review meeting. They typically believe in a future world where automation will make incidents, magically disappear. To some extent. This is understandable, especially when it comes to engineers, engineers want to fix things. They want to engineer difficult to prevent them from doing that. They also typically do not capture what makes an incident difficult, but more often the case only what technical solution there was for it. They would like to cut to the chase, zip past the story, but get to the thing, um, and that's for a number of different reasons.

00:16:14

But what it does is short change and a really short circuit, um, what sort of learning could be, could be had specially informal captures of, of post-incident narratives that also typically don't capture the post-incident write-up for readers beyond their local team. If they write it for anybody, they're going to write it for their local team more often than not. And as a result, they typically don't read post-incident review write-ups from other teams. Why? Because it's a lot of work that usually has lots of jargon and, uh, things that are very difficult to understand, unless you are familiar with the technology. So therefore it limits how many people might be able to learn something, uh, about how stuff works in the organization. They also typically fear what leadership thinks of incident metrics more than they fear themselves, misunderstanding the origins and sources of the incident. And there's some variation this in this, um, there are certainly incentives and, and, uh, motivators that would lead you to, uh, fear what leadership thinks more or less, um, depending on the organization. But it is a thing that typically have to exercise some significant restraint from immediately jumping to fixes before understanding an incident beyond a service level. Quite often, you'll hear that the point of having a post incident review meeting is to develop action items. And if you don't develop action items, we'll then what are we doing here? Learning as a possibility from that learning might come some decent things to do about the future.

00:18:04

And of course, hands-on practitioners. When we talk to them, typically believe that the above observations also don't apply to them. So I want to put, uh, something I said earlier, um, I'd like to put some attention to, is that learning is not the same as fixing, right? The danger here is not that, uh, coming up with action items or follow on remediations or things to help in the future are a bad thing to do. In fact, they're an amazing thing to do the question isn't whether you're doing them or not. The question is how good are they and how, and what understanding are they based upon if you understand a incident at a very surface level, you run the risk of generating hypothesizing follow on action items that are either too large in a boil the ocean sort of way, or completely off the mark that doesn't address what the incident actually has.

00:19:09

Again, somewhat beyond the scope. There are some, uh, there are some materials here that, um, I will throw at you to, to read more about this. So you could say this, okay. I understand John, look, we're not doing very well out in the world was then what do you have to, what do you have to give to us? So there's like a couple of things, um, uh, beyond, um, uh, getting a master's degree and, uh, and doing the work that we focus on every day aimed at technology leaders. I would say this learning from incidents effectively requires skill and expertise that most do not have. And by most, I mean, all, um, I don't want to, uh, uh, to, to sugarcoat it. Um, I would say this, that these are skills that can be learned and improved. You can imagine this, you don't become an NTSB accident investigator without having learned some new skills, the same prioritize, learning this, prioritize, doing it well, and do it when things are going well, don't wait until you've had the big one to pound the table and say, we have to get better at incident analysis.

00:20:34

Um, there's no better time to fix the roof than when it's, when the sun is shining. Um, I will tell you this, it will accelerate the expertise in your org and with the amount of time, effort and money you're spending on retaining talent. Uh, you could spend just a little bit more time and effort investing in accelerating the expertise that you already have on staff. Uh, I'm gonna link you here to the learning from incidents.io site. Um, this is, it's not just adaptive capacity labs. Um, that's paying a lot of attention to this. This is also a community of people from across the, uh, across the industry who are looking on what deep analysis looks like and how it looks different from your typical template driven. Post-mortems some other suggestions for technology leaders focus less on incident metrics and more on signals that people are learning.

00:21:36

How could you tell that someone is learning? Here's some suggestions, try to get some analytics on how often an incident write up is being read, try to find out who is actually reading these write-ups by their own volition. Where are these write-ups being linked from? It may surprise you to, to know that, um, we've seen organizations that not only, uh, link to individual incidents from code comments in the code base, uh, but also from their, um, their architecture diagrams from their roadmaps, uh, know at least a handful of, of, uh, organizations that use incident analysis, uh, in new hire onboarding and orientation, because they're such a good tours of, of, uh, how things work, support incident, review meetings, being optional and, and track their attendance. I can say this with some confidence engineers do not go to optional meetings unless they are very interested. And if they're very interested, there's something there.

00:22:54

I can tell you that I can't learn anything. If I don't go to the place where I can learn it, you want this number to go up track, which write-ups link to prior relevant incident. Write-ups seeing how elements and, uh, and, and aspects and facets of your architecture and your organization change over time can be done by cross incident analysis. And the only way that you can do cross incident analysis is if each of the individual analyses are rich enough for you to see these meta patterns over time. There's more about this here. So now I'm going to turn my attention to practitioners. Uh, don't place all the burden on a group review meeting. This is a conventional view. You have an incident, you maybe make a timeline, and then you have a big meeting. Use this meeting to present and discuss analysis that's already been done.

00:23:53

Okay. There's just some, there are too many potential pitfalls to bet everything on a single meeting, uh, there is the highest paid person's opinion that dominates the conversation. There's group think, uh, which is a dynamic that can happen. People can get off in tangents. There's, uh, various political redirections. Oh no, that it w it wasn't about this, uh, thing. It was about this thing, by the way, this is the thing that I really want everybody to talk to because it furthers my agenda. Elephants in the room, down in the weeds, these are pitfalls. This is an important meeting. So prepare for it like it's expensive because it's very expensive, perhaps controversial is that practitioners consider practitioners to not be stakeholders. You, your role is not to tell the one true story of what happened. Your role is not to dictate or suggest what to do.

00:24:48

Maintaining a non stakeholder stance signals to others that you're willing to, to hear a minority viewpoint. Minority viewpoints is part of getting multiple diverse perspectives, which is the fuel that makes a story compelling. It's the fuel that makes something interesting. You want these interesting stories because without them people won't learn because they won't pay any attention. One way of thinking about this as half of your job, full 50% of your job as an incident analyst is to get people to genuinely look forward to and participate in the next incident analysis and having a, uh, somewhat of a neutral, certainly a non stakeholder stance is a way of getting that done. Perhaps even more controversial, separate generating action items from the group review meeting, separate them in time, make the group review meeting about understanding the incident. You're not going to construct a timeline. You're going to explore one.

00:25:57

That's already been constructed. And you're going to talk about themes that have been, uh, that have been synthesized from the data that you've already analyzed. This makes the meeting focused, and it doesn't burden the meeting with coming up with action items. I'm sure lots of people have experience with being in a post-incident review meeting. And at some point, someone says, well, we've only got 10 more minutes, or I've only got 15 more minutes. We should better come up with action items. Um, so let's come up with some, if you separate in time, what we call soak time, uh, later, even just a day later, and it could be asynchronously. The generation of action items will necessarily be better. This is the same dynamic that causes what most senior engineers know to be their greatest trick, which is when you are banging your head against a wall on a really, really tricky bug

00:27:01

Stop doing what you're doing and go for a walk, go to the gym, take a shower that when you're not programming is when the answers will come to you. It's the same dynamic. So I'm going to put forth the challenge and leave you with this. This is a challenge I challenge. Now I dare you all to take me up on this. So for technology leaders, I dare you. And come back to me to tell me what happened. I dare you to do this. I dare you to start tracking how often write-ups incident write-ups are voluntarily, not mandatory, voluntarily read by people out side of the teams, closest to the incident. Also start tracking how often these incident review meetings are voluntarily attended by people outside of the teams, closer to the incident, do this for one quarter, have somebody in your staff do this for one quarter. Don't publicize advertise that this is what you're doing, how often people read and how often people attend is your starting point for making progress on learning.

00:28:30

If people don't read or attend, there's not learning happening, but that was obvious, but I figured spell it out. Practitioners. I have a dare for you for every incident that has a red herring episode, a wild goose chase, a rabbit hole. What I want you to do is capture the story of that red herring in detail, in the incident, write up, put it in its own section, do whatever you need to do, capture it, write it, ask the people who were there, what it was most importantly, what made following the rabbit hole seem reasonable at the time. Anybody could say you shouldn't have fallen down that red herring. Next time don't follow that. Red herring, red herrings and rabbit holes and goose chases exist because they almost always work. And we're convinced that they are working only hindsight tells us that they don't. That's my dare.

00:29:37

As you can imagine, uh, I'm interested in incidents, um, Dr. Richard Cook, Dave woods, myself, uh, and all of the people in the learning from incidents, community, which is growing very quickly, are interested in hearing your stories. We want to know, we want to hear cases. There are academic works, being written. There are, uh, there are, there are degrees, uh, being achieved, um, doctorates being achieved on this topic. It's time for decades of research in human factors, resilience, engineering, cognitive systems, engineering to reach the technology industry. Things are too important to screw this up. So bring your stories and reach out to us. Thank you for listening. Um, very much care about this topic. These issues are so important. It's not just about the incidents. It's about how an organization learns and achieves any goals that it sets out to. I'm very excited that I was able to speak right after Erica Morrison, who spoke pretty brilliantly about how these things in a real case can make a difference for your organization and your customers. So if you have any questions for me, if there's anything I can do to help you venture down this path, let me know. I've learned a lot over the past number of years, the benefits are so much larger than we think. So you will find me in slack, the slack instance, uh, for the, uh, for the, for the conference, for the DevOps enterprise summit. And you can always reach out to me at Al spa on Twitter and, uh, all spa@adaptivecapacitylabs.com. Thanks. Thanks for listening.