Las Vegas 2020

Findings From The Field: Two Years of Studying Incidents Closely

In the past two years, we have had the opportunity to observe and explore the real nitty-gritty of how organizations handle, perceive, value, and treat the incidents that they experience. While the size, type, and character of these companies vary wildly, we have observed some common patterns across them.


This talk will outline these patterns we've discovered and explored, some of which run counter to traditional beliefs in the industry, some of which pose dilemmas for businesses, and others that point to undiscovered competitive advantages that are being "left on the table."


These patterns include:

-A mismatch between leadership’s views on incidents and the lived experience that engineers have with them.

-Learning from incidents is given low organizational priority, and the quality and depth of the results reflect this.

-“Fixing” rather than “learning” - the main focus of post-incident activity is repair.

-Engineers learn from incidents in unconventional ways that are largely hidden from management (and therefore unsupported).

JA

John Allspaw

Principal/Founder, Adaptive Capacity Labs

Transcript

00:00:06

Thank you, Erica. All right. For the next speaker, certainly almost everyone in the DevOps community knows the name, John all spa. In fact, if there were a starting gun in the DevOps movement, it was the famous talk that he gave at the velocity conference in 2009, where he as a VP of operations at flicker co presented with Paul Hammond, a director of development in the Yahoo mothership talked about doing 10 deploys a day. Every day. I finally got to meet John all spot in person at the dev op stays in mountain view in 2010. It's a meeting I'll never forget. And over the years I've learned so much from him. He got his master's degree from London university. Back when he was CTO of Etsy, his advisors included the famous Dr. Richard Cook and Dr. Sidney Dekker and Dr. David Woods famous for the contributions to the safety and resilience engineering community.

00:01:01

He now works with Dr. Cook and Dr. Woods as partners at adaptive capacity labs. He told me a couple of months ago about how much he's learned working with them, taking his colleagues expertise to study incidents deeply just as what Erica does shared. What dazzles me so much about John's work is that he believes that by understanding how organizations handle incidents, we can gain incredible clues on how those organizations learn. I can't think of a better presentation to follow. Eric has amazing presentation than John Allspaw who has contributed so much to this space. Please welcome John.

00:01:48

Oh, thanks Jane. Thanks for that introduction today. I'm going to talk about some of the patterns that we've seen emerge in our work with adaptive capacity labs over about the past three years or so. During this time, we've had the opportunity to observe and explore the real nitty gritty, the actual messy details of how organizations learn from incidents, handle incidents, perceive what they mean, uh, and, and the value that they think they might get from, uh, their experience, um, while the size and the type of these companies vary quite a bit. Um, and everything from a 100 person to 30,000, uh, people, um, established sort of decades, old B2B, uh, SAS companies to consumer facing startups. Um, we've seen some common patterns. And so I'm going to describe some of these patterns, uh, some of which will probably run counter to some typical beliefs in the industry and some of which might pose dilemmas for businesses.

00:02:56

Um, and, and, and finally others that might point to some, well, I would say undiscovered or competitive advantages that are being left on the table. So, uh, I won't spend too much time about me as Jean mentioned. Um, uh, my, uh, master's is in human factors and systems safety, uh, from Lindsey university. My career has mostly been in software prior to that. I was CTO at a company called Etsy, and I've written a couple of books that you can see here. Um, so before I start, I want to make a couple of things clear. Um, the first is that these are only a few of the most common patterns that we've seen. I have a short period of time to describe in detail what these are. So they're not going to include, it's not going to be comprehensive. Um, uh, and these are across organizations that we've come into contact with, uh, both clients and non-clients.

00:03:53

Um, and the second is that these, uh, reflections these patterns that I'm gonna describe to you. Aren't, uh, these judgments, aren't a comment, uh, these comments, aren't a judgment, uh, on any single organization. These are, um, uh, common themes, if you will, that we've seen. So here's, here's the bottom, here's the sort of summary slide viewer there's any slide you wanted to read? It would be this one. Um, here's some of the, here's what we've observed across the industry. Uh, the first is that the state of maturity in the industry on learning from events is quite low. Well, we've been doing this for some years now. Um, and I think that there's a huge gap between what businesses believe they are a state of maturity wise, or their ability to learn from incidents and what they're actually doing. Uh, I would say that tech leadership in particular fundamentally misunderstands what it means to effectively learn from incidents versus fixing things quickly.

00:04:59

We'll talk more about that. The second is that there's a significant gap between how technology leaders and hands-on practitioners view what it means to learn from incidents. And we find these in sort of miss calibrations between leadership and hands-on staff, uh, in a couple of different ways that I'll describe the third is that learning from incidents is typically given low priority. Given what we've just said in, uh, number one in number two, this, uh, um, this, this would make sense, the sort of fixes quickly, uh, sort of approach can lead to rushed, uh, fixes, which might actually paradoxically make things more vulnerable to future outages, then making things safe.

00:05:54

And the last is overconfidence. And in what sort of the typical historical shallow incident metrics mean, and there's a significant amount of energy that's wasted in sort of tabulating them. Um, they're frequently gamed and they don't have much predictive value. Um, they're about as a signal strength as lines of code. So there's this pattern. If this pattern right here is perhaps the most prevalent that we've seen as well as could sort of concerning. And so there's this gap between people who have the everyday lived experience of designing preventative measures and responding to incidents and tech leadership. When an incident happens, there's a gap between what's actually learned about the incident, how that learning takes place in the organization versus how they imagine it might take place and what the incident actually means. Remember, incidents, aren't these singular atomic singletons that don't have re relationships to each other. They are informed both positively and negatively by past experience. So

00:07:17

The reason why I want to make this distinction between technology leaders and hands-on practitioners is to draw attention to something that's so fundamental. We often don't acknowledge it. Practitioners are the people charged with the responsibility of designing and evolving technology in the organization and leaders sometimes known as the blunt end are charged with providing the resources and the policies that enabled practitioners to live up to that responsibility. So put another way, however, the people furthest from the day-to-day details need to understand what it means to cope with that complexity, at least enough, to support those hands-on folks effectively. And as a result of this distance leaders have to, uh, um, and the, the preponderance of details, uh, and, and, and, uh, nuances around incidents tend to oversimplify via these summaries or abstractions or statistics while the people who are front end practitioners, the people who are on call, the people who are trying to work out, what is, what is likely to break next, or what is happening now, what might be, what might we do about it, um, are mired in these, uh, details that have a great deal of nuance that changes sometimes from second to second.

00:08:45

Okay.

00:08:48

So if we were to look at how hands-on practitioners see an incident from multiple different perspectives, it might look like this incidents or these complex events that have multiple parallel and orthogonal perpendicular, uh, um, uh, interactions between components and people, and, uh, making sense of what to do now versus what to do, uh, and versus what we've done in the past. And how's that affected to where we are now, there's a story that fits in people's minds. Whereas leaders typically don't have for many different reasons, this detailed understanding they're further distanced. The result, like I said before, is this squashing of this complexity into narrow, shallow, uh, quite often numerical terms, which ends up making, uh, again, wiping away the details that really matter to the people who are working on this technical complexity into nice and neat tabulated boxes. Now here's something I'm going to spend some time about, uh, what, we've, what we have been able to, uh, glean as themes when we speak to technology leaders.

00:10:16

So there's a couple of, uh, um, patterns here. The first is that they are typically far away from the messy details of incidents. They might've begun their careers, I hands-on practitioner, but that unfortunately can frequently bring them to overestimate their ability to understand the real details. They frequently believe that their presence and participation in incident response, the, the, in the trenches as the, as the situation is unfolding. And I mean, by that chat transcripts, uh, chat rooms and bridges, um, they believe that their presence has a positive influence. Um, and I, I think we're quite confident now that it doesn't have positive. Uh, there's certainly a gap between what leaders believe they are doing in responding actively and an incident, uh, and what hands-on practitioners understand that, uh, influence debate. They typically believe that, uh, incidents are adverse events in an otherwise quiet and healthy reality. Of course, that's not the case. Incidents represent very important signals to pay attention, to and not to dismiss as a defect. There is no vision, zero in modern software.

00:11:37

They typically fear how incidents reflect poorly on their performance. More than they fear practitioners, not learning effectively from them. Incentive structures can encourage leaders to put more priority on how they look politically, then on supporting effective learning by others, the people who are responsible for managing and handling the technical system, technology leaders also typically believe that these abstract incident metrics, and this is for a good number of reasons. Uh, for some period of time, these abstract incident metrics have been purported, many of which come from manufacturing have reported. And, uh, there's a belief that they tell enough of a story for leaders to understand the state of the system. That's not the case. They typically believe that these, uh, abstract metrics reflect more about their team's performance than it reflects the complexity that those teams have to cope with. We're going to touch on this in a minute.

00:12:44

And, uh, finally leaders also typically believe that these above observations don't apply it to them. So when I say abstract incident metrics, I mean the typical conventional ones, such as meantime to restore and, uh, detect and meantime to know when, uh, all the various, uh, um, uh, statistical measures, frequency of incidents, severity of incidents, customer impact the issue with these. When it comes to, uh, learning from incidents, they don't have any predictive value forward. You can't take, uh, the raw data in this case, in this chart and use it to make a prediction with, with, uh, some amount of, of certainty of what's going to happen in the future. It also doesn't provide any explanatory value backwards. This is the raw data to aggregate it in a mean or median or mode also, uh, doesn't have this ability now while some people will, that we've spoken with acknowledged that these metrics don't provide much, uh, uh, any, uh, insight into these incidents.

00:14:08

The response that we get quite often is what will they can happen? They can help us ask deeper questions, right? They can help digest where we would, uh, otherwise ask questions. My w the response I would give is that you don't need that chart to ask deeper questions about incidents. Incidents tend to be for the most part at some point unambiguous events, how they're tabulated recorded, uh, um, uh, communicated is a different story, but you don't have to ask, use this to ask deeper questions without this raw data. You wouldn't. Um, the questions you might ask about an incident that took a long time may be very different than one that took a very short period of time based on the content. What was in there, Just ask the questions. You don't need the chart to do that. And while I'm at it, I would tell you to record both the questions that were asked and the answers and capture that so that others can find them in the future. So unless technical leaders Encourage, uh, and, and, uh, capturing the difficulties of handling an incident, all they'll have is a shallow data. And so here's a one way that my colleague has been able to put it. How can you tell the difference between a difficult case handled well and a straightforward case handled poorly? Both can last an hour. The difference between this and how that's characterized is what engineers remember, it's what engineers have, uh, uh, greater confidence around. And it's, it makes for a story stories will, uh, fit in people's lasting memories, longer learning requires remembering.

00:15:58

So one way to think about this are sort of three areas, the consequences or the impact of the incident, incident metrics, uh, um, um, sick can signal those, uh, quite often, uh, the difficulty in handling the incident and the performance in handling the incident. These are three, uh, areas, three qualities of an incident that have significant, uh, importance when it comes to learning from them Without these difficulty and performance. You can't understand what incidents mean. In context, The, some on this particular topic is, uh, that incident metrics don't do what you think they do. Um, I've placed, uh, covers of a couple of books that may be of interest for those who want to read a bit more here. So I'm going to move on to hands-on practitioners. So what we find, uh, across the, uh, experience over the last three years, that hands-on practitioners, practitioners typically view post-incident activities to be a check the box chore. They typically believe in a future world where automation will make incidents disappear. Uh, they typically don't capture what makes an incident difficult only what technical solution there was for it on this particular topic. There are a number of organizations that we've spoken with wet that, uh, our dialogue with them made it quite clear that they are writing incident reports and write-ups with the specific audience of leadership

00:17:44

To put the truth, which is, we don't actually know what's happened here. We don't actually know how this came about. We still, after some days looking at it, aren't really clear on this is less palatable to some tech leaders. And so therefore the writeup serves to help provide some comfort for technical leaders. That is an illusion of comfort. They typically don't capture the post-incident writeup for readers beyond their local team. This isn't to say that local teams and people who were in the incident, aren't learning from the incident, it's just not happening in a formal encaptured way. It's happening in the informal, the, uh, uh, social exchanges, the, um, the ad hoc conversations about this thing, or that was weird, that sort of thing, these aren't, what are, um, captured in write-ups of incidents, therefore truncates, what's known about the incident to this local group. They typically don't read post-interview, uh, post-incident review write-ups from other teams and who can blame them because they're not very useful. They typically fear what leadership thinks of incident metrics, more than they fear misunderstanding the origins and sources of the incident.

00:19:12

They also typically have to exercise significant restraint from immediately jumping to fixes before understanding an incident beyond the surface level. To some extent, this is a very difficult to avoid. It's what engineers do. It's what we do. We like to fix things. We want to do that so much that our ability to, as soon as we have a minimum viable guess about what happened, which may not be at all, uh, uh, characterization that's accurate, um, about what happened, uh, we will reach for fixes and also, they also typically believe the OBS the above observations don't apply to them. So this idea that learning is not the same as fixing is in an important one. We've, I've written, we've written a little bit about this, um, and sort of give them some talks on it. Um, so at this point you might think, okay, look, understood.

00:20:18

You've kind of described it somewhat of a bleak situation in the world, um, about, uh, what you've been able to see, um, what are some solutions? Well, so there's a handful of things that we'd like I would push to you for technology leaders, learning from incidents effectively requires skill and expertise that most folks do not have. Think of incident analysis as a set of skills, no different than software development. You wouldn't ask somebody who's written a line of code in their life to build an application and be upset when it doesn't go well, you wouldn't ask somebody. Who's just compiled their first hello world to get into the, uh, the esoteric details of the Linux kernel. The next day, these are skills that can be learned and improved. If you prioritize it, when things are going well, it will accelerate the expertise in your organization.

00:21:19

There's the expertise in your organization is coming from inside the house. The question is, can you assist people in helping each other gain this new expertise and skill? Uh, there's a, uh, a growing fast growing community around this, um, learning from incidents.io. Uh, you have to focus less on incident metrics and more on signals that people are learning. What does this look like? It means a collection of signals such as analytics on how often incident write-ups are being read when they're not being read, they're usually written to be filed, which makes them pretty, uh, uninteresting to read analytics on who's reading these write-ups on where the write-ups are being linked from. Uh, the open web meant that refers, uh, can sometimes be included along with the request we've seen in organizations that do this really well, references to incident write-ups from code comments, from new hire onboarding, from, uh, code reviews, from product roadmaps support group incident, review meetings, being optional, and then track their attendance.

00:22:48

When attendance is going up, things are going well, organizations that do this really well have engineers that report to us that they go to those meetings because they can learn things there that they can't in any other place in the organization track, which write-ups linked to prior relevant incident. Write-ups we are already capturing at least in a, uh, a sort of rule of thumb heuristics sense. Quite often, we'll hear the terms of, or the frustrations of repeats, repeat incidents or similar incidents. Show me the similarity. What is the similarity are all the similar is the similarity seeing universally there's more about this, uh, um, that I'll, uh, in a, a blog post there. So some advice for practitioners don't place all the burden on a group review meeting using this, use this meeting to present and discuss analysis that's already been done. Quite often. These meetings are, are, have a lot of people who get paid a lot of money in one room and a great deal.

00:23:59

When we've heard people say, oh, the post-incident review process. Tell us about that. Well, it's the group meeting they say about the process or incident analysis in their world is the meeting. This is a missed opportunity because there are too many pitfalls to bet everything on a single meeting, the highest paid person's opinion can, uh, can drown out minority viewpoints or esoteric details that turn out to be critically important. But isn't recognized because somebody's too loud group think, uh, tangents redirections elephants in the room, being down in the weeds. These meetings are expensive and need to be prepared for guided and facilitated, which is a real skill that's. Facilitation needs to come from the analysis that's already been done. You ought not to be constructing a timeline. You should be exploring a timeline. That's already been constructed, prepare for it like it's expensive because it is incident analyst should not be stakeholders.

00:25:11

And yes, this means that you can't effectively be a manager of a local team and be an incident analyst. At least you can't do it without significant amount of barriers. Your role is not to tell the one true story of what happened. It's not for you to understand the incident is for you to understand how others understood how the incident came to be. What made it difficult, how the mechanism, uh, happened and how people understood the mechanism over time and what it means your role isn't to dictate or suggest what to do. You shouldn't be the owner of action items. Action items ought to come from people who are responsible and involved. Because if, if you can maintain a non stakeholder stance, it signals to others that you have no agenda. You have no horse in this race, other than understanding how they understood the event, which means that you're not, you don't have a predilection. You don't have a leaning towards, I really want to get this legacy thing replaced. Uh, if you're the analyst, you have a significant amount of power. You need to be relating what you have heard from others. One way to think about this is half your job is to get people to genuinely look forward to and participate in the next incident analysis.

00:26:38

Uh, perhaps controversial idea is separating generating action items. Follow-ups tasks from the group review meeting. If you can separate them in time, even a day, it allows people to, to soak, to develop, uh, ideas simulate in their mind, what might be worthwhile actions to take later a sign of an experienced engineer. I think everybody in the audience understands that, uh, when you are banging your head against the wall on a particularly tough problem, one of the most mature things you can do is step away from the keyboard and go for a walk, go to the gym, take a shower, spend time with friends it's then when really tricky problems and solutions come to you. So here's a challenge for you. I'm just going to leave this out here, uh, for technology leaders. See if anybody can take me up on this start tracking how often post-incident write-ups are voluntarily read by people outside of the teams closest to the incident.

00:27:57

And now it's a bit of a, uh, it's a sort of a cheeky excuse to say that, uh, default installs of confluence, it doesn't track page views. You can work it out. Uh, tracking reads accesses on webpages has been a solved problem for a long time start tracking how often incident review meetings are voluntarily attended by people outside of the teams, closest to the incident for practitioners. One very specific suggestion. In my challenge for every incident that has a red herring episode, you've gone down a rabbit hole to only find out that that wasn't the case, uh, that you were handling capture the details captured this, the red herring part of the story in detail, in the writeup, especially what made following that rabbit hole seem reasonable at the time. People don't follow red herrings because they're, they're, uh, barely believable. They follow them because they believe them.

00:29:06

You want to know what leads them to that. So here's the help that I'm looking for discuss this talk and these slides with people inside your organization, what matters is how your organization treats and understand some of these topics that I'm bringing forward share, uh, with us, uh, with, uh, with those in the industry, what you find when you have these, is there something that I've pointed out that you're very confident doesn't take place in your organization? Excellent. Uh, I am all spa at Twitter, uh, or all spa@adaptivecapacitylabs.com uh, feel free. Uh, we will treat all information is confidential. We want to hear about this. The world needs better cases for us to make progress on what learning from incidents actually looks like. Thank you.