Learning Effectively From Incidents: The Messy Details

Much has been written about "organizational learning" and "learning organizations." This continued and growing attention on these topics in the software world is encouraging and warranted! However, creating conditions for people to genuinely and effectively learn from the incidents they experience is difficult to do, never mind sustain over time. The frequency, severity, and even absence of these events do not represent what is learned, who has learned what, or how learning might be taking place in an organization. This talk is about the "messy" and practical realities of learning effectively from incidents, including a number of paradoxes and ironies that technology leaders face as they work to make progress in their organization.


John Allspaw

Founder and Principal, Adaptive Capacity Labs



Thank you, Shelby and Liz, in my opinion, the person who is leading the charge in redefining what the mental models we need in order to design and operate the massively complex sociotechnical systems that we all live in every day is John Allspaw. Certainly almost everyone in the dev ops community is familiar with his decades of work. In fact, if there were a starting gun to the DevOps movement, it was likely the famous talk that he gave at the velocity conference in 2009, with Paul Hammond, where they talked about doing 10 deploys a day, every day as a part of their work at flicker, I finally got to meet John Allspaw in person at the DevOps days, mountain view in 2010, it's a meeting that I'll never forget. And over the years I've learned so much from him. He got his master's degree from Lund university. Back when he was CTO at SC, his advisors included the famous Dr. Richard Cook and Dr. Sidney Decker and Dr. David Woods famous for the contributions to the safety and resilience engineering community. We all talk about our desire to create learning organizations. And that is why I find John's work so intriguing because he believes that by observing how organizations respond to incidents, we can gain incredible clues on whether and to what degree organizations are actually learning. Here's John.


Thanks. So my goal today is to describe one of the most effective ways we can learn from incidents. It might not be intuitive, but at the same time, it might be very intuitive, just unclear on how to do it before we jump into it. I want to start out by giving you the summary. I want to give you the conclusion of this talk before I start. So here's a too long didn't watch slide. The main gist of what I want to get across the first is learning is never not happening. It is what humans do. It's an integral human activity. Second is that it requires remembering learning and remembering are inextricably linked. That means that learning from incidents effectively means discovering and highlighting aspects and qualities of the story of an incident that makes it more likely to be remembered. So what are those aspects?


Well, there are elements of surprise and difficulty misunderstandings dilemmas, paradoxes. This is what makes for good stories. This is what makes for stories that can be remembered. If you can remember, it's likely that learning is going on. So something that I want to mention is that I'm using the phrase messy details in the title of this talk very deliberately. You may have heard me reference this phrase before. It's a reference to this paper and it captures quite eloquently in my opinion, that when it comes to work in complex domains, the details of what people do and how they do it is what matters almost more than anything else. These details are easy to miss and they're not often looked at closely. So that's what I mean. When I say the messy details. If you haven't taken a look at this paper, I can't recommend it more highly.


Uh, just because the topic surrounds the domain of healthcare doesn't mean that it's not applicable to other domains. So I want to first start with a story from my time at Etsy company here in Brooklyn, New York, where I am right now, uh, I worked for a number of years. So the story of this incident is that an engineer on the job for no more than a couple of weeks made a change that brought all of that t.com down, and the site wasn't just slow or degraded. It was down hard. This is sort of what a typical write-up looks like. Uh, about incidents recently hired engineer made a change to production, blah, blah, blah, blah, blah. It was about an hour and 10 minutes for them to figure out what was going on and what to do. So nothing catastrophic, but nothing, nothing either. So I'm going to park that for a second. We'll come back to this story.


Let's talk about learning in general and learning from incidents. As I mentioned before, people are always learning, but it's difficult to prevent people from learning the question. Isn't whether they're learning or not learning. The question is what are they learning and how useful, or how productive is what they're learning, going to be to help them do their work in the future. The challenge isn't getting people to learn. It's about creating conditions where a couple of things can happen. We want to create conditions where people it at every level of the organization have opportunities to discover new things they didn't know, or revisiting things that they thought they already knew, but either were wrong or dated in their knowledge. That sort of thing. It's also about creating conditions where experts are supported and describing and teaching others, telling stories about what they know and how they know it.


This is actually a lot more difficult than just getting something on the calendar and ask somebody, tell me what you know, what we know about studies of expertise is that experts are not necessarily expert at describing what makes them an expert, but rich stories are valuable. You want to create conditions where there's viewed as assets, just like any other valuable asset to the success of a business. So mentioned before, in some other talks is that learning is not the same as fixing often, especially in the industry at the moment, they sometimes seem to be confused, uh, or swapped for one of the other one way to saying about learning. And what's important if you can't remember something you can't say you've learned it.


So analyzing incidents therefore means finding what made the incident surprising or difficult. These are what make for memorable stories. What if incident and analysis is less about solving the problem that the incident responders responded to and more about understanding how the incident responders understood and experienced handling and working through the incident. So we work with a lot of different organizations. One of the first questions that we always ask when we first talked to them, do you have any stories about incidents? And that's it, that's the prompt. That's the prompt that we give them. We don't give them anything else. See any incidents come to mind? Can you tell us a story? A couple of things show up for us. That is always true. First is they're really enthusiastic. They always responded. Oh yeah. Well, oh God, let me tell you about this one. There. They use their hands there.


Clearly they all sometimes almost like flip into a different mood. When they tell the story, they tell the story in suspenseful ways, right? They know what's going to happen. That's how they have the story we don't. And whether they know it or not, they lay out telling the story in using what scholars in narrative composition, we call suspense structures. Even if they don't know what a suspense structure is, they include what was surprising. They include what was weird, strange, difficult. They give us the backdrop we're oh, so you got to remember, uh, this was the day our right about this point was when our CEO took the stage that, uh, that was an in-person conference or something along those lines. They're giving context for when in time, sometimes in space, the story took place and they can tell it in detail. And this is what is the most gratifying and fascinating things for us.


Even if it's been years since it happened, they can come up with all sorts of esoteric. They could even write on the whiteboard, what this little piece of code looked like. So after they're done with telling the stories, we always ask them, Hey, so this is an amazing story. Is there a place where I could read about this? Or when, when somebody joins the company, could they go read about it? And they always respond in some form of, oh yeah. Well, uh, we've got a post-mortem document and hold on, let me see if I can get it and we'll take a look at it. And here's, what's interesting to us is that the story that they tell is always different than the official writeup. We wonder why that is. Does it have to be that way?


So I'm going to tell you a little bit of an example, given, given an example of how the telling of a story can or cannot reflect the richness of an event or series of events. Um, and here it goes, one sentence, a high school senior in Illinois led their classmates on an 11 hour crime spree committing fraud, grand theft auto and cyber crimes. That's the story, right? So unclear whether you picked up, what I've described is Ferris Bueller's day off. And to be fair, it's not wrong. It's it. All of the facts in this sentence, all of the statements are true. It's just incomplete. And not only that is, it may take some liberties in a, some of the descriptions are not, it can be true and also be pretty anemic, as far as story's concerned. If I give you that sentence and you go see the movie, you'll see that they're quite different.


This is that richness and those messy details that I want to get across, such as a Froman for those who've seen the movie. So let's revisit this incident that I was telling you about before September, 2012, uh, afternoon, this is a tweet from the Etsy status account, uh, saying that there's an issue on the site, give you a little bit of, uh, some flavor of what was going on in chat. People said, oh, the site's down. Uh, people start noticing that the site is down a couple of observations that were booming out of memory errors all over the place, more observations Hmm. Signals that there's something about memory going on.


It seems like some templates were rebuilt in the last deploy. Interestingly, there was a deploy, but it was actually spaced in time. Usually at least back then, if a deploy had an issue, there was some sort of bug or anomalous SPI sort of behavior. It would not take very long. It was as long as soon as the code was out there, it was relatively, it wasn't five minutes. And this was roughly about five minutes. Things seemed fine. So that was of interest. Anyway, whatever was in the deploy, still wasn't clear. Uh, people said, oh, well maybe there was some sort of a temp template related thing. And people saying, well, it looks like we need to actually get a restart. And somebody said, it's really hard to even connect to some of the web servers. Meanwhile, people were making the change or making the changes, trying to work out what was going on said, oh, well, we can deploy this.


We can deploy that. And people said, well, actually, it's going to be hard to even deploy because we can't even get to the servers. And people said, well, we can't even, we can barely get them to respond to a ping. We're going to have to get people on the console. The integrated lights out for hard reboots. And people even said, well, cause we're talking about hundreds of web servers. Could it be faster than we could even just, we have to power cycle. These were, this is a big deal here. So whatever was in, whatever it was in the deploy that caused the issue, it made hundreds of web servers completely hung, completely unavailable.


People said that deploying this is even deploying. Even if we knew what was going on, it's going to be pretty hard to do, uh, until we can power cycle, everything somebody pointed, well, we're going to have to actually disable the load balancer bit or disabled traffic coming in because we don't want them to come back up after we power cycle them. Cause they're still going to have the code, whatever that whatever's going on. It's only going to happen again. So we'll block all the traffic, reboot, all the boxes, deploy the change, whatever it is. We don't even know what that is yet, but we have hundreds of web servers. So people were fanning out, uh, oh, you get this number and you get this number. You get, I'll get web one through 10, you get 11 through 21. So on and so on. And so on. We're in, they were in rib, reboot Fest, some point they get to a spot where they walked through all of those steps. A lot of people ran a lot of commands in a very short period of time to get these boxes up and running.


They finally got it up. What's interesting about this. Well, let's go back to one of the changes. Would it seem that there was something about templates, uh, what they had worked out afterwards? There was a ticket. Hey, um, one of the ticket was for this newly hired engineer who was on a bootcamp at Etsy. You would start in your first week. You would spend a week in this team and then you'd spend a week at another team and spend a week at another team. We call it bootcamp, lots of organizations do this. And then you'd finally land at the team that you're going to be part of more permanently. Is it like getting a bit of a tour? And one of the tasks was with the performance team and the issue was old browsers. You always have these work rounds because internet didn't fulfill the promise of standard.


So let's get rid of the support for IE version seven and older. Let's get rid of all the random stuff. And so now let's see, uh, if you don't know, written PHP might still be, we used a templating engine, uh, to help sort of put together compose pages called smarty. And, um, in this case we had this template based template used by as far as we knew everything, uh, and this little header, uh, dash E dot CSS, this was the extra workarounds. And so the idea was, let's remove all the references to this CSS file in this base template. And we'll remove the CSS file. And this had been tested and reviewed by multiple people. It's not all that big of a deal of a change, which is why it was a task that was sort of slated for next person who come through bootcamp in the performance team.


So they've made this change. And like I said, sometime past what would happen. They figured out later it was a request would come in for something that wasn't there for a force happen all the time server would say, well, I don't have that. So I'm going to give you a 4 0 4 page. And so then I got to go and construct this for a four page, huh? But it includes this reference, the CSS file, which isn't there, which means I have to send a four or four page. You might see where I'm going back and forth for a four page fires, a four, a four page fires, a 44 page pretty soon, all of the four oh fours are keeping all of the Apache servers, all of the Apache processes across hundreds of servers hung, nothing could be done. The team looked at how many servers they had. And then when they split up, it became clear that they had to power cycle. They'd take 10 at a time so that lots of folks could reboot them quicker in parallel.


So that's a little bit more of the story than what I first gave you, right? I just want to be clear on something. This story I'm hoping that many people who worked@thatsiteatthetimeseesthistalkwhereonecsschangecouldnotonlybreaketsy.com, but break it. So spectacularly that its entire fleet of Webster, her servers required hard. Power cycling is out. I'm going to go out on a limb. It's one that I don't believe anyone who was there will ever forget. It is a very memorable side note on this particular case. This is the case that led us to build a, an award that we gave every year called the three armed sweater award. I'll leave that for a different talk. There are other talks about it. So what I'm trying to get across here in this story is we need to make effort to highlight these messy details. What was difficult for people to understand what was surprising for people about the incident?


How did people understand the origins of the incident? When the people first in, uh, in the CSS case went looking, they dismissed the change that had just been made as being relevant because some time had passed. And that was a very reasonable thing to do. What mysteries still remained for people. There are some details of the story that, and I was there, but I'm still not clear on, but goal of effective incident analysis is to capture the richest understanding of the event, represented for the broadest audience possible. This means multiple tradeoffs in different levels. You don't want to, uh, capture in written form something that's so technically detailed that you've lost a whole bunch of readers. You also don't want it to be so vague. And hand-wavy as to basically tell you nothing like that first slide that I showed you of the CSS case, didn't really say much. Just one quick note on what I would say is the toughest. There are many barriers, many challenges on getting this done well, first is hindsight and the hindsight bias or is brute fish off called it. The, I knew it all along effect. This is a tendency to simplify these complex, messy details of the event down to the one true story.


As a result, this tendency can, can basically produce that, uh, a story where all of these details, these multiple perspectives, all get sort of wiped away, uh, in favor of a story that makes sense to me, the person looking back and we want to do it to be efficient and crisp, but that's lossy. It means that smoothing out this messiness and boiling it down to how long the incident took hour and 10 minutes is an hour and 10 minutes. The most interesting part of that story, what you want to do in capturing you'll note that I haven't told you how to do it because that's much more beyond a talk. You want to support the reader regardless of how you do it. You want to write incident, strip descriptions to be read and not just to be filed. You want to describe the data that you relied on in your analysis.


Was it just Joe who responded to the incident? And I don't know, took 10 or 15 minutes filling out a template. I don't know. Maybe there's more than just Joe's view on it. You want to make it easy for readers to understand terms or acronyms that they've not seen before. And you could use this. And this is a proprietary knowledge trick here. You could use hypertext, linking technology, look it up. It's amazing. You want to have connections. Incidents are not these extra side distractions. They are a part of the work you are doing. Remember you're preventing them all the time. You want to increase the amount of preventing them all the time, use diagrams or other graphics to describe complex phenomenon. Don't be afraid of using pictures, make it easy for others to link to the write-up document. So how can you know if you're making progress? Well, describe some of these before.


Here's some signals that can tell you that you're making progress in the right directions. More people will actually read post-incident write-ups because you're tracking them right. More people will voluntarily attend post-incident group review meetings and they'll participate. They'll talk about their view, their perspective, what happened for them? What was surprising for them? More people will link to these write-ups from code comments and commit messages, architecture, diagrams, other related incident, write ups, new hire onboarding materials. I can say. Now, after working with a number of organizations for a couple of years, this happens, there are companies where voluntarily, I know of one organization where voluntarily 80 engineers showed up to a group review meeting and a huge majority of them added and, uh, and, and, and, and, and calibrated and, uh, help modify their understanding collectively about the incident months after an incident write up has been written. Still people are commenting on it. Still people are linking to it. Still people are reading it. Tens of people a day are reading it and sharing it with their colleagues.


I mean, this is difficult organizations that, uh, that we know of are doing it. I will say this, that your competitors are hoping that you won't pay attention to any of it. Um, these markers are progress. I just wanted to point out here. I, I literally asked you and challenge you to pay attention to these things last year in the talk that I gave at the DevOps enterprise summit. So my snarky response now is how's that going? So here's the help that I would like. I'd like in the conference, slack channel one, people to offer up their stories. I want people to challenge me on things that I've said in this talk. I want people to keep conversation about these messy details alive and moving and evolving forward. This is how we become better at learning from incidents. Thanks very much. My name is John L spa, and I appreciate your attention.