Incident Analysis – Your Organization's Secret Weapon

Nora is the co-founder and CEO of Jeli. She is a dedicated and driven technology leader and software engineer with a passion for the intersection between how people and software work in practice in distributed systems. In November 2017 she keynoted at AWS re:Invent to share her experiences helping organizations large and small reach crucial availability with an audience of ~40,000 people, helping kick off the Chaos Engineering movement we see today. She created and founded the www.learningfromincidents.io movement to develop and open-source cross-organization learnings and analysis from reliability incidents across various organizations, and the business impacts of doing so.

NJ

Nora Jones

Founder & CEO, Jeli

Transcript

00:00:13

I hope you're having an amazing and exothermic day to have this conference and we have two great talks coming up next. So I've been following the work of the now many cohorts of graduate students who chose to enroll in the human factors and safety systems program at London university. Among the first of these was John Allspaw famous for his 2009 talk 10 deploys a day every day at flicker, but I've been particularly interested in the work of Nora Jones because her work is informed by so much of her firsthand experiences and some of the most famous properties in the world, such as being head of chaos engineering@slackandbeinginvolvedinsomanyaspectsofchaosengineeringatbothnetflixandatjet.com, which was acquired by Walmart. I love her stories because they help bridge the worlds of theory and practice underscoring, why it's so important to deeply learn from incidents. In other words, how organizations prepare for and learn from incidents is so critical famously called an investment that was made on your behalf, but without your consent, I believe that doing this well is one of the key hallmarks of dynamic learning organizations nor is currently founder and CEO of jelly.io. Okay. Here's Nora.

00:01:36

Well, everyone we've all had incidents. They're unexpected, they're stressful. And sometimes in management, there's inevitable questions that creep up. What can we do to prevent this from ever happening again? What caused this? Why did this take so long to fix the organizations I've worked in? And the research that myself and my team has done in the space has, has shown the following responses to the question of why do we do postmortems? I'm honestly not sure management wants us to, it gives the engineer space to vent. I think people would be mad if we didn't. We have obligations to customers. We have tracking purposes. We want to see if we're getting better or who we want to have the answers to the board's questions. I think we all know that some form of post-incident review is important, but we don't all agree on why it's important.

00:02:29

You know, we, we want to make efforts to improve. We want to show that we're improving, but we're spinning our wheels in a lot of ways, because we're not actually making efforts to improve the post-incident reviews themselves. We're making efforts to try to stop incidents, but without making efforts to try to improve the post-mortem reviews or improve the incident reviews, we're actually not going to improve incidents on any level. The good news is incident analysis can be trained and aided, but it has to be trained and aided to be improved upon in this very conference. Before John Allspaw has talked to us about how the metrics we are tracking today, like MTTR and MTT and number of incidents are actually shallow metrics, right? I get why we're tracking those things. It's, it's, it's an emotional release. It's, it's something that can make us feel better, but he posed an open question and challenge to the audience.

00:03:21

He said, where are the people in this tracking? And where are you? We haven't changed much as an industry in this regard gathering useful data about incidents does not come for free. You need time and space to determine it. And today I'm going to talk to you about why giving this time and space to your engineers, into your organizations to improve post-incident reviews can actually work within your favor. It can give you that ROI. You're looking for in level up your entire organization. And I'm going to tell you about this through multiple stories that I've experienced myself and new paths on how you can do this in ways that are not disruptive to your business. And next steps for you to embark on spoiler alert. Sometimes the thorough analysis or incident review actually reveals things that we're not ready to see here or change yet.

00:04:11

So as leaders, we have to be open to hearing some of these things. I'm Nora Jones. I've seen this on the front lines as a software engineer, as a manager. And now as I'm running my own organization in 2017, I keynoted at AWS reinvent to an audience of around 50,000 people about the benefits of chaos engineering, purposely injecting failure in production and my experiences implementing it@jet.com, which is now Walmart and Netflix. And most recently I started my own company based on a need I saw for the importance and value add to the whole business of a good post-incident review. However, I asked this the barrier to entry of getting folks to work on that. I started an online community called learning from incidents and software. This community is full of over 300 people in the software industry, and we're sharing our experiences with incidents. We're sharing our experiences with incident review, and we have folks from all over the industry that led me to starting my own organization, to helping companies get more ROI from post-incident reviews at jelly, this equation in a book called seeing what others don't by Gary Klein, Gary Klein is a cognitive psychologist who studies experts in expertise and organizations.

00:05:25

This metric he came up with is performance improvement is the combination of error reduction. Plus insight generation. You can't have one without the other. Yeah. We focus as an industry way too much on the error reduction piece and not on the insight generation piece, except we're not actually going to improve the performance of our organizations. If we're only focusing on the error reduction piece. And I get it, that is an easy thing to measure. A software engineers were taught to look for technical errors. We're taught to look for some of these things were not so much taught to generate insights, not so much taught to disseminate insights, and we don't get celebrated for it. That's something that we can do as leaders is we can actually celebrate the insight generation and dissemination and training materials, um, by folks in our organization today, I'm going to tell you three different stories about the value incident analysis brought about in different organizations.

00:06:16

These are based on true events. I have witnessed or been a part of, but there's names of details have been changed. When I was at Netflix, I was on a team with three other amazing software engineers. We'd spent years building a platform to safely inject failure production, to help engineers understand and ask more questions about areas in their system that unexpectedly behaved when presented with turbulent conditions we see in everyday engineering like injecting failure latency. It was amazing. And we were happy to be working on such an interesting problem that could ultimately help the business understand its weak spots, but there was actually a problem with the way that we implemented the tooling and the way it was being used. And when I took a good look at it, most of the time I realized that the four of us were actually the ones using the tooling.

00:07:02

We were using the tooling to create chaos experiments, to run tasks, experiments, to analyze the results, which meant what were the teams doing? Well, they were receiving our results and sometimes they were fixing them. And sometimes they weren't. I'm sure all of us have been a part of like an incident where the action items don't get completed. It was kind of a similar situation and it wasn't the team's fault, right? So why is this a problem? Well, it was a problem because we were the ones doing most of the experimentation and generating the results. And we were finding our mental models, but we weren't the ones on the teams for the chaos experiments we were running, right. We weren't on the search team. We weren't on the bookmarks team, but we were running experiments for them. We weren't the ones whose mental models needed refining or understanding, but we were the ones getting that refinement and understanding, which actually didn't provide much benefit to the organization.

00:07:52

You know, we were leading this horse to water, but we were also pretending the horse. We were also drinking the water and sometimes teams would use it, but that would actually last only for a couple of weeks. And then we'd have to remind them to use it again. So we approached this problem, like any good software engineer would approach it and started trying to automate a way, the steps that people weren't taking in order to get them easier access to the harder parts of the tooling. But that part isn't what this talk is about. It's about one of the other things we did. We wanted to give them more context on how important a particular vulnerability that we found with the Ks tooling was, or wasn't important to fix. So to know if something was, or wasn't important to fix, I started looking at previous incident, I started digging through some of them to try to find patterns, to try to find patterns of systems that were underwater or incidents that involved a ton of people or incidents that costs a lot of money, um, so that we could help prioritize the results.

00:08:48

We were finding what these chaos experiments. I wanted to use this information to feed back into the chaos tooling, to help improve the usage, the tooling. But I found something that was much greater incident analysis, had a much greater power in the organization than just helping them create chaos experiments, and prioritize the results better and spending time on it opened my eyes up to so much more things that could help the business far beyond the technical. And so here's the secret I've found incident analysis is not actually about the incident, right? It's this opportunity. We have to see the Delta between how we think our organization works and how it actually works given most of the time, we're not good at exposing that Delta. It's a catalyst to understanding how your org is structured in theory, versus how it's structured in practice. It's a catalyst to understanding where you actually need to improve the socio of your socio-technical system, how you're organizing teams, how people in different time zones are working together, how many people you need on each team, how folks are dealing with their OKR has given, you know, all the technical depth that they're working through as well, incident as a catalyst to showing you what your organization is good at and what actually needs improvement.

00:10:01

This reminds me of a separate story. I was at an organization where an incident had occurred at 3:00 AM and that's when all the bad incidents occurred. Right. I came into the office the next day and I was tasked to lead the investigation of this highly visible incident. After the fact, this was something that made the news, but a senior engineering leader pulled me aside in the office the next morning and said something along the lines of know. I don't know if this incident is actually all that interesting for you to analyze that. I feel like maybe we should just move on. I asked why. And they said, well, it was all I know. I'm not supposed to say this, but it was human error. You know, Karen didn't know what he was doing. He wasn't prepared to own the system. He didn't need to respond to that alert at three in the morning, it could have waited until he was in the office and he could have gotten help with it.

00:10:51

I was shocked. This was an organization that thought they were practicing. Blamelessness right. We've all heard about blameless postmortems, but yet we all use it a little bit incorrectly. They were, they thought they were practicing this without a deep understanding of it. And when something like this happens a cure and makes an error, it's usually met with instituting a new rule or process within the organization without publicly saying that you thought it was Kirin's fault. Yeah. Everyone, including Kirin knows that folks think that that's still blameful right. It's not only unproductive. It is actually hurting your organization's ability to generate those new insights from that equation. We looked at earlier and build expertise after incidents. And so you're actually harming your organization's ability to improve your performance. I get it. It's easier to add a new rules and procedures. It's easier to add in gates.

00:11:40

It's easier to update a runbook and just move on. It allows us to emotionally move on and we need that as humans, we need to feel like we're done with the thing, but these implementations of new rules and procedures actually usually come from the folks on the frontline either. And that's because it's much easier to spot errors. In hindsight, especially from a management perspective, it's much more difficult as leaders to encourage insights. Um, but unfortunately, adding in these new rules and procedures actually diminishes the ability to glean new insights from these incidents. You're not giving people the space and time. They need to glean these new insights because what Kirin did someone else is going to do in the future. Even if you add those guard rails up. So despite all that, I still decided I wanted to talk to Kiran and I wanted to figure out what happened.

00:12:28

So according to the organization, Karen had received an alert at 3:00 AM that had, he spent more time studying the system. He was on call for. He would have known, could have waited until business hours to fix. I came into a conversation with Kirin completely blank. And I asked him to tell me about what happened. Well, he said, I was bugging a chef issue that started at 10:00 PM and we finally got it. Stabilized. I want you to bed at around 1:30 AM at 3:00 AM. I received an alert about a Kafka broker being borked, interesting finding number one, Karen was already awake and tired in on-call and he from debugging a completely separate issue. That's interesting to me. I wonder why we have people on call like that for two systems in the middle of the night, and we're not keeping an eye on them. I asked him what made him investigate the Kafka broker issue?

00:13:21

He said, well, I had just gotten Patriot. My team just got transferred. This on-call rotation for this Kafka broker. About a month ago, I asked if he had been alerted for it before he said no, but I knew this borrower broker had some tricky nuances that led me to interesting finding number two Karen's team had not previously owned this Kafka broker. And I wondered at this organization, why did they get transferred? The on-call for this Kafka broker? And how do on-call transfers of expertise work who originally held the expertise for this Kafka broker? If not this team, I then asked him how long he's been at this organization. He said five months, interesting finding number three, Karen was pretty new to the organization. And we had him on call for something like this for two separate systems in the middle of the night. I don't really feel like this is Karen's fault so much anymore.

00:14:09

And I'm starting to think that this really wasn't human error. If I wasn't curing shoes, I would have absolutely answered this alert at three in the morning, I'm new to the organization. It's a new team that I'm on call for. And I know this broker has tricky nuances. It makes sense, but yet if we hadn't surfaced all these things and we hadn't had the opportunity to have a good incident review with Kieran, we wouldn't have surfaced this, right. We would have kept repeating those hacky on call transfers. We would have kept putting new employees on call when they maybe weren't ready yet, or maybe when we had to train them yet. And so by digging into this a little further, we were able to surface these things. But if we had just implemented a new rule or procedure, this kind of stuff would just get repeated again, maybe not with this Kafka broker, but with another on-call system in this org.

00:14:58

So let's go back to this point. Reviews are important, but they're not good. And what's worse is when an incident or event is seemed to have a higher severity. We actually ended up giving our engineers even less time to figure out what happened. Sometimes it's due to SLS that we have with customers, but it's important that we, the time and space that is given after that customer SLA is met to come up with actually good action items to come up with the how of how things got the way they are, give your engineers space to work through them, especially if it was an emotionally charged incident. When you do an incident analysis of if an incident slack channels or zoom transcripts, or chatting with people, you can talk to people one-on-one like I did with Kirin, we call this an interview or a casual chat.

00:15:45

And these individual interviews prior to the bigger incident review can determine, you know, what someone's understanding of the event was what stood out for them as important, what stood out for them as confusing or ambiguous or unclear and what they believe they knew about the event and how things work that they believe others don't, especially with emotionally charged incidents, we should set up, you know, some one-on-one individual chats like this. If I had asked here in the questions I had asked him or the chat I had asked him about in the incident review, meeting himself, myself, like it probably wouldn't have revealed all the things that he revealed to me in that one-on-one chat. Now, there are certain ways we can ask questions and we call these cognitive questioning or cognitive interviews, knowledge and perspective gleaned in these early interviews, or the way we ask these questions can point to new topics to continue exploring.

00:16:36

They can point to some relevant, ongoing projects. They can point to past incidents. They can point to past experiences that are important for the organization, important historical context, to know, to help level everyone else up. There's a bunch of sources of data that we can use to inform this incident review, right? And we can iteratively inform and contrast the results of cognitive interviews with these other sources of data like pool requests and how they're being reviewed, or how slack transcripts are going or docs and architecture diagrams, or even JIRA tickets where the project got created. Now, my last story is one that we might all be familiar with a little bit as a software industry. Um, I was in an organization where promotion packets were do now promotion packets in this organization consisted of, um, engineering manager, putting together a little packet for someone on their team that they thought deserved to be promoted.

00:17:28

As this organization was growing larger and larger, this became harder to read all the packets. And so they became very number driven. Did this person complete the things that they said they were going to complete at the beginning of the quarter? And so that's what it was mostly dry driven off of is if they had completed those things. Right? And so people were losing promotions when they hadn't completed things at the beginning of the quarter. But I know we've all been at organizations where we've committed to something at the beginning of the quarter, we get midway through the quarter and realize that that's not the most important thing anymore, but yeah, this is what we were judging people on. So what do y'all think happened?

00:18:09

Well, people would commit to things at the beginning of the quarter realized they weren't relevant anymore, but knew that that's what they were getting judged on for their promotions. And so they'd rushed to complete those things just before promotion packets were due. Now we saw certain upticks and incidents in this organization during the year. And as I was analyzing the incidents for this organization, I was analyzing individual incidents, but I was also analyzing historic themes. And if we could correlate them with certain events, traffic spikes, big uses of the application. And I saw spikes in incidents around the time promotion packets were due just a few weeks after, because we would see an uptick in things getting merged to production, maybe things that weren't ready. And I would sit in some of these incident reviews and engineers would say, yeah, I wasn't going to get promoted unless I pushed this in.

00:19:01

And so this engineering organization thought they were incentivizing the right things, but they were actually ending up creating, um, poor incentive structures, right? This is, this was the organization they were creating. Um, but without actually looking into incidents and without actually looking at an incident analysis, they weren't able to figure out that this is what was happening. And this is why this kind of stuff is important is it can help you or structure your organization better. A good incident analysis should tell you where to look. And I mentioned this before, we're not trained as software engineers to analyze incidents. You know, we're, we're trained, um, in different pieces of software and distributed systems and we can figure out technically what happened, but we're not really trained to figure out socially what happened. And it can be kind of awkward sometimes, right? Figuring out what questions to ask, figuring out what people to talk to.

00:19:52

But as leaders, we can help not make it awkward and we can help make it psychologically safer. Now I mentioned a good incident analysis should tell you where to look. This is really hard with some of the tools on the market today. And I want to show you a quick screenshot of jelly, which is the tool my company is working on. Um, we're not GA yet, but I wanted to give you a little teaser today, just so you can see where incident analysis can show you where to look. Uh, so where would one look here? Well, you can see a heat map of all the chatter on the team. You can see a heat map of when the slack conversations were going off or when PagerDuty alerts were alert, storming, or when certain pull requests are going through, you might be interested in the absent shatter on early Saturday morning where it looks like management was the only one online.

00:20:35

Maybe that's a sign of actually good management taking one for your team. There. You might be interested in the fact that customer service seemed to be the only one online late Friday night. I wonder if they were getting supported. You might be interested in some of the tenure of folks on the team. And if they're in their PR the patient level, are we relying solely on folks that have been here for awhile? What about folks that are fully vested? Are we relying on them a little bit too much? What happens when they leave? You might be interested if we relied on folks that weren't actually on call, right? That can tell us if we need to unlock tribal knowledge. If we have knowledge islands in the organizations, you might be interested if we're, if people were on call for the first time ever and how we're supporting them, a good incident analysis should tell you where to look, but it can also help you with a number of things.

00:21:25

It can help you with head count, right? If you're always relying on people from S from a certain team or people that weren't on call that can help you understand if you actually need to spin up a team there, if you need to spin up training there, it can help you with planning promotion cycles. As we talked about earlier, quarterly planning, unlocking that tribal knowledge, figuring out what people know I was in an organization once where every time a certain guy came into the incident channel, everyone would react with the Batman emoji in slack. And he was amazing, but it was actually a poor thing in this organization because we relied on him a little bit, too much. Those engineers are expensive and they usually leave organizations quickly because they burn out, right? And they take all that knowledge with them. This can help you see how you're actually supporting that.

00:22:13

You can see how much coordination efforts are costing you. During the incidents, as an industry, we pay a lot of attention to the customer costs of incidents and the repercussion of the incidents. We don't pay a lot of attention to our coordination costs. If we're working with a team we've never worked with before, if we're working with people we've never worked with before in the midst of an incident, and you can, it can help you understand your bottlenecks, not just in your technical system, but in your people system. Now you're probably thinking, I can't give one-on-one interviews for every incident. I don't have time to do this, right. And I want to go back to my earlier point. A lot of the reason that you don't have time is because the incident reviews today are not that great. And it feels like, why should we spend more time on something?

00:22:55

That's not that great, right? But you can make it better. Now there's some starting points that you can do. Like which kinds of incidents should be given more time and space to analyze it doesn't have to be every incident. So, and it doesn't have to be every incident that just caused customer impact or, you know, just hit Twitter, Vic time. Um, there's certain signals that you can use to see which incident should be given more time and space. Like if there were more than two teams involved, especially if they had never worked together before, or if it involved, um, engineering and a non-engineering team like customer service or PR marketing and working together, that's a good indication that more time and space should be given where if it involved a misuse of something that seemed trivial, like expired certs. I think every single organization I've been in, someone from leadership has been like, why are we having all these expired certs incidents?

00:23:45

Like, let's look into them a little bit more. Usually when it's something seemingly trivial that is triggering, a lot of incidents is actually an indication of a deeper organizational problem. Not someone not knowing how expired certs work, if the incident was almost really bad, if we found ourselves going, I'm so glad no one noticed that that's usually an indication that we can dig into this deeper and that we have a lot to learn from it in a nice way. That gives us time and space. If it took place during a big event, like an earnings call, or if the CEO was, was doing something within the organization, or if we had a big demo or if, you know, promotion packets were due, or if everyone was out of the office, those are usually indications as well. If a new service or interaction between services was involved, if more people joined the incident channel than usual, are you tracking how often there's lurkers as compared to actual participants in the channel?

00:24:42

There's usually a lot of people wanting answers, but sometimes there's three or four people actually debugging the incident that ratio of lurkers to actual participants can tell you a lot about the incident as well. And usually indicates there's more to dig into there. So when are we ready for incident analysis? Like when are we ready to level up our postmortems and not just have the standard RCA doc and not just have this meeting that people kind of feel like they wasted time at you're ready. Now, having customers means you're ready to benefit from incident analysis in some form. And the earlier you start the better the earlier you can ingrain this in your organization the better. So what can you do today to improve incident analysis? You can give folks more time and space to come up with better and analysis. And this can be trained in aided use incidents that were not high profile, that didn't have a lot of emotional stakes and give them a couple of weeks to look at them.

00:25:34

In addition to their regular work, it doesn't mean it to be something that they drop everything and work on, but you can get a lot of value out of giving them some time and space to actually review the incident under a different lens, come up with some different metrics. Look at, look at the people, right? Don't just have MTTR and MTD and error counts. But look at the teams, look at if they've worked together before, look at, if they were pushing out a new service to production, look at how many people they have on their team. Look at how often we're relying on people that were not on call. Look at how many lurkers to actual incident responders. We have look at the coordination cost of the incidents. You can do investigator on call rotations, treat this like you would incident response, right? Half folks that were not involved in the incident doing the incident review, because you get that unbiased perspective.

00:26:27

You get someone that can ask here in those questions, right? Without cure and feeling like they're blaming them. Having folks that weren't involved in the incidents doing the incident reviews actually levels up your entire organization, because now they're learning about a system for an incident they didn't participate in. And that expertise is amazing to see and allow investigation for the big ones. You need time for this. And I know you're asking, I know you're getting asked answers from your boards. I know you're getting asked answers from your C-suites, but giving them time and space is going to actually help with these big ones over time. And they're not going to seem as big over time. Um, my company actually offers a couple of things to help with this too. We haven't moved fast and learn from incidents workshop where we give you two fake incidents that you can practice some of this in without using one of your real ones.

00:27:17

Um, and we also have a product available that's in closed beta today that I'll give you some more information to reach out afterwards. So how do you know if it's working? There are more folks attending the incident reviews and more folks reading them, not because they're being asked to not because they're required to, but because they want to, this is an indication that they're actually learning something. I actually saw folks get promoted because of what they were learning in these incident reviews at an organization where they really invested the time to level, level up their people and level up their incident reviews. You're not seeing the same folks pop into every incident. You're not having to react with that Batman emoji anymore. And folks are feeling more confident about their on-call rotations. They're not hesitant about ignoring an alert or responding to an alert they're feeling better about it.

00:28:09

Teams are collaborating more. You're not seeing as high of coordination costs in your incident. And there's a better shared understanding of the definition of an incident. Something I challenge you to do is ask a few different folks in your organization, what an incident is and see how many answers you get without them needing to pull up your sub doc guide. That's also usually an indication that your coordination costs might be quite high. I want to share some testimonials from people that improved incident reviews in their organizations and spent time and space to do this. Someone said, I just changed the way I was proposing to use this part of the system in a design that I was working on as a result of reading this incident review document, right? They were working on a completely separate project and were able to learn about how a piece of technology got implemented because of reading an incident review.

00:28:58

That's what incident review should be for. It doesn't need to just focus on the socio or on the technical. It should be. It's a training mechanism. I had someone say, never have I seen such an in-depth analysis of any software system that I've ever had the pleasure of working with. He was saying that folks that read this document are coming out with a better and more under more informed understanding of services that started out of just having one or two people understanding them. And it ends up being educational pieces that people pull up later in the organization. I've seen them. The incident review gets outputted and people are still pulling it up months later, not during incidents or anything, but as part of implementation, as part of onboarding guides, as part getting ramped up to a team, they can be beautiful living documents. There are a few components I recommend is as parts of a strong post-incident process, an incident occurs, we assign it impartial investigator.

00:29:56

The initial analysis is done by the investigator to enough to identify if there's people we need to talk to one-on-one. Um, and then they do an analysis of the disparate sources involved like the slack transcripts, zoom transcripts. The PR is the tickets. Then they might do some individual chats before the incident review. Then we might want to align and collaborate on something together, facilitate the meeting, output the report. And then after some soak time after a day or so after this, then come up with action items. I promise your action items are going to be so much better if you don't do them right away. And you'll actually see people getting them done because they're inspired to not just because they feel like they have to. I realize this might feel like a lot for every incident. So think about the metrics I gave you earlier for certain incidents.

00:30:44

You should apply some of this too, and it can be condensed and consolidated for other incidents as well. And if you're interested in further resources on incident analysis, learning from incidents.io, community, open sources, a lot of our learnings, we write about how we're doing this and organizations like actual chop wood and carry water stories, not so much on the theory, but actually how it's working in practice. Um, and if you're interested a little bit more on the era counting mechanisms I had brought up earlier and why they can actually hurt us. Sometimes there's a very quick paper. It's about two pages called Arif counting errors by Robert L wares. Um, and it's taken from other, uh, from another industry. Software has a lot to learn from other industries like medicine and aviation and maritime on how we look at accident investigation. We don't need to reinvent the wheel there.

00:31:34

Um, and it's a really great paper to look, look at. And Jean asked me to include this slide at the end on, on the help I'm looking for. Um, as I mentioned before, I'm building a company around these capabilities, uh, which are exactly the tools that I wanted to have when I was an engineer at previous organizations, I took the time to build them. Um, if you have any interest in how we're thinking about this kind of work or the problem and solution, you can reach out to me via my email here or on the contact us form on our website. And I would love to talk, thank you so much.