Speeding to Resolution with Human-in-the-Loop Automation (Las Vegas 2020)

Today more than ever, downtime has an unprecedented impact on businesses and end-users. To be effective at modern day operations, we need to focus on resilience. But operational expertise and that human intuition necessary to troubleshoot an incident is hard to train and currently lives as institutional knowledge within a small subset of most organizations. In this talk, Transposit founder and CTO, Tina Huang, will discuss how you can use a data driven approach with human-in-the-loop automation to democratize the skill set and knowledge necessary for modern day operations into a broader range of people, enabling SREs to be more effective but also empower more people to take on that reliability role. This session is presented by Transposit.

2020vegasbreakout
TH

Tina Huang

Founder and CTO, Transposit

TRANSCRIPT

00:00:12

Hi, I'm Tina founder and CTO of trans posit. But before I started trans posit, I was a software engineer where I had to be on call for lots of different production services. I started my career in 2001 at apple working on the application framers team following my first true passion building for developers. I was then fortunate enough to join Google in 2005, where I got to witness the rise of SRE. Back in those days, SRV was a really limited commodity. So you had to earn your SRE through runbook coverage. And then at Twitter, I built and ran a number of high-scale production services. It's crazy how we all have these war rooms from being on-call. I still remember one of the first times I was on call for blogger back then blogger ran on windows machines, and yet our development was all done on Macs.

00:01:11

And so it wasn't until the first time that pager went off, that I realized I didn't even know how to open up a windows command line. And then another time I was on call, I checked to see that you could view the blogs, but I forgot the double-check that you could actually post updates. The next morning I went into the office and my boss was livid. Blog posting was down all night long, and yet there was almost no documentation telling you how you were supposed to verify that the site was up. You were just expected to know. He pretty much brought me to tears all of these experiences. They really feel my passion for building systems that reduce the need to rely purely on human intuition at trans posit. We're attacking this problem, but doing it with a very, very holistic approach at the core, we evangelize something very simple.

00:02:09

How do you turn all of that institutional knowledge into codafide processes? And the specific philosophy that we take is human in the loop automation. But first I'd like to set the stage with what does it mean to have downtime in 2020, as we're all painfully aware COVID has changed our lives in almost every way. As we bunker down businesses all had to turn digital for the first time, we basically took five years of digital transformation and compressed it into about six months, but as demand for online services boomed. So did incidents, one sector hit hard, was education students and teachers were struggling with how do I start with remote learning? And they turn to zoom in order to be their remote bridge to each other. But one more Monday morning in August, zoom went down and it didn't come back online for almost four hours, almost all tech companies, including mine have gone fully remote during the pandemic.

00:03:18

This means we are more dependent than ever on platforms like slack and Microsoft teams, but in may slack stopped sending messages for almost three hours, destroying tons of productivity. Then in August for more than four hours, Gmail users, consent emails, and many other apps in G suite were also inaccessible. Only a month later, it went down again. And just a few days ago, Microsoft 365, including outlook mail, word, Excel and teams were hit with a massive outage. And the list goes on. These services are more critical to our functioning than ever causing higher demand, which is a good thing. But this increased usage often leads to more incidents. The next major outage. Isn't a question of if, but when did the rest of this presentation, I'm going to walk through a bit of history of how agile and the rise of SRE methodologies have changed the landscape of what it means to have an incident.

00:04:24

Then the fundamental challenges of running a modern incident management process. We'll talk about some of the shortcomings of the traditional approach to automation. And finally what I call that a driven human in the loop approach. So how did we get here from the development side of things? The last decade has been all about agile and the shift to continuous integration and deployment. But as we are maturing in that practice, we are realizing that it brings with it new challenges on the operations and process side. I was recently talking to an analyst at Gartner who commented how most companies are still riding the high of continuous integration, but they're about to feel this massive pain because they haven't yet created processes to make observability and reliability sustainable. A lot of people focus on SRE as the answer, but really SRE is a specific title. Whether you have an SRE org, you believe SRE is a role, or you just believe that reliability and resilience are important.

00:05:30

Well, I'm about to describe as relevant to you. I come from a software engineering background. So when I first approached reliability engineering, I really thought of it as engineers who just specialize in infrastructure. But then I stepped back and thought about what does it really take to be a good SRE? The first is lots of domain knowledge about infrastructure, but they also need to have loss of ability to automate the building tools. They need to understand what it takes to create good processes, but most importantly, they need great intuition. They need to have that full context of their production environments all in their head and not just the production environments, but organizational context, who do I go to? And in what circumstances, and they also need generally good intuition about risks and trade-offs. And when we think about the impacts of agile, it's easy to focus on development and ops.

00:06:34

But when we talk to customers, the problem is intricately tied to customers and customer support. When I was talking to my CEO Devani about this awhile back, I was reflecting on my experience at apple at apple, we ship box software, but what does that mean? It meant we had months between releases that meant we had long, big periods for software hardening. We had feature freezes, UI freezes, code freezes, and strict change management towards the end. And then Devani pointed out something that had never considered the old service desk model was really built around support. That was mostly about debugging user error. With these long cycles. There was no ability for engineering to roll out a change quickly. So if it wasn't a user error, you had to just figure out how to help the user work around the problems with agile. We are now testing code and prod.

00:07:33

I don't care how many automated tests you have. Customers are still uncovering bugs in your systems. Incidents often begin with customer support and then flow to engineering. Agile means that customer support is part of reliability. Agile has fundamentally changed what incident management looks like. It has increased the frequency of incidents at the same time that it's broadened their impact all at a time when service uptime is more critical than ever not a great combo. So when we think about incident management, we need to focus on how do I resolve my incidents faster and how do I learn from, and have my organization get better over time. I'm known for my controversial opinions. So here are two of them. One, a fully automated world is not the panacea. We think it is. And two postmortems alone. Aren't the answer to continuous learning when it comes to incidents.

00:08:35

So let's start with automation. 100%. We need more automation and not just automation of technical tasks, like cherry picking a change or restarting a server, but automation of the basic incident process itself like assigning a commander, filing JIRAs and starting zoom bridges. But most teams today are still relying on manual and chaotic processes. Teams that we've talked to overwhelmingly report that they have little to no automation around incident response. What automation they do have is fragile and creates risk. So why is this? What I hear from our customers boils down to three things. First automation is scary. Second many processes have too much nuance to automate. And third talking to all the services you need to automate is hard. So let's go into each one of these one at a time, automation is scary. Automation is scary because you need to have a lot of trust.

00:09:39

You need to have full trust in this automated script that you may or may not understand. And that trepidation extends to how to run the script and the environment in which you're running it. For me, this pain point is always felt most acutely with Python, Python scripts. I can never remember what version of Python I'm running or how to toggle between them. And then if something goes wrong, when you run it, what do you do? How comfortable are you the bugging? It, it works until it doesn't. Automation is about more than just the script. It's about the ecosystem where the script lives it's about discovery. How did members of your team learned that the script exists? Documentation? How do you run the script and maintenance who owns the script and who can help troubleshoot it?

00:10:36

The second piece we hear from customers is that their process and systems have too much complexity and nuance to automate another way of putting it is you can't automate the unknown. If you look at the landscape of automation, tooling we have today, you have things like get hub actions, AWS, Lambdas, and JIRA workflows. All of these tools share one thing in common. They are trigger-based think about it. They all have some machine detectable trigger that spawns a headless process. That means you need to have a reliable machine detectable event that has this process. Can't take any human input. If something goes wrong, you need a way to notify a person to step in and debugging. What went wrong is hard engineers. Often like to say to me, Hey, is trans positive, successful and helps me auto automate away all of my incidents. Won't you be out of business?

00:11:39

I find this so adorable engineering minded incidents are unfortunately a result of Ford movement and innovation. As you introduce new systems, you will find that they interact in ways and those interactions can cascade into more incidents. Unless your product has been put into maintenance mode. You will have incidents. And finally, automation is hard because API APIs are hard. The connection between APIs and automation might not be obvious at first, but as a tool chain has moved to SAS and the cloud automation now requires coding against API APIs. Here. I have a map of continuous delivery tools. Whenever we show this map to our customers, they typically laugh and say, it looks so simple and organized compared to what we have. And this is just delivery. Most of our tools these days are these powerful SAS platforms. And that means automating around them requires API APIs.

00:12:44

Even our cloud infrastructure like AWS is accessed via API APIs. And this brings us back to API. APIs are hard. Each one is a unique snowflake calling them needs to be secure and finally reliably scripting against them as complex. This effectively means that to build automation in this world, developers need to have some general competency around distributed systems and security. For instance, they need to understand how to code against unreliable networks. They need the code to be resilient to long latencies of network requests, and they need to know how to write code that throttles request appropriately to not overwhelm the underlying systems and trigger Ric limits to lower the barrier to automation. We didn't, we need to make it easier to build on top of APIs. So this brings us to, we need to rethink automation. When we talk to engineering leaders, they often start with automation, but what they quickly get to is they want to take knowledge that is buried inside individual's heads and turn that into a repeatable process.

00:13:51

Anyone can do when it comes to reliability. What we really care about isn't automation, it's codafide processes, codafide machine processes, automation, but a codified human process is runbooks. The ideal role is about being able to seamlessly bind these two together. Something I call human in the loop automation. When you look at other mission, critical industries like flight pilots or medical doctors, they often employ checklists human in the loop. Automation is basically checklists on steroids. Think of a checklist that documents the pieces that need to be done. But with buttons that let you easily run automated pieces. Keeping humans in the loop surprisingly means more automation and more robust automation that's because you can let human judgment step in where the logic to automate would end up being so complex as to cause more problems than it solves, where the logic would be so closely tied to the specifics of a fast evolving product that it would need to be updated almost a spiritually as a product itself.

00:14:59

So let's go back to one of our earlier observations. What makes a great SRE or reliability engineer, one having lots of operational expertise to lots of domain knowledge mapped in their heads and three great intuition around how did the bug and incident when they occur. Now, I'm going to walk you through how we take that great intuition and turn it into a documented process that others can follow. How to make that context available to a wider range of engineers, to enable them to use good judgment, to make sound decisions, and then how to continuously collect new learnings to improve your reliability processes over time.

00:15:44

There's a customer that we were working with that talked about how oftentimes incidents are caused by these long running queries in their Mongo database. There are a few problems with this one. Most engineers in their team don't know to look at Mongo DB for runaway quarries. Secondly, most engineers don't actually know how to kill a query. And third, most are afraid of killing a query on a production database. What we work with them on implementing is really simple. Start with a high level basic incident management process for us. We go to transpose that when something is wrong, the important thing here is you can only expect all on-call engineers to remember something this basic, everything else should flow from there. Whether it's filing a JIRA, starting a zoom, or looking at a specific graph, you can now organically introduce new processes, including driving engineers to better utilize runbooks then build runbooks to disseminate important information.

00:16:52

Like when you see this alert check, Mongo DB finally incorporate automated workflows and those runbooks to make it less intimidating and less error prone to run tasks like killing Mongo DB queries. It's important to realize that a lot of what we need to do during an incident, we do infrequently SRVS have the advantage of focusing on reliability a hundred percent of the time. But as one customer told me, they're on call rotation is every 24 days. So between tours of duty, they often forget what to do. Process is key. So let's go back to my second controversial opinion, postmortems alone. Aren't enough for continuous learning. Something that never stops surprising me is why incident management, which is so closely aligned with engineering. Isn't more data driven. The other day, I was reading this article in the new Yorker about the epic medical record system. When electronic medical records first came out, the promise was to have all this data available, to help diagnose and treat patients at the time of hospital course, but in practice, the electronic medical records and the systems around them were limited to billing and accountability.

00:18:05

As I was reflecting on this anecdote, it occurred to me that incident management has exactly the same problems. We focus on root cause analysis as part of the post-mortem process. Our ability to make use of that data is no better than with electronic medical records. We do post-mortem analysis to understand what happened after the fact. And we sometimes file tickets to track areas of improvement, but more often than not, postmortems are used for documentation and accountability. That data is inaccessible to uncle engineers at the time of the next incident, almost all customers I've talked to use something like JIRA or Google docs for the post-mortems and all of them say that that information is not easily searchable during the next incident. The way we think about incident management is fundamentally broken and the data we collect around it is sparse or missing. Most incident management platforms treat incidents as these discrete events and they track effectively the single metric MTTR, but this is broken for a few different reasons.

00:19:13

It says, don't start up the alert it started when someone did something like commit a bad line of code or do a bad infrastructure change incidents don't end when the ticket is marked result, that's usually just indicating that the service has been restored. And then everything that happens during an incident is effectively treated as this big black box. We need to be more data driven. We need to collect more granular data on what happens at the time of an incident. Things like who is involved, what runbooks are people using? What actions are people taking. If we had this kind of data, we could start learning in a whole different way. We could answer questions, like where are we missing Rumba coverage? What are common tasks that take up a lot of time that would most benefit from automation who is being pulled into every incident and is likely to hit burnout.

00:20:07

We need to start thinking about incidents as part of an operations continuum, and we need to start capturing more data across the entire spectrum of human and machine events. Let's stop and think about where all the data around incidents currently lives. Most teams use a number of tools like JIRA, Google, and Zendesk to manage communications and process and with communication. There's the big elephant in the room, chat clients like slack. I can't tell you how many times I hear customers complain about how they can't stop incident communications from moving to chat. And yet they have no way of capturing the data. There there's a number of system tools like PM's and continuous deployment that all live in their own silos. And then there's external communication through status pages, blog posts, and email, but in the middle of an incident, the last thing that is on anyone's mind is capturing data.

00:21:03

Today. We rely on manual capture done primarily when building post-mortem timelines. And this process is far from ideal because it's manual and time consuming. I can't think of an engineer who doesn't complain about having to write postmortems. It's also error prone. It requires you to sift through chat logs and go back and do screen grabs it's unstructured. So it's hard to access and use during the course of the next incident and relies on humans to analyze it and learn from it. So how can we eliminate the burden of having to manually capture all of this data, especially messy human data and capture it in a structured way. I began my career working in consumer. When I worked on Google news and Twitter, we instrumented our interfaces to automatically gather data and help us improve the service. So occurred to me. Let's do that for incident management, as we discussed, we want a single funnel point for driving an incident response process.

00:22:05

Now I'm going to add to that. You probably want more than just a framework. You want an actual platform and that's because this platform will allow you to automate and control how you capture data. In the ideal world, this platform is easy to use and makes on-call life easier. We often use the slogan beach ops. The dream is to have an on-call process so easy that you can manage an incident from your phone while drinking a margarita on the beach, having a platform that everyone wants to use that also captures the data. Your organization needs is a win-win. This means you can stop using a stick to beat process and data capture into your team, but having the data is not enough. We need to make that data accessible to on-call engineers. During the course of the next incident, there's basically two parts to make this happen, capturing the right data with the right structure.

00:23:00

If your data is unstructured texts such as chat archives is going to be hard to make use of this data and then create usable tools to access this data. Luckily, once you've adopted an incident management platform that automates the structured data capture, you're halfway there. So what are the ways of making this data more accessible? I'll describe two of them again, taken from the consumer world recommendations and search. We can use that data to recommend things like when a particular alert comes in, suggest people to invite to the incident. So just from books that might help with investigating and resolving the issue to just workflows that are commonly used. And now that we have the structured data, you can enable much more robust search that allows you to answer things like during the hour, before the incident, what changes occurred in my environment last time, this alert went off. What was the cause? When the last person ran this workflow, what parameters did they use? And everyone loves machine learning. Luckily, once you've got structured data, you can employ all kinds of existing machine learning techniques to help further drive improvements.

00:24:13

We've talked to hundreds of customers about their challenges with driving resilience into their organization. And we've employed all of the principles that we've just discussed to help them turn all the institutional knowledge into codafide processes. Look, this isn't meant to be a sales pitch. You can employ these techniques in a number of different ways, but I hope these examples from trends posit will help you to incorporate these principles into your organization. We've built something. We call an interactive runbook as our solution to human in the loop automation here, you can see what it looks like to have a run book that codifies human processes around something like debugging, an alert for high five hundreds. And you see how the runbook seamlessly integrates with the choice of workflows that spawn automated processes. But the on-call engineer is in control of choosing if and when to run them.

00:25:05

Here's an example of what I mean by capturing semi-structured data around human processes. This screenshot is the automatic documentation that transpose posit collects. You can see we've captured elements like what runbook and on-call engineers using and what actions the engineer's taking. You can see that we suggest runbooks and actions to help with a given alert, but the on-call engineer can choose whether or not to follow these suggestions and what it means to make that data accessible. During the course of an incident, we've built something. We call knowledge streams that lets the uncle engineer do sophisticated searches over previously recorded data. And finally, what it means to abstract away the complexities of API APIs here. Do you see the developer platform we've built that transpose that we support SQL Python and JavaScript, but I think CQL is the best example of this abstraction. You can see, I have a simple SQL statement so that the developer just expresses what data they'd like to fetch, but they don't think need to think about all the complexities of API, such as authentication and pagination and today's world reliability and resilience are key to achieve uptime. We need to move from institutional knowledge to codafide processes. And to do this, we need to take a data driven human in the loop approach to automation. While many companies aim to solve technical problems with technology at transpose it, we aim to solve human problems with technology. Thank you very much for watching. I hope I've opened your mind on your approach to resilience.

00:26:45

If you have any questions or want to talk, join me at chance, posits, slack, Q and a and visit our booth for a demo live chat and cool prizes. We'll be hosting two special happy hour events. Check out our booth for details.