Las Vegas 2020

How to Crush Major Incidents with DevOps Agility

The key to DevOps success is being prepared for incidents, responding quickly and ultimately getting services back up and running. In this session, we’ll explore how, using Atlassian tools, Dev and Ops teams can work together to do just that: respond quickly to incidents, troubleshoot their cause and restore services as fast as possible. We’ll also take a close look at how teams are improving communication and collaboration across their entire organization, minimizing the business impact of unplanned service disruptions.


Darren leads the product marketing team for Atlassian’s ITSM and ITOM Products. He previously held executive positions at Opsgenie, Onshape, InVue, and DS SolidWorks and has a passion for exceptional technology. He enjoys spending time with family, camping, and more recently, woodworking. Darren holds a degree in mechanical engineering from the University of Florida.


This session is presented by Atlassian.

DH

Darren Henry

Head of Product Marketing, ITSM & ITOM, Atlassian

Transcript

00:00:17

Hi, my name is Darren Henry and I lead the product marketing team for the item and its M products at Atlassian. I thank you for joining my virtual session. It's entitled crushing incidents with dev ops agility. I'm not going to keep on my webcam during the presentation. I don't think my face adds a lot to the content, but I wanted to introduce myself and kick off the session, you know, at, at LaSeon we believe that incidents are inevitable. And as a practitioner and early adopter of dev ops and agile methodologies, we think there's a lot you can do to accelerate your time to resolution. That's what this presentation is all about. I hope to not only teach you a little bit about our vision and what we're investing in in our products, but give you some ideas that will help you reduce the time it takes to resolve incidents. So let's get started.

00:01:14

I think a good place for me to start is to define what I mean. When I say incident, it's an event that causes a disruption or a reduction in the quality of a service, it requires an emergency response and believe me, incidents are going to happen. If you Google major service disruptions, you'll see every month, there are many that occur just this week. Dunkin donuts, mobile app was having issues. That's a major catastrophe in the Boston area where I live now, it calculated that the daily cost of incidents are about $4 billion. Gartner stated it was even higher. Any way you cut it, incidents suck and they're expensive. Now we believe that the operation teams that deal with incidents are either it or dev teams, but they look at incidents through slightly different lenses. The it teams are responsible for numerous services. They often are made aware from people reporting issues.

00:02:14

And in the case of major incidents, several teams may need to be involved to get to a resolution. Dev teams are focused on the services they deploy. Real-time monitoring is critical because incidents often arise due to the high frequency of change. Most incidents have to do with either a code change or a third-party service. Atlassian is a vesting and focused on incident management. We not only develop, but use our own products like ops genie to manage alerts and on-call schedule status page to communicate outages to customers and stakeholders and JIRA software, as well as JIRA service desk to manage service requests and also the work associated with restoring services and fixing the underlying problems. We are focused on empowering Devin ops teams to respond, to resolve and learn from every incident. And in each of these areas, we think there are big opportunities for improving things to be clear.

00:03:13

All this is more important. When you have embraced an agile mentality, we place the whole monitoring and incident management practice firmly on the right side of the dev ops infinity loop. And we recognize that delays here often kill your entire dev ops momentum. So let's talk about practical and tactical ways to improve how you respond, resolve and learn from every incident. We'll start with respond over the last two years. Most companies we've talked to have automated their incident response. They started programmatically managing alerts and on-call schedules. If you haven't done this yet, it's the first step you should take to really gain efficiency in incident management. Now we strive to make this easy with ops genie ops genie integrates with all monitoring tools and manages your on-call schedules. A great tactic to improve your team's understanding of alerts is to normalize the way they're displayed to your teams.

00:04:14

Here. You can see an alert from AWS CloudWatch. It may be a little confusing and I recognize it's hard to see in this slide, but it has to do with a Lambda execution failure. Well, we can take this alert and reformat it with ops genie so that the information is clear. We can specify the source, the issue and the region right within the name of the alert. What's really nice is you can normalize all the alerts from other tools like Datadog, new Relic and Splunk so that they appear in a similar fashion responders, see the information in a consistent way, and they're spending less time to ciphering the alert.

00:04:57

When an alert is of high priority, we can write a role that will escalate it to a major incident so that response teams are alerted. And if you're not familiar with ops genie, it will notify response teams using on-call schedules and escalations. In fact, we use different notification channels, including push notifications, SMS, voice calls, emails, and chat. So no matter where your team is, critical alerts are never missed. We have some practical advice regarding resolving incidents as well. Collaboration is essential. So we recommend you tie your incident management tools to your favorite collaboration methods. Now ops genie has an incident command center. It has built in video conferencing, chat and incident timelines. It can spawn a virtual war room immediately when an incident occurs and it can direct teams to that room, right? When within their notifications, we recognize that many of our customers prefer zoom for video conferencing.

00:05:55

So we enabled zoom to be used their command center. That link can be included in the notifications. We also saw that people prefer slack and Microsoft teams. So we built strong integrations with those products. So here you can see individual slack channels can be used for each incident. Our tools invite all responders, post a critical information in the header of the channel and enable responders to take action. Finally, any message that occurs in slack can be easily recorded back in the incident timeline, to keep track of every important action that was taken. So the practical advice is to utilize the collaboration, processes and tools that your response teams like and be very flexible in allowing that you'll get the most efficient collaboration. We recognize that quickly investigating the root cause also drastically accelerates incident resolution. And I mentioned earlier that in dev ops environments, many incidents are caused by code deployments to determine if this is the case, we recommend you look for ways to correlate your incidents to your deployments.

00:07:04

And here's an example of how you can do that with ops, genie and Bitbucket. We use a service list. We're able to relate incidents to the services that are disrupted, and we can also map our code repositories to the services that they control. So when an incident occurs, you can use option these incident investigation to get a quick understanding of the services that are disrupted, including the service dependencies. And you can see the deployments related to the services. So here, we're looking at a very visual indication of the last 24 hours. And we can see clearly successful deployments, failed deployments and past and ongoing incidents in one place. The halos represent the number of file changes. And if there was a deployment near the time of the incident, it can be tagged as a likely cause by surfacing this information, you can see to developers that were involved, you can include them into the response team and strategize a fix. This might be rolling back the deployment, turning off a feature flag, or maybe creating a hotfix. And by the way, you can use our action channels to run diagnostic tools or even take remediation actions as well, right from your mobile device. So for example, you could run an ETQ rescue playbook or an easy to restart using Amazon system administrator all with the tap of your thumb. In summary, you want to use automation and correlation to find fast ways to troubleshoot and remediate incidents.

00:08:36

Now, another great way to crush your incidents is to proactively communicate with users and stakeholders during the incident. This is not only builds trust and preserves your reputation, but selfishly it minimizes distraction by deflecting the redundant reporting of incident and issues and minimizes people asking you for status updates. Now you should set up notifications for stakeholders, similar to alert notifications, but a great way to further your communication is with status pages, right? From your incident management tool, you can spawn a public facing status page when appropriate. We use status embed, widgets that can also add messages to our webpages, our health portals, and even our applications. We believe another great way to proactively communicate is to surface major incident information directly within ITSMs and help desk tools. Here. You can see our JIRA service desk offering. And over on the left, you may notice we've added visibility into major incidents with a seamless integration with ops genie. Now, as requests come in, agents can quickly link them to incidents by defining this relationship. Everyone wins responders get a sense of the blast radius of the incident and can change priority as needed support agents, see the status of incidents and can respond faster to help seekers. It's a complete crush. If a major incident is human reported, the support agents can even create a major incident directly from the ITFM tool and start the teams forming on a response.

00:10:13

Okay. The last section of my presentation is faster review, but also very important. And it has to do with the learning state of incident management. We believe it's important to track your progress and look for ways to improve. So you should run reports and discuss the trends as a team. Our most popular reporting and analytics include measuring the meantime to assemble. And the meantime to resolution, many people ignore meantime to assemble, but the manner in which you get the right people to start taking action is usually a great place to start the improvement process. You should also look for ways to analyze which notification channels work best for your teams. Compare the on-call responsibilities and work distribution, especially after hours examine which teams were notified and which teams resolved, resolve the issue. Sometimes you're notifying the wrong team for a type of issue and always look for the sources of the most common incidents. We prebuilt a ton of reports to help you gain the insight from these metrics.

00:11:20

And Hey, we're talking about dev ops environments. So let me talk quickly about how you can measure dev ops performance. Remember how I talked about relating services to code repositories and CIC tools? Well, if you do that, you can quickly understand three of the Dora four metrics and trend them over time. Look at this report that option can generate in real time. When you shared the list of services with Bitbucket and Bitbucket pipelines, you can see, we provide deployment frequency and change failure ratio, as well as to resolution. We can even trend this data over a period of time. So you can see how you're doing. We'll tell you clearly the number of deployments, incidents, and alerts, and then map that to deployments versus failures deployment versus file changes and overall service health versus team reaction by aggregating data across systems, you get a clear picture of what the hell is going on and how to improve. Now my final advice on crushing incidents is to document and share the knowledge.

00:12:26

It goes without saying that every incident should have a post-mortem report, but many people fail at documenting what transpired or how the incident was resolved. At Atlassian. We invested in making this easy. When the incident is resolved, you can easily create a post-mortem report because option you records everything that transpired, it populates a template, and then the incident commander or the response team can add commentary. A key point is that the report needs to be shared. So we added the ability to export the report to confluence, and it can easily be distributed across the organization. This will help speed resolution of similar incidents and help teams avoid common pitfalls. So how do you crush incidents with dev ops agility? Well, here's a summary of my practical and tactical advice. You should centralize the alerts and then normalize their format for easy reading, route them to the right teams with strong redundant notification to resolve incidents faster, be flexible in the ways teams collaborate and start connecting systems like CICT tools and ITSMs tools to troubleshoot at lightening speed, finally track and trend data using strong reporting and analytics, find out the areas of success and the opportunities to improve at last seen wants to help all the tools I mentioned are available in free versions.

00:13:53

And we have trials of all the advanced plans. We also have some killer resources that you should check out like our incident management handbook that you can download for free or our incident management website that is chock full of best practices. This concludes my presentation. Thanks for spending time with me today. I hope you found this session helpful and I hope your major incidents are few and easy to smack down when they arise.