DevOps Metrics We Love

Our Vice President came to us asking if we could build a dashboard to measure team velocity. We came back to him with a list of metrics we would rather measure, instead.


We wanted to improve DevOps practices and behaviors across multiple squads that are building and operating a global marketplace. There are hundreds of things we could measure. We seriously considered about 50 metrics that we could track (many from DevOps Enterprise Summit talks, or Accelerate) and chose to surface a dozen or so that would give us the most insights into agile delivery, the flow of work, availability, and code health.

This talk is about the metrics.


Assumption: you have a culture that will visualize metrics without fear.


How do we justify the value of each metric when challenged?


How do we set our thresholds for "good", "acceptable", and "poor" scores?


Which metrics are related and grouped together on the dashboard?


Which metrics do we intentionally not visualize and why?


Which metrics do we wish we had, that we don't?

Why we made it difficult to compare data across teams.

CC

Craig Cook

DevOps Coach, IBM

AM

Ann Marie Fred

DevOps and Security Lead, IBM

Transcript

00:00:07

Hi everyone. I'm Anne Marie Fred from IBM Craig Cooke. And I will be talking about dev ops metrics we love and our experience with using these metrics to drive measurable change within our teams. In this presentation, we have two key takeaways first. Why should you care about dev ops metrics? What can they do for you? And second, how can you incentivize the right behaviors without encouraging people to gain the system? But first a little about ourselves. I'm the dev ops and security lead for our commerce platform. That's the part of ibm.com that enables online sales. It includes things like the checkout process provisioning my IBM, where you manage your subscription and billing information and the product catalog. I have more than 15 years of software development experience, including nine years of working in a dev ops environment where a test automation and continuous delivery are a way of life and sharing the benefits of that with others, I spent the last five years on call for various production applications. I spent three years as a development manager, then switched back to a technical role to start a formal DevSecOps program. And I've been our privacy and security compliance lead as well for the past year over to you, Craig.

00:01:26

Excellent memory. Most of my background has been in operations and infrastructure. The last few years have been helping internal IBM teams improve their DevOps practices, not having direct authority over any squad mates things. Very interesting. I had to understand where their pain was. Every squad is at a different stage on their journey and then have to deliver the value and explain why they should implement these ideas. Ultimately, my goal is to help squads get value into production faster with higher quality today, I started a new role. Shout out to all the DevOps stays organized worldwide. I help with Raleigh And Marie and myself did not speak for IBM. What, what you going to hear today? Our own opinions. IBM has some fantastic lawyers. I really do not want to end up here again.

00:02:30

And Marie and myself are individual contributors. As you can see, there are a few layers of management above us. This talk is from the last five years helping build IBM's global marketplace. When we started our organization, we started with the Spotify squad model. The we modified it a few times as squad was part of the testing operation squad and goal was to improve the demos practices of other squads. When we first started, one of the first things we did was create an availability dashboard to understand the uptime of the services in the marketplace. You'll hear more about this later

00:03:17

Around September of 2018, our vice-president came to us with request to build a velocity dashboard. He believed his squads were performing well and he wanted data sheet to show it. The problem is we already knew a couple of things from experience. The first is that story point numbers are not consistent across squads. And the second is that if you ask teams to increase their velocity, they'll probably just start padding their story point. Estimates. What we really wanted were metrics that would help teams improve their agility and efficiency while maintaining quality. We had a gut feeling that certain things were making feature delivery slower, and we chose an initial set of metrics that focused on code reviews and the code delivery process. The screenshot on this page is of our first prototype for the record. We eventually threw this version away and started over again, but we got some good feedback from it. This talks in four parts, we'll cover the discovery phase, prioritizing the metrics, the gory technical details of those metrics and outcomes. What changed on our squads as a result, Craig's going to talk about some of the first decisions we made.

00:04:36

He really did not want to make our own custom dashboard. We'd already done that. And you, it was a lot of work. The year prior to this effort, we had evaluated an open-source project that did something similar. It didn't have support for Travis. Travis is a prime rate CSED platform that we use. We spent in a few weeks to try and add support and gave up in frustration. When this initiative came up again, we evaluated that project still on track for support that got discarded and VP had been contacted by a vendor with a commercial product that does something similar. We were encouraged to evaluate the tool. Once we dug in, we discovered there'll be a lot of custom work to get out. Data sent to this tool would also be very expensive, trying to send data at the scale we wanted to send that left us the custom option. We've done it before. How hard could it be? Once we chose the custom option within had decide what we're going to put on it.

00:05:53

We often get pushback on the metrics we measure from teams that get a poor. So we needed research backing up the predictive value of each metric and how each of the metrics ties back to business results. Our first resource was flow metrics from some of Dominica to grant assists talks. We like the theory behind these, but some are easier to measure than others. Next. We use the accelerate book by Dr. Nicole Forsgren does humble and gene Kim accelerate highlights four key metrics lead time, deployment frequency, meantime to restore and change, fail percentage. Some of these metrics that we use have been around for so long that we didn't know where they came from, such as the availability or uptime of a surface. We also considered a couple of metrics feed her heard of from Microsoft. Um, we have a sonar cube server, which gives us scores for security code quality and test coverage. And finally, we invented a couple of our own including deployment stability in the overall DevOps score.

00:06:59

We already had this availability dashboard that we use in quarterly reviews with our general manager. This visibility into availability puts pressure on the low performing teams, but also gives them the air cover. They need to fix availability problems at the cost of feature work. The goal of our minimum viable product or MVP was that it should be small, useful, and fully functional. So we could get feedback on our ideas. As you saw from the prototype screenshot earlier, we started with just a few metrics we could pull together in one month and deployed the app to gauge interest. First, we learned that everyone appreciated the way we had pulled together several different types of metrics into one dashboard. We had availability build and deployment time code review, time and batch size in one place. And we had waited the scores to come up with an overall dev ops score. However, a majority of the people were afraid that the metrics would be used against them or that they would get uncomfortable questions from executives about any red tiles on their dashboard. We realized from the beginning that is critical to optimize the dashboard for squads using it themselves. Not for somebody looking over their shoulder and telling them what to do before going any further with development. We trained our executives that they were not allowed to question any metrics directly with the squad.

00:08:29

And let's talk about how we chose metrics. What metrics do we want to see? We already had a dashboard that showed availability data that was in RSVP, wanted to see velocity metrics. There's various ways to interpret that quality is important. Do you have good development and operational practices? Agile is good. How do you know you're building the right thing? How do you get those metrics? I calculate list had 34 just because we could visualize something doesn't mean we should. We wanted a dashboard to drive best practices that will raise the quality of all services. When developers are on call and woken up for their own services, they get very interested in highly available architecture. We knew from the accelerate data that high-performing squads deployed at least daily. We want to say deployment speed. What can progress is a Kanban metric. It's better to complete one item that have 10 almost done. Test coverage gives you confidence that your code is working as intended. What am I? The security is hard to do. We want to visualize those metrics as well. We knew story points could easily be gained. It's easier to count the number of stories. Instead, we got that idea from the everyday Kanban website

00:10:14

Metrics that were not automated, too hard to get. They were thrown out human creative metrics could create variability and cause pain. We don't want to cause people pain. When we're trying to get them to adopt a new service. How would you visualize lead time with your teams? How many work items are near the completed?

00:10:44

Our independent squads use different tasks tracking tools. It became very complicated to try and tap into each of them and get that data out. We did a quick review on our own data and discovered that most stories quickly implemented and completed or not at all defects stories and unplanned work are all shown on the what completed graph, not a separate goals of ratios. Something about defects per developer could create the wrong impression and let people jump to conclusions. We don't want to encourage by behavior defects outside of SLA are handled through a different process. Lastly, we don't enjoy hurting cats, affectionately known as squats, trying to ask them to change their workflow was going to cause trouble. Now that we had a list, we plotted each item importance on one axis, feasibility on the other that made it easy to see where we should start writing stories and on gathering these metrics.

00:11:58

Now we'll get into the technical details of the metrics that we chose. If you're a metrics geek, this is the part of the talk you've been waiting for. We set our own thresholds for good, acceptable, and low scores based on our experience of what high performing teams do in our organization to get a green or good score. You're going to have to be best of breed, not just barely acceptable. We have high standards and we don't believe in grade inflation. We showed the calculations we used right on the dashboard so people can see them for themselves and to invite debate.

00:12:33

The first metric is availability and it's measured relative to your service level, objective or SLO. Most of our applications and web services agree to a 99.95% SLO our corporate wide authentication service has a 99.9% SLO. So squads, depending on it, can't be higher than that. And they commit to a lower 99.85% SLO. Instead, the availability score is based on the uptime of the deployed application relative to its service level objective, or the last 30 days a score of 100 is given for 100% of time decreasing to 80 for meeting SLO and zero at four times below your SLO. So if you're allowed to have 0.05% downtime, and you have 0.2% downtime or worse, you will get a score of zero. It's hard to get a high score. But our goal is that the site should never be down. The accelerate book focuses on meantime to restore as a key metric. Availability is a closely related metric, but it's easier for us to measure directly then MTTR.

00:13:43

The next metric is deployment frequency, which is based on the number of successful deployments over the last 30 days. This is inspired by the deployment frequency metric in accelerate and other dev ops studies. If you've deployed zero or one times in the last month, your tile is read two or three deployments, we'll get you a yellow tile and four or more deployments is a green. In short, you must consistently deploy at least once per week to get the highest score teams that do well on this metric, make small frequent and low-risk changes. They use continuous delivery. They fully automate their tests and they never change or reconfigure servers running and production. Instead they change infrastructure code and get hub and redeploy. They also patch their systems. Frequently.

00:14:30

Deployment stability is one we invented and it's the percentage of time when the most recent build was successful. If the build is successful less than half the time, the tile will be red. If 50 to 90% of the build succeed, the tile will be yellow. And if 90% of the build succeed, it will be green teams to do on this that do well on this metric. Find errors earlier on developer workstations, instead of on the build servers, they fix chronic build and deployment issues in order to make their developers more productive. This maps to the change failure rate and accelerate. When your change process is fully automated, it can be measured directly like this

00:15:12

Deployment speed is based on the amount of time needed to deploy changes to production from the time when a developer mergers a change a perfect score is given for build times under 20 minutes and decreases to zero above 90 minutes, teams that do dwell on this metric use build parallelization to speed up their builds. Their fast deployments, improve the meantime to recover recovery whenever redeployments are needed to fix problems and because their deployments are easy and fast, these teams also deploy more often and get faster feedback. The repository speed score is inspired by a Dominica DeGrandis is flow metrics. It's based on the time from pull request submission to merge, which is effectively the code review duration of GitHub pull requests over the last 30 days. A perfect score is given when the average life of a pull request is zero to two weekdays decreasing to a score of zero at five weekdays teams that do well on this metric. Don't neglect or ignore pull requests. They quickly review pull requests and help other developers who are stuck. Each unmerge pull request represents hours or days of a developer's work that hasn't delivered value. Yet. These teams also reduce their work in progress by finishing what they've started before moving onto the next feature.

00:16:37

The repository efficiency score is based on added lines of code and GitHub pull requests. Over the last 30 days, a score of 100 is given to pull requests with fewer than 150 lines added and it decreases to zero for pull requests with over 500 lines added the importance of small batch sizes comes from the Toyota lean production system. And its impact was illustrated in a books like the goal by Lea who am Goldratt and the Phoenix project by gene Kim teams that do well on this metric, keep their changes, small and low risk. They are able to review small changes, more carefully, and they approve changes more quickly back to you, Craig.

00:17:21

Thanks Emery. See these banks risks come from Sonic queue. The security metric is the code one abilities rating. We average the bugs and code smells metrics to get code quality. The test coverage percentage is used on the last column. Yeah. Ever score is a weighted average of the others. If data is not available for a score, it was admitted from the calculation. This gives the squads a quick high-level view of how they're doing. Each squad comes up with at least three epics for the next quarter. These are reviewed with our executives at the start of the quarter. It can be easy to lose track of these over time. We want an easy way to visualize how we're progressing with them. Instead of changing thresholds for each squads, we created X squads comment feature. It's a way to document reasons for some metrics that are visible. If you have a red tile, it's not necessarily a bad thing, you need to know that why it's red and the squad comments is a way for you to document that Most of our squads are using some version of scrum. Kanban was Scrumban, which is a combination of both of them. These are generic level metrics that work. All of them

00:18:55

Work in progress is account of items in the in progress or review QA stage. And the workflow steps. A point in time, snapshot is taken each week. What completed his account of work items finished in the sprint. Blue is planned items, green tasks, red defects, orange unplanned work. Great. Other the aging metric comes from Dominique to ground. This is flow. Metrics talks work in progress items with no updates in the last 10 days. If you haven't touched in 10 days, why is it still there? VP owns each platform. We created an easy way for them to see how the scores were performing. Some companies can tell you how many deployments per day they can do. I don't know about all of, I begin. We can see that Metro area though, on average, we do about 200 deployments per day.

00:19:58

Now there are several metrics where we see the value of the metric, but we haven't invested the extra work needed to measure them. If you remember what Craig mentioned earlier, we took a look at the lead time metric for our squads and found that it was not as long or as variable as we would have expected. What we found instead is that most stories that were opened were either added to a sprint plan within a couple of weeks or never implemented it at all, or to put it another way, stories and defects in the top 10 of a squads backlog, or normally implemented quickly while the rest slowly accumulate until someone decides to clean up the backlog. So instead of measuring that, we would like to track development lead time. The time from when a story is committed to the development backlog or that top 10 to the time when it's done and in production flow efficiency is another metric that we like from Dominica grant is it measures the amount of time the work is waiting for something, whether that's, uh, resources or deployment or people, we would love to be able to quantify this, but our squads existing workflows, weren't set up with wait states in them.

00:21:03

And we haven't been able to muster support for imposing a new workflow in our squads. Our squads are protective of their workflows. Once they're happy with them. We also already collect squad health metrics as described by a Spotify on the website here, free squad in our business group on a quarterly basis. These are valuable because of the discussions they provoke with a new squad and those discussions usually to positive change. We haven't pulled them into this dashboard though, because we've been collecting the answers using spreadsheets. And they're not in a database that we could pull from. Craig is part of a volunteer team working on a squad health app. So then we could use APIs to get the data. Once that work is done, IBM also collects employee engagement metrics on a regular basis. The screenshot on the right is from our employee engagement website, which provides guidance for managers and individuals on how to make use of the survey results to improve employee engagement. It would be nice to make those visible as happiness metrics on our dashboard, but we don't have API access to that data. Let's talk about grades numbers and colors. Numbers are objective, but without context, they can be confusing. For example, 95% availability is a poor score, indeed. But if you got a 95% on a test, it would be an a also letter grades are more powerful than colors. People really hate seeing a D, but they might tolerate an orange tile. Our general manager actually asked us to remove the letter grades to soften the blow. We decided to make that a feature flag to show only colors by default.

00:22:45

Every squad is different. As squads are autonomous and independent with different goals, we intentionally made it difficult to compare squads using a dashboard. There's no easy way to see the overall score for each squad. You have to drill down into each section. Feedback is a gift. Some squads were upset with their grades and they let us know This led to conversations about great practices and the definition of what is good enough. That also prompted us to add that squad comments feature. We refuse to set different thresholds for different squads. Everyone needs to be treated to the same threshold metrics has to be consistent. That's the reason why we created the squad comments feature. We have many examples of where the metrics and the conversations around them change business, outcome and behaviors. After a set of squads, adopted Sona cube, I saw poor metrics for vulnerabilities and code smells. They started to work to address these issues. Some of them went further and integrated IDE. So I'm gonna cube into their IDE to catch issues. Even sooner shifting development left, One squad was upset with the PR creation to budge time. After discussing with him, I mentioned more programming as an experiment and adopted it with great success. For nine months, they created high quality services and created the happiest squad as shown in this squad health surveys.

00:24:38

Some squads deployed every two weeks. That was a sign that they might be behind on their patching. After discussing how to do data deployments, we helped them get their JavaScript, repeat repos, doing that daily. They now do the latest version of packages every day. And they use a thing called MPM audit and the JavaScript world that tells you if your package is vulnerable, we open sourced a script that helps you do that throughout your CICB pipeline.

00:25:16

Our own squad used the dashboard every day in our daily standup meeting. So you could consider us expert users. We saw correlations between high work and progress and things not being completed. So we reduced our work in progress. As a result, basically we committed to fewer stories. Each sprint, the poll requests list also highlights work in progress. It makes block or blocked or stuck work in progress, more visible and helps us reduce that and deliver value faster. We added deployment stability when our developers are complaining about spending too much time, fixing broken builds with data and red tiles. To back that up, our team agreed to invest more time, a few weeks in just improving the builds and deployment frequency showed us where something hadn't been deployed in a while. So if an app hasn't been updated in months, it probably needs security patches. We would review the pull request list at the end of each daily standup meeting as a result, no longer were pull requests lost or forgotten because nobody thought to check one of the dozens of repos. Our squad owned it's especially helpful when we have bots that are creating automated pull requests with our security patches.

00:26:33

This has been a whirlwind tour, but hopefully you've learned why you should care about dev ops metrics and how you can tailor metrics to your own organization and incentivize the behaviors you care about. Thank you so much. Feel free to reach out if you'd like to discuss our work in more detail Craig, any final thoughts?

00:26:57

Yes, Henry. Thanks. I would like to invite anyone to discuss these metrics with us. It's been challenging to create these and influence squads to adopt them. We've seen some great benefits. Let us know if you try. Thanks.