Measuring for DevOps Success (US 2021)

When introducing DevOps for one or two teams, there was no need to provide evidence of the effectiveness or implement organizational measures for improving our DevOps approach. Ever since rolling out DevOps to our entire IT organization, both changed: we need to make sure that we are on the right track with developing our organization and we need to demonstrate this. Deriving from the metrics of State of DevOps Report 2019, we introduced KPIs without much additional tooling required. Working in an environment that is not used to make KPIs visible, it took some time to get accustomed. I will show how we did this and how we are now using our metrics to improve our processes. The question however is if process performance really matters. I will dig into this and correlate it with result orientation.

usbreakoutlas vegasvegas2021

Stephan Stapel

Head of Development, Hermes Germany GmbH



Hello. My name is Stefan. I'm from Germany and I'm working for a company called Hermes and Hermes. We introduced Def ops about four years ago now, and we are quite happy with the path we took. And today I'd like to share with you some insights of our journey. Part of our journey was to measure the success of Def ops. I'd like to start with some short introduction and then discuss what success means and this in our environment, I'd like to introduce the metrics we are using and like to conclude with some takeaways, hoping to inspire you for your own journey.


Let's dive into the introduction. Hemis group is the largest post independent parcel company. In Europe. We have subsidiaries in multiple countries and Thomas Germany alone delivers about 500 million parcels per year. Besides private customers. Our typical clients are medium and large IE retailers. In this environment. We have a market that is growing with five to 10% per year. That means that we need to cope with an ever increasing amount of parcels. This in turn means that we need to automate everything possible. And that is key to our business. Our customers and business clients expect digital innovations to be happy, happy our customers, or to foster their own businesses.


Having worked with Hermes for 10 years. Now, I can clearly say there's no business without technology. And in this environment, the question is what is success? First factor clearly is to be able to focus on bringing value. And that sounds simple and obvious. And those of you working in larger enterprises with lots of different interests competing with each other, know what I'm talking about? Another factor is the ability to provide value faster, to remove technical and organizational burdens, to have everything in place that we need to provide value. And another success factor is as a tech organization to be a reliable partner in the company, be someone to trust.


I have an example. I want to introduce to you. We have two projects on the diversion of parcels. So to redirect parcels, if you're not at home for your garage, for the neighbor to the parcel of shop, and this two, those two projects have similar sizes, similar complexity, similar stakeholders, and even similar topic to deal with. And the first project took place seven years ago in the old working system. We got scrum by then, but I would probably not call it agile, at least not today, but we had no idea of pipelining of automation. And this project took us nine months if not longer.


And the second project which we conducted last year with the current working system, with a good understanding of automation of delivery pipelines, or fast flow of work with feedback mechanisms, it took us four months. Probably you cannot speed it up even further, at least not by that degree. So if we would speak in seven years, it probably wouldn't be two months. But what you can see is that the effort to enhance the working system pays off really quickly. But it's important to know that enhancing the working system. It's not only about speed. That's not only because value is more important than speed, but it's also since delivering tech products, it's no sprint, which has finished after 100 meters, but it's rather an emphasis marathon running kilometer after kilometer after kilometer.


This is why I like this edge. I principle, which was written more than 20 years ago. It says, keep constant pace, indifferent indefinitely, which means listen to yourself, listen to your organization, find out how fast you can work, but don't work faster. Find a pace you can keep you can best work at. And by saying that, and by aiming at that, we found two problems in our organization that we wanted to work on. First of all, we didn't know how long a larger piece of software, a larger piece of work would take. So can we improve our estimates? Can we generate more reliable estimates? Can we improve ourselves in second question is we put lots of effort into introducing continuous delivery. And can we prove that that really paid off?


And by asking these two questions, you have to understand that DevOps itself move from a grass root movement to a general direction for our tech organization. And we are getting frequently asked by our top management, if all the effort is really worth it, if it really needs this modern way of working. So we regularly need to prove that we are on the right track and we sometimes even need to defend. We then decided to shine a light, the system of work to better understand what it's going on. And that was really good to make the situation transparent, to share this transparency with everyone in the organization, because that helped to feel the pain together for shine. The light on the work system.


We took a look at these four key metrics and I like these four key metrics as they were introduced in the dev ops report and the accelerate book from Nicole Forsgren. And we now make use most of most of them in a way that is achievable for us. So let's dive right in measuring lead time. And to understand what we are measuring here, you first have to understand how you're working in our organization. The ways of working the way of how we are coordinating the work is based on the flight level model from Klaus Leo point. And this model basically comprises of Kanban boards on three levels of management on three levels of abstraction. And each of this level is aligned with the other.


And the top we have, this is a strategic level to manage large company wide initiatives, making sure that there's strategic fit, that there are valid business cases. On the second level, we take a look at the context. We are aligning teams if they need to work together, for example, on a particular feature and on level one or the team level, this is where each team plans and managers that work. For example, using scrum or Kanban or whatever might be appropriate for a particular team. And we decided as a first step that we want to measure the lead time on the coordination level, measuring the time it takes to work on a particular feature.


The question is not what, what, uh, what is the feature for you? So sometimes features I'll refer to as an epic and for us, the effort of such a feature should be two to three months. Not because we are calculating based on time, but because we believe that we want to coordinate something which really has an impact, for example, happier customers or higher profit, but it might also be an experiment where we are aiming for learning and some, some real life examples for this is the introduction of electronic payment to our customer website or the implementation of a new newsletter or adding certain countries where our customers can send their parcels to.


And these features comprise of a number of stories eventually implemented by multiple teams. And those stories then are managed on the level one board. And our goal is that all features on the coordination level should have similar effort, similar size. This allows us for better estimates that eases coordination and prioritization because you can compare them at least in effort or size. You can see each other. And the effort that I mentioned is typically called lead time, which is the time from starting to work on a feature until this feature is available to the user. In our case, the customer, it might or might not involve multiple teams. In this case, we have team a on the top and TB on the bottom that, that are involved in, in this particular feature and this process.


And if we have multiple teams, the goal of the coordination bot is not pitting teams against you against each other. Instead, we are observing the collaboration and help them aligning, help them to collaborate, to get the feature done. The organization I'm responsible for comprises of 14 of such product teams with approximately 100 people in total. And in this organization, we deliver 100 to 120 of such features per year currently. And what we want to avoid is such a U curve looking statistic with lots of lots of features, finishing quickly, lots of features taking literally forever and just few feature of the desired size. We are aiming for quite the opposite type of curve.


And this is what the lead times currently looks like. So what you can see here is the lead time medians of the feature calculated per month, along with the all-time medium of 78 days. And what I did here is to smooth the values, at least to some degree, to smooth out some, some, some outliers. I took the all-time medium of 78 days, which is approximate approximately two months. And I now took the two month rolling median to calculate the median for each month for each bar, which you see here, we still have some varying lead time. So we are clearly room for improvement. We might even someday aim to lower the lead times a bit, but that's not the goal for now because shortening the feature level lead time is no good goal because this would not improve the work system at all because people would just start to cut the features in half. It's just the technical measure and the features were finished quickly up quicker, goal achieved, but nothing improved. So that's not the goal for now. The goal for now is to be more consistent.


One side note, if you take a closer look, you see two sections in this diagram. And the explanation is quite simple until April of this year, we worked on normal business features to make our customers happier. And then in may our cloud migration project kicked off. So we had completely different topics that we needed to start working on this little expertise on these topics. And there's no risk introduced. So what we saw then is that the lead times pumped up and I am now eager to see what happens when the migration is finished in October of this year, if we will return to the old level of lead time, then we are measuring deployment frequency and failure rate.


Let's take a look at the necessary information we needed to calculate those. During an earlier idle implementation, a change advisory board was created and introducing continuous delivery. It was like putting a horse in front of a racing car. So the, the cup probably never works in companies with lots of software change going on. We changed the game for some time, but when we streamline our processes during introduction of continuous delivery, we need to wipe out the cup and we are happy that we succeeded with that, but even thought wiping out the cup, we still need to document the changes to give transparency about what is happening.


So what we did is we automated the change documentation within our deployment pipelines that gave us a good acceptance by the teams. So everyone edit that to, to their deployment pipelines. So we have a comprehensive database of changes and this comprehensive database is now used to generate the metrics and measuring the deployment. Frequency is quite simple from the change database. We are just counting the number of deployments per time and per solution. Easy. What is really important before you short, before showing you some numbers, we are not comparing teams and we are not comparing their performance. It should never be a race for a number of deployments. But what we can do is that we can compare the trends and work with the teams to find a good frequency that fits to their skills, to the topics they are working in on, and also to the maturity of their solution.


Thinking about deployed deployment frequency, that directly says something about the health of our continuous delivery pipelines, those automatic pipelines make delivery process safe for everyone to use. So that's no ex anxiety to deploy to production. And that is what results from just humble quote, which I really like. So do it more frequently and bring the pain forward. If it hurts. Secondly, deployment frequency is a proxy metric for batch size. This means a higher, the higher the deployment frequency, the smaller must be the batch size and the smaller batch size in the context of software delivery means that we have better control of what is going into production. And that in turn means that we have a better control of quality and risks.


That's an example. I brought to you the deployment frequency of one of our teams. So KPI is responsible for the API, the generates labels for all of our private customers. You'll see the month along the X Xs in the number of deployments displayed as bars. And I'd like to I'd to share with you two observations. I see you, at least from the context I have it Hermes. We are traditionally very cautious with deployments during peak, which is November and December of each year. And as you can see here, we just rewrote this rule, but we just continue working and bringing stuff into production with proper pipelines, with proper automatic tests. And we see that says no service degragation to get back to Jess humble. We brought the pain forward.


On the other hand, what you can see here is what results from going into holiday during Christmas season, with five deployments in January, what am I learning? I have the observation that the team was at least five deployments per month. So one deployment for a week is generally doing fine. If the team has less deployments than once per week, it's a good chance to get into discussion with each other, to find if we can do anything, to help them. If for example, the application needs some re-engineering, the pipeline needs some re-engineering to ease deployment. Then what we found is that solutions that get small much get no more mature, have lower deployment rates. And that is because the teams can focus on outcome on value instead of delivering a pure amount of features.


Also taking these measures, it's good to bring teams together, to let them inspire each other, help them getting better pipelines, enhancing them, finding cooler solutions and from a management manager's perspective, it is a good tool to better understand and offer support to the teams. Second metric failure rate. And in contrary to the original metrics, I believe that there is no such thing as a change failure rate because with automatized pipelines, deployments will almost always work. The boundary condition of course, is that we have comprehensive pipelines containing automatized tests, code analysis, and the likes, obviously, despite all the effort we take, we sometimes send packs into production. So this remains even true with, with continuous delivery.


And we asked ourselves, how can we get transparency about such events from the data that we have? And we decided to start with the statistic approach, because basically you're back in the context of dev ops, it's usually detected by monitoring data or feedback from users or customers. Such a feedback usually comes in quite quickly. If you have internal users when developing software for customers for the public, such feedback comes in much slower. So in this case, telemetry is king. And with these mechanisms in place, we can assume that we quickly find such box. And because the change was small, we can assume that we generally are able to fix it quickly. At least if such facts occurred during office hours.


So w we looked at the change data for second deployment coming in quickly after the initial deployment. And by looking at the data and discussing with people, I found that the typical fix is delivered within the next three hours after bringing out the initial malicious deployment and using this statistic approach, we found that, so you see again, the same bars as before, and you see that the TPI had one fixed forward in November. And the second fixed forward in March, the preconditioning is of course, the quality checks deploy, prevent the deployment of bad code. And those checks need to be baked into the deployment pipeline and on, and all, we found that the fixed forward rates across all my 14 teams, less than 1%. And for me, that is a good proof for the benefits of continuous delivery.


What are my observations? We were glad that we didn't have to introduce an additional measure measure that allowed us to keep the efforts. Small. Of course, this approach suddenly is a compromise that compromise comes with some additional work. So we need to get into discussion with the teams to verify the approach we need to observe and adapt. The three hour threshold. The questions we have is answered. We can prove that the quality mechanisms are working. We can prove that continuous delivery does not yield instability in the future. It would be interesting to learn about the detection times of Fox now the reaction times and the times to fix so to cut the metrics into smaller parts.


So there's room to improve. Let me now conclude with some takeaways. We've discussed some numbers, some statistics, but the question is how happy are the people? And we are not measuring happiness yet. Instead, we are getting into conversation with each other, getting regularly feedback on the measures we are taking. And I'd like to share with you two quotes from two colleagues who have been with us for more than 10 years now as well. And the first quote is from Kirsten, who is the product on our, from my Hermes E our private customer portal. And she said, we are faster. Good, fair enough. But even more important, the quality improved tremendously, we bring small changes into production and are monitoring them with the entire team. The entire team looks at what's going on in production.


And the second quote is from Michael. Michael is a QA expert in one of our API teams. You said formally, I could easily make deaf team sweat by imagining we at test cases. Nowadays, everything is so transparent and streamlined and pipelines. I, as a QA engineer, can barely even break anything at all. I'd like to conclude that implementing these measures were really worth it for us. And I think they are worth it for everybody. We are now able to work on consistent lead times, and we can now make sure that we have small batch sizes, introducing little risk to production, if any risk at all.


But to be honest, with most successes come some vulnerability that we did. One big mistake. We did not collect the information before we started our dev ops transformation. And you can only prove your success if you know how bad it was when you started. So I'd like to encourage you to do it better. So if you did not start yet with the DevOps approach with continuous delivery, please collect the before state, please collect how bad or maybe even how good it is now in your environment. But what I can say is that we now have information where we can improve on. We can now improve from here at least.


And besides all the technical discussions between BB fights or discussion about continuous delivery in my leadership role, these analyses now allow me to connect better with my teams. It now allows me to better offer support, where it might be necessary. We can now connect the teams better. We can now foster collaboration between the teams, encourage them to share their learnings from their own continuous delivery journey. And with this, I'd like to conclude if you have some feedback or want to get into discussion, this is my contact information. I'd be more than happy if you want to connect. Thanks a lot for listening.