Avoiding Goodhart’s law - Use SLO’s as Tools not Cudgels

The concepts of SLI, SLO and Error Budget are there to balance risk (rates of change) and reward (business contentment). Using such metrics as red lines to punish teams, or force acceptance of risk by the business is missing the point. My experiences from SLA’s in service contracts for hospitals inform this conversation identifying that SLI, SLO and Error Budgets are better as a basis for conversations about the stress an application can withstand, and the three dimensions the measures should cover. This session takes Goodhart’s law from economic policy as a frame for reconsidering SLI’s and SLO’s, and offers a few hints for approaching the negotiation meetings. Leave this session inspired to approach your SLO negotiations in the best possible way.

MC

Marco Coulter

Technical Evangelist, Tech-Whisperer

Transcript

00:00:13

Good. I welcome to avoiding Goodhart's law. Using your as tools might cuddles the concepts of SLIs, ALO, and every budget are there to balance risk and reward risk around the acceptable rate of change and reward being the business success and customer contentment, using such metrics to punish teams for exceeding budgets or forcing acceptance of change within the business is a path to failure. This session is going to give you a few hints for success, and I'd like to thank my house at DevOps enterprise summit for giving me the chance to share some knowledge in this way. First, let me introduce myself, get I on Mako. I am an ex CTO who has worked for one of the top 50 international banks. I've supported data centers for hospitals and service providers and worked for some of the industry's largest vendors. I've lived in three countries and manage teams across 13 countries.

00:01:06

I also spent five years as an industry analyst running the data science team and four, five on research, seeing technology from every side as an operator developer analyst, vendor, buyer, and CTO gives me a unique view on technology. You can read some of my writing or interviews and the publications on the left here or in my own personal website, tech hyphen whisperer.com. So enough about me. It's good to have targets, right? Think of Robin hood, the story where he places the child against the tree, and then he loads up his arrow in the aims. And then we're told of an apple was a target on the child's head. Well, when with that, the story becomes a story of a skilled Archer, but without that target apple, well, it's just the story of a dangerous guy shooting arrows at children. So it's good to have targets as long as you use them correctly.

00:01:54

Now today's session is going to come in three chapters. I'll talk about how I experienced Goodhart's law before I even knew it existed. Then we will think about the SLIs in a better way across dimensions. And finally, I will throw a few hints about negotiating the rest delays and give you some further reading. So let's get going. I'll get to Goodhart's law in a moment. I want to share an experience with you first now, depending on your personality, you will either pause, relax, and enjoy my story. Or you might be already searching online for Goodheart on Wikipedia and that's gaming the system. Some folks to find a game gaming the system as a smart play. Others equate the phrase to cheating. I guess I'm more in the second group for me gaming, the system means taking the rules that are created to protect the system.

00:02:37

And instead of manipulating the system towards a desired outcome or goal, here's an example in a prior life back in Australia, I worked for a service provider that supported all of the hospitals in the state. Now in hospitals, nurses take lab samples and they get sent to the labs. They are processed and the results get transmitted back to the patient record where the nurse back in the ward can immediately look them up. Pretty simple, right? By the way, the wards are often on the 16th floor somewhere while the will abs are in the basement of another building on campus, it can be quite some distance between them. And technically it looked a little like this messages from the labs, your unit system would be sent to message queues. And the is then fed the lab updates into the mainframe system that held all the patient records.

00:03:28

Now everything allegedly spoke a common hydro seven standard. So there's never going to be any problems, right? Some of these, some of you already see where this is going. Different vendors had slightly different interpretations of the HL seven standard and malformed messages would get stuck in the queue. We would get phone calls from hospitals that they had to go to manual procedures. Now, the backup procedure, if the message queues gets stuck, or if the data doesn't appear in the patient record was for the nurses to physically run from the ward down to the labs to get results. This was not optimal as the patient's health was at risk, both from the delay and from the nurse being absent from the ward. Now being techies, we thought we would take care of the setting, the situation by setting an SLA that said, if the message queues get higher than a hundred, we, the service provider had to refund money back.

00:04:20

That's sort of dressings, right? I encoded a monitor bash script so that when the queue length approached, you know, it didn't have any internal monitors. So when the Caitlin has approached, a hundred alerts would start to go off, monitor icons would turn from green to yellow, to red, and it was technicians. We focused on that measure as the target, as the goal, you know, we even built capacity plans around making sure that that queue processing system got all the power it needed and he might think great result, Marco that's top-notch the lab results are getting to the water time. Right? Well, the only problem was we would still get those pesky phone calls and the nurses in the ward sign, the system sucked and nurses don't bite their tongue. When they're telling you you're not helping them. They were always having to run down and manually collect results.

00:05:02

And we looked at the message piece renting. Now the problem was, well now transactions for selling timing out before hitting the message queue. We were managing the capacity plan. In fact, the whole application to the metric and not the outcome. Now, years later, I learned that this procedure had a name and yes, we're finally going to get to new Goodhart. Now, Goodheart was an economist in the UK who in 1975 stated, let me read this. Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes. Now I know that's kind of wordy. He was English. He was a politician where it is what you get. Here's what he meant. Basically the law says that when a measure becomes a target, it ceases to be a good measure because people won't gain the system. ACU length was a good measure of queue function.

00:05:54

We are managing to queue length as a target for success instead of successful laboratory transactions in patient records, which is what the nurses needed. So how do we avoid Goodhart's law? Well, we need to let SLIs be measures and SLS be goals. So what should we measure the keys to stand in the other people's shoes to see everything from a few different angles. So, hence three dimensions. Remember DevOps is about balancing the risk of availability against rapid innovation and efficient operation. We embrace that risk by giving it a value through well-defined and governed service level indicators, objectives, and agreements. As you're identifying SLIs, we need measures that, see the whole picture in the three key dimensions of code infrastructure and customer experience. So let's me step through a simple example here, based on that hospital environment, the example, it's not going to give you specific SLRs that you can apply in your environment.

00:06:49

It's not meant to, it's intended to share the thought process. Now, first let's quickly review the and agree the SLI SLO SLA model. Now SLIs a numbers and they work better when they're percentiles. So avoid averages in mature environments, SOS will be nested. They will combine SLIs that sit up against the code and the technology up to SLIs that sit next to the customer. There are defined quantitative measure of a metric you're then going to set limits on the SLI, say APA, or maybe an opera and Lala. And that's going to give you the SLR. The objective. Now it's the Lowe's should capture the performance and availability levels that if barely met would keep your typical customer happy that generally a target filers will be less than X or, or range responses will be between X and Y. Generally, I like to translate the SLS in periodic budget so that I can track, you know, weekly with the weekly weights for you will depend on your release cycles and so on.

00:07:47

Then you hit the SLS. They define the actions that are acceptable once the budgets are used up and defining these ahead of time is critical on one additional thing. It's not just about full outages these days. So, so you need to focus on slowdowns as well, rather than traditional uptime availability. Try to focus on the customer domain and their experience. You use successful customer requests instead of technology to capture the overall environment. The parent is I supporting the CX ones should cover each of the following three dimensions. First, let's start with code from our code. We want functional code that does not fail and also feature additions and write downs of technical debt. You're going to be dealing in this dimension with multiple live languages. So you want to be sure that the metrics will work across them generically, where you can avoid metrics that only apply to a specific language.

00:08:41

It adds too much extra handling also. Um, it's not going to be limited to applications you built in house. So sometimes you're gonna have to deal with things like SAP and the APAP language or a SAS application like Salesforce. Um, maybe what you're managing is environmental like a Kubernetes environment or something. And then you're not just watching for code transaction errors. You're looking at configurations as well. So some of the nicer they socialize around code might end up being Yammel code accuracy, your open API definitions. What you want to avoid here is silos of data. You don't want the business team working off one source and metrics. While the development team works off a different source, you need a single source of truth there. Um, as part of deciding what your base CSLs on, I'll give you the examples. We'll show you that now for the SOI here for the sample indicator, let's embed a few clarifications.

00:09:32

Our first step would be to focus on well-formed updates and that specifies the transaction. We want to update the patient record and we want to acknowledge completion and that specifies the reaction. And then in this case, we want to measure it. Somehow we agreed on APM as the source. I'm familiar with app dynamics, but you could use data dog and new Relic. And in fact, because we're only concerned about COVID here, you might prefer some of the observability offerings like honeycomb or in star. Now some books recommend good over bad ratios for us crystallize. If you can control all the code that can work. But in this example, I am avoiding the fail ratio as too many elements are out of our control. The HL seven update transactions were coming out of purchase lab software. We had no way of fixing that code. We had to wait for patches, the queuing systems with third party software.

00:10:21

We couldn't tweak them to respond better to malformed event entries. We had to wait for patches. So we couldn't be certain we'd reach a point where the HL seven outputs were always well-formed coming into our code. So that basic code has aligned needs to be focused on well-formed updates. Getting to the code that we were writing in controlling now for the code SLO objective, we applied goal to that indicator. We're already assuming well-formed records so we can set this fairly high. Again, we've been clear about the transaction, the reaction and the source, and it's easy to set this goal too high. You know, people always will generally say we want a hundred percent, but that means there's no experimentation. There's no innovation. It needs to be instead set. We're just over the level where you will keep customers happy, too high, and you're wasting opportunity costs.

00:11:13

And for the SLA here, we apply an outcome. Now that RSLA goal allows them wiggle room against the SLO. We added a time range here that it must be met over a sliding range in this case of 28 days. The, um, so I should specify what happens when the SLA has missed. Does one department or the other partner refund, or does the software release cycle will get automatically frozen for the next 28 days to return to civility? See the SLA part is the part that's negotiated, um, in a perfect world. This is defined by the business or customer, but in reality, it's a conversation. Normally I wouldn't put technical phrases like well-formed HL seven and the SLA, it would be a customer outcome. Uh, but we'll, I'll come to that later in the session. So, okay. You have some SLIs and SLOs around code, and hopefully that gives developers a sense of balance.

00:12:02

Something to measure opportunities, to add features and innovate and clear technical debt against business impact. Now code though runs on infrastructure. And of course that can have customer experience impacts as well. You have availability concerns of how to support updates on the infrastructure. Uh, you know, you need time to update operating systems or move to different clouds or different network providers, uh, adding maybe it's just adding new locations to better support remote customers. Those risks to availability and performance will they need to be balanced as well. So infrastructure is the second dimension. And as we added, things are going to get more complicated. You'll be dealing with a full stack and often multiple stacks in multiple locations. And we, in our hospitals, we have pretty much everything from windows clients to Unix lab systems, to different Unix and Q systems and IBM mainframe patient record system.

00:12:55

And this was all scattered across the state that it's physically one third, the size of mainland USA. So it was not close to any cloud providers and networks mattered. Now, some of the messages that nested SLS around infrastructure might be inherited from your other providers, from your cloud, your network, your service providers, and life, a code you want to avoid a silos of data. It's best to have a single source of SLR truth for infrastructure. If you can. What you're looking for is the infrastructure's ability to support load and deliver predictable that I can see. Now we need to include the impact of all of these infrastructure components into the SOI. We look at the total transaction time as a way of doing that, that would certainly have nest of SLIs for each pipe piece of the puzzle. So maybe you'd have an SLA for the labs update, leaving the lab hardware, the message queue adding and leaving the queue.

00:13:44

And they select for the labs update, arriving at the mainframe and then an SOI for traversing the networks. And it's like adding it to the patient record, or maybe even an SLI for the database inserts on the patient record system. Now, even with that little list, I'm getting a little crazy, you can get too crazy with us allies. So define us allies either at system boundaries or at team boundaries. Um, the system strength of system boundaries is that they're less likely to change, uh, but they might be too detailed on the technology level. Trying to attain boundaries is that people can sort of self identify with the measurement and work towards it. So for ESSA lows on infrastructure, you may want to express the SLO and the shape of a performance period. He would expect the bulk to occur normally within 30 seconds. Well, within the know this was well within the capabilities or infrastructure, but some may take longer, you know, when there's a high system mode or something like that.

00:14:36

And you will see that we included a long tail in here of five minutes at the top of the curve. As we moved to the SLA and negotiated with the customer, we see a big jump. You see that we are only committing to the five minute time thing. Now, this actually came from their side after we had conversations with the ward nurses, because what we did was we, you know, we realized after the first debacle that we weren't understanding their needs, we went out to the wards and we watched them and talk to them. And their view was different. They realized it takes time for the samples to get from the ward and be delivered to the labs. And so for them, the timeframe was about beating the time. It took a nurse to run from the ward to the lab. When the system was down, there were sounds better to come back sooner than that.

00:15:16

So a doctor wouldn't dispatch them. So when we offered, you know, and that takes about 10 minutes in this environment and we offered them five minutes, they were really happy with that. So it doesn't always have to be as fast as possible, just as fast as necessary. Now, of course, when I worked for banks on stock trading systems, that was a whole different world. The processing time was there a competitive differentiator for the traders, so fast as possible, and nevermind the cost was the approach, very different. The dimensions here of code and infrastructure though, that's not the full picture. So I saved the best for last. Our third dimension is the business, or if you're a nonprofit or government, then it's the customer experience. This is about the revenue or service production capabilities of the application. As we add in the business dimension, it can be difficult to mill to measure the full experience.

00:16:03

So you might have to get out to the customer interface and that might require some browser integration or some mobile platform agents. And for availability to track predictability of response times, you may want it to even look at synthetic testing tools that have synthetic transactions. I'm going to keep it a little simpler in our hospital example. So here in this example, again, with the SOI, we're looking at the doctor and nurse experience. Now we had built and owned the patient record application. So we knew we could add our own specific measure in there. And from hanging out with the nurses in the wards, we worked out that the nurses had an instinctive expectation of when the labs would be coming back. And that's when they'd start looking at the record. Now, if the uptight was not there, they'd do something else. And come back in a few minutes.

00:16:46

And we worked out that repeated record lookup or lookups was our sign that we weren't meeting their instinct of expectations. And soon they would be calling us to complain. So we could Recode it a repeat counter into the patient record application on the refresh counter. Now, as you look at the numbers here, what have we said beyond 10 seconds? And, you know, we had five minutes of course, but why beyond 10 sentence? Well, at first we just had five minutes, but we kept missing the target because there was sort of just one or two nurses in one or two of our schools who wouldn't wait at all. They'd just sit there, hitting the refresh button again and again and again and again. So we still hit up beyond 10 seconds line in there to avoid those crashing patient months. Now, if we look at the SLO here, you know, we're seeking a loan number, you might've expected a tiny percentage, but the SLO now, because we're looking at the whole thing, includes all those malformed transactions coming out of the crappy lab system.

00:17:38

So we needed to be realistic with them. And in fact, this is the low, so a nest to everything within the system, when we hit the SLA again, we gave ourselves no reaction room against the SLS. Now the eight hour timeframe here actually came from the nurses, I think in terms of their shifts. So, you know, if they serve at once within a shift, et cetera, or if they sort of twice within a shift though, you know, they're finding us. So I'm got to remember that you're working towards, you know, it's not a Metro contract. Um, use it as a tool, not to beat each other up, um, but to, you know, understand and capture assumptions. So you need to consider all these three dimensions for success. The SLA is, are not there to beat each other up. They're there to capture the mutual understanding.

00:18:23

You reach that mutual understanding through negotiation. So SLIs, I suppose, I suppose every budget, they have the tools to support negotiations. Now, negotiating is a key skill for any dev ops professional. There are some great books out there though. The books sometimes can be contradictory, you know, getting the S getting to know and pretty much, you know, almost all of them are aimed at sales folks. So I want, you know, for me, I'm a win-win negotiator. I want everybody to feel like they want something and that's not always possible. So here's a quick few thoughts on, based on my experiences, save you reading all the books that I've read now, few quick steps. None of that itself is not a new thing. Um, it's covered in the temple of Apollo in Greece, in the fifth century BC. So definitely not a sort of this century thing, but starting here is great.

00:19:15

How much can you control? What risks can you absorb and keep your job? Is the risk spread evenly through the year? Or do you have peak periods like a black Friday or a Superbowl or a new year's Eve? Are you in a period of significant transformation? Isn't an applies. Will things be the same in 12 months? Or if you set SLI survey will be a bit meaningless in 12 months, depending on the nature of the transformation. So use all this together, your needs. You probably have a feeling for expectations so that the business will have. So what do you need to deliver that? Could you accept tough, tougher SLS? If you could grow your team or purchase supporting tools, uh, when consulting, I try and brainstorm this a bit to identify where my outer boundaries are, what would be unacceptable or be considered too easy, preparing to engage is about gathering the information and building it into a strategic model.

00:20:07

There's a people factor here. As you're gathering information as well, gather opinions about the people you will be negotiating with, what are their goals, their attitude to risk or innovation, even subtle things like what time of day are they more open to ideas are in a better mood. Are they happier if there's donuts in the room? Are they happier? If it's coffee or not? If this will be you as the facilitator, then read up on this and learn it ahead of time. Facilitation's a very specific skill consider bringing a contractor in or, or an outsider. And actually it can be great to grab a leader from a different part of the organization who you know, is a natural facilitator who can park their own ego and needs and draw input from everybody in the meeting. It can give that facilitator a career profile inside your organization, and they learn about another part of the company.

00:20:54

So it's win-win. I told you, I like win-win. Um, but then it's time to get negotiation on, to schedule things and set up the meetings. So in a negotiating meeting, you want the Walmart, and this is a discussion of the application and the dimensions and the scope of the meeting. Be brief. You don't want them design out there, all expert experts in this, some aspect here or that you wouldn't have them in the room. You don't want to turn over every stone yet. Get everyone to talk, ask them to spend one or two minutes describing their aspect of it. Um, the nominator facilitator should polite, close anyone down. If they start to exceed a brief introduction, your next step is to test drive something, give them a shot at the end of the indicators under consideration, you're testing the water here. So trying to make it something that could live on as a final agreement and make it something real, then you hit assess.

00:21:44

Now they have something to talk about, assess the business value. Is this the best place to start or pursuing this? Give ROI. Do you need to balance innovation and risk for this application? What actions will be effective for missed SLIs when you just phrase changes until you returned to stability? And if so, how long should that be? That frees? Boom important part of this phrase is extracting and capturing assumptions. Clarifying assumptions is what footnotes and SLIs can be longer than the actual SLA. Sometimes the next step is to propose. So you've gathered information. Everyone's turned over a few stones and a quick hand here, a predictability is often more important than speed. The higher variance in response times, the more user experience is affected and you lose their trust. So avoid spreads than greater than say, six standard deviations. If you're more than six STS, then you've got low capability to process.

00:22:36

So it's not good. So your proposer, you get reactions. Now you recur, you assess the new proposal and expect to iterate through this recur this stage many times, because this is where the real negotiation occurs now for each meeting, if you have, you might have to schedule followup meetings, make sure to take a few minutes, to revisit the warmup, restate the scope and goal recap, the conversations to date and try to acknowledge something from everybody in the room. You want them participating and feeling respected, right? But don't talk long enough to design them out. Finally, you get to agree. This is the sort of final presentation of a finished for sign-off. Now it doesn't have to be a physical, physical signature, but it's worth saying that you will need everyone to confirm the email. They need to commit on the record. If they're reluctant, you miss something during the assess phase.

00:23:25

So return to that and try again. There's some assumption that all or hindrance that they have. Okay. So I went through that pretty quickly, but the slides kind of covered. Here's what I want you to do. Learn from my experience. Don't manage on the metrics, focus on the outcomes, the full transaction, the complete process, the overall experience don't use service levels to beat each other up, using it, to become preemptive, use them so that you can offer more services ahead of time. And then when you build out your service levels, remember to assess them against the three dimensions code infrastructure, customer experience. CX, are you seeing the full picture is some critical aspect being overlooked or assumed. One of the most critical things around customer experience of course is predictability. And with higher variance in response times, uh, the more the user will lose trust in you and the application.

00:24:20

And it's a low indication, a low capability of process. So, so don't fall into that trap, keep your eyes on the transactions that are outliers, outliers, and annoy the crap out of users. So, um, you know, watch those and finally realize that if you want to be great at good DevOps, then you need negotiation skills and negotiators is good for life. And it's good for DevOps now, as promised here is the some links to further research. This is a time to switch back to this screen and take a quick snapshot, although I'm sure the PDF are we available online. Um, you can catch up most of my thoughts on my website, take off and whisper.com. If you found this session interesting, then please connect with me on LinkedIn or Twitter. I might have some more interesting thoughts tomorrow. You never know, and feedback on this presentation would be fantastic. Uh, what caught your attention? What was important that I missed so that this becomes stronger and stronger as, as we get more important. So with that, I want to thank Jane and Anne and the DevOps enterprise summit team for their support and help during this event in particularly ans patients. And I want to thank you for your time today, and I will see you in the chat rooms as we, you know, again, feel free to bring questions. And to me there, I look forward to meeting you and catching up.