Las Vegas 2023

Not dead, just resting! How To Win at Maintenance Mode

How do you do maintenance mode in a DevSecOps world, and why is that even a question?Maintenance mode. Keeping the lights on. BAU support. Evergreening. Whatever you call it, it happens when there’s no more funding for feature development in live digital services and data pipelines. There’s a need to resize teams to cut costs, to reassign people to start on new propositions… but those live services aren’t dead, they’re just resting! Who does the library upgrades, security patches and defect fixes? Who preserves availability targets when there’s no more money to look after the live services that are making money?

I’ve spent time with many different scaleups and enterprise organisations. Too many have used their operations team as a maintenance mode solution, and wondered afterwards why technical quality, reliability, and employee retention all took a hit. And if you think transitioning a live service into an operations team is hard, you should see the reverse when it’s back to the developers for more features.

I’ll cover the pros and cons from three different maintenance mode solutions - delivery teams, the operations team, and multi-product teams. I’ll share the results from a maintenance mode survey of 40 different organisations worldwide. And I’ll explain why multi-product teams as an extension of the You Build It You Run It operating model can give you a truly magnificent maintenance mode.

Steve Smith

Head of Scale, Equal Experts

Chapters

Full transcript

The complete talk — auto-generated from the talk's captions.

Okay, the next speaker is Steve Smith, head of Scale at Equal Experts. The title of his talk is How to Win at Maintenance Mode, which describes the challenges of what one does when the business critical applications are no longer the top priority of the organization. We all know the poor outcomes that result when an application is put into maintenance mode, but what are the alternatives? How can we be better prepared when those applications actually need to be updated?

Again, Steve and I have been talking about this over the last year and I'm so excited by how he thinks about the problems and what we can do about it. And by the way, this talk report is one of my favorite, uh, talks, which was The Leftovers Dilemma, which uh, George Crass from Discover Financial Services described so well last year and he will be, uh, providing a continuation of his experience report on day three. So here's Steve. Good morning everyone.

Ron, Ron, where are the slides? Oh, there we go. Thanks Ron. Hi, I'm Steve Smith.

I'm gonna talk about the maintenance mode, scaling problem, the kind of unglamorous inevitable technology problem that Gene Kim and I just love talking about. I phoned you a year ago and I said, gene, I'd like to talk in Vegas. And he was like, what about? And I said, A topic so boring gene that you'll instantly regret ever having me.

And he was like, oh, that sounds amazing. Tell me more. Somebody last night in the restaurant said to me, how do I get to know Jean? And I said, the problems that you're bringing him are not boring enough.

So we'll cover what the problem is. We'll cover three different solutions and a survey of 40 organizations that Gene made me do as the price of coming here. Uh, who am I? Let's go.

Yes, gene, come on. No, no clicker Ron. Really? You know, I hung out with Ron out back, but clearly, oh, see Gene's gonna go back and turn off Ron.

There we go. Okay. Oh, there we go. No, Ron.

No. Okay, thanks Ron. Alright, so this is why you always get to know the AV people. Who am I?

I'm an engineer who likes to advise other engineering leaders. It runs in the family. My dad was an engineer, my dad's dad was an engineer. Uh, I'm the head of the scale service at Equal Experts.

We're a global technology consultancy of around 3000 people. I've been with the company for uh, 10 years now and it's my job to collaborate with our customers to identify and solve scaling problems. That's something that slows you down or harms you when you're trying to scale your teams up or down. Before that I worked L max, a financial startup where we built an amazing state-of-the-art low latency exchange.

I worked with a guy called Dave Farley. I dunno what ever happened to him. Never heard of the guy ever again. Um, just kidding.

He's on YouTube now. My daughter's really impressed by that. And yes, it's true. I'm British.

Somehow I snuck onto the program. Uh, you can tell I'm British because my humor is quirky, yet ultimately somehow lovable. Um, I've never spoken to a lawyer in my life, which over here is a kind of a big deal. I know.

And in my household of four people and one puppy, I have one car that is the correct ratio in Britain. But don't worry, uh, equal experts has a North American office full of lovely North Americans like ssid who's down here somewhere in front. Uh, there he is. Hey Sid.

So Sid's, our S V P and he is from Nebraska. He lives in Florida and in his household of four people, excuse me, three people in his household, he has four cars and one plane that is the correct American ratio. So what the hell is this maintenance mode problem? You see, SSID told me last night to take out a joke about communism and I swapped it out for a joke about him.

And now that's gone down really well. My wife said not to do politics in the us she was right. So what the hell is this thing? Anyway, we're behind times speed up already.

So last year George cra Otis spoke about the leftovers. He talked about how to manage business critical services when they lack funding or clear business ownership. Thanks George for sharing your story and the request of G and I'm gonna try and build on that a little bit. Um, and we're gonna try and hang out tonight.

Me and George, he's been ducking me for a year. Now we're ready. I see this as the maintenance mode scaling problem. You might know it as business as usual or harvest mode or keeping the lights on.

Let's say that you've got a hodgepodge of monoliths and microservices or a zoo of data pipelines. Those are the correct collective nouns for them. The majority of them will be non-differentiating services. You'll build them because you have to, not because you want to.

And your teams work hard, right? They continuously deliver planned features to your users and then once they've got live traffic and demand slows down, deploy, slow down and they reach a zero demand, that means zero funding remains for feature development. The day-to-day work becomes maintenance tasks, uh, upgrading libraries, security fixes, all that good stuff. And the problem you have here is who manages these zero demand services at scale?

Because they're like the Norwegian Blue Parrot in Monty Python, they're not dead, they're just resting, right? And if you have annual funding, this problem is coming your way because, because you'll have pressure to reduce CapEx spend, you can listen out for this problem. You can hear people say things like, we need to increase capacity for more services. We need to reduce costs.

Or maybe you're measuring unplanned work rate for your teams and it's slowly starting to creep up. To solve this problem, you need to create an ownership model for zero demand services that allows for teams to be reassigned, resized or retired. That's gonna cover those two new outcomes that you want more capacity and lower costs, but you also have some existing outcomes you want to protect. You want to protect your current levels of live services, reliability, job satisfaction, and future feature delivery.

Alright? What solutions do we have? How effective are they? What problems do they cause themselves?

Because scaling problems are hard like that, you know, and we can't all be Spotify can we just throwing Swedish money and Swedish fish at our challenges. We've got to actually figure them out for ourselves. So they're not gonna hire me. It's fine.

The first solution is to have your delivery teams maintain their own zero demand services in the background. Here's a composite organization synthesized from many equal experts customers over the years and I can't remember what any of them are, which is good. Let's pretend it's an American retailer. It's not really American, but I do like America.

The first time I ever came to America, I was made to sit in a glass room at L A X for three hours with all the other passengers while a tan are yelled at us that we should not speak to any walkup lawyers. I've never been so scared in all my life. We've got 12 delivery teams here. There's a product details page and an online checkout team.

They're building differentiators, they're innovating, they've got long-term funding, forget about them already. I'm much more interested today, at least in the 10 other teams, the furniture team, the returns orchestrator team, all of the non-glamorous stuff I warned you about. They're being built for parity, right? They're going to reach a zero demand Once they're into live traffic with this maintenance mode solution, what happens is those services move into the background of the delivery teams so that they can pick up more work.

That's what the blue line signify here. Just shove it into the background and here's the end state of this solution. Orange means a team is maintaining a live service in the background while they're building a new service in the foreground. You can resize teams but you can't really retire them.

And the operations team does all the life support for you. Fax ops, not that they chose that life. What's good with this solution is there's more capacity and there's a similar level of services reliability to before because the maintainers haven't changed, they're the creators. But what's bad is future feature delivery is much slower than before.

Okay? Because a team will have two different business owners for their two different services. So prioritization becomes painful or as I like to call it a bun fight. And what will happen is you'll probably allocate a dedicated amount of time each week to maintaining the background service.

But then teams will come under pressure to squeeze that time down or to shove in large unfunded features and it's pretty stressful. What else is bad? Well, there's little sense of mission or purpose for your teams and it's not easy to retire teams. You can't actually reduce costs with this solution.

That's why this maintenance mode solution kind of happens by default until someone turns up and says we need to save some money around here. So to test our beliefs, we surveyed 40 organizations across the equal Experts network, uh, telcos, financial services, retailers, the other ones, the columns here are the new outcomes that we're trying to achieve and the existing outcomes that we're trying to protect. And the row is going to be the maintenance mode solutions that we're looking at. Green means we think that the outcome can be achieved, red means it can't deal with it.

And the percentages are our actual customer experiences For this delivery team solution, 58% of responders agreed that there was an increase in capacity and 75% agreed that they could protect services reliability to a similar standard as before. But 100% the responders said there were no cost reductions and 52% agreed feature delivery was slower than before. One person said to us, we were still seeing you feature requests, high value stuff that could make a lot of money, but it managers would say, uh, Uhuh, we can't change that. It's in maintenance mode.

And 33% saw a drop in staff happiness. One person said to me, whenever we dial up maintenance tasks, I can see it have a real impact on team morale. Alright, our second solution is probably the most popular one out there. Your operations team are already doing all of the live support, so why not make them do all the maintenance tasks as well, right?

This could be a move from the first solution or it could be your starting point. So what does that look like? Here's our American retailer. Again, it's not really American, but I do like America.

You know, the first time I went to New York I got into an awful lot of trouble in a bagel shop. I asked for a British sized bagel. The waitress said, what is that? I said, your food sizes are insane in this country.

Give me a half size bagel and I'll pay full price. She got angry, she threw me out and now I can't get a bagel anywhere on the lower east side. The sad thing about that is that story is entirely true. We've got the same 12 teams as before, the same 10 non differentiators that were interested in.

Now when they reach zero demand, they are transitioned into the operations team who will do all the maintenance tasks as well as life support. And the end state is the operations team maintaining everything and supporting everything. And now you have the flexibility to resize teams, retire teams, allocate services to different teams. It's entirely up to you what you do with this.

Listen to your heart. But whatever you do, your operations team is going to be awfully busy. So what's good here is you do achieve greater capacity and lower costs, which in this economy is super important. But there's a lot of problems here.

Number one is reliability takes a big hit, okay? Because you entirely blameless operations team are affected by two things. One, their cognitive load is gonna be really high. There's no limit to the number of zero demand services that can be imposed upon them.

They have to work long hours, their work in progress is really high. Mistakes are gonna creep in. And number two, it's unlikely they'll be given the time and space necessary to acquire the technical skills and domain knowledge to rapidly complete maintenance tasks to a similar standard as when services were with the delivery teams. Future feature delivery is another big problem here, right?

That I'm sure people have seen before. One operations team owning many zero demand services means one of two things. Zero business owners or many business owners. So prioritization then is really hard and endless BUN fight over an ever-growing support backlog.

And there's a lack of job satisfaction here because developers feel that they're in a never ending feature factory. And your operations team feel that they're in a never ending dumping ground and they're both right. There's a countermeasure to the problems caused by this maintenance mode solution. You can do a reverse transition, right?

You can suck a zero demand service back out of the operations team into a short-lived, newly funded delivery team who will complete a small backlog of stability enhancements or long planned features and then it can go back into the operations team. Here's our American retailer again, you can see in the middle there's a payment service that's being sucked back outta the operations team into a payments V two team because of course the payments V one team is long gone. I've seen this solution a bunch of times in the wild and this countermeasure and it's flawed. Your new delivery team will lack the domain knowledge necessary to rapidly complete their work.

And a reverse transition out of an operations team is like an order of magnitude more painful for everyone than a transition into your operations team. I know a company where there's a 300 question spreadsheet before an operations team will take on maintenance tasks as well as life support. And to do a reverse transition back out of the operations team, there's a separate 100 question spreadsheet where every question is a cunningly worded variant of could you please do a better job than last time? And yes, of course when the developers are finished doing a better job than last time when they finished implementing those stability enhancements or those long planned features, it's back into the operations team and let's do that 300 question spreadsheet all over again.

But this time your operations team are really waiting for you. It's a time consuming wasteful merry-go-round. Back to our survey results for the operations team solution. 66% of responders said yes, we had increased capacity and 55% said yes, we saw reduced costs.

That's good, that means you've achieved those new outcomes. But 55% agreed that reliability took a really big hit. One responder said we had a lovely time once by which I know they are British and it was not a lovely time. They spent 20 hours fixing a broken deploy because their blameless operations team had been unable to complete maintenance tasks for a year.

So there was no rollback pathway at all and 55% felt that tive was more difficult than before. Somebody very angrily wrote to me saying the ability to handle maintenance tasks is a very painful concept. Here we have outsourced operations contracts with no time dedicated for all uh, upgrade work. Everything has to be a purchase order and a change request.

The amount of waste in this process is criminal. Don't worry, I did actually phone the police about that one. They told me to stop phoning them. 88% of responders saw a drop in staff happiness.

One developer said to me, the operations team here is permanently unhappy looking after services they didn't build and trying to upgrade them all the time. And I feel bad for them. So what we see with this solution is it achieves those new outcomes in the short term that probably explains why it's so popular, right? But in the medium to long term, it kills your existing outcomes that you're trying to protect.

Maybe that explains why I've seen the solution cause so many problems. Alright, our final maintenance mode solution is to form multi-product teams. It's a solution we've designed with some of our customers. This could be a move away from either of those prior solutions that we've seen or maybe it's your starting point.

Either way the you build it, you run. Operating model is a prerequisite for multi-product teams. So we'll need to talk about that first. So at the Europe 2021 event, I spoke with Simon Skelton from John Lewis and partners a UK retailer similar to Nordstrom over here.

He shared how we introduced the you build it, you run it operating model to 30 teams and 40 digital services as part of their transformation journey. You build it, you run. It is an operating model in which teams build and run their own digital services. It means product managers are incentivized to prioritize reliability alongside functionality because they're accountable for both.

And it means that developers are incentivized for the long term to continually build reliability into everything that they do. Think of it as better insurance for your most valuable business outcomes. This new operating model gave John Lewis and Partners an annual deploy count that was 26 times higher than before. A time to repair that was twice as fast and revenue protection was four times higher.

Um, I co-authored a print book on You Build It, you Run it a year ago now I've got some copies in the back. Come and find me afterwards if you'd like one. It's no problem at all. So the obvious follow on question we have here is what would happen if we applied?

You build it, you run it. Principles of outcome oriented teams, zero handoffs and clear incentives to zero demand services. Here's our American retailer one final time. It's not really American but I do like America.

I went to a Walmart in Orlando's last year and I bought a giant bag of Cheetos. It was the largest bag of crisps I have ever seen. That was lunch and dinner all sorted. I did have some very strange dreams that night though.

So we have the same teams here, but now they're organized into product families. You might know 'em as verticals and those teams are also allocated into different domains. We have our teams building and running their own services and when they reach a zero demand, they're transferred into a multi-product team dedicated exclusively to that family. A multi-product team is a team of engineers who are accountable for reliability outcomes for all zero demand services in that family.

They don't have a product manager that would wrongly imply the presence of a product backlog, but they do report into a product lead who owns that entire family. So what happens is that you transition those services into your multi-product team and then you can resize or retire teams or reassign people as you saw fit with the operations team solution. Again, numbers will vary here. It depends on your organizational context.

What's important here is the idea of you build it, you run it being applied beyond a single product team. So what's good with this solution is your new outcomes are achieved, you've got more capacity and lower costs and you also can protect your existing outcomes. Multi-product teams have these skills and domain knowledge necessary to preserve a level of reliability similar to before. And their cognitive load is much more manageable because the number of zero demand services that can be imposed upon them is limited to the number in their product family Future feature delivery can happen at pace when necessary.

When a multi-product team identifies an improvement opportunity, they can go to their product lead who's an Apex decision maker for that entire family. They can make a quick prioritization call across all of the services that they own without any drama. And the engineers in a multi-product team can have have job satisfaction. They are responsible for end-to-end outcomes.

What they're doing is making a difference. They're empowered to act. What's bad here is you do need to spend time setting up the right guardrails to counter any uh, dumping ground cultural factors that are like lurking in your organization. So you need an organization wide definition of zero demand.

You need a strong identity for multi-product teams and some really great product leadership. And you do need to find funding for one multi-product team per family, which means you need a good answer for wouldn't one operations team be cheaper than all of this madness, Steve? And the way to do that is to flip that cost-based conversation into a value-based conversation. You need to understand the kind of outcomes that you're trying to achieve together.

And then you can look at the business metrics that are necessary to show if you're actually um, successfully protecting outcomes and achieving your new outcomes. Alright, what have we learned when we uh, go to break? There are like to remember three things to take away from here. Number one, if you have annual funding, zero demand services are inevitable, it's gonna happen to you.

So plan accordingly for how you're gonna approach maintenance mode. Number two, don't chuck all of your maintenance mode work blindly over the wall to your operations team. You are accidentally sacrificing important outcomes in the medium term. And number three, if you move to you build it, you run it an empowered product teams, moving to multi-product teams for maintenance mode is gonna be a natural fit for you.

So thanks to Jean for inviting me to speak here. Thanks to Anne for all the logistical help. If you'd like to talk about any kind of scathing problems that are on your mind, I'm around, uh, all week. I'm at the conference.

I'm the one that's easy to find in a crowd or you can, uh, get in touch with me on LinkedIn afterwards. And thank you to equal experts North America for looking after me this week. Remember, SSID is here if you want to talk to a North American, someone who thinks like me, but sounds like you. Thank you.