Las Vegas 2020

My Ops Team Can't Keep Up with My Dev Team, Creating Strategic Differentiation in Ops

Meeting with clients and potential ones, by far the most common frustration expressed by C-level executives is the inability of Ops to keep up with Dev in today's fast paced cloud model. There are a number of general principles that are often the solution for many of the problems.


In this talk we discuss different DevOps techniques used to transform medium sized enterprises, resulting in Ops changing from a drag on the velocity of the organization, to a strategic differentiator for the business.

DM

Dave Mangot

Principal, Mangoteque

Transcript

00:00:13

Hi, my name is David Manko. And today we are going to talk about strategic differentiation in operations. Uh, I've been in operations for about 25 years now. Uh, and I've seen a lot of things that a lot of different companies, and I'm lucky that I get to work with companies now on leveling up their operations teams. That's that's my job. So, uh, it's great. And I really enjoy it. And what are the most common things I hear when I talk to CEOs, CTOs, CIOs is my ops team. Can't keep up with my dev teams and there's a lot of reasons for this, right? There's, uh, containers. There's all kinds of other things that are happening now that are brand new, that enable developers to go faster. There's cloud, there's all kinds of things, but there's also things that operations organizations do to themselves that caused them to not be able to keep up with the development teams.

00:01:10

And so today we're going to talk about some of the things that I've seen in organizations that I think if you recognize them in your organization, they can help you accelerate your operations team. And we're going to also talk about what is a high performing operations team look like. So for me having done this for so many years, I really don't like it. When I hear this, my ops team can keep up with my dev team because I've been doing this long enough that I remember when operations was just considered a cost center, right? The value in an organization is in development. It doesn't happen in operations because you just keep the site running. And so I love, love that like SRE and all these things have come along where people recognize that operations is actually a really valuable part of the organization. And so I've been lucky enough to be on teams and to lead teams and things like that, where we've created a strategic differentiation.

00:02:11

And so we had this company that I was working with and we had Cassandra there. And Cassandra was for people who don't know, Cassandra is this distributed database it's designed for very high throughput, uh, large amounts of data terabytes and terabytes of data. Uh, it operates in a ring topology. So, uh, you know, the data is replicated throughout the ring. That's how you can lose data. Uh, it also allows you to talk to different nodes if you want to be able to get data or store data or things like that. And then Cassandra takes all that data and does all the right stuff with it, which, you know, that's great. Uh, and so we were running Cassandra and, uh, it runs on the JVM. So there was a lot of, you know, things that you had to know about, but we're having a lot of trouble with Cassandra.

00:03:02

And so we were running at about 70% utilization on your, my SQL database. Maybe that's not as bad, but I could, Sandra, that's really kind of pushing the envelope because Cassandra will do things called compactions, which we'll go through and sort of make the data more. I don't know what you call it, like better storage of the data basically. And so when a compaction kicks off, it uses a whole bunch of CPU. So that could be a problem. Uh, we had a lot of outages because we were running so close to the edge. And when we had an outage, the recovery could take days. And this is because, uh, there's a lot of data, right? If we lose the data, like let's say we had a replication factor of three, and then we lost one of those nodes. Then the other two nodes that were holding onto that data would have to stream the data across the network and rehydrate this, uh, replacement node with all the data.

00:03:59

And it was pretty complicated and pretty fragile. You actually had to be a pretty expert. You know, you had to be an expert, you had to have expert level knowledge of Cassandra in order to be able to do these kinds of recoveries and stuff like that, or even to do these kinds of configurations. And so we didn't like the situation with it. We were in it. Wasn't good for customers, customers. Don't like it when it's hard to retrieve their data customers. And in this case wanted sort of real-time storage of their data so they could use this data. So we had to do something about it. So what does everybody do in engineering when there's some problem that they're in and they need to get themselves out of it? Yeah. Okay. I hear you saying that you write it yourself, but we didn't do that till later. Uh, we upgrade, right. Uh, upgrade, solves all problems. Uh, I was looking for like a rainbow and unicorn slide. I found these things with the dogs and the balloons, and I just had to use it. So, uh, you know, upgrade solves all problems. Of course.

00:05:02

Well, the problem was upgrading. Didn't solve a problem. In fact, upgrading made things worse. And so, uh, we dark launch, you know, a new Cassandra ring. We were sending data over there and we were getting these massive, massive timeouts, like timeouts that made the ring, you know, unusable, like you could not store data. We could not put customer data on there. There was just not an option. It's not something that we could do. So we were left with the enviable task of figuring out where this problem was introduced because the version that of Cassandra that we were on didn't have that problem. But the version of Cassandra that we wanted to go to did have the problem. So because Cassandra is an open source project, you know, it shouldn't be too hard. We just have to go and do some testing and figure out where this problem was introduced between the two versions.

00:05:52

So we went to get hub and looked at the Cassandra project and compared to the version that we are on to the version we wanted to go to and discovered that there was 5,827 commits between the version that we are on and the version that we wanted to go to. Okay. So all we have to do is find out which commit out of almost 6,000 commits was the one that was causing the problem. So how do you do something like that? Well, it turns out we were actually able to find the actual problem and we used a technique called get bisect. And in order to, uh, you do get bisect. What you do is you say, we have 6,000 commits, you go to commit 3000. And if the problem still exists, that commit 3000. You know, the problem was introduced before commit 3000. And if the problem did not exist there, then you know, the problem was introduced after commit 3000.

00:06:53

So, you know, that say the problem exists before then you go to commit 1500 and you keep doing this binary search, basically until you find the problem in order to be able to do this kind of search, you need to be able to stand up Cassandra ring. What we just said that these Cassandra rings could take days to rehydrate if there was a problem and that the configuration and things like that were, you know, very complicated things that were hard to get. Right. And so what we were able to do as an SRE team is create a, uh, an ability to stand up a Cassandra ring in 20 minutes using SaltStack and it's event driven infrastructure and things like that. And so what would happen is, you know, the Ringwood stand up and then we would get, uh, you know, uh, an understanding within the ring that all the nodes were there and that would create the actual ring, excuse me.

00:07:52

Uh, and then, you know, from there we could run our, uh, our tests. And so we were able to actually isolate the exact commit that was causing the problem in two and a half days out of the 6,000 commits. And everybody always wants to know like, okay, well, what was the actual problem? So there was some code that was subtracting some nanos from some Millies. And so we're able to isolate that thing. We wrote a patch, uh, it fixed the problem, the patch we submitted upstream, like they gladly accepted it, all kinds of fun, happy stuff happened at the end, but how many organizations can really do something like this? And so I went and talked to the CTO about it, and I said, you know, what would you have done if we hadn't developed this ability to be able to stand up a full task Cassandra ring in 20 minutes?

00:08:43

And he said, you know, we'll kind of one of three things. One like maybe we never would have tried. So we would have just said, you know what, the version of Cassandra that we're on, we're going to have to live with it. We're going to have to figure some ways of making the life better for both the operations folks who are getting paged, but also for the customers, maybe we would have done some, you know, maybe that's what would have happened. Maybe we would have hired some consultants, some expensive consultants who, who knows how long it would have taken them to find it. Maybe they never would have found it. W I, we don't really know, or maybe we would have just done something else, completely like, you know, use a different piece of software abandoned Cassandra, who knows what we would have done, but the ability to stand this thing up and be able to isolate down to the problem and be able to fix the problem in cooperation with development like the developers.

00:09:34

Uh, this was a pretty awesome thing. And, you know, for me, this is the essence of working in this dev ops mindset of like everybody coming together to solve these problems that the business has. But there were a knock on effects from being able to do this kinds of thing, because we were able to stand this stuff up. And because we worked out techniques for being able to automate a lot of this stuff, our ring recovery time now went from a maximum of days down to a maximum of minutes. So this is pretty great for, you know, for operators, for everybody in the business, because now failure is not such a problem on top of that in the course of doing all this, we were moving from AWS classic to AWS, ISI to VPC, and we reduced our costs through the ability to do like, you know, uh, capacity planning.

00:10:28

And right-sizing the, uh, nodes that we were using to the workload that we were doing and the storage that we were doing to the stuff that we were storing, uh, all that stuff, uh, enabled us to reduce our costs by 45%, just in that migration. On top of that, we use this ability to stand up the rings and work with engineering and stuff like that, such that we were able to create a solution that mostly replaced Cassandra entirely. And it was a purpose built solution that we work with developers. We gave them the ability to stand up these test instances. We talked to them about how are we going to do resilience, or I guess all spot would say robustness. Uh, and we talked to them about ways to design the system. And ultimately the conservative estimate was that we saved 70% in costs over what we would have done if we were running everything on Cassandra, which is pretty impressive.

00:11:27

And if you want to know more about like how we did some of this stuff, there's a great talk about Cassandra that we presented, uh, at AWS reinvent a number of years ago. And it goes into much more detail about the actual Cassandra stuff. But the question then is, well, how does my ops team do that? Right? And so we're going to talk about a couple of things that I see a lot of times when I'm working with organizations and ways that we can sort of get out of that. So, uh, the first obstacle really is, uh, figuring out like what kind of SRE organization you're going to be. And there's a bunch of different models, and we're going to talk about that. And then we're going to talk about empathy, and we're gonna talk about ticket systems. And then we're going to talk about alignment.

00:12:10

And after this talk, you won't have to hire me, like, you'll know everything that you need to know. I'll put myself out of a job, but my hope really is, is that when we're talking about some of these things, you will recognize some of these things in your own organization, and you'll be able to take those things back and use them to make your operations, uh, you know, component of your organization that much better. So when I work with SRS, I had two rules for them. These are the overarching rules for when the, you know, they have to make a decision about something. These are the things that they have to keep in mind. Number one, and these are in order. So I'm saying number one is, keep the site up. It seems kind of obvious. We're an operations, you know, of course keep the site up, but, you know, we want the site to be available.

00:12:59

That's what the business is paying us for. And so we have to keep the site up. And the second thing is, keep the developers moving as fast as possible. And so we're going to talk about like how these things relate to a high performing SRE or operations organization, and you know, how these things related to what we were doing on that Cassandra ring. And you can already see right, keep the site up. Obviously, if you're having all these outages, we need to take care of that. We need to make sure that that's not a problem and keep the developers moving as fast as possible. We already alluded to, right? We talked about how we gave the developers the ability to stand up their own nodes and, you know, be able to try things out and all kinds of other stuff like that. So in order to know where you're going, you have to know where you are.

00:13:46

And so, uh, there's two sort of SRE models. Like I kind of look at that are sort of the opposite ends of a continuum. And if you really want to dig into these, I highly recommend the O'Reilly book seeking SRE. I gave a blank Edelman, uh, and there's all kinds of stuff in there that digs into some of these different SRE models. But for me looking at these different models, I look at them as a continuum of keep the site up, right? And so the Google model of SRE is a very active model of keeping the site up. So Google has their SRS, they're on call. They, you know, have standards for the developers. If the developers don't meet those availability standards or whatever performance standards or whatever, then it's the developer's problem. Again, you know, the Google SRE is don't run the site at that point, but it's a very active model.

00:14:38

They're very actively participating. They're the ones who get paged, all that kinds of fun stuff. On the other side, we have the Netflix model and that's a very supportive keep the site up model. So Netflix defines like your availability, numbers, your performance numbers, all those kinds of things. Ultimately, you are responsible for running your own service, but if you're having trouble, if you're not meeting your numbers, then the SRE organization is there as a very like consultative organization. And it is someone who will come and work with you in order to help you achieve those numbers. And obviously, you know, if you're not achieving those numbers, you're going to keep hearing from them. There's a whole bunch of stuff about getting reports from the Netflix SRE team and things like that. But it's, it's the very supportive, keep the site up model as opposed to the very active, keep the site up model and in your organization, you know, it's good to know which model you're, you're leaning towards, which model you want to get to because without a clear definition of which model you're operating under, it's going to be very hard to be a high performing SRE organization.

00:15:50

But I think the interesting thing about this idea of like, keep the site up, or, you know, keep the developers moving as fast as possible as we like to talk a lot in dev ops about empathy. And, uh, this is a model that, uh, is proposed and sort of, you know, uh, advanced by the teams at like Stanford and UCLA. And it's this three component model of empathy. And it's a little bit more advanced than our, you know, dictionary definition of empathy that we just feel what other people are feeling. And, you know, there is that in this model, right, that's called experience sharing. And there's a bunch of other names for it. Uh, if you really are interested in this like dev ops and empathy stuff, uh, I'll put the link in the, in the slack room, uh, about my, a, a link to my dev ops days, Vancouver talk about the cognitive neuroscience of empathy.

00:16:41

You're a dev ops, natural. Um, but we're gonna touch on this a little bit today in the, in, in this talk. Um, so we already talked about there's this idea of experience sharing, uh, there's this other idea of metalizing, which is about recognizing a mind and other people not going to dig into it. It's super fascinating. Uh, but the part I want to focus on here is this part on the bottom, right? This pro social concern, and the interesting thing about pro social concern, it is the feeling that we get when we recognize somebody else's in a situation that we have the ability to help out. This is pro social concern. It is motivating to us to want to go and help those people who are in that situation. And so where does this come into play often outside of operations, obviously as with healthcare workers, right?

00:17:29

If a nurse is working in like palliative care or something like that, you don't want to feel depressed at the end of every day because of what you're seeing you, you you'd be a wreck. And so what we try to work with healthcare workers on is this idea of pro social concern. This idea of you have the ability to help, and that ability to help is the best thing that you can do. And so if you rely on that part of empathy, then you are not Iraq. In fact, you are actually empowered to go help other people. And so this is I think the most important part for operations teams to take into, uh, you know, the work that they do when they're working with, uh, you know, developers or things like that. And so we said, idea of empathy was this pro-social empathy. And we also talked just a few slides ago about keeping the site up and keeping the developers moving as fast as possible.

00:18:26

And so I will assert that operations teams who really care about the customer and making sure that the customer has a good experience. And obviously that has great, you know, great effects for the business. They are exercising their pro social concern to make sure that the customers have the best experience possible. And if the customers are not having a good experience because the site is down or whatever, then they are exercising their ability to help and get that site back up. And the other part of this is keep the developers moving as fast as possible. I think that operations teams that are really concerned about that, that know that they have the ability to help developers and they don't have developers, you know, waiting around for things, uh, and you know, having to stop their work because they're waiting for operations, they're exercising their pro-social concern to keep the developers move as fast as possible.

00:19:24

So, you know, empathy is a good thing, right? And having those developers sitting around is why I do not like ticketing systems. And so, you know, when I talk about ticketing systems, I want to make clear, I'm not talking about agile boards. I'm not talking about Kanban or scrum. I'm talking about those ticketing systems where if I want a new instance, or I want a new SQS, or I want something, whatever the equivalent is in Azure, right? Then I have to open up a ticket and wait for operations to do that. And that's not keeping the developers moving as fast as possible. And there's a number of problems with ticketing systems. Number one in waste in, uh, in lean their waste, right? This is where work goes to wait. It's a handoff. We don't want to create handoffs in the flow of work through the system.

00:20:13

That's not dev ops, right? This is the dev ops enterprise summit. The second thing is there exceptions. So if I'm writing code and my code doesn't know what to do, at some point, it needs a human help. I need a human, it throws an exception. It says, I cannot continue to work. I am stuck until a human comes and help me helps me. And a ticketing system is a way of codifying that type of work. We're basically saying like, we need to get something done and we're stuck and I need to get a human to come and help me. And we don't want humans to come and help us. Right. And is, you know, you've heard about this a lot. Like we need to be empowering as operations teams. Uh, and we need to do things that will, uh, that will enable work to flow through the system, but in a self-service manner, when we talk about that more in a second, and the last thing I don't like about ticketing systems and why you should burn your ticketing system to the ground is that they are tracking toil.

00:21:11

That's what they're doing toil. And so when I say toil, I'm talking about it in the SRE perspective. And so, uh, if you've ever had the Google SRE book, um, this should be familiar to you. And one of the chapters, the vac Raul says, toil is a kind of word tied to running a production service that tends to be manual repetitive, automateable tactical devoid of enduring value. And that scales linearly as a service grows, right? We don't want manual repetitive work. We want self-service. We want to empower developers to be able to do things themselves in a very safe guardrails, all that kind of stuff manner. But you know, you've heard Damon Edwards talk about operations provides the platform, right? And when we were giving the developers this ability to create this system that replaced most of Cassandra that saved 70% of costs on top of 45% of costs, right of the cogs of running the Cassandra, uh, service, we gave them the ability to launch it nodes themselves so they could test things out.

00:22:16

And so I actually have a problem with the vex, uh, quote here, because I don't believe that toil scales linearly as a service grows. I think that toil scales sub linearly, because there's all kinds of coordination costs and things that are involved in, in doing this. And, and the problem is, is as it scales sub linearly, we're actually going to get worse. The more success the business has. And so what do I mean by that? Well, if the business is more successful and we are generating more toil because it scales as the service grows, the service is growing, the business is doing better and we are starting to scale sub linearly. Then the business is actually going to slow down the more success that it has, and we don't want that, right. So we do not want to operate in this like ticketing manner that is tracking toil.

00:23:09

And if you really want to dig into this idea of toil and the ticketing system, I wrote an article for tech beacon and the run-up to, uh, to the DevOps enterprise summit about this specifically. And, you know, definitely go check it out and we can put the link in slack as well. So, I mean, there's one problem here is that we've got this kind of toil work, which is a problem. And that toil work, uh, is something that we sort of treat as this kind of special thing, which is, you know, open up a ticket and get this thing done. But there's another problem that I've seen where people treat work as special. And that is where the remediation work, right? And this is sort of just the opposite of toil, where like we elevate remediation work to be some kind of special kind of work.

00:23:59

The problem with this is that failure is not exceptional. Like failure is the result of regular work. And, uh, Jessica DaVita has some great stuff to say about this, and you should definitely go follow her on Twitter. If you want to know more about that kind of thing. But you know, all spot teaches us that like work is not linear and you can't just action your, I, your item, your way out of a problem. And so if we take these action items and we create them as some kind of special thing, that's a problem. And so what I've seen organizations do is say like, Hey, we're going to track our regular work over here, but remediation work. We don't want it to have an outage again. So we're going to track that in some other special way, and that's not okay, right. Remediation work is just work.

00:24:43

And we know about cognitive biases and we know that recency bias is not a great way to do prioritization. So, you know, what can we do about this work? Well, we'll treat it like regular work, right? And, uh, people have seen Dr. Kirsten's presentation a few years ago at the DevOps enterprise summit from project to product. But Dr. Kirsten says like, you know, there's these four types of flow items, the features, the risks, the deaths and the defects and remediation work is just part of that. Those flow items, maybe it's risk, maybe it's debt. Maybe it's a defect. I D it obviously depends on the situation here, but remediation work is just work and we shouldn't treat it any different. And so Dr. Kirsten says like in different parts of the life cycle, different parts of the flow light at different flow items are going to be, uh, you know, highlighted maybe at one time in the year, we're going to work more on risk. And another time we're going to work more on features, but that's how we have to pre-write prioritize remediate remediation work. We have to fit it into what the business needs at the time.

00:25:51

And this is why frontline managers are so important. They need to work with the engineers and they need to work with the business and understand what is the proportion of these things that we want to allocate so that the remediation work does get done, but it gets done within the context of what the business is trying to accomplish. And the only way that the business can get those kinds of things, uh, communicated is if there is alignment. And so a lot of times when I'm working with companies, I see organizational debt that's manifested as tech debt, right? And so managers need to understand what the business priorities are, but the organization also needs to be structured in such a way that that stuff is communicatable number one, and it flows correctly. So, you know, anyone who's ever heard about dev ops knows like silos are bad.

00:26:44

Silos are bad, but this is why silos are bad, right? It does not allow us to have our organizational alignment. And what winds up happening when we don't have this alignment, is it manifests itself as tech debt and tech debt is, you know, not an awful thing, but tech debt is an awful thing. If it is something that is coming out of the fact that we don't have our organization structured in the right way. And then it manifests itself as tech debt. If I have two competing silos, and they're each doing the things that are best for their silo, a lot of times that looks like tech debt. And so what I advise people is tech debt should be a conscious choice. Like we can have tech that we can save for the next six months. We're going to accept this problem. And that's okay, because at the end of it, we're going to be able to wipe out this entire class of problem.

00:27:35

But what we don't want is organizational tech debt manifesting itself as tech debt. And then that's a problem. So how do we get this alignment obviously? Um, that's going to be the next question, right? Uh, we can do this, like through, you know, these various methods. Okay. Ours V2 moms is what we used to use at Salesforce. But the idea is that leadership communicates, like, what is important? What is the most important thing? Second, most important thing. What are the things you're going to be looking at? What are the things we're going to be trying to accomplish? And then the people who are, you know, successfully below them in the hierarchy, uh, you know, and hopefully you're operating, you know, very Western, you know, generative manner. Uh, they're going to align with those things. And this is why it's important. I always say to have like operations and engineering under the same leader, because that allows you to have like this strong alignment.

00:28:28

And if you have those individual silos, then it's going to manifest itself in all kinds of, uh, of ways tech debt is one of them as a result of your organizational debt. And so, you know, people ask me like, okay, Dave, this sounds great. Like, you want to put this person, you know, this group under this or this group under that, whatever, where do you actually draw the line between development and operations? And my answer is it's a fuzzy line. It's a fuzzy line. It's a fuzzy line on purpose. That is by design, right? Because that place in the middle where it's fuzzy, where those two things overlap, that's where the dev ops happens. That's where the dev ops is because our customers don't care, right? They don't care if it's development problem or operations problem, or the ops people don't like this or develop it.

00:29:16

Folks don't do that. Or what you don't care about. Any of those internal things. They don't, that's not a concern for them. What they want is a high quality product that's delivered to them and where the overlap is between development and operations. That's where the dev ops happens because there's only one product, right? And by having those teams overlap and work together and try to deliver the best possible thing they can, that's what we're going to get the best results. That's where we're going to be able to achieve 70% cost reduction. That's where we're going to be able to reduce our, you know, our recovery time from gaze down to minutes. And that's a really powerful thing. And that's when operations is enabling the business, they're keeping the site up and they're keeping the development, the developers moving as quickly as possible. That's when they become strategic differentiators. And, uh, please DM me, please talk to me in slack. Please communicate. I'll be around for the entire conference. And I will be in the slack even afterwards. Thanks very much.