The Business Benefits of GitOps (Europe 2021)

As the Chief Technology Officer at WeaveWorks, Cornelia Davis is responsible for the company’s technology strategy, inclusive of open source projects, commercial products and services offerings. She is driven by the desire to help enterprises transform their business through the leverage of cloud platforms. She cut her teeth in the space of modern application platforms at Pivotal where she was on teams that brought Pivotal Cloud Foundry (Pivotal’s PaaS) and Pivotal Container Service (Pivotal’s Kubernetes Service) to market. Cornelia currently serves on the Technical Oversight Committee of the Cloud Native Computing Foundation. She is the author of the book Cloud Native Patterns: Designing Change-tolerant Software. An industry veteran with almost three decades of experience in image processing, scientific visualization, distributed systems and web application architectures, and cloud-native platforms, Cornelia holds the B.S. and M.S. in Computer Science from California State University, Northridge and further studied theory of computing and programming languages at Indiana University.

breakoutlondoneurope2021
CD

Cornelia Davis

Chief Technology Officer, Weaveworks

TRANSCRIPT

00:00:14

Hello, my name is Cornelia Davis. And thank you for joining me here today to talk about the benefits business benefits of get-ups before jumping into the content. Let me tell you a little bit more about myself. I've been in this industry for quite some time, about 30 years, maybe even a little bit more. I've always been a developer. I didn't come from the operation side, but it says that I wasn't ops that was in past tense. The reason that I consider myself now somewhat proficient in ops is because I've spent the last 10 years or so working on developer platforms initially working on cloud Foundry. Then later on Kubernetes, all told I've been working in the Kubernetes space for about five years, which doesn't make me the longest veteran, but it doesn't make me a new either. I've been working in web architectures and indeed cloud native.

00:01:01

And what I mean by cloud native is this world that where they fit the environment that we're running our software in is constantly changing and it's highly distributed. But working in that space for nearly a decade, even though we didn't call it cloud native at the time I've been part of the DevOps enterprise summit programming committee for quite a number of years and for the last year and a half or so, I have been the CTO that we've worked now, we've worked. We kinda consider ourselves the get ops company, which is why I'm talking to you about get ops today. So I've been spending quite a bit of time in this space. I'm also, as you can see on the slide, the author of a book called cloud native patterns, which is a book that is targeted at the application developer and architect that teaches the software architecture patterns that support software running well in this highly distributed, constantly changing world.

00:01:50

That is the cloud. So why am I talking about cloud native? Well, I believe that fundamentally get ops is the thing that takes cloud native all the way to operations and to drive this point home, I'd like to start with our very own Jonathan smart, Jonathan smart, many of you know has been part of this DevOps enterprise summit community for quite a number of years. He's given some great talks that dev ops enterprise, including some of the funniest and most wonderful lightning Tufts. I encourage you to look them up online. They're really fantastic. This is a picture of him giving a talk quite a number of years ago when he was still at Barclays bank. And this picture is interesting in a number of different ways, but the real point that I want to drive home here that I want to tee off of is that main statement that he makes across the top of the slide that says we are so freaking agile.

00:02:46

Yay. Now he builds the slide up and I'm not going to spend as much time on it, but the point that he's driving home here is that we've gotten really very good in the industry at applying these agile and in short feedback loops and short cycles and things like that to the early stages of the development process. So dev is really great. But then when you look at the whole broad spectrum of things, you look at the, on the left side, which is all of the stuff that happens before we go into development. You'll notice that it has a very different cadence. It has annual, quarterly and monthly cadence. And I'm not going to talk about that side, but where we're really going to talk is on the right side of that dev, which is when we say after I've done the development, what are the things that need to happen to get this thing all the way out into production?

00:03:38

And there again, you can see that there is a cadence of monthly and quarterly. So imagine you have softer, ready to test against customer expectations. You want to get some feedback, you've got it ready to go, but it takes you an entire quarter before you can get that in front of customers and get the feedback well, get ops is something that can help you shorten that cycle significantly. And that is ultimately the major business value. So before I jump into the business values, let me explain a little bit about what I mean by get ops. What is it? So what we're talking about here, as I just teed up as we're talking about running software in production, so I've got some type of runtime environment that I'm going to run that software in. Now that runtime environment needn't be just Kubernetes. It could be any type of run time environment.

00:04:37

I'm using Kubernetes here just as an example, because of many of you are using Kubernetes are planning on using Kubernetes in production. And in fact, it gives us a leg up if we were to look at things under the covers, but there is that asterisk that says the runtime environment could really be anything. And then of course, we've got some human beings that are responsible for managing that application in production. Now those, those individuals are increasingly the application team. And I'll talk more about that in just a moment, but nevertheless, I've got some humans that are ultimately accountable and responsible for getting this thing running and keeping it running while in production. Well, of course you can see there's a whole bunch of white space on this slide. So I'm going to fill those things in with the techniques and the mechanisms that we can use that we believe are making it more and more effective for these human beings to manage these applications in production.

00:05:34

So what's the first thing. Well, as you can imagine, since I'm talking about get the first thing that I want to start with is since I'm talking about kit ops, the first thing I want to start talking about is get now get, of course, something that many of you are probably thinking well, I've been using get in my infrastructure as code for some time I'm storing things in get, does that mean I'm doing get well? I'm hoping to explain to you that in fact you might not be, there's a few other essential elements of this. Now, the first element that you see here is yes, we are storing some of the operational artifacts in get, and then we're using, and this is key point number one is that we are using get, and the get applications, things like get hub, get lab bit bucket. We're using those as the interface for operations.

00:06:30

This is really important because it drives home the point that I am not just storing things in get, and then using something else to exercise those operational tasks and then using yet another interface to do auditing against that and another interface to do this or that I am using this as the interface for operations, that's really very significant, no longer am I using the vSphere console or the AWS console. I am using GIF as the interface for operations. That's really point number one is it's more than just storing things and get I'm actually using it as the interface for operations. Now, of course, when I do things and get, I want there to be as much automation as possible, and that's kind of the second element. Now I know some of you are thinking, well, I've already got triggers that are happening so that when I do things and get, and I make a change in one of my scripts and automatically reruns that script for me, um, you're of course familiar with automation in this CAI processes.

00:07:37

So the ability for your developers to check in their source code and have their Jenkins pipeline automatically run, or their tech time pipeline automatically run. But there's a little icon that is showing right next to those gears. The gears represent the automation, but there's an icon that has like a little circular arrow thing. And that is pointing to the second element that is really important about gift. And that is that that automation is convergent. One of the things that is super critical in this cloud native space is remember I mentioned that it's a constantly changing environment and those changes can be coming from a number of different vectors. And what this automation is designed to do is it's designed to constantly be adapting to those constant changes. Now there's a couple of other words that are on the slider ready that are giving you a hint as to that convergent automation.

00:08:32

And that is when we say in the runtime environment, there's an actual state of the system. And what we're storing in get is a declarative desired state of that, of that runtime environment. And that convergent automation is in the business of making sure that those two things are aligned now. So far, you see the arrow going from left to right only in one direction. So if I change my desired state, then that automation is going to converge the runtime environment to be in alignment with the change I just made in get. But there's an equally important thing, which is to recognize that sometimes changes happen in the runtime system. They might be intentional, they might be automated. Um, there might be a break class scenario, for example, where you have to fix something as rapidly as possible. And you're not sure exactly what it is.

00:09:27

So you have to try a few things directly in the runtime environment. Well, we want to automate that feedback loop back in so that if I make a change in that automated environment, I don't end up with a mismatch between that actual state and the desired state that isn't in get, because you'll see, as we get into the business values that having those two, having gift actually represent what is running in production is one of the key enablers of some of these business values. So to kind of sum that up is that again, I want to emphasize that these convergent loops are really central to get ops. And you'll see, in a moment, I won't go into the details here, but you'll see that there's a series of loops. There's these convergent loops that are around continuous delivery, as well as operations. Remember I said that get ops is about taking cloud native all the way to operations.

00:10:23

It's not cloud native delivery, it's cloud native operations, inclusive of delivery and ops. So, okay, that's fine. And if you feel like I've been giving you some techno speak in a way that I have, but I wanted you to understand some of these key principles because those key principles, and it's only when you have those key principles that you get some of the business benefits that I'm talking about. I wanted you to have familiarity with those. So we're going to come back to some of those things as we go along, but really in order to talk about business benefits, we have to pose the question of what are you really trying to do. You're not just trying to do get ups because it's the center square on the buzzword bingo card, right? Um, okay. Maybe some of your, maybe some of your folks are, but there's real business benefit.

00:11:10

And that's what I want to talk about and what we're really trying to do here to bring it down to brass tacks is get better at doing software. We just want to deliver more value through our software, to our customers, more rapidly, more efficiently and more resiliently. And of course it would be great if we did so in a, in a cost efficient way as well. Now, how do we measure if we're doing software, if we're getting any better at doing software? Well, this is not work that I need to do. This is work that has been, been done really well and has been talked about at the DevOps enterprise summit many, many times. So I'm quite certain that the vast majority of you are familiar with this work. And this is the work of the DevOps research assessment organization. This of course is the organization that was formed by Nicole.

00:12:02

Forsgren, Jess humble and our very own Jean Kim. And they did in this is a chart that comes out of the state of the dev ops report from 2019, and also is reflected in some of the details in the accelerate book, but it really comes down to these four metrics. And I'm going to draw us back to those four metrics over and over again throughout this presentation and what you can see here for those of you who aren't familiar in a nutshell, what this does, is it established a correlation between these four measurable things in it. And we'll talk about what they are in just a moment and performance of a business based on business metrics. Now, those business metrics are, of course, the things that you would expect are you gaining market share? Are you profitable? Do your customers like you, do you have a good net promoter score and the most, the highest and the most elite performers do things along these four axes in a particular way, and the lowest performers, the ones that are at risk of going out of business, completely getting blockbusters, right?

00:13:08

Um, those are the ones that are losing market share, have unhappy customers and so on. And it's really extraordinary. You can see the range of these different practices. So I'm going to start at the top. The first two are a little bit closer to the developer themselves. So it has to do with deployment frequency. So it says, all right, I have an application. How frequently can I deploy it? Can I deploy it once a day or more than once a day? Or am I deploying it every six months? You can see there, the range, the lowest performers are not deploying very frequently. The highest performers are deploying very frequently. Similarly, lead time for changes. I've got that code ready to go out into production. How long does it take me to get there? The highest performers, less than a day, the lowest performers, what we just saw and Jonathan smart slide months quarterly.

00:14:02

Then as we move down into the blue category, we start talking a little bit more about not the software itself, but the stability of the software as it's running. So there's the notion of, if something goes wrong, how long does it take for me to recover that meantime for recovery? And then also the change failure rate. I think it's been shown quite a bit that a lot of those failures come from changes, which is why we have change control bodies and things like that. And so our aim here is to reduce both the meantime to recovery, as well as the change failure rate. So we want to get better. And as we reduce that change failure rate, we of course are going to feel more confident, deploying more frequently. So how does get ops support these things? Well, I'm going to start by focusing on the top three.

00:14:50

I'm going to really focus on that developer. They're the ones that are getting that code ready for deployments and have had again, and want to be able to do that frequently and be able to shorten that timeline. So let's talk about that developer or rather actually call them the dev ops team. So they're responsible for creating the software and bringing it to production. You'll see. I know, I know that bringing it to production that's gets a little hairy and we'll get to that in just a moment. But one of the first things that get ups does to support the, you know, the, the developer is it allows them to use familiar tooling. So rather than introducing yet another tool for continuous delivery and yet another tool for some of the operation key operational characteristics. The first thing that we're going to do is we're going to say, you know, what, you know, and love get you've used it so effectively for your, the earlier parts of yourself, your development life cycle, you've built all sorts of automation around it.

00:15:52

What you see here as a screenshot that has all sorts of tags. And when you apply some of these tags, various automation kicks off. You'll also have the ability to have very collaborative cycles in there. So why just take those practices that we've been applying to source code evolution. Let's also apply that to configuration evolution and that makes developers happy and it makes them efficient. So that's the first thing that get ups does is it allows developers and dev ops teams to use familiar tools to get their job done. Now, the second thing is that we want to enable those teams again, they're responsible for operating their apps in production. We want to give them self-service capabilities. Now, what do I mean by self-service? I do not mean self-service infrastructure. I am not saying give them the ability to get their own infrastructure and then take on the burden of managing that infrastructure.

00:16:51

What I'm talking about here is self service operations, let them operate their applications themselves. So remember I emphasized that get, and the get ops process is the interface to operations. So let's give them an interface to operations and the capabilities, the access control to be able to do those operational things from within their get environment. Now, in order to do that, self-service though I can hear many of you, many of your thoughts, like, oh, wait a minute. I have something to say about that because I also, as an enterprise am responsible for maintaining security compliance, resilience, cost management. I have all of these enterprise concerns and that's, what's kept me from giving self-service access to these DevOps. Well, that emphasis that I just made a moment ago about don't make it self service Infor make it self service ops, make it self service, get ups.

00:18:00

That is the difference between what we've done in the past and what we're doing now with platforms and with get-ups centric platforms. So again, I'm going to emphasize a few of those things. I'm going to emphasize the security and compliance concerns as well as the resilience concerns. And so let's see how these things now remember, I'm still talking about the two green bars at the top, which is around shortening the lead time to production and also increasing deployment frequency. So let's look first, a little bit at the security. So let's talk a little bit about the software supply chain we've of course seen V just recently with the solar winds access that security around the software supply chain is absolutely critical. Now, one of the most common ways that CD is implemented today is kind of at the tail end of CGI. And there are CDs solutions that act as kind of centralized solutions that when I'm ready to go out to dev and then to staging, and then to prod my CD server is going to push those things into the various environments.

00:19:08

And there's going to be an inventing system and approval system and all of that stuff. Now, the challenge with this from a security perspective is that centralized CD system provides an attack surface that puts at risk all of these environments, most notably your production environments. So if you've compromised your CD system, you've compromised the CD system for probably all of your applications and across all of your different stages. So one of the first things that we do with get ops is we turn that arrow around and we say, rather than having a centralized system, pushing out to these environments, let's put the control for the delivery operation and the continuous operations into the cluster, into the target environment itself. And notice that little icon, it's the same icon that I had next to the automation. It's that reconciliation loop. So that's how we can turn it around.

00:20:07

And so now, if your dev environment is compromised, you aren't compromising your staging or your production environment. If you, God forbid, get your production environment for one application compromised, your other production environments are not compromised because they're each doing their own drawing and they each have their own access control settings around that. So this is just one example. I'm not suggesting that every security concern is, is, is addressed by this, but this gives you an example of how get ops supports at least one. And I can tell you that there's more security concerns that are addressed by get offs. So what we're doing there of course is we're pulling and, oh, there's my animation that shows that this is inherently more secure. Now that's an example of security, but I also talked about compliance. And so again, the notion here is if we can bake security into the environments, then what we're able to do is we're able to give a little bit more access to that dev ops team, allow them to self serve deployments out into production because some of the security concerns are automatically addressed as a part of the, the working procedures in the tooling that we have in place.

00:21:26

Now, the second element that I talked about is compliance. Now compliance says, I want compliance to a large extent is about your auditors, your auditors, being able to go in and say, who did that? Who, who deployed this into production? And when, um, now what I'm showing you here, doesn't just support compliance. It actually supports operations as well, kind of finding root cause analysis. But in this particular case, let's talk about compliance. And this is something that you get for free. If kit is the, the interface for operations, that means that every operational thing that you do is recorded in that get repository. Now, what get does, if you properly configure it is that you've configured it for immutability. You've created the right access control settings in it. So that only the right people can commit things into the various get repositories. And all of that is inherently recorded in an immutable versioned repository.

00:22:27

So what you see here in the, in the, in the, uh, in the black boxes, you see every single change that was made you see who made it, and it can't be tampered with. So these Shaws are a fingerprint of what the entire state of the system looked like when this change was applied. It's a fingerprint. So it cannot be tampered with this makes auditors very happy. So coming back here, then again, we were talking about, you can see how, if we are giving self-service access to those dev ops teams, which we can do, because we're starting to bake in via these get ops processes and get ups, tooling, bacon, security, and baking compliance into that. That makes it more and more possible to apply this self service environment. Now there's other parts of self-service. And that is I can't allow my dev ops teams to deploy into production because I'm worried that it might cause some, some, you know, rippling effects in my production system.

00:23:37

And that's where your change control body comes in is like, they're ultimately responsible for making sure that nothing bad happens in production. So what I'm talking about here is resilience. Now, how does get up support resilience? I'm now going to take this starts to get us down into the lower parts of this chart. And I'm going to take them one by one. I first want to talk about how, what, how good ops supports lowering the meantime to recovery. And the first place that I'm going to start is get semantics. Remember I said that gift in this chart, this dark box is not just about auditing. It's not just about support, you know, supporting compliance. It also has these other elements. Well, again, we've got everything versioned here. So when we're talking about resilience, um, what does versioning have to do with resilience? Well, you take that version and you bring it together with what I have on the right side of the slide here, which is to point out that every single one of these points in the get history is a complete representation of the state of the system.

00:24:48

What that means is that I have version markers for every single state of the system. If I start to roll something out in production and it looks fine for a little while, but then tomorrow we have a spike. And all of a sudden, when everybody shows up at work and there's a spike of traffic, things start going a bit haywire. I can always say, oh, go back to the last version. And because I have a complete representation of the system, and I have automation that is all about convergence. I've now just effectively reverted my desired state, that automation automatically kicks in and brings the actual state in alignment with the version that we ran yesterday. And that is a huge enabler of resilience. Now, if you've got that resilience, you can imagine that now has rippling effects back to the developer things I talked about earlier.

00:25:48

It gives us more confidence to be able to release code more frequently, because I know that if something does go wrong, I can very quickly revert back to a previous state. So it really, these effects ripple across the board. When we start doing get ops. Now there's another element around, um, uh, about, uh, around, um, resilience. And that is that practice that I just talked about, where I can just go ahead and revert back to the previous thing. Well, another scenario is, let's say one region goes down and I quickly want to stand things up in a different region. I can use that same practice. I just point into my get repository. And then the reconciler is we'll make it. So they'll stand up the new system, nothing else is required that only works. If I haven't had drift in my environment, if I'm guaranteed that the environment that what's expressed in the get repository is in fact exactly what is running in production.

00:26:54

Now, how might you have drift? Well, I call this the modern day equivalent of SSA, Ching into a box and making a change, and then forgetting to record that change somewhere else I could do a coop cuddle apply well, because remember that cycle is in both directions. We can do a number of things. First of all, we recognize that there's been drift. And then once that drift has happened, we can either revert it. We can notify on it, or we can record that change. And that is something that we can express in policies. Now, I want to come slowly to a close here and go over. The very last thing was, which is to address this change failure rate. We want our changes to less and less frequently result in errors in the production system. Well, I'm going to go back to get, so one of the things that we do with that kind of final gate before we go into production is we do a review, a change review board.

00:27:56

Well, what we've done here is we've taken that change review board and we've shifted it left. We've said, you know what? Get ups get is a wonderful place where we can bring together a number of individuals to have a group of individuals approve changes in production, lots and lots of eyeballs on this. And that is going to by its very nature. You're going to crowdsource that. So by its very nature, it's going to reduce the frequency of disruptive changes. There's also a second element, which is that in this picture, what we saw earlier is that in the runtime environment, it's drawing changes from the get repository. Well, without getting too techie, there's really two elements to that. There is a delivery element. So it's drawing those changes from get into the environment and it's drawing it into this internal store that Kubernetes has, and that is at CDs.

00:28:54

So don't worry about that techno speak, but then there's also a convergent loop on the right side of that, that says, okay, once I've brought that configuration into my internal state store, which is Kubernetes, then how do I actually instantiate the running instances of my application? And that's where again, we have reconcilers and we can have special purpose reconcilers that do progressive delivery that do that in a Canary style. We just do a little bit and if it looks okay, then we roll out more or we roll out a and B versions side-by-side to see if they're working better or worse, or we do blue-green deployments, all these different deployment strategies. It takes us back to this picture, which is that I've got convergent loops. So summing things up now, really what it comes down to is there's two emphasis that I want to make.

00:29:49

One is that get ops is a combination of continuous delivery and continuous operations. If we come back to these metrics and we fill out this chart, I'm going to quickly build this out, to show you that things like familiar tools and then a whole bunch of things around self-service where the platform team has something to say about that. Self-service those are, these are all the attributes that line up directly with those characteristics. So finally, I want to share with you a thesis. The thesis is that get ops supports the dev ops agenda in a particularly effective manner. And I hope that today I have shown you how very specific get ops capabilities, support these business metrics. And with that, I thank you for your attention and I hope you enjoy the rest of the conference.