The Business Benefits of GitOps

As the Chief Technology Officer at WeaveWorks, Cornelia Davis is responsible for the company’s technology strategy, inclusive of open source projects, commercial products and services offerings. She is driven by the desire to help enterprises transform their business through the leverage of cloud platforms. She cut her teeth in the space of modern application platforms at Pivotal where she was on teams that brought Pivotal Cloud Foundry (Pivotal’s PaaS) and Pivotal Container Service (Pivotal’s Kubernetes Service) to market. Cornelia currently serves on the Technical Oversight Committee of the Cloud Native Computing Foundation. She is the author of the book Cloud Native Patterns: Designing Change-tolerant Software.


An industry veteran with almost three decades of experience in image processing, scientific visualization, distributed systems and web application architectures, and cloud-native platforms, Cornelia holds the B.S. and M.S. in Computer Science from California State University, Northridge and further studied theory of computing and programming languages at Indiana University.

CD

Cornelia Davis

Chief Technology Officer, Weaveworks

Transcript

00:00:00

<silence>

00:00:14

Hello, my name is Cornelia Davis, and thank you for joining me here today to talk about the business business benefits of GI Ops. Before jumping into the content, let me tell you a little bit more about myself. I've been in this industry for quite some time, about 30 years, maybe even a little bit more. I've always been a developer. I didn't come from the operations side, but it says that I wasn't ops that was in past tense. The reason that I consider myself now somewhat proficient in ops is because I've spent the last 10 years or so working on developer platforms, initially working on Cloud Foundry. Then later on Kubernetes all told, I've been working in the Kubernetes space for about five years, which doesn't make me the longest veteran, but it doesn't make me a noob either. I've been working in web architectures and indeed, cloud native.

00:01:01

And what I mean by cloud native is this world that where the, the, the environment that we're running our software in is constantly changing and it's highly distributed. Been working in that space for nearly a decade, even though we didn't call it cloud native at the time. I've been part of the DevOps Enterprise Summit Programming Committee for quite a number of years, and for the last year and a half or so, I have been the CTO at weaveworks. Now, weaveworks, we consider ourselves the GI ops company, which is why I'm talking to you about GI Ops today. So I've been spending quite a bit of time in this space. I'm also, as you can see on the slide, the author of a book called Cloud Native Patterns, which is a book that is targeted at the application developer and architect that teaches the software architecture patterns that support software running well in this highly distributed, constantly changing world that is the cloud.

00:01:52

So why am I talking about cloud native? Well, I believe that fundamentally GI Ops is the thing that takes cloud native all the way to operations. And to drive this point home, I'd like to start with our very own, Jonathan Smart. Jonathan smart, many of you know, has been part of this DevOps Enterprise Summit community for quite a number of years. He is given some great talks at DevOps Enterprise, including some of the funniest and most wonderful lightning talks. I encourage you to look them up online. They're really fantastic. This is a, a picture of him giving a talk quite a number of years ago when he was still at Barclays Bank. And this picture is interesting in a number of different ways, but the real point that I wanna drive home here that I wanna tee off of is that main statement that he makes across the top of the slide that says, we are so freaking agile.

00:02:46

Yay. Now he builds this slide up, and I'm not gonna spend as much time on it, but the point that he is driving home here is that we've gotten really very good in the industry at applying these agile practices and, and short feedback loops and short cycles and things like that to the early stages of the development process. So Dev is really great, but then when you look at the whole broad spectrum of things, you look at the on the left side, which is all of the stuff that happens before we go into development, you'll notice that it has a very different cadence. It has annual, quarterly, and monthly cadence. And I'm not gonna talk about that side, but where we're really gonna talk is on the right side of that dev, which is when we say, after I've done the development, what are the things that need to happen to get this thing all the way out into production?

00:03:38

And there again, you can see that there are is a cadence of monthly and quarterly. So imagine you have software ready to test against customer expectations. You wanna get some feedback, you've got it ready to go, but it takes you an entire quarter before you can get that in front of customers and get the feedback. Well, GI Ops is something that can help you shorten that cycle significantly, and that is ultimately the major business value. So before I jump into the business values, let me explain a little bit about what I mean by GI Ops. What is it? So what we're talking about here, as I just teed up, is we're talking about running software in production. So I've got some type of runtime environment that I'm gonna run that software in. Now, that runtime environment needn't be just Kubernetes. It could be any type of runtime environment.

00:04:37

I'm using Kubernetes here just as an example, because many of you are using Kubernetes or planning on using Kubernetes in production. And in fact, it gives us a leg up if we were to look at things under the covers. But there is that asterisk that says the runtime environment could really be anything. And then of course, we've got some human beings that are responsible for managing that application in production. Now those, those individuals are increasingly the application team, and I'll talk more about that in just a moment. But nevertheless, I've got some humans that are ultimately accountable and responsible for getting this thing running and keeping it running well in production. Well, of course, you can see there's a whole bunch of white space on this slide. So I'm gonna fill those things in with the techniques and the mechanisms that we can use that we believe are making it more and more effective for these human beings to manage these applications in production.

00:05:34

So what's the first thing? Well, as you can imagine, since I'm talking about Git, the first thing that I wanna start with is, since I'm talking about GI ops, the first thing I wanna start talking about is git. Now, Git of course, is something that many of, you're probably thinking, well, I've been using Git in my infrastructure as code for some time. I'm storing things in Git. Does that mean I'm doing GI ops? Well, I'm hoping to explain to you that in fact you might not be. There's a few other essential elements of this. Now, the first element that you see here is, yes, we are storing some of the operational artifacts in Git. And then we're using, and this is key point number one, is that we are using Git and the Git applications. Things like GitHub, GitLab, Bitbucket, we're using those as the interface for operations.

00:06:30

This is really important because it drives home the point that I am not just storing things in Git and then using something else to exercise those operational tasks and then using yet another interface to do auditing against that and another interface to do this or that. I am using this as the interface for operations that's really very significant. No longer am I using the vSphere console or the AWS console, I am using GIT as the interface for operations. That's really point number one is it's more than just storing things in Git. I'm actually using it as the interface for operations. Now, of course, when I do things in Git, I want there to be as much automation as possible, and that's kind of the second element. Now, I know some of you're thinking, well, I've already got triggers that are happening so that when I do things in Git and I make a change in one of my scripts, it automatically reruns that script for me.

00:07:32

Um, you're of course familiar with automation in this CI processes. So the ability for your developers to check in their source code and have their Jenkins pipeline automatically run or their Teton pipeline automatically run. But there's a little icon that is showing right next to those gears. The gears represent the automation, but there's an icon that has like a little circular arrow thing, and that is pointing to the second element that is really important about Git. And that is that that automation is convergent. One of the things that is super critical in this cloud native space is, remember I mentioned that it's a constantly changing environment, and those changes can be coming from a number of different vectors. And what this automation is designed to do is it's designed to constantly be AP adapting to those constant changes. Now, there's a couple of other words that are on the slide already that are giving you a hint as to that convergent automation.

00:08:31

And that is when we say in the runtime environment, there's an actual state of the system and what we're storing in Git in a declarative desired state of that, of that runtime environment. And that convergent automation is in the business of making sure that those two things are aligned. Now so far, you see the arrow going from left to right only in one direction. So if I change my desired state, then that automation is gonna converge the runtime environment to be in alignment with the change I just made in git. But there's an equally important thing, which is to recognize that sometimes changes happen in the runtime system. They might be intentional, they might be automated, um, if there might be a break glass scenario, for example, where you have to fix something as rapidly as possible and you're not sure exactly what it is.

00:09:27

So you have to try a few things directly in the runtime environment. Well, we wanna automate that feedback loop back in so that if I make a change in that automated environment, I don't end up with a mismatch between that actual state and the desired state that is in in Git. Because you'll see as we get into the business values, that having those two, having Git actually represent what is running in production is one of the key enablers of some of these business values. So to kind of sum that up is that, again, I wanna emphasize that these convergent loops are really central to GI ops. And you'll see in a moment, I won't go into the details here, but you'll see that there's a series of loops. There's these convergent loops that are around continuous delivery as well as operations. Remember I said that GI ops is about taking cloud native all the way to operations.

00:10:22

It's not cloud native delivery, it's cloud native operations inclusive of delivery and ops. So, okay, that's fine. And if you feel like I've been giving you some techno speak in a way I have, but I wanted you to understand some of these key principles because those key principles, and it's only when you have those key principles that you get some of the business benefits that I'm talking about. I wanted you to have familiarity with those. So we're gonna come back to some of those things as we go along. But really in order to talk about business benefits, we have to pose the question of what are you really trying to do? You're not just trying to do GI ops because it's the center square on the buzzword bingo card, right? Um, okay, maybe some of you are or maybe some of your folks are, but there's real business benefit, and that's what I want to talk about.

00:11:11

And what we're really trying to do here, to bring it down to brass tacks is get better at doing software. We just want to deliver more value through our software to our customers more rapidly, more efficiently, and more resiliently and of, of course, it would be great if we did so in a, in a cost efficient way as well. Now, how do we measure if we're doing software, if we're getting any better at doing software, this is not work that I need to do. This is work that has done, been done really well and has been talked about at the DevOps Enterprise Summit many, many times. So I'm quite certain that the vast majority of our, you are familiar with this work, and this is the work of the DevOps Research Assessment Organization. This, of course, is the organization that was formed by Nicole Forsgren, J Humble, and our very own Gene Kim.

00:12:06

And they did, and this is a chart that comes out of the state of the DevOps report from 2019, and also is reflected in some of the details in the Accelerate book. But it really comes down to these four metrics. And I'm gonna draw us back to those four metrics over and over again throughout this presentation. And what you can see here, for those of you who aren't familiar, in a nutshell, what this does is it established a correlation between these four measurable things in it. And we'll talk about what they are in just a moment and performance of a business based on business metrics. Now, those business metrics are of course the things that you would expect. Are you gaining market share? Are you profitable? Do your customers like you? Do you have a good net promoter score? And the most, the highest and the most elite performers do things along these four axes in a particular way.

00:13:00

And the lowest performers, the ones that are at risk of going out of business, come completely getting blockbuster, right? Um, those are the ones that are losing market share, have unhappy customers and so on. And it's really extraordinary. You can see the range of these different practices. So I'm gonna start at the top. The first two are a little bit closer to the developer themselves. So it has to do with deployment frequency. So it says, alright, I have an application. How frequently can I deploy it? Can I deploy it once a day or more than once a day? Or am I deploying it every six months? You can see they're the range, the lowest performers are not deploying very frequently. The highest performers are deploying very frequently. Similarly lead time for changes. I've got that code ready to go out into production. How long does it take me to get there?

00:13:53

The highest performers, less than a day, the lowest performers, what we just saw in Jonathan Smart slide months, quarterly. Then as we move down into the blue category, we start talking a little bit more about not the software itself, but the stability of the software as it's running. So there's the notion of if something goes wrong, how long does it take for me to recover that meantime for recovery? And then also the change failure rate, I think it's been shown quite a bit that a lot of those failures come from changes, which is why we have change control bodies and things like that. And so our aim here is to reduce both the meantime to recovery as well as the change failure rate. So we wanna get better, and as we reduce that change failure rate, we of course are gonna feel more confident deploying more frequently.

00:14:44

So how does GI Ops support these things? Well, I'm gonna start by focusing on the top three. I'm gonna really focus on that developer. They're the ones that are getting that code ready for deployments and have ha again, ha and wanna be able to do that frequently and be able to shorten that timeline. So let's talk about that developer, or I'd rather actually call them the DevOps team. So they're responsible for creating the software and bringing it to production. You'll see, I know, I know that bringing it to production, that's gets a little hairy and we'll get to that in just a moment. But one of the first things that GI Ops does to support the, you know, the the developer, is it allows them to use familiar tooling. So rather than introducing yet another tool for continuous delivery and yet another tool for some of the operation operational characteristics, the first thing that we're gonna do is we're gonna say, you know what?

00:15:42

You know and love Git. You've used it so effectively for your, the earlier parts of your software development lifecycle. You've built all sorts of automation around it. What you see here is a screenshot that has all sorts of tags. And when you apply some of these tags, various automation kicks off. You also have the ability to have very collaborative cycles in there. So why just take those practices that we've been applying to source code evolution? Let's also apply that to configuration evolution, and that makes developers happy and it makes them efficient. So that's the first thing that GI Ops does, is it allows developers and dev ops teams to use familiar tools to get their job done. Now, the second thing is that we want to enable those teams, again, they're responsible for operating their apps in production. We want to give them self-service capabilities. Now, what do I mean by self-service? I do not mean self-service infrastructure. I am not saying give them the ability to get their own infrastructure and then take on the burden of managing that infrastructure. What I'm talking about here is self-service operations. Let them operate their applications themselves. So remember I emphasized that GIT and the GI ops process is the interface to operations. So let's give them an interface to operations and the capabilities, the access control to be able to do those operational things from within their Git environment.

00:17:20

Now, in order to do that self-service, though, I can hear many of you, many of your thoughts like, oh, wait a minute, I have something to say about that. Because I also as an enterprise am responsible for maintaining security compliance, resilience, cost management, I have all of these enterprise concerns, and that's what's kept me from giving self-service access to these DevOps teams. Well, that emphasis that I just made a moment ago about, don't make it self-service infra, make it self-service ops, make it self-service GI ops. That is the difference between what we've done in the past and what we're doing now with platforms and with GI ops centric platforms. So again, I'm gonna emphasize a few of those things. I'm gonna emphasize the security and compliance concerns as well as the resilience concerns. And so let's see how these things, now remember I'm still talking about the two green bars at the top, which is around shortening the lead time to production and also increasing deployment frequency.

00:18:31

So let's look first a little bit at the security. So let's talk a little bit about the software supply chain. We've of course seen just recently with the SolarWinds access that security around the software supply chain is absolutely critical. Now, one of the most common ways that CD is implemented today is kind of at the tail end of ci. And there are CD solutions that act as kind of centralized solutions that when I'm ready to go out to dev and then to staging, and then to prod, my CD server is going to push those things into the various environments, and there's gonna be an eventing system and approval system and all of that stuff. Now, the challenge with this from a security perspective is that centralized CD system provides an attack surface that puts at risk all of these environments, most notably your production environments.

00:19:27

So if you've compromised your CD system, you've compromised the CD system for probably all of your applications and across all of your different stages. So one of the first things that we do with GI ops is we turn that arrow around and we say, rather than having a centralized system pushing out to these environments, let's put the control for the delivery operation and the continuous O operations into the cluster, into the target environment itself. And notice that little icon, it's the same icon that I had next to the automation. It's that reconciliation loop. So that's how we can turn it around. And so now if your dev environment is compromised, you aren't compromising your staging or your production environment, if you, god forbid, get your production environment for one application compromised, your other production environments are not compromised because they're each doing their own drawing and they each have their own access control settings around that.

00:20:32

So this is just one example. I'm not suggesting that every security concern is con is is addressed by this, but this gives you an example of how GI ops supports at least one. And I can tell you that there's more security concerns that are addressed by get offs. So what we're doing there, of course, is we are pulling and oh, there's my animation that shows that this is inherently more secure. Now that's an example of security, but I also talked about compliance. And so again, the notion here is if we can bake security into the environments, then what we're able to do is we're able to give a little bit more access to that DevOps team, allow them to self-serve deployments out into production because some of the security concerns are automatically addressed as a part of the, the, the working procedures and the tooling that we have in place.

00:21:26

Now, the second element that I talked about is compliance. Now compliance says I want compliance to a large extent is about your auditors, your auditors being able to go in and say, who did that? Who, who deployed this into production? And when? Um, now what I'm showing you here doesn't just support compliance, it actually supports operations as well, kind of finding root cause analysis. But in this particular case, let's talk about compliance. And this is something that you get for free. If GIT is the o the interface for operations, that means that every operational thing that you do is recorded in that GIT repository. Now, what GIT does, if you properly configure it, is that you've configured it for immutability. You've created the right access control settings in it so that only the right people can commit things into the various GIT repositories. And all of that is inherently recorded in an immutable versioned repository.

00:22:27

So what you see here in the, in the, in the, uh, in the black boxes, you see every single change that was made, you see who made it and it can't be tampered with. So these haws are a fingerprint of what the entire state of the system looked like when this change was imp uh, applied. It's a fingerprint, so it cannot be tampered with. This makes auditors very happy. So coming back here, then again, we were talking about, you can see how if we are giving self-service access to those DevOps teams, which we can do because we're starting to bake in via these GI ops processes and GI ops tooling, baking security and baking compliance into that, that makes it more and more possible to a, apply this self-service environment. Now, there's other parts of self-service, and that is I can't allow my DevOps teams to deploy into production because I'm worried that it might cause some, some, you know, rippling effects in my production system.

00:23:37

And that's where your change control body comes in, is like they're ultimately responsible for making sure that nothing bad happens in production. So what I'm talking about here is resilience. Now, how does GI up support resilience? I'm now going to take, this starts to get us down into the lower parts of this chart, and I'm gonna take them one by one. I first wanna talk about how, what, how GI UPS supports lowering the meantime to recovery. And the first place that I'm gonna start is GI semantics. Remember I said that gi in this chart, this dark box is not just about auditing, it's not just about support, you know, supporting compliance. It also has these other elements. Well, again, we've got everything versioned here. So when we're talking about resilience, um, what does versioning have to do with resilience? Well, you take that versioned and you bring it together with what I have on the right side of the slide here, which is to point out that every single one of these points in the Git history is a complete representation of the state of the system.

00:24:48

What that means is that I have version markers for every single state of the system. If I start to roll something out in production and it looks fine for a little while, but then tomorrow we have a spike and all of a sudden when everybody shows up at work and there's a spike of traffic, things start going a bit haywire. I can always say, oh, go back to the last version. And because I have a complete representation of the system and I have automation that is all about convergence, I've now just effectively reverted my desired state. That automation automatically kicks in and brings the actual state in alignment with the version that we ran yesterday. And that is a huge enabler of resilience. Now, if you've got that resilience, you can imagine that now has rippling effects back to the developer things I talked about earlier.

00:25:48

It gives us more confidence to be able to release code more frequently because I know that if something does go wrong, I can very quickly revert back to a previous state. So it really, these effects ripple across the board when we start doing get offs. Now there's another element around, um, uh, about, uh, around, um, resilience. And that is that practice that I just talked about where I can just go ahead and revert back to the previous thing. Well, another scenario is, let's say one region goes down and I quickly wanna stand things up in a different region. I can use that same practice, I just point into my GI repository and then the reconciler will make it so they'll stand up the new system. Nothing else is required. That only works if I haven't had drift in my runtime environment. If I'm guaranteed that the environment that what's expressed in the GI repository is in fact exactly what is running in production.

00:26:54

Now, how might you have drift? Well, I call this the modern day equivalent of SS hing into a box and making a change and then forgetting to record that change somewhere else. I could do a cobe Cuttle apply. Well, because remember that cycle is in both directions. We can do a number of things. First of all, we recognize that there's been drift, and then once that drift has happened, we can either revert it, we can notify on it, or we can record that change. And that is something that we can express in policies. Now I want to come slowly to a close here and go over the very last thing, whi, which is to address this change failure rate, we want our changes to less and less frequently result in errors in the production system. Well, I'm gonna go back to Git. So one of the things that we do with that kind of final gate before we go to production is we do a review, a change review board.

00:27:56

Well, what we've done here is we've taken that change review board and we've shifted it left. We've said, you know what Git, ops Git is a wonderful place where we can bring together a number of individuals to have a group of individuals approve changes in production. Lots and lots of eyeballs on this. And that is going to, by its very nature, you're gonna crowdsource that. So by its very nature, it's gonna reduce the frequency of disruptive changes. There's also a second element, which is that in this picture, what we saw earlier is that in the runtime environment, it's drawing changes from the Git repository. Well, without getting too techie, there's really two elements to that. There is a delivery element. So it's drawing those changes from Git into the environment and it's drawing it into this internal store that Kubernetes has. And that is, you know, et cd.

00:28:54

So don't worry about that techno speak. But then there's also a convergent loop on the right side of that that says, okay, once I've brought that configuration into my internal state store, which is Kubernetes, then how do I actually instantiate the running instances of my application? And that's where again, we have reconciler and we can have special purpose reconciler that do progressive delivery, that do that in a canary style. We just do a little bit. And if it looks okay, then we roll out more or we roll out A and B version side by side to see if they're working better or worse, or we do blue green deployments. All these different deployment strategies. It takes us back to this picture, which is that I've got convergent loops. So summing things up now, really what it comes down to is there's two emphasis that I wanna make.

00:29:50

One is that GI Ops is a combination of continuous delivery and continuous operations. If we come back to these metrics and we fill out this chart, I'm gonna quickly build this out to show you that things like familiar tools and then a whole bunch of things around self-service where the platform team has something to say about that self-service, those are, these are all the attributes that line up directly with those characteristics. So finally, I wanna share with you a thesis. The thesis is that GI Ops supports the DevOps agenda in a particularly effective manner. And I hope that today I have shown you how very specific GI Ops capabilities support these business metrics. And with that, I thank you for your attention and I hope you enjoy the rest of the conference.