The Business Benefits of GitOps - Weaveworks | Europe 2021

Login or create a free 7-day trial account

Europe 2021

Slides not available

The Business Benefits of GitOps

As the Chief Technology Officer at WeaveWorks, Cornelia Davis is responsible for the company’s technology strategy, inclusive of open source projects, commercial products and services offerings. She is driven by the desire to help enterprises transform their business through the leverage of cloud platforms. She cut her teeth in the space of modern application platforms at Pivotal where she was on teams that brought Pivotal Cloud Foundry (Pivotal’s PaaS) and Pivotal Container Service (Pivotal’s Kubernetes Service) to market. Cornelia currently serves on the Technical Oversight Committee of the Cloud Native Computing Foundation. She is the author of the book Cloud Native Patterns: Designing Change-tolerant Software.

An industry veteran with almost three decades of experience in image processing, scientific visualization, distributed systems and web application architectures, and cloud-native platforms, Cornelia holds the B.S. and M.S. in Computer Science from California State University, Northridge and further studied theory of computing and programming languages at Indiana University.

CD

Cornelia Davis

Chief Technology Officer, Weaveworks

Chapters

Full transcript

The complete talk, organized by section.

Cornelia Davis

Hello, my name is Cornelia Davis, and thank you for joining me here today to talk about the business benefits of GitOps. Before jumping into the content, let me tell you a little bit more about myself. I've been in this industry for quite some time, about 30 years, maybe even a little bit more. I've always been a developer. I didn't come from the operations side. But it says that I wasn't ops, that was in past tense. The reason that I consider myself now somewhat proficient in ops is because I've spent the last 10 years or so working on developer platforms. Initially working on Cloud Foundry, then later on Kubernetes. All told, I've been working in the Kubernetes space for about five years, which doesn't make me the longest veteran, but it doesn't make me a noob either. I've been working in web architectures and indeed, cloud native. And what I mean by cloud native is this world that where the environment that we're running our software in is constantly changing, and it's highly distributed. Been working in that space for nearly a decade, even though we didn't call it cloud native at the time. I've been part of the DevOps Enterprise Summit programming committee for quite a number of years. And for the last year and a half or so, I've been the CTO at Weaveworks. Now, Weaveworks, we consider ourselves the GitOps company, which is why I'm talking to you about GitOps today. So I've been spending quite a bit of time in this space. I'm also, as you can see on this slide, the author of a book called "Cloud Native Patterns," which is a book that is targeted at the application developer and architect, that teaches the software architecture patterns that support software running well in this highly distributed, constantly changing world that is the cloud.

So, why am I talking about cloud native? Well, I believe that fundamentally, GitOps is the thing that takes cloud native all the way to operations. And to drive this point home, I'd like to start with our very own Jonathan Smart. Jonathan Smart, many of you know, has been part of this DevOps Enterprise Summit community for quite a number of years. He's given some great talks at DevOps Enterprise, including some of the funniest and most wonderful lightning talks. I encourage you to look them up online. They're really fantastic. This is a picture of him giving a talk quite a number of years ago when he was still at Barclays Bank. And this picture is interesting in a number of different ways, but the real point that I want to drive home here, that I want to tee off of, is that main statement that he makes across the top of the slide that says, "We are so freaking agile, yay!" Now, he builds this slide up, and I'm not going to spend as much time on it, but the point that he's driving home here is that we've gotten really very good in the industry at applying these agile practices and short feedback loops and short cycles and things like that to the early stages of the development process. So dev is really great. But then when you look at the whole broad spectrum of things, you look on the left side, which is all of the stuff that happens before we go into development, you'll notice that it has a very different cadence. It has annual, quarterly, and monthly cadence. And I'm not going to talk about that side, but where we're really going to talk is on the right side of that dev, which is when we say, after I've done the development, what are the things that need to happen to get this thing all the way out into production? And there again, you can see that there is a cadence of monthly and quarterly. So imagine you have software ready to test against customer expectations. You want to get some feedback. You've got it ready to go, but it takes you an entire quarter before you can get that in front of customers and get the feedback. Well, GitOps is something that can help you shorten that cycle significantly, and that is ultimately the major business value.

So before I jump into the business values, let me explain a little bit about what I mean by GitOps. What is it? So what we're talking about here, as I just teed up, is we're talking about running software in production. So I've got some type of runtime environment that I'm going to run that software in. Now, that runtime environment needn't be just Kubernetes. It could be any type of runtime environment. I'm using Kubernetes here just as an example because many of you are using Kubernetes or planning on using Kubernetes in production. And in fact, it gives us a leg up if we were to look at things under the covers. But there is that asterisk that says the runtime environment could really be anything. And then, of course, we've got some human beings that are responsible for managing that application in production. Now, those individuals are increasingly the application team, and I'll talk more about that in just a moment. But nevertheless, I've got some humans that are ultimately accountable and responsible for getting this thing running and keeping it running well in production. Well, of course, you can see there's a whole bunch of white space on this slide, so I'm going to fill those things in with the techniques and the mechanisms that we can use that we believe are making it more and more effective for these human beings to manage these applications in production.

So, what's the first thing? Well, as you can imagine, since I'm talking about Git, the first thing that I want to start with is, since I'm talking about GitOps, the first thing I want to start talking about is Git. Now, Git, of course, is something that many of you are probably thinking, well, I've been using Git in my infrastructure as code for some time. I'm storing things in Git. Does that mean I'm doing GitOps? Well, I'm hoping to explain to you that, in fact, you might not be. There's a few other essential elements of this. Now, the first element that you see here is, yes, we are storing some of the operational artifacts in Git, and then we're using, and this is key point number one, is that we are using Git and the Git applications, things like GitHub, GitLab, Bitbucket, we're using those as the interface for operations. This is really important because it drives home the point that I am not just storing things in Git and then using something else to exercise those operational tasks, and then using yet another interface to do auditing against that, and another interface to do this or that. I am using this as the interface for operations. That's really very significant. No longer am I using the vSphere console or the AWS console, I am using Git as the interface for operations. That's really point number one, is it's more than just storing things in Git. I'm actually using it as the interface for operations.

Now, of course, when I do things in Git, I want there to be as much automation as possible, and that's kind of the second element. Now, I know some of you are thinking, "Well, I've already got triggers that are happening, so that when I do things in Git and I make a change in one of my scripts, it automatically reruns that script for me." You're of course familiar with automation in the CI processes, so the ability for your developers to check in their source code and have their Jenkins pipeline automatically run, or their Tekton pipeline automatically run. But there's a little icon that is showing right next to those gears. The gears represent the automation. But there's an icon that has a little circular arrow thing, and that is pointing to the second element that is really important about Git. And that is that that automation is convergent. One of the things that is super critical in this cloud-native space is, remember, I mentioned that it's a constantly changing environment, and those changes can be coming from a number of different vectors. And what this automation is designed to do is it's designed to constantly be adapting to those constant changes. Now, there's a couple of other words that are on the slide already that are giving you a hint as to that convergent automation, and that is when we say in the runtime environment, there's an actual state of the system, and what we're storing in Git is a declarative desired state of that runtime environment. And that convergent automation is in the business of making sure that those two things are aligned. Now, so far you see the arrow going from left to right, only in one direction. So if I change my desired state, then that automation is going to converge the runtime environment to be in alignment with the change I just made in Git. But there's an equally important thing, which is to recognize that sometimes changes happen in the runtime system. They might be intentional, they might be automated. There might be a break glass scenario, for example, where you have to fix something as rapidly as possible, and you're not sure exactly what it is, so you have to try a few things directly in the runtime environment. Well, we want to automate that feedback loop back in so that if I make a change in that automated environment, I don't end up with a mismatch between that actual state and the desired state that is in Git. Because you'll see as we get into the business values, that having Git actually represent what is running in production is one of the key enablers of some of these business values. So to kind of sum that up is that, again, I want to emphasize that these convergent loops are really central to GitOps. And you'll see in a moment, I won't go into the details here, but you'll see that there's a series of loops. There's these convergent loops that are around continuous delivery as well as operations. Remember I said that GitOps is about taking cloud native all the way to operations. It's not cloud native delivery, it's cloud native operations, inclusive of delivery and ops.

So, okay, that's fine. And if you feel like I've been giving you some techno speak, in a way I have, but I wanted you to understand some of these key principles, because those key principles, and it's only when you have those key principles that you get some of the business benefits that I'm talking about. I wanted you to have familiarity with those. So we're going to come back to some of those things as we go along. But really, in order to talk about business benefits, we have to pose the question of what are you really trying to do? You're not just trying to do GitOps because it's the center square on the buzzword bingo card, right? Okay, maybe some of you are, maybe some of your folks are, but there's real business benefit, and that's what I want to talk about. And what we're really trying to do here, to bring it down to brass tacks, is get better at doing software. We just want to deliver more value through our software to our customers more rapidly, more efficiently, and more resiliently. And of course, it would be great if we did so in a cost-efficient way as well.

Now, how do we measure if we're getting any better at doing software? Well, this is not work that I need to do. This is work that has been done really well and has been talked about at the DevOps Enterprise Summit many, many times. So I'm quite certain that the vast majority of you are familiar with this work, and this is the work of the DevOps Research and Assessment organization. This, of course, is the organization that was formed by Nicole Forsgren, Jez Humble, and our very own Gene Kim. And they did, and this is a chart that comes out of the State of the DevOps Report from 2019, and also is reflected in some of the details in the Accelerate book, but it really comes down to these four metrics. And I'm going to draw us back to those four metrics over and over again throughout this presentation. And what you can see here, for those of you who aren't familiar, in a nutshell, what this does is it established a correlation between these four measurable things in IT, and we'll talk about what they are in just a moment, and performance of a business based on business metrics. Now, those business metrics are, of course, the things that you would expect. Are you gaining market share? Are you profitable? Do your customers like you? Do you have a good net promoter score? And the highest and the most elite performers do things along these four axes in a particular way, and the lowest performers, the ones that are at risk of going out of business completely, getting Blockbustered, right? Those are the ones that are losing market share, have unhappy customers, and so on. And it's really extraordinary. You can see the range of these different practices. So, I'm going to start at the top. The first two are a little bit closer to the developer themselves. So it has to do with deployment frequency. So it says, all right, I have an application, how frequently can I deploy it? Can I deploy it once a day or more than once a day, or am I deploying it every six months? You can see there the range. The lowest performers are not deploying very frequently. The highest performers are deploying very frequently. Similarly, lead time for changes. I've got that code ready to go out into production.

How long does it take me to get there? The highest performers, less than a day. The lowest performers, what we just saw on Jonathan Smart's slide, months, quarterly. Then as we move down into the blue category, we start talking a little bit more about not the software itself, but the stability of the software as it's running. So there's the notion of if something goes wrong, how long does it take for me to recover that, mean time to recovery, and then also the change failure rate. I think it's been shown quite a bit that a lot of those failures come from changes, which is why we have change control bodies and things like that. And so our aim here is to reduce both the mean time to recovery as well as the change failure rate. So we want to get better, and as we reduce that change failure rate, we of course, are going to feel more confident deploying more frequently. So how does GitOps support these things? Well, I'm going to start by focusing on the top three. I'm going to really focus on that developer. They're the ones that are getting that code ready for deployments and want to be able to do that frequently and be able to shorten that timeline. So let's talk about that developer, or I'd rather actually call them the DevOps team. So they're responsible for creating the software and bringing it to production. You'll see, I know that bringing it to production, that gets a little hairy, and we'll get to that in just a moment. But one of the first things that GitOps does to support the developer is it allows them to use familiar tooling. So rather than introducing yet another tool for continuous delivery and yet another tool for some of the operational characteristics, the first thing that we're going to do is we're going to say, you know what? You know and love Git. You've used it so effectively for the earlier parts of your software development life cycle. You've built all sorts of automation around it.

What you see here is a screenshot that has all sorts of tags, and when you apply some of these tags, various automation kicks off. You also have the ability to have very collaborative cycles in there. So why just take those practices that we've been applying to source code evolution? Let's also apply that to configuration evolution. And that makes developers happy and it makes them efficient. So that's the first thing that GitOps does, is it allows developers and DevOps teams to use familiar tools to get their job done. Now, the second thing is that we want to enable those teams, again, they're responsible for operating their apps in production. We want to give them self-service capabilities. Now, what do I mean by self-service? I do not mean self-service infrastructure. I am not saying give them the ability to get their own infrastructure and then take on the burden of managing that infrastructure. What I'm talking about here is self-service operations. Let them operate their applications themselves. So remember I emphasized that Git and the GitOps process is the interface to operations. So let's give them an interface to operations and the capabilities, the access control to be able to do those operational things from within their Git environment. Now, in order to do that self-service, though, I can hear many of you, many of your thoughts, like, "Oh, wait a minute. I have something to say about that because I also, as an enterprise, am responsible for maintaining security, compliance, resilience, cost management. I have all of these enterprise concerns, and that's what's kept me from giving self-service access to these DevOps teams." Well, that emphasis that I just made a moment ago about don't make it self-service infra, make it self-service ops, make it self-service GitOps, that is the difference between what we've done in the past and what we're doing now with platforms and with GitOps centric platforms. So again, I'm going to emphasize a few of those things. I'm going to emphasize the security and compliance concerns as well as the resilience concerns. And so let's see how these things... Now, remember, I'm still talking about the two green bars at the top, which is around shortening the lead time to production and also increasing deployment frequency. So let's look first a little bit at the security.

So let's talk a little bit about the software supply chain. We've, of course, seen just recently with the SolarWinds attacks, that security around the software supply chain is absolutely critical. Now, one of the most common ways that CD is implemented today is kind of at the tail end of CI. And there are CD solutions that act as kind of centralized solutions that when I'm ready to go out to dev and then to staging and then to prod, my CD server is going to push those things into the various environments, and there's going to be an eventing system, an approval system, and all of that stuff. Now, the challenge with this from a security perspective is that centralized CD system provides an attack surface that puts at risk all of these environments, most notably your production environments. So if you've compromised your CD system, you've compromised the CD system for probably all of your applications and across all of your different stages. So one of the first things that we do with GitOps is we turn that arrow around and we say, rather than having a centralized system pushing out to these environments, let's put the control for the delivery operation and the continuous operations into the cluster, into the target environment itself. And notice that little icon. It's the same icon that I had next to the automation. It's that reconciliation loop. So that's how we can turn it around. And so now, if your dev environment is compromised, you aren't compromising your staging or your production environment. If you, God forbid, get your production environment for one application compromised, your other production environments are not compromised because they're each doing their own pulling, and they each have their own access control settings around that. So this is just one example. I'm not suggesting that every security concern is addressed by this, but this gives you an example of how GitOps supports at least one, and I can tell you that there's more security concerns that are addressed by GitOps. So what we're doing there, of course, is we are pulling, and oh, there's my animation that shows that this is inherently more secure.

Now, that's an example of security, but I also talked about compliance. And so again, the notion here is if we can bake security into the environments, then what we're able to do is we're able to give a little bit more access to that DevOps team, allow them to self-serve deployments out into production because some of the security concerns are automatically addressed as a part of the working procedures and the tooling that we have in place. Now, the second element that I talked about is compliance. Now, compliance says I want... Compliance to a large extent is about your auditors. Your auditors being able to go in and say, "Who did that? Who deployed this into production and when?" Now, what I'm showing you here doesn't just support compliance, it actually supports operations as well, kind of finding root cause analysis. But in this particular case, let's talk about compliance, and this is something that you get for free. If Git is the interface for operations, that means that every operational thing that you do is recorded in that Git repository. Now, what Git does, if you properly configure it, is that you've configured it for immutability, you've created the right access control settings in it so that only the right people can commit things into the various Git repositories, and all of that is inherently recorded in an immutable versioned repository. So what you see here in the black box is, you see every single change that was made, you see who made it, and it can't be tampered with. So these SHAs are a fingerprint of what the entire state of the system looked like when this change was applied. It's a fingerprint, so it cannot be tampered with. This makes auditors very happy. So coming back here then, again, we were talking about, you can see how if we are giving self-service access to those DevOps teams, which we can do because we're starting to bake in via these GitOps processes and GitOps tooling, baking security and baking compliance into that, that makes it more and more possible to apply this self-service environment.

Now, there's other parts of self-service, and that is I can't allow my DevOps teams to deploy into production because I'm worried that it might cause some rippling effects in my production system, and that's where your change control body comes in, is like they're ultimately responsible for making sure that nothing bad happens in production. So what I'm talking about here is resilience. Now, how does GitOps support resilience? I'm now going to take... This starts to get us down into the lower parts of this chart, and I'm going to take them one by one. I first want to talk about how GitOps supports lowering the mean time to recovery,

and the first place that I'm going to start is Git semantics. Remember I said that Git in this chart, this dark box, is not just about auditing, it's not just about supporting compliance, it also has these other elements. Well, again, we've got everything versioned here. So when we're talking about resilience, what does versioning have to do with resilience? Well, you take that versioned store and you bring it together with what I have on the right side of the slide here, which is to point out that every single one of these points in the Git history is a complete representation of the state of the system. What that means is that I have version markers for every single state of the system. If I start to roll something out in production and it looks fine for a little while, but then tomorrow we have a spike and all of a sudden when everybody shows up at work and there's a spike of traffic, things start going a bit haywire, I can always say, "Oh, go back to the last version." And because I have a complete representation of the system and I have automation that is all about convergence, I've now just effectively reverted my desired state. That automation automatically kicks in and brings the actual state in alignment with the version that we ran yesterday, and that is a huge enabler of resilience. Now, if you've got that resilience, you can imagine that now has rippling effects back to the developer things I talked about earlier. It gives us more confidence to be able to release code more frequently, because I know that if something does go wrong, I can very quickly revert back to a previous state. So really, these effects ripple across the board when we start doing GitOps.

Now, there's another element around resilience, and that is that practice that I just talked about where I can just go ahead and revert back to the previous thing. Well, another scenario is, let's say one region goes down and I quickly want to stand things up in a different region. I can use that same practice. I just point into my Git repository and then the reconcilers will make it so. They'll stand up the new system. Nothing else is required. That only works if I haven't had drift in my runtime environment, if I'm guaranteed that what's expressed in the Git repository is in fact exactly what is running in production. Now, how might you have drift? Well, I call this the modern-day equivalent of SSH-ing into a box and making a change and then forgetting to record that change somewhere else. I could do a kubectl apply. Well, because, remember that cycle is in both directions, we can do a number of things. First of all, we recognize that there's been drift, and then once that drift has happened, we can either revert it, we can notify on it, or we can record that change. And that is something that we can express in policies.

Now, I want to come slowly to a close here and go over the very last thing, which is to address this change failure rate. We want our changes to less and less frequently result in errors in the production system. Well, I'm going to go back to Git. So one of the things that we do with that kind of final gate before we go into production is we do a review, a change review board. Well, what we've done here is we've taken that change review board and we've shifted it left. We've said, "You know what? GitOps, Git is a wonderful place where we can bring together a number of individuals to have a group of individuals approve changes in production." Lots and lots of eyeballs on this. And that is going to, by its very nature, you're going to crowdsource that, so by its very nature, it's going to reduce the frequency of disruptive changes. There's also a second element, which is that in this picture, what we saw earlier is that in the runtime environment, it's drawing changes from the Git repository. Well, without getting too techy, there's really two elements to that. There is a delivery element, so it's drawing those changes from Git into the environment, and it's drawing it into this internal store that Kubernetes has, and that is etcd. So don't worry about that techno speak. But then there's also a convergent loop on the right side of that that says, "Okay, once I've brought that configuration into my internal state store, which is Kubernetes, then how do I actually instantiate the running instances of my application?" And that's where, again, we have reconcilers, and we can have special purpose reconcilers that do progressive delivery, that do that in a canary style. We just do a little bit, and if it looks okay, then we roll out more, or we roll out A and B versions side by side to see if they're working better or worse. Or we do blue-green deployments, all these different deployment strategies. It takes us back to this picture, which is that I've got convergent loops.

So summing things up now, really what it comes down to is there's two emphasis that I want to make. One is that GitOps is a combination of continuous delivery and continuous operations. If we come back to these metrics and we fill out this chart, I'm going to quickly build this out to show you that things like familiar tools and then a whole bunch of things around self-service, where the platform team has something to say about that self-service, these are all the attributes that line up directly with those characteristics. So finally, I want to share with you a thesis. The thesis is that GitOps supports the DevOps agenda in a particularly effective manner. And I hope that today I have shown you how very specific GitOps capabilities support these business metrics. And with that, I thank you for your attention, and I hope you enjoy the rest of the conference.