Tales From the Branches - Why GitOps Matters For Your Business Success (US 2021)

The pipeline-as-code approach allows git workflows to automate the deployment of CI/CD pipelines, turning code into features faster and at a more secure pace for business. And this is where GitOps gets interesting for your business. The GitOps approach to continuous deployment, enables developers to focus only on developing and contributing code as they always have, through git repositories. Reconciliation loops in GitOps, monitor the actual versus desired state of your running software and align infrastructure automatically. Traditional operations teams can now evolve into SRE and DevOps roles that they aspire to with the introduction of DevOps in the first place. IT teams have now turned Kubernetes from a complex orchestration system to a platform that integrates all tasks and desired tools a modern cloud native enterprise needs. And GitOps is the essential pattern for the highly distributed and constantly changing environment that makes up the cloud. In this session Steve will cover the key principles of GitOps, and demonstrate the real business benefits any size company can experience. He is tying GitOps practices back to the DORA IT metrics as measures of Software Delivery and Operational (SDO) performance including frequent deployments, shorter lead time, mean time to recover, and change failure rate. He will also show how these techniques provide solutions to a number of use cases including drift detection, malware remediation, disaster recovery and more. This session is presented by Weaveworks.

uslas vegasvegasbreakout2021

(No slides available)


Steve George

COO, Weaveworks





Hi there. Welcome to this. Talk about why you get all this matters for business success. My name is Steve George. I'm. The COO at weave works. One of the companies that has been working in the cloud native space, we coined the term guitar a few years ago, help from helping teams build and operate cloud native applications and platforms that scale. And that's a term that's been taken up across the industry. So my goal today is to tell you what get-ups is, why it's important from a business perspective, and then hopefully having convinced you that it's a great thing, how you can adopt get ops let's jump right in. So the thesis for best talk is based on three things, get ops supports the dev ops approach in a highly effective way. And specifically what this means is that get up the Springs dev ops to cloud native software operations, it's taking the principles of DevOps and then applying them in the specific cloud native sphere. And for that reason, because we know that the DevOps capabilities and principles, help teams to operate more effectively get ops takes that mental forward and it can help to significantly improve the way that we operate cloud native software.


So get ops, what is it? Well at a central level, it's very straightforward. We're thinking about get for configuration management and ops the operations of software. So here we have our developers on the left, they're working on their code and their configuration, and they've got an application that they wanted to deploy into a runtime environment, a Kubernetes environment here on the right with two applications they could, of course, just directly apply that into the environment, but they want to take advantage of automation. So using automation, what we think about here is the fact that on the left-hand side, we have the desired state. This is all of the configuration, everything that is needed to deploy the actual service. And then on the right hand side is the actual state. This is the running system and they use automation to deploy that service, that capability, that application, whatever it is, and we'd get Hobbs.


One interesting aspect here is that we can take advantage of a convergence loop. And so what I mean by that is we can, uh, take advantage of the declarative basis for cloud native technologies, Kubernetes specifically, and we can ask the system, what is the actual state? And so from that, we can say, we have a desired state, what we want it to be. We have an actual state what it actually is, and we can see if the two things are converging together. If they are, we know that our deployment has been successful, whatever that is. And if it isn't, then we know there's a problem and we can react accordingly. And so ultimately that is the heart of what get ops is using this convergence loop to say, I want something deploy it. Did I get it in a continuous loop? Now the aspect that's interesting here of course, is that we're using get, we're bringing get as a single interface to the operations of software. So moving it from not just an interface for development, which development teams are very familiar with, but moving it across into that operational sphere and making it the single point of collaboration and work and that the teams use whenever they're deploying or operating their software.


So in the CNCF get ops is an open working group. And we've been working on the principles of get ops for awhile, as everyone is coming to understand how get-ups can be used to deploy an operating software in the cloud native sphere. Um, and these principles are as follows that the entire system is described declaratively. So here we're taking advantage of Coobernetti's declarative nature and many of the other cloud native technologies. And we described the entire system and here, what we mean by that is the Kubernetes platform self and the services that it needs, the workloads that you're intending to run on top of it. And then we have the canonical desired and version stored in get, so this is what we want stored in that get, can get pository. We can, then you make sure that approved changes are automatically a tromp apply. So we're using that automation to apply those changes.


We'll talk about some cases where here, where we may not want to automatically apply them, but it's important that we use an agent to use them to take advantage of reducing the speed of that loop or increasing the speed in which we can you do those deployments. And then because we've got this software agent, which understands the actual, uh, running, uh, state within the runtime, um, we can then check whether there has been any divergence. So when there is that divergence, we can take an action. So if the runtime state diverge is we can choose, remove that bad change and automatically deploy a new one. And so ultimately what we have here is a closed loop system.


When we talk about it gets off, it's one of the first questions that people have is, well, I already use my CII system to deploy application. So are you telling me that I need to change my CIA system? Something that I've put a lot of effort and teams understand very well, and we're not C I remains, as it was before all we're doing is removing the CD part from it. And we're taking that into the, into its own separate area where we do deployments with, um, an agent and then that's its own system. So continuous delivery now works using a guitar system. And then of course, because we live in a continuous world, we know that we need to continuously operate and build new versions, deploy new versions, scale, our systems deal with latest changes, new sales, whatever it is that causes the system to have to respond. So things to bear in mind here is that we're not talking about just talking about continuous delivery. We're also talking about continuous operations, all of the things that need to be done to operate software on a daily basis.


So that's what get ops is, but why do we want to use it? And so ultimately we want to use it because we want it to get better at doing software. Um, and in, you know, the dev ops world, we understand that there are sort of key, um, characteristics, key performance elements that, um, in the software delivery process, which when done well, help teams, it teams to do that have a great business value. Um, and you know, this, uh, um, table is particularly famous and well-known for a demonstration about how these particular, um, uh, metrics really apply at a business level. Um, so ultimately if we can deploy software more frequently, then we can make sure that the changes are smaller. And that will mean that we can reduce the lead time of those changes. In turn, we get those changes into production as quickly as possible, making us agile and responsive to business needs.


And of course, you know, ultimately there are changes that happen, but don't go well. And so thinking about, um, how we improve our reliability by ensuring that our time to restore services as small as possible, and that any changes that we do make, we do our level best to ensure that those changes have a very low failure rate. And so really to demonstrate to you why get ups is relevant for a business perspective. I want to try and demonstrate how detox applies and helps to support these key metrics, how improves that and ensures that we can deliver and help businesses to support the way in which they, they deliver these, these elements.


So ultimately what we're looking at here is dev teams or dev ops teams who need to deploy applications. What they want to do is they want to be able to release more frequently, reduce the time to do those deployments, and then operate those applications as efficiently and as effectively as possible. So what we're talking about with guitars here is the fact that it provides both familiar tooling and enables a self-service approach. And then in many enterprises and complex enterprises, of course, we have an underlying platform on often a platform team whose job it is to create a platform that the applications will work on and they need to maintain the reliability of that system. Security compliance, think about how doing that as efficiently as possible. And so I want to show you that get ups helps to enable resilience and that it can certain capabilities within your tops deliver security and compliance.


So first all, I'm going to focus on the key metrics, which are really developer focus. So these are the ones that are around deploying software, um, as quickly as possible. Um, and then making sure that therefore, because you're incrementally deploying those changes quickly, the lead time to getting a deployment out there is a small as possible. So the first way in which dev ops really supports these two metrics is the fact that it's familiar tooling, development teams understand get, they know how to use it. It's often the center, it's the center of their universe. Um, and so what we're doing here is we're taking it, not just from being a development tool, but into being an operations tool. It's not necessarily part of get, but it is inherent in the way that we use getting how that it's an extremely collaborative process. And so what we're looking at here, I pull request away to fix something. We've got various people talking about that, figuring out what the right approach is. And so get is an extremely collaborative way of working, you know, with all of the different get services out there these days. And that we can bring that collaboration into the operational universe, where we can ensure that there are many eyes working on a deployment or a particular operational change. Um, and everybody can collaborate together using a well-known process and system and way of working.


And of course, what we want to do, and here is the second part is to ultimately provide a self service platform. So how does get ops enable a self-service platform where teams can deploy their applications, their services onto that platform. Now we're not here talking about self-service infrastructure. So we're not thinking here about the underlying servers, but we are thinking about the cloud native environment, anything that is needed to run that environment, um, and all of the things that, and services and applications that are needed on top and any operational systems that are needed to, um, look after those applications. And so one of the things that you really notice, um, with self-service is that in order for platform teams to deliver that is something that they need to configure and set up. And you'll recall that earlier on. I said that one of the things that's really interesting about your tops is that it defines the entire platform.


So using a guitarist approach means that we can define a cloud native platform. We can say which version of Kubernetes is allowed, which ingress is allowed, which monitoring system is allowed, which kind of, um, different production elements. And then we can deliver that to our development teams as a platform, which they can then use in a self-service way. And then the next bit about gifs, which is really important is that it enables resilience and why this is valid or where this is applicable really is that we can feel confident about going faster, um, and putting more changes into production as quickly as possible. If we know that we need, that we can return to a good state. So ultimately it's about taking the handbrake off because we know that we can get back to a good position. And the way that we do that is ensuring that we have a resilient platform.


So from a platform team perspective, and from the developers operating a service, they can go faster because they know that everything within the platform is defined and they can return to that good previous known state, whether that was last Tuesday or the week before, or a couple of hours before they know that if it worst comes to the worst, they can get back to that good name state. And so what we're talking about here is the semantics, right? Because each version is a complete representation. It has everything within the platform, everything which has been recorded, defined, and then using the benefits of cloud native, we can deploy that and rebuild the platform of service. And of course, because it's within get it's immutable, right? It's got a shower that shower is known, it's completely immutable. And, um, you know, we can use that as a record of the changes.


So in practice, the way to think about this, who's, here's our pirate and he's not using a configuration management approach and he's not using get-ups. He just likes to login and do cute couple apply directly to the production cluster. And one of the things that's really interesting when you look at teams and the way that they operate production platforms is it, this is very, very common. And one of the effects that it has is means that teams often think that their platforms have certain components are running. Certain versions are configured in certain ways. And then in reality, somebody has done something and it hasn't been recorded. Um, and therefore they don't know exactly all the versions and the capabilities and the configuration that they're running. But if you're using a get-ups approach, what happens if somebody tries to do a straight Q capital apply? Well, what happens here is that the, there will be an immediate alert back because we have that definition of what the desired state should be.


And the agent will understand that the cluster has now, or the application or service has now moved away from the desired state and the actual state and the runtime and the running system is now different from what it should be. And at that point, it will send them alert and we can choose what action we would like to take, uh, whether that's automatic. And so in this case, what we might do is say, right, we want that change removed. So we're going to revert back to the desired state, to that known good deployment, uh, and that will automatically happen. And so the change will be wiped out a side effect of the fact that everything is recorded within get, and everything is sharp is that compliance teams really love guitar because it's an immutable record of everything which has happened within the system.


So if we think about the fact that ultimately in order for development teams, to be able to go faster, the most important aspects here that get us helps to provide, or the fact that it supports deployment frequency, because it gives teams something where they can collaborate together using common tools, and because it's more resilient and able to recover to previous states points in time, it means that you can take the handbrake off making those changes, because you always know you can get that. And that helps to ensure that the amount of work in progress that you're holding back is relatively low, because you can make changes as quickly as possible. And then hopefully that helps to reduce the lead time for changes. And then the other aspects here are the sort of platform team, the resilient team, uh, the resilience aspects. And so, you know, uh, time to restore services and reducing the change failure rate.


We've talked about the fact that in order to restore services, you can do that using your tops because you've got the definition stored within get, well, there's a couple of other ways in which you can ensure that you reduce the failure rate. I'm going to mention the fact that you can collaborate. You can inspect the changes. So here we have our developer who's you received a pull request. They're able to that change and make sure that that change is something that happy with and you can use automated guardrails. And so here, I'm showing, uh, some parts of the configuration of being locked, and it's been decided that those configuration elements should not be changed. And so you can use automated guardrails, policy, security, tooling, and all of that tooling. Um, you know, uh, there's lots of tooling out there that works very well with get pipelines and all of that tooling can be brought into play here.


And so really what you have is this control point, um, where you can make the decision about whether you wanted to deploy the new version. And we said at the beginning that there's many situations where you may not want to do fully automated deployments. That's not realistic expectation for many enterprises, many situations in many particular services. Um, and there's nothing within get-ups that requires you to do full automation. You just want to automate as much as possible. And so in this diagram, we've got a manual approval process, perhaps is a manual sign-off process. Everybody's happy with the change, the guard rails say that it's all fine. And at that point, we're ready to make that deployment. And then of course, if there is an alert, if that change doesn't go well, then we can revert. And that really gets me on to the next bit, which is really about progressive delivery.


So get ops and using the cloud native technologies is really fantastic for using progressive delivery, because we can take that particular atomic change. We can deploy it into the cluster and then Sydney, whether it deploys well, see if, you know, certain amount of traffic goes to it. Well, and everybody's happy with that. Or if a runtime is showing some sort of performance issues, um, over that deployment. And then if we're perfectly happy with that, we can then continue with a deployment where not the side, it's easy to revert back. Um, and, and that atomic change that unit of change is also something for that service or system that we can then deploy into other environments. So imagine that you're running, you know, many, many clusters, many, many environments, many services, and you can deploy the same versions to different environments, or you could deploy the same service into many different environments, for example, which is, which is a sort of common, uh, requirement and need.


So hopefully I've now shown you how get ops supports all of the dev ops, the key dev ops metrics that we know help to improve the way in which it can support a business and familiar tools. And self-service for making sure that development teams are as effective as possible. And then a resilient platform, which is using a new version of immutable store, making it simple, to understand drift and get back to that previous known good space, uh, state, um, at any point in time. So we've talked about what gets up is, and we've talked about why we would want to have good tops. So hopefully I've convinced you that it's something that's worth your time and effort. And so now the question is, how can I adopt get up?


So gets obviously itself is, um, an industry term. And I think it's really key that everybody understands what it means and how it works, and that it, we have different getup solutions out there that, which are all interoperable and work together. And so that's why the CNCF has a guitar working group when there are many vendors that have come together, um, across the space who are working to understand what get-ups is, how to use it and develop, um, the way in which we can operate software using a get ops approach. Um, and that's also a community where there are many open source users and end users who are really interested to, um, discover the ways in which to use get-ups. And Ms. Community talks a lot about how we should drive and or how people can adopt get-ups and use it within their environments. And we've worked ourselves, we've worked with teams of all shapes and sizes over the last few years, helping them to take advantage would get up.


And in our opinion, there's a simple way to do this, and you can make it as straightforward as possible. The key prerequisites are that you obviously have to have a Cuban eighties environment, um, and you need to have some container workloads that you want to deploy into those environments. And the first step up this path is to do core detox and here, where you recommend that people focus on deploying their applications, workloads, and services in a single team environment. So single environment, a single workload, a single team. And my one, uh, so a little reference here will be to we've get ops core, which is our open source and package version, which helps you to deploy applications. And then beyond that, as you get into more levels of complexity, we're then looking at enterprise, get up where we can build an, a complete platform. And then of course, scaled get OBS where we go beyond this, where we're thinking about more advanced policy and security, across many, many platforms. So thank you for your time. I hope I've demonstrated to you what get ops is why it will have an impact for your business and interested in you enough to try it out. If you'd like to find out more about it, I'd be very interested in your questions or check out the we've worked site and get started with we'd get ups. Cool. Thank you.