GitOps & SLO-Driven Automation Driving Faster and Better Releases! (US 2021)

Did you know that 90% of DevOps & SRE automation code that is currently developed must be rewritten or thrown away within the next 12 months? It’s because most automation frameworks have GitOps and SLO-driven orchestration as an afterthought resulting in high levels of code duplication and technical debt. This is especially true when implementing use cases like automatic monitoring configuration, deployment validation or SLO management. If you are a DevOps, SRE, Performance or Automation Engineer then join this session where Andreas (Andi) Grabner, core maintainer of the Keptn CNCF Open-Source project, will show you how early adopters of the new data-driven cloud automation have increased speed of app delivery by 75%, improved deployment quality by 50%, and helped scale DevOps beyond lighthouse projects. And, do all of this without developing and maintaining custom pipelines or DevOps tool integrations! This session is presented by Dynatrace.

2021breakoutuslas vegasvegas
AG

Andreas Grabner

DevOps Activist, Dynatrace

TRANSCRIPT

00:00:12

Hi there folks centered by summit. My name is Andy Grabner and I'm really pleased to be here with you at least virtually and give you some insights on a topic that I care about a lot. I'm here in my kitchen, but now let's go and start sharing the screen because this is what I really want to show you. So let's get started. Uh, today's topic is get ups and SLO driven automation driving faster and better releases. I'm Andy Grabner. I am a DevOps activist at Dynatrace, but also at Def rail for the open source project cap. That we'll talk a lot about captain today and therefore, if you're interested in learning more, please make sure to check out some of the links, follow us on Twitter, star us on GIP, or join a slack conversation. We are a CNCF project and really want to make the life of dev ops engineers.

00:00:59

And SOVs easier. All right, let me get started. I want to kick it off with a little, let's say breaking news alert because I'm pretty sure all of you are trying to get better in automating your delivery and automating your operations. You're moving to the cloud and moving to Kubernetes yet, just moving to these new technologies doesn't necessarily give you all the stuff you need, right? Just moving to Kubernetes. Doesn't give you a resiliency. It's a service. So that means we may all need to prepare for situations where systems go down, where systems don't act is expected, especially systems that we may not be in control. In our cases, Dynatrace, we also run our systems in the clouds and, uh, things to the way we are embracing DevOps. We're embracing a Sarabia embracing automation. We're leveraging observability. We were able to withstand the four hours.

00:01:55

AWS is a two hours outage in the Frankfurt region with Cyril impact for customers. I really love that. Uh, Thomas ISP, who is leading our team, we call it the ACE team diplomas cloud enablement team. They're responsible for running and operating and deploying our software for our assess and manage customers that he was sharing the story with me. If you want to read more, there's a blog, but really what this is about. It's about a sharing session and telling you how we are doing things internally and also how that impacted what we've been giving back to the open source world as part of captain and how we also then enable our benefits customers to become better in the dev ops practices. So, first of all, to kind of remind ourselves, but I'm sure Jean and others have been talking about this for many, many years.

00:02:42

We, as the recent, their folks need to deliver faster and better, we're measured against the different dimensional metrics. I think dev ops on the one hand, it's using automation to speed up delivery. We are measured against some of the Dora metrics like deployment frequency, but lead time for change. So speeding on the other side, we have SRS or however you call them in your organization. I see them emerging out of operations, where they are now using automation to ensure resiliency of their environments, that they are responsible for measured against things like change failure rate, what time to restore services in case something eventually goes wrong. Okay. So speeding up delivery and also ensuring resiliency both heavily relying on automation. And I think one of the things that connects them together, our SLO service level objectives, because in the end, whatever we do, however often we deploy or whatever we do in production.

00:03:38

We also always want to make sure that our services are available to our end-users to our business stakeholders based on what we have agreed to deliver. These are the analysts SLA service level agreements, but we often now measure them as SLO service level objectives. So we need to do a lot of things to get there. I think there's ups and their sarees must automate many tests through their pipeline for their automation scripts. I just highlighted a couple here and I'm pretty sure they're definitely not complete. Whether it's about automated testing, automating security scans, automating your monitoring and observability, adding notifications to it, you know, doing more around what they call zero downtime deployments, where it's, blue-green, Kineret all these things we as, as sarees and their folks need to figure out how to automate into our pipelines. Not as clearly as we all know no shortage of, I call it to yourself, Swiss army knife tools or scripting.

00:04:36

I'm picking one of my favorite tools Jenkins. Definitely. Uh, I can execute tests with Jenkins. I can add my test result analysis. I can add notifications. So to notify people about the results I can integrate with my APM, with Meltzer livability platform, I can add an approval process. I can add kales engineering, which isn't, I think in a very emerging, a new practice, adding security scans, adding the whole thing across multiple stages, and then also edit these serial deployment downtimes, right? So nothing keeps me from doing this with the tools we have available by doing a lot of scripting with these tools. The thing though is if we do it with the existing tools, and if we are really proficient with writing automation scripts, then we may end up like Christian heckler man, a senior dev ops engineer who is responsible for almost a thousand CICB pipelines.

00:05:29

He's constantly reacting to pipeline broken. Please fix the, is that because his pipelines that he built, his automation scripts, that deploy tests, and then just do some evaluation, ended up being very complex. This is just one of his pipelines with more than thousand lines of code, more complex than some of the microservices he's deploying and testing with it. And well, he started a while ago and that escalated pretty quickly, right? Because it's hard, very hard to maintain these pipelines. Next example is from data. One of my colleagues at Dynatrace, uh, he's responsible for kind of the new cloud native workloads. Uh, he and his team also started using Jenkins pipelines. Again, well-known system. We know we can do magic things with it. One service was onboarded, more services got onboarded. They needed the little specific things like a different testing tool, a different type of notification, a different metric to pull in for the evaluation.

00:06:25

And this thing kind of exploded or kind of what we call it. A snowflake effect, many different permutations of these pipelines. Dieter also then did the analysis of actual codes to application across our different automation scripts we use for deployment and keeping things in production. And there we saw we are victim to the same thing that the engineers that write business code, our victim always falling victim to which is high technical depth hypo to application high code complexity. So we thought, how can we solve this? Because we as deaf up sensory need to automate more, but we shouldn't be drowning in the automation scripts. We shouldn't be needing to maintain tool integrations that make up the bulk of these integrations individually. But so what can we do to help the industry? And this is where captain come in, all kept my, I kept them. You can go to captain.sh or to get in slack to find a mobile video.

00:07:21

Do we really want to make automation easier automation for dev ops and their series, not just delivery automation, but also automation for operations. So how did we try to solve the problem? What was, again, the problem we saw, if we could look at your classical automation script, whatever tool that is, you'll have hard-coded steps where you may prepare your system for monitoring you, then deploy calling your deployment tool of choice. Even sometimes having your helm scripts or manifests, whatever it is hard-coded in the pipeline, you then run your tests again, your heart coded. You have a hard-coded integration between the pipeline tool and the testing. Then you do some evaluation, but you're pulling back the log file from the testing tool. They may pull in some data from your monitoring tool for their API, and then you try to then figure out is that a good analytical build?

00:08:14

And then you are either promoting a notification. You're promoting it to the next stage by calling another tool or sending out notifications. Again, the challenge with this is, as we've seen, there's a lot of hard coding integration between the process and the tooling. And there's also often configuration in these pipelines, whether it's test scripts, Yammer files, and the metrics you want to analyze. So what we thought with captain, we want to, first of all, remove these hard-coded dependencies, which means we said, let's get rid of this whole thing that combines everything, processing tooling, move the tooling to the right on the left side. Just keep the definition of tasks you want to automate as part of a sequence. So for instance, prepare, deploy, test evaluation, promote, and on the right, you have some tools or capabilities that can then fulfill certain activities. Now the configuration, there should not be any tool specific configuration in your pipeline definition or in your automation sequence, definition.

00:09:17

This all moves to the right to get. And then we using eventing to connect everything together because basically we broke process and tooling, and now we have loosely coupled process and tooling definition. They're connected through them just as we do in normal software engineering. All right. So to give an example, I've got, for example, a three examples, automate performance sequences in staging, a very common use case in captain, you would define a sequence of deploy, run a test, and then evaluation not evaluation is highlighted particularly here with a special color because SLO evaluation is core of what we do. It's part of every workflow. So now if you have model that you want to deploy and test and provide this to, let's say your engineers and an engineer can simply say, captain, I want to trigger the performance sequence in staging. And here's some additional information like the image that I wanted to deploy.

00:10:13

So what the captain does, it starts with sending out an event because then now the first step is deployed. So it sends out the so-called cloud event with the information about deploy, what image, which stage, and maybe some additional information like, Hey, this should be a blue-green deployment. And then you may have one or multiple tools that subscribing to this event. One tool would obviously be the deployment tool. This could be helm. This could be an existing Jenkins pap, and to get their pipeline, anything that you want to use to deploy an image with blue-green in staging. Additionally, we can have multiple tools subscribed to these events. Like the notification tool is also subscribing it because maybe you want your slack channel to be updated whenever any deployment happens. So this was deploy. Then test happens, kept them triggers. A Testament is picked up by the testing tool.

00:11:00

Once the testing tool is done, sends it back by the way deployment tool and the testing tool. They all get the configuration where from, from the kit that is completely managed by our system. And then the last step is the evaluation, right? Evolution means I want to pull in metrics from the monitoring tool, from a testing tools, from any type of tool. I want to get these details and then calculate and everything is good or not good, depending on your SLO definition. And again, here, notifications might be interesting. You want to notify people in the end, so they give you some terminology. We call the process definition, or let's say that with the definition of automation, tasks and sequences, a shipyard, we call the tools or the capabilities that are taking part of that workflow. That's subscribed to these events, so-called captain services and the part of the captain's uniform.

00:11:55

So what is kind of what's the uniform that the captain is wearing? What else do we have with the configuration? We have conflicts in the conflict report and we have cloud events. This is a standard they're also covered driving with the CDF where tasks specific metadata is sent by the orchestrator, but the process orchestrator, uh, to, uh, the individual tool that then picks it up. So this was automated performance sequences. Now let me give you a little more complex. One automate Canary roll-outs. First of all, the sequence has changed. This is not for production. We want to do a Canary rollout. You have some additional tasks, like prepare and release on the right side. I removed some of the tools. We may stick with the same monitoring tool, but now we may have a different deployment tool that can do the Kenny areas and maybe have a different notification tool for protection because different teams are interested in it.

00:12:48

But then at the end, it's the same concept. You say, captain trigger a certain sequence in this case, can every role for a particular stage with this particular metadata, right? Same thing happens again, deployment event is sent out. Now maybe the deployment tool, the two that can do the reason that has been, uh, kind of officially assigned for production deployments is now peaking in what about SLO relation? Same thing. It's slow pulls data from the monitoring tool. And then at the end, depending on the result, we may say then release event where then the deployment tool will say, okay, now we are scaling up to 50% to a hundred percent to Ken arrows. So you can see here, it's just, again, the process on the left to define your automation sequences. And then you have different tools on the right that are then participating in that workflow for subscribing to these events.

00:13:45

And they are getting their real configuration from the get repository that is also organized in stages, staging, production, whatever you have last example, because we want to make sure that you understand this is not just another automation tool for delivery production remediation. So here in production remediation, I may have some additional tools, right? When problems come in from your, uh, from your monitoring tool, for instance, you may have some infrastructure automation, some ticketing system. So if your production monitoring system finds a problem, it can say captain speaker remedies, or it sends an event over to captain and says, I found a high failure rate problem in staging and the root cause seems to be lock this latency issue. Now on our captain side, you can specify so-called remedy action sequence, but most importantly, for remediation, we have a special concept. We can also specify remedy action actions in a so-called remediation Yammel file actions, per actions that you can specify for certain root cause.

00:14:49

And then what kept them does it takes the first action, like clean this since the event, same concept as before they individual tools that need to participate that can deliver that actually like cleaning the disk, pick it up, sent back the status once it's done. Additionally may also want to send everything that is done by automation to notifications, again, as shown here most important. The after every action captain, again, reaches out to the monitoring to dusty evaluation. If it's good, thank you, process time, problem solve everything good. No human interaction needed. If it's not good, if it fails, the evaluation system is still down. Then executing the next stage, like a rollback same concept event is sent. Tools are basically being pulled in and so on and so forth. Now after every action, every mediation Hexion, if it's good, everything is fine. If it's not good, it continues.

00:15:46

And the last step could potentially be well, let's escalate this whole thing, but maybe in this case, we'll create a ticket so that somebody really can follow up. So three examples. It was the first one on test automation and the second one was on Panera deployments. And now on all the remediation, why are we doing this again? Remember I started off with the problem statement that many of you are building automation with your existing tools, which are perfectly fine, but don't build too complex automation scripts that are maybe more complex than the microservices that deploy. We want to help you with reduce the complexity with captain. We've seen 90%, less automation, prolifically, a separation between process and tooling. GitHub sits in grain. Every all the configuration is engaged. Every time you make changes, you can trigger a new process, SLS core to the whole thing.

00:16:40

We'll show you in a second and most important. They kept them. It's not replacing the tools that you have made investments in, kept in this, connecting them to really automate sequences for deliberate and operational purposes. So this is typically the moment when I'm stage, when I am on stage and say, please now take a picture because again, on the left side, kind of where most people are currently heading by building the wrong automation on the right side, this is what I want, or the always what I say. Um, friends, don't let friends build their own automation. My friends suggest to their friends. Please have a look at captain first and how KIPP can take away. A lot of the automation pain most important again, is leverage your existing tools. Take your tools that are maybe already deploying. Then just tweak a recap and sequence that does a test and evaluation.

00:17:36

Um, you decide how to bring in automation with captain very important captain really, really, really does a great job in orchestrating all these tools and in the there's always the SLO evaluation. That means after every sequence, after the task, captain can reach out to observability platform and say, are we good to go? Or are we not good to go? What did we do? Cool. Now most adopters have typically get the question. How do people get started? Most adopters start with integrating the SLO validation. That's kind of the simplest process that you can have with captain into the existing pipelines. The typical use cases, a lot of our users already have pipelines already do some deployments, some testing, but then they manually sit there and validate that the deployment actually happened. What was the test execution? So they built dashboards in the most popular, uh, you know, observability platforms.

00:18:33

Uh, this is an example from Dynatrace, but what we are doing instead of manually looking at these dashboards, how beautiful they may look, uh, this takes a long, long time and it takes human people that need to be available to do the evaluation. And therefore we say, let's automate that if we have, if you already have a dashboard, you look at, it means, you know, which metrics, which SLS are important and what are the SLS, what are you looking for? And then we can automate this completely with Kevin and bring this down to a fraction of the time. One of the examples, and I want to give you some option. Examples is Mike Kobish from any IC. He's a performance engineer. You can see on the left, he's building these beautiful dashboards in Dynatrace where after every test or violet test is actually running, he's looking at key performance metrics from his performance tests that are monitored by his of civility platform.

00:19:27

So he's building this dashboard, but he actually augmented it with a slow information. He's 10 checking in this data, this dashboard, and gets making it accessible to kept them, and then kept them as completely automated for the analyzing the dashboard for you, giving him an easy to understand score in the end. So if you want to know more about this, watch the video. If you want to know more about the scoring, there's also more information out there, how we do a Salinas to those scoring. So one example automating the validation of a deployment though. It tests not a great adoption example of AFS and software. They're responsible for Austin online banking, um, because you wonder where my exit comes from. And also Austrian. We work closely with them. They are triggering this from the Jenkins pipeline where it changes is doing the deployment into the UAT environments and then running some tests.

00:20:16

So then however, they go off define relevant SLS that should be analyzed fully, automatically, and then kept them in the end, calculates a complete total score. And as I said, everything is triggered in their case from Jenkins, with the link back between the tools to navigate easily. Most important thing is if everything is green, nobody needs to look at the data anymore. Captain provides the release kind of validation, um, recommendation, last example. We also see a lot of, uh, integrations with other CICB tools like Azure DevOps. We have a great partner with, uh, rail Goldman. Uh, they built, uh, Asia DevOps integration, same thing. Um, as you've seen before, that means people are using the automated SLO based evaluation for deployments and with a Belgium government agency, they were able to speed up the delivery end, especially with use manual work. So some of these examples now we also have some great testimonials from people like Tara, a performance engineer at Facebook.

00:21:21

It's great to see comments like this on LinkedIn. Kevin feels like the reference implementation of Google's site, reliability engineering and the site reliability workbook, really great testimony for somebody that really knows, uh, slows, SRAs performance and automation pretty well. So to kind of close it up, um, talk a little about KIPP. Uh, it's an art, it's our open source project that we want to contribute back to the world. And the nice thing is whoever you are at dev ops in estery, you'll pick the use case that you want to automate. We can start with just the quality gate or the SLO evaluation. You can go all the way into all the remediation. Every use case needs some configuration for your specific tools. Most importantly, you connect your tools. We're not replacing your tools. We connect them. We are giving you the chance to not build and maintain your own tool integrations in orchestration.

00:22:15

So therefore kept an automates monitoring delivery reliability in remediation. We have our UI, we have our API, so you can automate everything. Most importantly, everything happens through events driven orchestration. As you saw, as silos are at the core, we always evaluate all the configurations and get everything is declarative and on the bottom, right standards, very important. We're working with the CNCF, but also with the CDF to standardize all the events we use for tool integration. Now, I talked a lot about captain. I also want to say a big thank you to Dynatrace that makes all of this possible. Remember in the very beginning I started off with, uh, we are all there for accessories and need to make sure even if, as we move to the cloud and belief that clouds technology can make us automatically more resilient. That's not the case. You need to invest a lot.

00:23:06

Um, like the example abroad, when we were able to with stand a, um, data center outage. Now we're using Dynatrace to monitor this to our systems, but we also use automation. And now more and more the stuff also from Capitan, which we also bring to our customers because we want to help them to speed up deliberate. And we've seen this from our examples, but 75% improved quality, very important. You have to be confident, the stuff that you deploy, therefore we are enforcing as silos, not just in production, but also as part of the delivery automation with the data, bringing it all together will increase collaboration between the DevOps and the SRE team. So DevSecOps teams and thanks to all the automation, thanks for the self service that enables this can be scaled enterprise wide. So hopefully you liked what you saw. There was certain things in there for you, if you want to follow up later on, um, then here again, all the different details of the get in touch with me, but most importantly, how to also have a look at our open source project and everything we do also on the Dynatrace side.

00:24:13

Thank you so much. As I said, I would love to be there with you on face to face. I'm sure it's happening next week. Not next week, but maybe next year. Okay. Bye. Bye. See ya.