Las Vegas 2020

Our Journey Towards Progressive Delivery

This session will talk about how the IBM Kubernetes Service builds and deploys micro-services to Kubernetes clusters across the globe.


Using standard tools and services such as Github, TravisCI, Kubernetes, and LaunchDarkly, IBM is able to deploy hundreds of code changes daily to thousands of clusters spread across the globe. By updating our development culture, adopting progressive delivery, and bringing our environmental configuration under control, see how we transformed our deployment pipeline from a slow monolith to a fast and agile set of micro-services.


This session is presented by LaunchDarkly.

MM

Michael McKay

Senior Development Manager, IBM

Transcript

00:00:13

Hello, my name is Michael McKay. I'm the delivery lead for the IBM Kubernetes service. I want to thank you for attending my session here at the DevOps enterprise summit. And I also would like to thank the organizers of this event, uh, and especially for having it in October, which gives an opportunity to show off some of my Halloween decorations. You can see my background here probably just started just a little bit about who I am. Uh, so I've been with IBM for 24 years. I just verified that this morning on LinkedIn. Um, so I'm pretty sure it's true. Uh, I started my career at IBM helping to manage the IBM internal SAP deployments. Uh, then I spent under sizable portion of my career building products for IBM Tivoli. And finally, for the past seven years I've been building and helping to operate the IBM cloud. Um, what's cool about what I'm doing today. It's really combination of what I did early on in my career, helping to manage those SAP environments and also building code and building products for IBM Tivoli. So basically now I write code and I have to run it and operate it as well. Uh, what other tidbit about myself is that I have four kids and three cats. Um, so most of the time I find, you know, managing and juggling, you know, delivery at scale to be much, much easier than when I have to deal with at home, especially during the days with COVID here.

00:01:44

So a bit about the IBM Kubernetes service. Um, we serve as the basis for most of the services IBM cloud, uh, we're currently running, uh, in six regions across the globe in 35 data centers, we have over 110 control plane clusters to build or to run our service. Um, and those 110 control plane clusters will, uh, supports, you know, well over 20,000 Kubernetes clusters for our customers. So on each one of these clusters that we operate and manage, and also our control plane clusters, we have various pieces of code microservices, configurations that we have to ensure get pushed out and managed, um, appropriately.

00:02:34

So just a little bit of background on how we got here, um, for one just lots and lots and lots of trial and error, um, because of all that trial and error, we learned a lot of things. We learned things that we should do learn a lot of things that we shouldn't do. Um, and we also, which is equally as important as we've learned things that are just not important. Uh, finally willing to take some risks and trying something new has been very, very beneficial for us. And it's unfortunately something that's can be more difficult in large corporations such as IBM.

00:03:13

So when we start off four years ago, we really just had a small team here in the U S and we had all of our code deployed in one data center in Dallas. Uh, at that point we just wanted to build something. Um, and we just want to build something that worked. We didn't really put much thought into scale or how we were going to operate this thing. Um, just basically what was going to be running today. Um, and can we make sure that same thing is starting tomorrow with some added pieces to it? Uh, most of the team had a development background, um, and because of that, we still treated what we're building as a product, um, like a code delivery product, something that akin to like, you know, I want to take this DVD and stamp them out and send it to my customers once a quarter, not we re really used to running a service, um, because of that, we are, we really used to delivering features, you know, over the course of months, not days or weeks like we do today.

00:04:16

So, um, usually when I do this slide, I've actually updated this quite a few times in the past. I used to talk about culture change, um, and that's really not really appropriate. Um, I really talk about cultural improvement because we're not really changing our culture. We, what I mean by, we're not just flat out, ripping out what we have in culture and replacing with something new. We're just really improving upon what we had before, because, um, believe it or not, we still have a lot of things. A lot of things that we did in our culture that are equally as valuable today as they were, when we first started this four years ago. Um, when we talked about culture improvement, I used to have a big, long list of things that we did, but what I find most important is I guess, what I've been calling, um, democratizing dev ops.

00:05:04

So what this means is that, and I'm sure a lot of you have been in a similar situation where you've had a team and that team was really led by just one or two, or just that I, a handful set of folks that kind of controlled everybody may have a control. You know, what, we're working on, how it got built, you know, things like what code we're using, which libraries you're using, what's coded or who's working on what and when and how, um, so moving more towards a dev ops democracy is I've been calling it. It basically means that every engineer should be equally involved and have a say in how they code, how they build, how they test and not only that, but they also will all be part of building or deploying code into production. And this does wonders for how we build and operate the IBM Kubernetes service.

00:05:59

Uh, for one, it helped a lot helped us to remove lots and lots of bottlenecks. Um, for example, in the past, we just have a couple of guys who would solely be responsible for pushing code out to production. This is now only slow and cumbersome, but the salsa meant that, um, you know, those two guys were then tied up and not being able to do anything else. So in today's world way, we do this is that every team, every engineer is part of a squad and that each squad is now responsible for building deploying and delivering their own microservice.

00:06:41

So the next step we've did was really changed the way we think about continuous integration and continuous delivery. Uh, we always say it, it's, it's really hard to do, to do progressive delivery if you can only deliver once a week. So our previous approach to doing CII and CD, what's it like most enterprises, we had lots of Jenkins jobs and lots and lots more Jenkins jobs. And we kind of had this, I don't really want to call a Rube Goldberg project, but we had Jenkins jobs calling other Jenkins jobs, calling it others, Jenkins jobs, and believe it or not, I think we still call it digital Jenkins jobs from that. So the jobs themselves were these huge mandalas, just like we were building and deploying our code. Those big mandalas would be responsible for the developer checking code. That code would be built that, go get tested that code to get promoted to environment, run some more tasks to get promoted to the next environment, uh, run some additional tests, uh, the whole process for building and deploying the IBM criminally service as a whole will take about three hours.

00:07:51

Uh, and because of that, we would only be able to deliver once or twice a week. So part of the rethinking of the CICT process was, um, was we had to set a few kind of ground rules for one, it had to be fast. It had to be scalable. Uh, we, our intent was to reduce friction and finally we wanted to get the users involved in the process. So in the past, we've talked about IBC have term called, um, visibility control and automation. Um, I'm a huge fan of the visibility. I'm also a huge fan of the control. I'm not so much of a fan of the automation. I mean, I think automation does have its place, but in our process, what we found is that too much automation took our developers out of the equation and therefore they became less part of the process and they didn't understand what to fix or what happened or what to do when something would go wrong.

00:08:52

So part of our, our new solution, our I process is based like most other organizations now is that you have your code checked in to get hub that code once it's checked in and built, uh, or once it's checked in, it will get built by Travis CGI. And then Travis will build the images. Uh, it will run tests, Lynda code, et cetera. And then finally what it does, it builds the images, uploads the images to our image registries. Uh, then all of our related Kubernedes artifacts are then uploaded to cloud object storage. And finally we updates a feature flag service, uh, we use called LaunchDarkly and then we use LaunchDarkly then to help deliver and push out our Kotel environments. So that moves on to the next step, which is the continuous delivery portion of our, of our overall pipeline. And this is where we have the most new ideas.

00:09:52

Uh, so we, uh, our, my team is actually called the Razzie squad. Um, so we've kind of invented some new pieces of technology to help us deliver at scale. I've done some talks on this in past, uh, but I'll kind of briefly cover what we do here today. So on each of those 20, you know, over 20,000 clusters, we have a small piece of code on our cluster. Updater, um, our cluster update or code will interact and talk back to LaunchDarkly. So LaunchDarkly can basically we can set rules in LaunchDarkly, which define how and when a code is rolled out to our environments. So what happens is that users checking code that code is built, um, uploaded to the, the appropriate repositories LaunchDarkly is updated. Then when the users want to deploy it, they go into LaunchDarkly. They find the deployment flag for their particular microservice.

00:10:49

And then we have a set of rules for each of these microservices. So most of the rules are by region. So then we can update that rule that says, Hey, for AP south, I want to deploy version, you know, XYZ. And as soon as you select that new version, the cluster updater, which contains a LaunchDarkly STK will be notified, pull that down and deploy that to as many clusters are in AP north. So this process works the same, whether we're pushing out code to one 10, a hundred or 10,000 clusters, we use the exact same process, uh, in a top of that, once that cluster updater is delivered that code or deliver those new Kubernetes resources to those clusters, um, it will also send up Infor, uh, basically the current state of the cluster, including which deployments are running, which configurations are applied to a service.

00:11:42

We call Razzy dash, uh, the rise of dash component. Then we can view the current state of the clusters what's currently deployed. What's important here is that we actually show what's currently on the Kubernetes cluster, not just what we thought we deployed. Uh, we've had this trouble in the past where we would have a job, which would push code to a cluster. So we assume that version XYZ was running. Um, but then when we actually looked on the cluster, we found that version was 1, 2, 3 years running. So for whatever reason, uh, maybe the deployment failed, maybe someone logged directly into a cluster and updated the version. Um, whichever reason we, we didn't have an accurate representation of what was running on these clusters. So what part of cluster up data, then we'll provide that visibility and provide an accurate inventory what's running on all, you know, all 20,000 plus clusters that we operate. Um, so the, the summary of this is that we have basically a, kind of a switch from a push model where we would have these Jenkins jobs pushing code out to the environments, to a pull model now where we take advantage of technologies like caught object storage and LaunchDarkly to help bill or to help deliver our code at scale.

00:13:02

So the next step that we took here is, okay, we've got this new toy, we call it, I called it at the time called LaunchDarkly. And it's one of those things where we were just kind of thinking, what other ways could we use? LaunchDarkly we're already kind of using, LaunchDarkly not the way it was intended today just for using it. Now, this feature flags, but as deployment flags. Um, but as we kind of scaled up the deployments and we're able now to deploy not a couple of times a week, but, you know, 150 to 200 times a day, um, we found that operating that environment could become much easier if we could just like tweaks and knobs here and there. So the next thing we did was we took LaunchDarkly and we used that to help us, um, you know, provide some operational controls via these feature flags.

00:13:55

Uh, one great example that we use LaunchDarkly for is to lock our clusters. So any time we can use a LaunchDarkly feature flag to lock deployments, to any set or any subset of clusters. So we can say that all clusters in the U S east, um, we want to prevent any new deployments. So we can just go through update a rule that says, lock our clusters in USCS. The second we said that in launch directly, you know, literally moments later, if not in, just in real time, those clusters will be prevented from doing any further deployments. So we have several other kinds of examples, how we do this, but this is kind of our next step into, um, you know, how we're moving towards progressive delivery. And really part of this is just kind of better understanding and better learning how we can actually take advantage of these feature flags.

00:14:47

So the next step, so we've kind of gotten to point where we are just delivering like gangbusters, we're delivering 150 to 200 times per day. Um, several different, several dozen different teams are currently deploying or are building and deploying code simultaneously across the environment. Um, so because that felt a little bit like the wild wild west, um, and so what we needed to do, um, and we kind of consider this call this our, our growing up or maturing phase as that's, we need to put some controls around how we build, uh, not really how we build, but how we deploy code out to these environments. Um, so this also helped us provide some focus and really to put more controls around, you know, how we update the environments. So what we did in this case is we basically built an application what sits on top of LaunchDarkly.

00:15:44

And this application allows the users to go in a controlled environment, select which flag they want to update. Um, pull down the new variant associated with a specific rule in the flag, and then submit a request to make that change done. So that request will then trigger a service now ticket to be opened up, and our operations team can then review that, uh, example of, for example, if we open up a change for you managed server, or you managed cluster, um, we can have a team, uh, in the EU data center, data center actually approve that request. Um, once they've approved that request, then it's still up to the developer to go through and click the button to apply the change. So it's, this kind of mentioned before about, you know, we still want to have the developers have a hands-on experience. Um, so we do allow the developer to control the rollout, but there's not necessarily this, this kind of underlying automation, which is automatically just driving it through the environments.

00:16:44

And again, this is on purpose. This is by design. Uh, so once that ticket's been approved once the user then, or once the developer then clicks the button to start the deployment, they can then monitor logs and they can actually test the application in the, in the environment. They deployed this to, to make sure nothing went wrong. Uh, the cool thing about this process is that the application that we're using to, to update the rules and create these change tickets is intimately knowledgeable about what you're actually changing. It knows the get commit of the chain of the version that's currently in the environment. And it knows the get commit of the version that you're trying to push to the environment because of that, the service now ticket that we opened up has all kinds of data that the opera operations team can then use to, um, to determine how risky this changes.

00:17:35

Um, but also additionally, uh, in the future, when we want to go back and see what changes occurred, uh, during a specific timeframe, we can easily see, um, exactly what changes went into the environment at what time. Um, so not only do we have what changed, but we know who changed it, we can also provide links to things like tests that were run, uh, in the previous environment at that random. So on all, once that code is deployed to the environment, uh, the user can then click, um, the final button there to tell whether or not it succeeded or failed. And as soon as I do that, as soon as I click succeed, that will, uh, complete that change and close that change ticket record. So again, the developer is still driving the code through it. Um, so they're still able to control the deployment, but we've now kind of put in some guard rails and additional sanity checks, just to make sure that, uh, we have the appropriate, um, you know, audit trail and controls in place to make sure that, um, everything continues to run smoothly. Uh, one of the things that is very interesting about this is that, uh, LaunchDarkly themselves has now incorporated this logic directly into their offering. So eventually we will sunset what we call our Razzy flights application, um, and begin to use just the integrations that are directly built into LaunchDarkly, uh, in there and their integrations with service now.

00:19:01

So the next step of our journey here, I'll call it feature flags act one. Um, so after we've kind of, we've been using feature flags now for about a year or so to control our deployments. And we have, uh, using some feature flags to do some operational changes or operational updates as well. So this is where we really started to realize, well, wow, there's, there's a lot of power to these feature flags. And what else can we use this stuff for? Which is ironic because what we started using it for now is kind of what LaunchDarkly was intended for. You know, that's what the whole total product is designed for. Um, so because of that, we started having teams, all these, you know, you know, a couple dozen teams that we have started doing feature flags. So each one, for example, our API, our UI, our billing, um, all these different microservices will then start to implement their own their features.

00:20:00

The problem was is that a lot of these features were the same. Um, so, uh, we ended up with lots and lots of duplicate flags, all had to be managed independently from one another. And because of that, uh, there's some confusion about which flags and they be updated when, um, who's added to which segments and flags, et cetera. Um, but having said that with those drawbacks, we are still able to deliver an update several, several very large features, um, through a progressive delivery model using these feature flags. This is something that even today, honestly, I'm quite amazed that we can do is that most at any time you look at our environments, we have beta code, we have even pre beta code running in our production environments. Um, we, um, we continuously deliver new code to these environments and can usually improve on the existing fee, not just existing features, but new features that we haven't yet rolled out to our, to our customers. Um, and so we, since we started using this model, um, we've kind of gone to like a dog and pony show across IBM to say, Hey guys, this is really cool. Um, just kind of, kind of talking about how we are now. I feel we are now doing progressive delivery and rolling out features using, and actually managing features, um, rather than just kind of pushing these features out typically as I like really large code deployments.

00:21:36

So, so finally, um, the next step is kind of what, I guess what I would say feature flags done, right. Um, so just like our deployments, um, early on our feature flags themselves were kind of getting out of control. As I mentioned before, all these teams had various different feature flags. Again, it was kind of the wild wild west, but this time with the F uh, the user flags, uh, so what we ended up doing is for one, we kind of sat back and really adopted feature management and feature flags into our overall development process. So what that means is that at the time that we think of a new feature or think of a new capability that we want for our product, the first thing we would do is go out and we create that flag in LaunchDarkly. So for example, we have a new offering called IBM cloud satellites.

00:22:32

Uh, it was announced earlier this year is currently in beta. Now, the interesting thing about this is that we've had satellite code running in our production environments, you know, since back in may, since we started first started thinking and talking about satellite, um, and the cool thing about this is that we've had one flag to control access to satellites we're in the past, we'd probably have a half a dozen or more different flags for each of the various components of satellite to give access to users for those pieces. So now, if we want to have user get access to our new satellite offering, there's one place we can go on LaunchDarkly, um, and we can add them and give them access to that, to the capability across the board. The next part is, again, just like the deliveries is we had to add a little bit of control around how we manage it, update these feature flags.

00:23:28

So because of that, we've now added a change management or a change manager requirement to update in this flags, uh, before this, um, you know, just random members of the team would go through and they have updated segments, or they change a flag, which could potentially expose this new feature to, to customers that we didn't intend it to, or may even change the behavior of certain production environments that would we no longer tend to, to the, uh, an important part of though this is that if we were to ever go back and kind of look at the history of a service, we weren't able to really tell who had access to what and when, so now we've also integrated our change management process around this. Albeit this process is a little bit more manual than our deployment flags. We do just require our users to go there and open up a change ticket kind of manually before they make any changes into launch into LaunchDarkly.

00:24:25

This is one of the areas where the new feature that launched directly has, uh, provided with direct integration while struggling with direct integration with service now will, uh, will help us tremendously. Um, and so, so final thoughts, I guess, closing thoughts here. Um, so just like every other project we're not even close to being done yet. Um, again, this is just kind of a journey towards progressive delivery. We just, we learn more and more every day about how to better ourselves, how to deliver the IBM Kubernetes service, more efficiently, quicker, and more reliably. Um, so a few things that we had looked into our, that we definitely want to do is for one is get like, how can we give users access to the feature flags themselves? So today that's a very manual process. Some like if a customer wants access to some feature, they generally have to come to us directly and say, Hey, can you give me access to this?

00:25:24

At which point we will go and LaunchDarkly update that feature and that use of particular capability or new feature. Um, there's a lot of other kinds of products and services that have the notion of kinda like a labs where folks can kind of go and say, oh, I'd really like to try this new feature and allow the users themselves to opt into these new features and to try them out and to give us some feedback as well. And so this is one of the next steps that we were looking at doing. Um, and one other piece is that for the most part, um, most of our services are, I guess, you know, feature flag enabled, or we're delivering it very progressively, but we still have a few pieces that aren't really in the mix yet. Uh, for example, one of those is our documentation. So today, if you go to cloud ibm.com, you may find some, uh, you may find some new features that you have access to, but you may not see the documentation for that, or vice versa.

00:26:23

You may actually see document this may be even worse is that you may see that campaign for a particular future, but you may not see the capabilities, uh, in your, in your experience. So at least my Nirvana is that we want to get the point where we've literally have a flag, which controls everything. So everything would include access to the UI access to the API access to the CLI and even access to the documentation as well. So when you're visiting cloud ibm.com, whether you're just browsing the documentation or actually using, uh, using the tools and capabilities of the platform, you will, um, it'll be very seamless across the board. So having said all that, um, I very much appreciate you listening to my talk here. Um, if you are interested in what we're doing or for more information while we're doing, uh, provide a link to our open source project, uh, one of the things I, I failed to notice, fail to note, and I'm going to put my shameless plug in here is that, uh, we've actually opened sourced our delivery process and our delivery model, and we call it Razzy and you can find that on, uh, you can go to razzie.io and find out form information for that.

00:27:34

Um, if you have any more questions about how, how we built, um, or how we operate, I'm always happy to talk about this. Um, just reach out to me on Twitter, um, or contact me directly via email at McCormick, uh, S ibm.com. Again, thank you very much. Thank you for attending and enjoy the rest of your day. Thank you.