A Layered Approach to Progressive Delivery (Europe 2021)

Progressive Delivery is the practice of decoupling deploy from release, allowing changes to be safely pushed all the way to production and verified there before releasing to users. Selectively dialing up and down the exposure of code in production without a new deploy, rollback, or hotfix is the foundation of Progressive Delivery, but the higher-level benefits of safety and fast feedback come from layering practices on top of that. Whether you are new to Progressive Delivery or are already practicing some aspect of it, you'll learn/refresh on the basics and then come away with a powerful model for layering higher-value benefits on top of that foundation. This session is presented by Split.


Dave Karow

CD Evangelist, Split



Good afternoon. I hope you're enjoying DevOps enterprise summit. My name is Dave and I'm a continuous delivery evangelist at split software. Today, I'm going to ask you three questions and then we're going to talk about layering data informed practices on top of progressive delivery. Let's jump right in first question. Do you do this when your team releases to production, do sort of each time you're going out to, to release a feature? Is there sort of a like, Ooh, I hope it goes okay. Question number two. Can you remember a release night that went something like this, and I'll give you a second to take in the meme here.


I've been through a few releases where we kind of knew partway through, oh no, this is not going as planned. This is very, very bad. Right? Um, not a lot of fun. And then finally, how do you respond when someone asks, how successful was that release you to kind of like, Hmm, even man let's say that to them, but you think to yourself, well, I don't know, you know, haven't really had any incidents. I'm not sure, but hopefully it went okay. Well, I don't think it has to be that way. And what I've been up to for the last year or two is demystifying, progressive delivery, especially the role of automated data attribution early on, as you're rolling things out. Right. And I got here because I've spent most of my career focused on developer tools, developer communities, and what I call sustainable software delivery practices, where the focus is delivering impact without burning out humans.


Right early on, I did my share of burning myself out and then I became a product manager and kind of on from there. And so I believe that there are practices we can adapt, which actually let us have greater impact without having to chew people up. Right. And that's kind of what, what motivates me. So I'd like to present a layered approach to progressive delivery. How do we build our way up to these faster, safer, smarter releases spend a little bit of time, just on some background. What does progressive delivery? What does it really, can we define it in a, in a kind of concise way? We'll look at a couple of role models. Who's already doing this and been doing it for quite a while. And what do they use it for? Then we'll lay the foundation in order to use these practices. You first need to figure out how you're going to decouple, deploy from release.


How can I push software all the way to production, but have it be off effectively? And then the upper layers, these are the data informed practices, the automated data informed practices that are really what delivers the value to your team and to your business. So progressive delivery. What is it really? Let's talk just a second on the roots, because I think the roots of aggressively, or are important to understand, is this just about gradually rolling out software? Or is there something more going on here? So you may know Sam Guckenheimer. He retired recently from Microsoft. He was head of Azure dev ops and he was having a conversation with James governor who goes by the handle monk chips on Twitter. And this is what said to James. He said, well, when we're rolling out services, what we do is progressive experimentation because what really matters is the blast radius.


How many people will be effected when we roll out that service and what can we learn from them? And I give this example because I think it's important to focus. Sam said progressive experimentation. And he was talking about learning early in the process. You want to not just to limit the blast radius and go out slowly, but he actually wanted to learn as much as possible. And what happened is that James had saw a Hawk and he thought, you know, I'm connecting a lot of dots. And I I've seen a number of practices that are about sort of, how do I change the way I roll all the way out to production. And he decided to coin this term, progressive delivery. And, you know, he, he described it as you see here, a new basket of skills and technologies concerned with modern software development, testing and deployment.


How do I roll in an intelligent, effective way where I have smaller lead times, happier teams, more impact on the business? Right? So, uh, Carlos Sanchez, who's now a software cloud engineer at Adobe, uh, wrote a blog post and he described progressive will bring in a really clear, succinct way. I'm going to use this here, which is progressive delivery is the next step after continuous delivery, where new versions are deployed to a subset of users. And that's really important, a subset of users and are evaluated in terms of correctness and performance. Some evaluating them as I go. And I'm not just evaluating like QA does. It just works at fail. Uh, you know, that might be what we consider correctness, but performance, is it performance in terms of speed and, and system resources, it also perform it in terms of business impact. Is it doing, is it having the impact I expect before rolling it out, the totality users.


So I'm actually going to learn things before I expose everybody to it and you roll it back. If it doesn't match some key metrics, now your culture may determine what your criteria are for rolling back, instead of just saying, okay, we learned something, let's try again. Um, and we'll talk a little bit about how, how that gets done. So then let's just switch gears to talk about a couple of role models who's doing this already. And the term progressive delivery was coined in 2018, but the practice has been going on for well over a decade. And, um, let's look at, let's look at these two, first of all, Walmart expo. So Walmart built their own platform for gradually rolling things out and learning as it's happening, because at the time when they built their solution, there wasn't really a market of solutions that could do this for you.


Uh, and so they had to build it from scratch themselves and you can learn more about how they kind of did the plumbing for that, right? Big company, a bunch of engineers, you can fill in the blanks. Right. But I think it's really interesting. If you look at these two reasons, they use it, they call it test to learn and test to launch. Now, test to learn sounds a lot like AB testing, right, or experimentation. And you'd be right. If that's what your assumption was test to launch is actually more about what I'm talking about here in terms of layering practices on top of progressive delivery. And that is that they wanted to be able to run effectively, run AB tests on the fly during partial so that they could determine whether they're impacting the business before they roll something all the way out. Right?


The second example is LinkedIn experimentation. And you know, the, the stuff on the left shows you kind of roughly how this flows, you know, the users make requests from the central service. A central service has this library can call that tells it what it should expose to the user and the user gets what they're supposed to get. Right? And you could decide, Hey, students are going to get 50 50, and job seekers are going to get it, you know, uh, 20, 80, and everybody else is going to get none of it. Right. Um, what's interesting about this example, though, if you look on the right side of the screen, what does it say? It says licks failed on site speed. I can be fairly sure this was not an experiment on what's going to be faster. This was probably an experiment on what will make people sign up for job offers, et cetera.


Right? But this is an example of what we call the guard rail, which is that LinkedIn is always watching for certain things. They've determined are always important, like errors and response time. And in this case, the alert is firing that, Hey, this thing you're rolling out is slowing things down by 50% and it's not good. Right? So we'll get back to guardrails in just a bit. Okay. So let's move on to the foundation in order to do any of these practices, we need a way of decoupling deploy from release. Now there are, it turns out there are a number of ways that a couple deploy from release and how you roll does matter, right? Uh, if you've heard of blue-green deployment or Canary releases or feature flags, um, these things will, whether you heard them or not, I'm going to kind of show you how each of these kind of exposes different benefits.


The benefits are down the left side, they're avoiding downtime, limiting the blast radius. Again, that's not hurting people for very long and not hurting. Very many people was kind of two aspects of that. And then limiting work in progress and achieving flow. This is how we get shorter lead times. If we have smaller bits of work that are going through and a smaller number of them, they can go all the way through to safety without everything get kind of stuck in a log jam of, of testing and integration troubles. Right? And then finally learning during the process. This was what Sam Guckenheimer mentioned when he had that conversation. How do I learn during these partial deployments? So I can get value out of them, not just, um, you know, limiting my blast radius, right? And so blue-green deployment. This is the notion of having a copy of your production infrastructure and because you have a copy of it, and this is obviously easier in the cloud.


Uh, if you have a copy of it, you can actually build the next release and take as much time as you want to get it all ready. So there's no downtime. There's no, Hey, we're down for maintenance interval. While we upgrade the solution, you can literally do all the work you need to do to get everything ready on the green deployment. And then just switch network traffic over from blue to green. Instantly people are in there's no downtime. And because you have that ability to switch network traffic between the blue and the green, you can switch them back, which is why limit the blast. Radius gets half credit here, which is that, Hey, I can decouple the play from release and if something goes wrong, I can basically flip it back really fast. Now I might've exposed all my users to the problem, but not for very long.


And that's why you get 50%. So 50% the time check the box, the scope nut. And then there's nothing really helping here on rolling out smaller pieces or learning inherently in the process. And then along comes Canary releases where you use containers, it's sort of spawn a few extra containers and send maybe two or 3% of your network traffic through those containers, with the new release on it. And again, you can build those containers on your own leisure. So there's no downtime needed there. Um, and limiting the blast radius that gets a hundred percent credit because you both are only exposing a small percentage of your users to it. And if something goes wrong, you can route the traffic right back through the other containers and very quickly, right. Doesn't really necessarily help you with limiting work in progress or achieving flow because you're still pushing a release out on those canaries, unless you've achieved sort of the microservices, holy grail, where a feature is on its own server, but most of us don't live in that world.


So the last one is learned during the process you lose, it gets one quarter credit, and that's because when you roll out this way, you could be sort of hypervigilant of the very small number of servers you're running us on and try to notice how it's working. So you could try to learn quickly because it's all focused on one little place. Um, but you're still effectively looking for a needle in the haystack of everything that's in that release. Things got a little different when feature flags became widely adopted. And it's not a coincidence that when feature flags became widely adopted, it's the same time period that progressive delivery was coined. And that's because of that third full circle there, the limiting work in progress and achieving flow feature flags lets you release, not up just a block, not just a deployment, but literally a block of code in a deployment.


So any, any subset of your app can be wrapped in a feature flag and it controlled remotely. And that meant that you could actually push much more stuff out and have much more nimble control of it. And if one feature has an issue, you can turn that feature off without having to revert the whole release. Right? Um, the last circle is empty because learning during the process, there's nothing built into feature flags themselves that lets you learn what's going on and you might be releasing any number of things, uh, roughly the same time. And so time-based correlations aren't really going to help you. You're again, you're kind of searching for the needle in the haystack. Then along comes this approach, which is having feature flags and data integrated together, right? And that's what you saw with the Walmart example with the LinkedIn example, they're literally, they have telemetry associated with the deployment such that they know which users are getting which experience.


And they know what the experience is for each of those groups of users. Therefore they can calculate this and because they have a lot of data sciences built into automation, they can figure this out on the fly. They don't study this after the fact they know about it as it's happening, right? And that's the practice that I'm talking about today. That will be our upper layers of the pyramid. So just a quick recap on feature flags, you place a function call on your code. It, it goes out, you can create one version of your code that could be in different environments and rolling through the pipeline. And your ability to expose that code can be done remotely. It's like having a dimmer switch for changes in the cloud. Um, and you might roll all the way to production and have as this first row shows here, 0% of our users exposed.


And the reason you would do that is that so that you can test in production and because maybe the feature is even finished yet, but you wanted to push a component of it all the way out to production. So it could be validated in production, uh, and move onto the next component before you ever expose it to users. Right? And then once you're ready, you can dial it up gradually. Right? And for those that like a little bit of code this slide and the next one are the only code I'll be showing in today's presentation. Uh, so don't panic. If code is not your thing, um, a function call that says, Hey, um, I'm asking the flagging subsystem to tell me, should this particular user at this particular moment of time be shown the related posts. And this is evaluated a user at a time, a session at a time.


This is not a line in a config file for the whole server for the whole user populations is just, it could be right down to Dave, right? If the treatment's on, show them a related post, if it's not skip it. And then here's a multi example, which is, I may be trying out two new search allergens and I want to compare them to my legacy search algorithm for like their ability to recommend or redo or, or respond either time or, or value of the responses. Right. And I can divide my user population up and I might be, you know, 80% going through legacy and, and, and, and, and 10 and 10% going through the V1 to V2 and then run it up to the point where it's one-third one-third one-third and then compared that the business outcomes, right. And the system load. So again, that's the end of the code.


So once you have a, a feature flags in place to do a couple of play from Elise, you get things like the ability to do incremental feature development for flow. So if you can build things a little bit at a time and roll them out as smaller packages that are easier to evaluate, you can actually get more done and achieve flow as opposed to like a log jam, right. And then testing and production, as I mentioned, and that doesn't mean using your user to test it literally can mean testing the code in the production environment with the production data, but actually not exposing the feature to users until you've happy with it. And then finally, and this is very popular with developers as a kill switch, kind of a panic button, a big red button. And the distinction here is that because you're using feature flags, if something goes wrong, you don't need a hot fix. You don't need a rollback. You don't need a war room of people deciding how are we going to, how are we going to fix this? You can literally just turn it off and then have your conversations about what went wrong and what should be done next.


So that's the foundation and now comes those data informed practices that I promised that are these upper layers. And these are the practices that are a little less known, kind of out in the wild. Obviously the companies that have been doing this for a while know about this. Um, and, and, and we can talk more about what some of those other role models are. But I, I hesitate to name a lot of role models because they're generally very large corporations with very, very large tooling budgets. Who've invested years and millions of dollars to actually accomplish this practice. And my premise is that that's no longer necessary, right? There are ways to accomplish this with off the shelf stuff now with, with services. Um, and, but I want to talk about how would we in a sane way, introduce these practices into our environments so we can start working a different way.


So, first of all, why would we automate data informed practices? Like what is, what's the main motivation? And I believe it's because there's a different way to ship that becomes possible. And I think most of you are here because you're trying to figure out different ways to ship or deliver value, deliver software, right? So it starts with deploying with the, with no exposure to the users, the code is deployed, but no one's seeing the code or being influenced by the outcome. Right? And then we go through a step, we call error mitigation, which is here, I've, I'm testing and production, maybe with 0% of my users, but then I might roll it out to 1% or 2% or 5% of my users. And here I'm just trying to find bugs or crashes. Things are going wrong that I missed in my earlier testing, right? What can I catch before the big, you know, Twitter rate rant from people like, how can I actually check for errors then I want to move over.


And this is where it's a little more exciting, which is I wanna be able to measure the impact of the release, right? And in this step, we're kind of halfway through. And frankly, what it says, your maximum power ramp, maximum power is achieved by pushing as many people through both sides of this experience as possible that the people that are getting a new thing and not so that you have the greatest statistical power to determine whether it's actually, um, delivering the impact you want. And then one last stop before we go all the way out, which is scale mitigation. So if you've ever had a release where everything seemed fine, but as along came your peak period with, you know, the big time when people are opening their messages or doing their thing, um, and then things go wrong, right? So what if you could ramp to say 70 or 80% or 90%, whatever seems more reasonable in your environment, um, and ride through a peak period, right.


And only then decide, Hey. Yeah. So during our peak period, uh, we saw the usual patterns. We didn't see any weird spike. We didn't see stuff, you know, we didn't see any race events or anything. It's all good, right. Race conditions. So then we release. So historically we think of these white ones deploy and release, right? And, and now we're proposing is that there are these data informed practices of error, mitigation to measure and skill mitigation that we can, we can add and without actually asking your people to do more work and you'll see why that's true. So first of all, can we just change things, uh, monitor what happens? So, um, I kind of came through it at a point where, you know, we had our heroes and we had our amazing people that could dig through log files and do sort of amazing feats of, of, of diagnosis when things were going wrong.


And you always wanted to have that person around if something did go wrong. Um, but it's expensive both to find these people. And it's expensive on them as a human being to always be in that mode. Right. So you can't always see, um, what's going on and figure it out quickly. Um, when you're, especially when you're doing more changes more often, right? So this is the problem we need to solve, which how do I separate signal from noise? There's other things going on in the world while we're doing our releases, right. And, you know, there may be other product changes happening at the same time. There might be marketing campaigns, your company may be paying money to deliberately change the behavior of your users. So you don't want to take credit for good or for bad, for a change in the, in the user behavior.


If somebody is actually moving the needle through a marketing campaign, and then an example, global pandemics, they change user behavior. Right. So how do I separate that from the fact that I introduced a new feature? Is it because everyone's working from home or is it because, uh, uh, my feature is really meeting your needs, right? What is it? And then finally, something as simple, as nice weather, depending on what you sell and who you sell it to can have an influence on user behavior. And so if the weather changes, do you want that to throw off your ability to determine how well your release works? Do you want to be like, well, I don't know. It was a really sunny weekend. I'm going to be the guy at the first part of my deck. Right? How do we make it so that those things don't throw us off, we need a way of separating the signal from the noise.


And if you ask yourself, what do you already have in your life to let you separate signal from noise? We'll come back to that. Just a second. Think about when we used to fly on planes a lot, what you wanted to have to avoid the noise, right? So imagine this you're, you're rolling out a feature. And as you go to a hundred percent, that's sort of the bar on the right is kind of traditional it, which is we released it. And then, oh my God, uh, response times went up and latency, uh, latency went up and, uh, and, and throughput went down, um, and that's bad, but look at there's another bar to the left of that first one. And it's actually when the feature was rolled out to 5% of the users. So if you adopt this sort of progressive delivery and you define that as just a coupling deployment from release, and you say, yeah, we're going to roll it to 5%.


And you're looking at your usual graphs. You might see something like this and you'd say, well, it looks like the ambient traffic on the graph. Everything seems I don't see anything in anomaly. So we're good. Right. And that's the problem that if you're all at the 5%, the problem would have to be 20 times bigger than normal for you to even see it. Right. So inherently it's kind of breaks your typical way of looking at, at sort of how are we doing today, right. Um, but the good news is, there's a way around this. And again, it was talking about, I kind of hear the tip was like, think about noise, canceling headphones. So how do we cancel out external influence with the stats engine? It works much like the noise canceling headphones would, but for your metrics, noise-canceling headphones have microphones on the outside of the headphones that are listening to the ambient noise.


They then inject into your ears, the inverse of that ambient noise, along with the music or the podcast, as a result, you hear the music and the podcast without the outside noise. Cause it's been canceled out right. First time I tried that on a plane. I literally laughed when I turned it off again and heard the difference. I was like, oh my God. So here's how this works in with our metrics, right? Take half your users and send them through the new thing and half your users and send them through the status quo, right? What a scientist would call the control for the status quo and the treatment for the new thing, and then compare the distributions of their behavior and the system behavior for those two populations, if they overlap exactly. Then the thing you did really didn't have any influence. It may be a busier day and there may be more latency than normal.


But if they both show the same increase in latency, it's not the change that did it. Right. Conversely, if these, if these distributions are apart from each other, then you know, you did something and you can actually see where the metrics are different. So that unlocks the upper layers of the, of the, of the pyramid, right? So how do I automate guardrails guardrails are this notion of, of, of like that, that sitespeed thing. How do I find a way to alert on exception and see these performance hits early in my rollout without toil, without actually people having to be hypervigilant. So limiting the blast radius without manual heroics. So if you can actually have a stats engine watching the things you generally care about, errors, response, time, unsubscribes, and whatever, and have it alert you, you can push 10 or a hundred times a day and you're not having people have to like go crazy paying attention to the system's doing it for you.


And then let's move on to measuring release impact. So here we want to actually be in a situation where we know whether the thing we did makes a difference. Now, if you're iterating often, if you're achieving continuous delivery year and using decoupling, deploy from release to achieve flow in your ship, ship, ship, ship, but you don't even know whether it's having an effect. It's very demoralizing, right? It's like it's been called a feature factory, right? And so this is not a rule we want to create. We don't want to be in a world where we're just making people move faster. And we don't even know whether we're having an impact. When you have direct evidence of your efforts, you're more likely to get pride of ownership. You're more likely to have people, even if something goes wrong, they're like, Hey, let's fix that because I know it's, it's not as good as it could be.


Right? So directly Evans, your efforts leads to greater pride. It's great, better for psychological safety, which is that we, we actually know whether we're hurting. This is not, it's not rumors, it's not someone's opinion. Right. And then finally test to learn, uh, what you might traditionally think of as AB testing. Right? And here, what we want to accomplish is we wanna be able to take bigger risks, but in a safe way, I don't have to give you an example of that is the second. And we wanna be able to learn faster with less investment. This is not so one of the common misconceptions here, by the way, it would be, oh, AB testing. Well, I don't want to build, to have every feature I literally have had more than once. I've had a product manager say to me, well, I don't want to use two story points instead of one, right?


This is not what this is about. So first of all, what you might be AB testing is status quo that you currently have. And should we add this new change? That's not two versions of the code that you built new. That's just one new thing. And seeing what its impact is. Second of all, there are ways to actually try different new things without having to create multiple versions of your software. And one of them is called dynamic configuration. And the idea here is that if you're using feature flags to deploy a couple from release, they can carry along a payload that's parameters that are specific to the population. That's getting that flag. So if I've already determined that I'm dividing my users into three populations, I can say, you know what, for this population I'll give you a specific example. It helps. So Speedway motors is a carport site online, and they have a engine it's a third-party system that takes input parameters to determine how it comes back with recommendations, right?


And they wanted to see how these would behave in the real world. And so they set up dynamic and fig to send different sets of parameters, to different cohorts of users, and then observe what happens. They could then iterate on that experiment without a new release, by just changing the values of those dynamic configuration parameters. Instantly these user population is now on to a new experiment, which is that new set of parameters. And they're capturing data in sync with that, right? And the other example I want to give is painted door. Now this is, this is kind of a, a usability hack, which is you can actually build the, the entrance to a feature without building all the complexity in the backend. And when people click on it, you can either throw an error or better. You can say, Hey, thanks so much for your interest in this new feature, like annual re you know, annual plan is interest to you.


We're working on that. We'll get back to you, right? And the last thing I want to say while we're hearing is taking bigger risks than safely, is that, um, uh, imagine you're a food delivery service, and you know that if you ship the right food to people, they're more likely to, to buy more and like you and stay right. Um, but you need to ask them more questions to make sure you know, what they want. And so the entry, the, the there's a company called imperfect foods, and now it states they had a signup flow and they proposed actually add more questions to the flow so that they could better cater to their customers. Right? The concern would be that people would drop off and they wouldn't finish the signup. And so what they did was created this sort of a traditional flow, a slightly longer flow in this significant longer flow, um, and test them out side by side.


And they found out that the longer one yielded consistently better results. And they were getting like seven or $9 us per order, more from the people that went through that flow. And so they fired it up, right? And so this was a great way to take a risk without like having big debate over whether we should do this or not. And you know, what'll happen. So this is what sustainable software it looks like. I believe that, that this new way of progressive delivery with automated data science in there actually creates a much more sustainable flow. And if you look here, the traditionally, we think of the deploy and the release, right? And these yellow steps in the middle, those are only made possible by automating the ability to do the statistics. Um, as this rolls out, this is not about studying the data afterwards is not about begging a favor from somebody to look at some, some data.


This is something that happens automatically. That's the way we build pipelines, right? We don't, we want to automate things so that we can move it as many times as we want. And it does the same thing every time to make sure that we're doing we're, we're, we're operating consistently ineffectively inefficiently, right? So this is the only slide I'm really going to talk about split software in which is who is split software and what's our deal, right? Well, in-house progressive to replatform paved the way for split. In fact, our founders came from multiple shops where they had built these large complex systems. And when they went to other places where they didn't have them, they missed it. And they wished that they could just buy this, right. Which is what they set out to build. So companies have adopted parts of this progressive delivery, like feature flags, and some have even added some sort of sensors and try to do correlation, but only very few have figured out how to do the stats engine and kind of the system of record of the alerting.


And those that did were spending tens of millions of dollars a year on making that happen. And they do it again. And they re up because they love it, right? It's a competitive advantage. We don't all have the advantage of having 50 million or 30 million or $25 million a year to allocate to this or spending 2, 3, 4, or five years to build it. And that is what the engineers that built split set out to do was make this just something you can subscribe to a SAS. So let's move on to Q and a with slack. I'm looking forward to your questions. I also want to invite you to come to our booth. We're doing kind of a fun thing. We call confessions and redemptions and continuous delivery, which is usually when we have an in-person show, we have great conversations with people about what they're working on and how it's gone. And, and, uh, we're kind of, kind of recreate that here. We're also holding a raffle to give away an Oculus quest too. So virtual reality headset, I hope you found this talk interesting and informative. I have a lot of, uh, sort of vendor neutral content we can talk about. This is not all about split. Split is just one company that's kind of leading the way in making this productized.