From Flagging Releases, to Flagged Releases - A Story of Release Acceleration from Vodafone UK

This session is presented by LaunchDarkly.

RG

Robert Greville

Head of Web Engineering, Vodafone UK

DW

David Ward

Head of Product Engineering, Vodafone UK

NW

Natasha Wright

Engineering Senior Manager, Accenture

Transcript

00:00:13

Hey, and can we talk, talk from flagging releases to flight releases, a story of release acceleration from Vodafone UK. Welcome. My name is Robert Graebel and I'm responsible for Vodafone UK web engineering. And I'm going to tell you a story, a story of wonderment peril, danger, and delight, a story in three parts, a beginning, a middle, and an end that takes us on a hero's journey from flagging releases to flight releases. Before we continue let's first, just do some housekeeping. I'll be joined by two other Intrepid adventurers. Firstly David Ward. Who's responsible for our platform team agile practice automation, engineers, and MBAs. And I'll also be joined by Natasha Wright. Who's been leading our CICB team from Accenture. Please join us on our quest.

00:01:17

Before we talk about our journey. Let me first tell you about our humble beginnings. Vodafone was the first company to make a mobile phone call in the UK, the first company to introduce text messaging and the first to introduce international roaming. We are a company of firsts and today we deliver a plethora of products and services from mobile devices for home broadband, all businesses, over 600 million customers around the globe with 18 million of those customers here in the UK. We're a global leader in IOT serving over 90 million connections. That's more than anybody else. We help our customers all around the globe. Stay connected. We passionately believe in the power of new communication networks and technologies to change our society for the better. How about I think this video that tells you a little bit about us

00:03:24

This is not where our story starts. Once upon a time we were fighting to make change. There was a call to adventure. The world was changing around us. The speed and rate of change was accelerating and we needed to catch up. We needed some help, um, something that could help us with our challenge on the road, that layer heads our ability to release code, but with pace and with quality in the beginning, Vodafone didn't really have much of an engineering team. It didn't have developers, guilds, tribes or communities. We had isolated parts of our business making change and coming together all at once. Releasing was becoming harder and harder each and every time that we tried, as we began to grow and build our engineering squads, we began to swarm around ideas and tooling that would help us hit our objectives and solve those challenges. We use. The short dev ops is it helps us for all of our agile needs. Our CICB practice is serves us really, really well, installing code, managing workflow, uh, even documentation. And in the short term, we were using as your DevOps to handle the turning on and off of code functionality through environment variables. But even that had its own challenges.

00:04:45

They were too coupled all of our heroes, it needed to bundle work together. I only go when all of them were ready, getting work out, became cumbersome, long arduous tires and days and nights for our engineering legends. Everything had to be done by a person deploying code. Usually late into the night. I remember fondly gathered around a conference table in our office with fingers, hovered over buttons, ready to press. And then once they were having to check everything manually, even placing orders into our shop ourselves to check that everything was working as we expected outside of engineering, even people that wanted to make changes such as our product management Guild, they relied on engineering support to turn those features on and off or run beta tests. There was no interface. The system was really unusable for non-technical users. So it meant that only developers can make changes or release all the changes as well needed that release even something as simple as setting something to true required, a new release, meaning more time being spent on administrative tasks rather than actually delivering customer value. So as the solution group, we needed to deploy changes more quickly to production, but we were already getting further and further ahead of our stock. We needed to get more releases out, but maybe just maybe the, our customers couldn't see those changes. So surely there must have been, if that's the way I'm going to hunt. Now, that's why middle section with David Ward.

00:06:23

So here we were releasing once per quarter, we had a strategy for releasing in place and that strategy was there to protect us. We'd been scared to release frightened at the thought of causing an outage and having to spend endless hours in rooms of war debating, where the issue had occurred, what the root cause was. And ultimately who was going to get the slap on the wrist this time for being the cause of the issue. Our strategy was built on blue-green deployments. We had two identical versions of our entire platform in production running with only one of those receiving traffic from our customers. At any point in time, a release would involve pushing changes to the dormant production environment, running sanity tests to check for stability before then performing a flip of website traffic. So the newly updated production environment, these flips had come to protect us.

00:07:25

If we ever had an issue with a release, we would simply flip back and I'll chastening for causing a release outage would be significantly less. So the flip was there to protect us and protect us. It did for a time there was successes, successes of decoupling that came with a new microservice architectural design successes of improved availability at lower cost from using cloud hosting successes of greater throughput and efficiency than ever from an improving agile delivery process and successes of an improved release cadence. We were now able to release our platform every two weeks, but these successes would also prove to be the doom of our existing release process. We were delivering business value faster than ever. And as a result, we were being asked to scale more than ever. The flip release process that worked well for 10 teams would not work for 20 teams.

00:08:30

We were now flipping every sprint. So every two weeks that meant within any given two week period, if any team wanted to deploy any change to any one of their services, they needed to have completed their development work and deployed that change to a dedicated platform environment for aggression testing by day seven of the sprint, our antigen testing would that have a few days to run full regression triage and resolve any defects with our development teams in order to ensure that the release was stable and could be fit with more teams, more services and more demand to deliver. We were pushing more and more into each flip to be released. These flips were fast becoming unwieldy with many, many changes, across many different microservices. All going out in a single flip, our releases have become micro lithic. The more we shoved into each flap, the harder it was for us to pass regression testing in the two week period. Any issues with releases meant, yes, we could flip back, but this would only exacerbate the problem and build up more changes with more changes in each release. It became harder and harder to triage issues as well. Our release cadence began to slow and it became clear that we needed a new solution. The solution began with something small, a seed laid out in the form of an OKR for our development teams to be challenged by. And most importantly to own for themselves.

00:10:18

Each team has to release just once in the next three months, independently from any other team, some teams succeeded, some teams did not, but the important thing was that each team tried and found out for themselves where the friction in releasing independently existed for them. We really immersed ourselves as a department in this OKR and celebrated each and every independent release. But this isn't a story about OKR. That's a tale for another time. We wanted to release smaller and smaller, faster and faster, but soon came to an issue that was common across much of our platform. Our automated testing was relatively immature. We didn't have our test suites running in an automated fashion that would give us the confidence that we weren't breaking production. We also had a multitude of different feature flagging solutions, making it very difficult to not only coordinate and to end integration testing of our larger deliverables, but also release those in an independent manner.

00:11:37

We needed a solution that didn't involve putting the brakes on our business commitments and retrofitting automation everywhere LaunchDarkly would be that solution for us. And we came up with a strategy to ensure that we weren't just going to add yet another feature flagging solution into the mix. We were going to ensure that this would be one feature flagging solution to rule them all. The first challenge we had to overcome was one, many would have come across before we had to deliver the integration and adoption of LaunchDarkly without impacting our current business commitments. In order to achieve this, our strategy was to spin up a standalone team, a fellowship, if you like, that would not only be responsible for being our LaunchDarkly gurus, but they would also be responsible for visiting each of our teams and migrating their existing feature flagging solutions over to LaunchDarkly before polar aggressed pull, requesting the migration back into each team for them to own.

00:12:48

This worked really well for us. And as an intentional benefit helped us to discover a few of the pain points involved with external contributions to our various services. We continue to use these learnings as we try and drive a culture of inner sourcing before any polar requests could even begin to be thought about our fellowship of LaunchDarkly gurus had to go on an epic journey themselves. They created rapper libraries as an obstruction lab and should everything LaunchDarkly was configured using Terraform discovered and documented how we could organize our feature flags from our naming conventions to parent-child flags, they aligned and created our environment strategy, and LaunchDarkly sorted out ACL resilience, flag, hygiene, and maintenance. It was quite a journey. And one that we're just about to start reaping the benefits from it's at this point in our story that I hand over to Tash, to tell you a bit more about some of that journey, starting with the life of a flag.

00:14:05

Hello everyone. My name is Natasha Wright and I work as part of a engineering team at Vodafone digital looking after the LaunchDarkly platform, which we use as a feature flagging solution. Um, here at Vodafone, we adopt and everything is code and a dev op centric approach in everything we do. And the LaunchDarkly platform is no exception. We are managing that platform using configuration as code at all layers of the stack. Um, whether that be the projects that we deploy to LaunchDarkly the environments, not Shockley, and even the flags themselves, we're trying to manage them within everything is code approach. So I want to spend a bit of time today talking to you about that approach, how it's used, how we structure it, and then talk a little bit about how we use the LaunchDarkly product here on Vodafone and how it integrates with our development practices and our operational tooling.

00:14:51

Um, so first I'm going to talk about, um, the configuration that we have for LaunchDarkly. So, as I mentioned, we're using everything is code approach. And here in this instance, we're using Terraform to manage our LaunchDarkly platform, LaunchDarkly ships with a Terraform provider. So it seems like the obvious choice for us to manage the product from the ground up, um, using, and everything is code approach we're using as your DevOps as our DevOps solution. So we use that for both get version control, C I N C D. Um, so I'm just going to take you through our Terraform repository now. Um, so for it's, from a Terraform perspective, we manage our projects, our environments, user roles, and permissions, API, and access keys, and even the flags themselves using Terraform configuration. Um, so when we look at our projects, we're using Terraform maps to manage all of that information.

00:15:41

Um, our projects, our logical divisions, um, that we have here for development. Um, and it means that we can assign environments to projects, whether those environments test environments or development environments, um, given that this is all with Terraform, we can continuously deploy this without impacting any of the work that's actually going on. So here, as you can see from my screen, we have a particular project called Vodafone consumer, and then a number of environments underneath it, um, just to fit to my LaunchDarkly console. Now, just to show you what this looks like. Um, here on the side of my screen, we can have a number of products. We see a number of projects that I've got set up and including the environments underneath each of these, and you can see these are all isolated, um, entities within LaunchDarkly. So deploying a flag or turning a flag on and off in one environment in one project, doesn't actually impact another one.

00:16:29

Um, we're also managing custom roles, um, using LaunchDarkly and we have another number of roles defined. So LaunchDarkly ships with some roles by default, um, for our work here in our development, and more importantly, our release practices, we needed some more custom fine greens, um, permissions, particularly to have a good separation from a permissions perspective of who can do what in production. So we actually created our own, our back model in Terraform, which we define, um, here in this repository. So we have different permissions for different roles. Um, so to take you through what this looks like in LaunchDarkly itself, I'm going to switch back to my LaunchDarkly screen. If I click on account settings and then roles, what we can see are the custom roles that we've created. So we have roles for our engineers, and more importantly, we have roles for our operational teams and our release managers.

00:17:20

And these are the people that can flip the flags on and off for production. Uh, meaning that we do have that logical separation of least privilege privilege between an engineer that's developing a new feature and our release manager or our business manager that wants to turn on a new feature in production for our consumers. Um, next I want to talk a little bit about the flags themselves. So we're actually managing all of our flags in Terraform. Um, we've broken down our flags into different Terraform for each of our feature teams, which means that each feature team can build and develop its own flags for its own pipeline of work that it has. I'm just going to click on one of these now and show you what this looks like. So we're actually using a map for all of our Terraform variables, um, that we pass in.

00:18:05

Um, each of our feature flags that we create will have a unique key. It looks at the variation type. Now this basically dictates the type of flag that we're going to deploy, and if it's going to have a true or false, um, value, and this is great when we have a block of code and we want to determine, should it be executed yes or no. Uh, but there's also other types of variation, um, that we can leverage and we are doing so here in particular, when we think about, um, our front end interfaces, and maybe we want to toggle the look and feel of a particular component or change in inputs going into it. Um, so we actually have an example here of some flags, um, which have a variation type of number. And what this mean is means is that when that flag is turned on or off, um, a different input is passed into it.

00:18:51

One that's not true or false. Um, so this is the roles. Um, this is the flags themselves and also our projects. Um, but now how do our applications actually pick up these flags for use, um, that's managed via SDK or API keys, uh, and we actually handle all of those using Terraform as well. So when we run our pipelines to create new environments and projects in Terraform, um, and deploy our flags, unique keys are generated and these need to be injected into our applications. And I hear him, we have a problem because we have a piece of secret material, um, which will allow our apps to access LaunchDarkly. So we use, um, AWS solutions, uh, for all of these items of secret material. In this particular case, we're using AWS SSM parameter store to store these valuable pieces of taken, um, secret material. Um, and we actually use Terraform to not only grab those keys directly from notch darkly, but also to persist them to AWS themselves.

00:19:53

Um, so we have a full loop here on the screen in front. Um, and basically what we do is we grab every STK API key that LaunchDarkly generates. And we then persist that into AWS per and store at a location, um, that our applications can pick it up from, and this is great because it means that none of our engineers ever actually have to manually handle secret material. Um, and the pipelines will do it all for us. So that moves me quite nicely on to our pipelines. Um, while we might have all of our code in Azure DevOps, um, this doesn't help us when we want to deploy it because it's, it's multiple Terraform, um, phases that we need to, to look at here, um, deploy multiple different types of configuration. So thankfully we have an automated pipeline, um, which we leverage to deploy all of these changes to our LaunchDarkly environment.

00:20:39

So in the first instance, we have a feature flags, Terraform, debug pipeline, um, and uninsured. This is a pipeline that's run. Anytime anybody makes a pull request or emerge into our master branch so that we can determine the changes that have been made are good changes. Um, this is also a pipeline that can be invoked manually by our engineers, uh, when they want to test any changes that they're making. So what I'm going to do now is I'm just going to click run and run that pipeline and show you what it does. So in the first instance, the pipeline will grab all of our latest Terraform artifacts. It'll bundle them up and it will publish them to the pipeline itself. Um, and then the second stage is what we call a debug stage. Um, and what this will actually do is it will execute a Terraform plan using those Terraform artifacts that have been downloaded from GIP.

00:21:30

Um, and then the pipeline itself will then publish a plant files. So we can have that as a version control tracked artifact that comes out. So if we ever introduce any bad Terraform changes, we can look back through our plan files and we can understand exactly which debug Ron it was, or which version control change, or which commit ID actually introduced that bad change. Um, so given that I'm running this Terraform plan against our master branch, which is already deployed to our LaunchDarkly incidents, um, I should see this run and tell me that no changes will be required. Um, this job normally takes about 20 to 30 seconds to run. So it should be wrapping up any second now, and then we can actually see the output of, uh, running this job. Now, as I said, this is something that any of our engineers could run. And it also runs when we want to make a merge requests into our master branch to ensure that the code changes are correct.

00:22:24

So if I just click on Terraform plant here and go right down to the bottom, we can see, it says no changes. Infrastructure is up to date. Um, so now if I go back to my, um, pipelines, um, I can see, uh, that that's my debug pipeline, which is great, but I also have, um, a DX feature flags Terraform pipeline, which is actually related to the code in my repository. So this is the one that's actually going to deploy the code changes that I have there. So if I just click run pipeline, now this would deploy everything that's in my master branch. So if I click run, this will kick off this pipeline. And in terms of, um, the stages, the first stage is very similar. It grabs all the latest artifacts and creates a build, which it then publishes. Now we have a slightly different for a second phase.

00:23:08

So it's a deploy LaunchDarkly phase rather than a plan. And what the second phase actually does is it connects to our Terraform backends, um, for our particular LaunchDarkly, uh, instance. So we have a state file, which we keep in an Amazon S3 bucket. And what happens is our pipeline will connect to that and essentially do a comparison between what's in the S3 bucket and the Terraform code that we've got and then execute the Delta of anything that's required. Um, and what this will do is it executes that Terraform files. And then it'll give us an output onscreen to tell us any changes that are made. Um, so this is great. So we've got now got two approaches to one, uh, give our developers the ability to make changes directly on the Terraform code, um, of any new flags that they would like to create and run a pre pipeline so they can see any changes that might happen.

00:24:00

Secondly, we have a deployment pipeline which allows us to push their changes in an automated way. So nobody has to make any manual interactions, um, with our LaunchDarkly instance. And this is also great if we got a brand new LaunchDarkly instance, if we needed to create it from scratch, we've got all of our configuration as code, so we can simply execute it against that instance. Um, so how do we actually use LaunchDarkly day to day from a development and operational perspective? Um, so we have a number of automated bots and apps that are deployed to slack, which we use as a collaboration tool. Um, in particular, we have a LaunchDarkly bot that's deployed to slack, um, which allows our engineers and developers to manage not only LaunchDarkly, but the flags deployed to LaunchDarkly through slack. Um, so if I, uh, go to a particular slack channel here, I can execute slash LaunchDarkly.

00:24:52

And what that will do, is it all, um, give me a message, um, help message to direct me to the various commands and functionality that LaunchDarkly has through slack. Um, so our developers will often use this as a way to manage flags in the development environments in a programmatic way to turn flags on and off. Um, so if I wanted to have a look at a particular flag, um, I just got some, um, a pre flag, uh, to save me having to type it. So if I just put that here, so here is a flag that I'm looking at it's in one of our development environments. Um, it has a particular name. This is the environment it relates to, and I can see that this flag is currently off. Um, it's Jeff. I want to have a look at this flag in the LaunchDarkly console. I can basically just switch back to my LaunchDarkly console and I can flip here to feature flagging, make sure that I've selected the right environment. So if I go to Vodafone consumer and then dev one, and if I look for that same feature flag, open banking, I can see that it's currently tracking as off, um, using the auto deployed sat bot. I can actually turn that flag on.

00:26:22

So if I click confirm, we can see that targeting for that flag has been changed to on. And if I now refresh my LaunchDarkly window, I can see that that flag is now on, and this is great because it means that our developers can handle and manage flags all through their collaboration tools. And this is great because lots of our incident management tools get all plugged into meaning that we can acknowledge incidents, turn flags on and off, or using that single tool, which is both persistent and allows us some traceability over who's made what changes. So talk a little bit more about slack and what we use it for. Um, we also have a number of other integrations, so things like, uh, when flags are deployed across our environments. See, we've got an example here that shows that LaunchDarkly had any flood created across all the environments.

00:27:08

Suddenly we also have traceability of user changes that are made. Uh, we have a deployment alert here that shows me that our user actually turned a flag on in our production environment. And this is great because we have these traceability and we can see when people are making changes. Um, additionally, we have other integrations which are set up with the LaunchDarkly platform in particular with Datadog. Um, so we're using Datadog for all of our monitoring data aggregation, custom metrics and alerting, and we have this integration set up by default in LaunchDarkly platform itself. Um, so if I switch over to Datadog, uh, we have a LaunchDarkly dashboard. Um, so if I click on that, now

00:28:00

I'll be taken to a dashboard, um, which shows me a number of events that LaunchDarkly is automatically sending to Datadog. Now, this is using the default integration and we've configured a custom policy to only sensors and types of events. So, uh, in particular, we have production events that we can see here, so we can see when people are changing, um, flags on and off in production. Similarly, we can also see when people are changing flags on and off in development environments. And we can see the event here that I just generated by turning a flag on and off. Uh, I'd also like to draw your attention to some custom monitors that we have here. So there are certain things about LaunchDarkly where we want to track when and why they're being changed in particular, if people are deleting projects, environments, um, changing production values. So we have a number of monitors here, uh, and this is set up in Datadog.

00:28:54

So it integrates with all of our existing operational processes, tooling and incident management, processes and tooling. Um, this integration allows us to manage LaunchDarkly in the same way that we manage our applications, um, and production environments. Um, all of these alerts, again also paid through to slack where we have some further integration there. So we have a dedicated channel, um, which we've actually set up, um, which is called LaunchDarkly notifications. Um, and what this tells us is any time that one of these monitors, um, changes from green to red, it's gone from good to bad. So we can see when people are changing things in production, we can see when we're flags are being deleted. And this again, allows us to build up a story of when we've got changes to our environment, all integrating with our existing development processes and all of our incident management processes. Um, so with that being said, uh, that's everything I already wanted to show you today about LaunchDarkly and how we're using it. And I look forward to any questions. Thank you very much.