How Betway Tests in Production: Hypothesis-Driven Development

Betway has been following a “test in production” approach to building software for a few years. They test in production for two primary reasons: to validate business hypotheses and gain confidence in technical implementations.


In this session, Michael will share tips, experiences, and lessons learned from testing in production.


Some of the topics will include:


• A/B testing: how Betway implements A/B tests and gathers data to prove that a product change is adding value

• Trunk-based development: how Betway uses this technique to release incomplete and broken code into production, so it can to be continually tested and refined within the production environment

• Managing problems: how Betway resolves both technical and business problems that arise. Michael will also share how the company implemented key aspects of testing in production across several systems. Plus, he’ll share a framework they adopted to streamline the whole process.


This session presented by LaunchDarkly.

MG

Michael Gillett

Solutions Architect, Win Technologies

Transcript

00:00:08

Hi, I'm Michael . I'm a solutions architect at that way and London, and I've been with the company for a number of years, working from the front end all the way through to the backend, um, on a number of the different websites and services that we offer around the world. In this presentation, I'm going to take a look at the ways in which over the last few years, we've really moved towards using testing and production as the way of doing releases, getting features out there and how we are able to do more efficient testing. I'll also be sharing some of the tips and best practices that we've found over the past few years. So I'm going to share my screen and crack on.

00:00:45

So what I want to make sure, just to set everyone's accept expectations is that you don't expect this to be something where I'm going to give you a load of horror stories. So I'm going to give you this. Disclaimer, what we've got today is going to be about how you can avoid things that an upline that looks scary. There's nothing scary about testing and production, and sure. We've had a few things go wrong along the way, but that's what towards the end of this, I'm going to be able to talk about the ways in which we've, we've come up with ideas and processes to ensure that things don't go don't go wrong, which is obviously what we all want. We want releases, um, to go well. So the talk is going to be around, how do we do testing in production? But it's obviously that element of how are we building things in a reliable way?

00:01:35

How are we making things that work, that do the things that we want them to do and do it in a way that won't give us headaches and problems as we're getting them out there. And I'm going to take a step back from talking about software, but look first in a wider picture, which is starting off with craft. And what I mean by this is this is how things used to be made several hundreds and thousands of years ago. This is how we did things. We dealt everything with a specific bespoke view. Um, everything was handmade, was a craft to build pots and, and make anything was done in this very, very particular way. Quality was often there for sure, but it was a slow way of working. You couldn't scale it to the levels in which often we needed things. Um, so it was quite inefficient, but it, it did often yield good quality things that then led to obviously in the, with the industrial revolution, we moved to mass production where suddenly you can produce things in vast quantities, far faster than humans could make by hand.

00:02:45

And we had no denotes of things available to us. The problem though, was that because everything was pretty cheap now to me, and you could make it quickly, um, things were made without really necessarily a need for them also quality. Wasn't a concern. It was cheap to make this stuff. So they just made it. Um, if 90% of every run through the factory was good enough, then that's fine. You just need to make a bit more new. You've got what you needed. Um, but obviously that was not very efficient way of working. And so that was refined into what, um, perhaps is best known as Toyota doing, which was this kind of idea of lean mass production, where you don't want that element of waste. You want quality built into what you're doing. You don't want problems. You don't want quality issues. What you want is everything that comes out of that factory to be good.

00:03:38

You want it to have that quality baked in. And the waiver, um, that was done was to understanding that pipeline, understanding the materials, the processes that were involved in producing these things. And through that understanding constant refinement could be made hypothesis could be created as to better ways of working. And then the end goal was, well, you can, you can validate whether that hypothesis was correct and in doing so, bake that in for the next run, through the pipeline or to the product that you're shipping at the end of it, um, all with fairly low risk, but it just means that what is produced results in very little waste throughout the, the process and what rolls off that production floor at the end is of good quality.

00:04:27

Then really what that is, is employing the scientific method in the way things work. So moving from craft to using everything in a bespoke way, suddenly you want to, you want to do things frequently. You want to make lots of the same thing again and again, you can do that. You can't necessarily do it reliably. So then you bring in this idea of looking at the numbers, analyzing things, um, being able to test things, repeat things, draw conclusions that leads you to lean mass production and an extremely efficient way of working. And obviously we all get to benefit from that in our day-to-day lives. There's most of the stuff that we actually use, it's all adopting that lean mass production, um, technique. And perhaps the best example of that is with NASA and getting humans on the moon. And they started a decade struggling to get anything into orbit and finished the decade, putting humans on the moon and bringing them back that was not necessarily lean mass production, but it was the scientific method, what they wanted to do.

00:05:33

They had an end goal, they had a hypothesis. Can we get there? And everything that they did in that time was about constant improvement and refinement, collecting data, analyzing that data and making sure that the next mission was more successful than the last wasn't to say that everything was successful. Sure. And that's, I had a lot of problems and some very expensive rockets blowing up along that, um, course to get to the moon. But the point is they constantly refined. They're constantly came up with ideas about improvements. They understood what it is that they were dealing with and through doing that, ensure that they reached what they wanted to get to. And obviously we've got modern day equivalent with space X. Um, only a couple of weeks ago. They put humans into space, but here you can see they're landing two rockets side by side at the same time, which had never been done before.

00:06:25

And again, just like NASA space X have had problems along the way. They've had rockets blow up and things go wrong and don't just be scrubbed. But the point is, what they're doing is they're constantly learning from all of that stuff and improving it for the next time. But that NASA and space X are effectively doing this stuff in production. Yes, they are analyzing stuff ahead of time. They're running models, they're checking the data. They're checking that things all look good, but ultimately when a rocket blows up, because it's meant to be going into space or stuff that is something going wrong in production. So NASA or space X. And obviously there are lots of similarities to the stuff that we do and probably dwelling on that. We should move away from that. So in space X and rockets, however cool they are and bring it back then to just software.

00:07:17

And how do we launch our stuff in to production, not space. So what, where therefore, where are you going to be talking about? But bearing all of that stuff in mind is hypothesis driven development. The idea where we take our ideas and we build stuff in accordance with those ideas, proving those ideas, implementing the ideas and constantly refining our products and our workflows in much the same way that a lean mass-production factory floor works or NASA and space X are getting things into space. The one thing I questioned with this terminal is perhaps a niche term, but it is used within the industry is the idea of hypothesis driven development, where development maybe feels more like a craft than lab lean mass production, where if we're employing the scientific method, um, we're not really developing things which might give this impression of every bit of software is unique and bespoke.

00:08:17

And no, you can't, you can't treat software in this standardized way. Everything we do, every app we have, every system we have is a unique thing as certainly I've heard, but maybe rather what we're looking at doing is bringing in these standards and processes that ensure the quality throughout, even if there's this element of difference, but perhaps, and what we're really dealing with is hypothesis driven engineering. But how do we do this at that way? I mean, that's the point of the talk, right? That's what I've been doing for the last few years and what I'm going to be sharing with you. So the first one and perhaps the most obvious way of doing hypothesis driven engineering is the business hypothesis where we need to validate an idea that someone has had to improve our product. They understand it, they understand the data, they understand what we're trying to do, and they've come up with a thing that they think would be better for our product.

00:09:10

So what do we call that? Well, that was probably an AB test, right? A lot of you will have probably heard of the idea of an AB test where you've got version a, of something and version B of something, and you want to run a test, which one is better. Quite often, we found that we could just go down the route of implementing a new thing. Doesn't mean it's better. It's just me. I'm going to do a big bang release and it would be there, but really what we now do. And what I think is a better way of working is creating a hypothesis first. Why should this thing be done? Why is it going to be better? What is the thinking behind that? Then you can split the traffic between your known state and the new state. You can make sure everything in your environment.

00:09:51

And I don't just mean production. I mean the wider environment can make sure everything's the same. And then you just run the same. You can split your traffic equally between the two and see which one is better. Is it more clicks? Is it more conversions? Is it more revenue? Whatever it might be, whatever that success metric is. You can, you can understand it, quantify through logging everything of relevance to this experiment. Then you have to analyze that data. When you analyze it, you can see which one is going to be better for you or not. And therefore you can validate that hypothesis. Was it correct to do that or not? Has it done what, what the original idea was meant to do? Well, you've done that. Sure. And you've done it in production and you've tested something and no longer is it this idea of thinking that it's going to be better or worse?

00:10:40

You know, if it's better or worse, you've done it with your real use cases, your real users in your real production environment. And you have categorically proven whether the hypothesis was true or not. And I'm just going to go through a scenario that we've done it that way. And this was the homepage. And you'll notice I'm saying was this was the home page in the United Kingdom. Um, and the crucial thing to note here is that we had multiple buttons or we're doing the same thing. And people wondered whether that was actually not really driving people to the right place. Those links were all going to registration, but perhaps this page is so noisy. People were dropping off before they ever clicked any of those buttons. So a hypothesis was developed, which was, let's just have a single button. If we emphasize that single button, we might, we should get more registrations.

00:11:33

And so we set about enabling this to happen. The first thing we needed to do was design a new home page. We wanted a slightly simpler one that can drive that message home a little better. Um, this one you can see on the right is the, the new one, the, the hypothesis one, uh, it's got a single join now button rather than the multiple for ones that are spread out on the left-hand side. It's still got a lot of what was in the original one. And obviously with this kind of stuff, changing a homepage of a brand as big as bet way comes with a lot of vested interest from a lot of other parties. We've got marketing brand SEO is the name just a few of them. Plus from a tech perspective, if we're doing a new homepage, is it an opportunity for new app?

00:12:18

Do you want the performance to be the same as lots to think about in this kind of thing, but the premise here is about that join now button. So let's just focus on that. So what did we do for this? Well, we've built that new home page and what we did was we ended up implementing a redirect on the beltway.com route using LaunchDarkly to split the traffic. And what we can then do is most of the traffic we can get to the old page perfectly fine, but we can release stuff to the new homepage. And just only let doesn't QA is look at that. Is it working? Is the feature that the dev is working on? Is that correct? Is everyone happy with it? Cool. What we can do is we can roll it out now for key members of the business that gets some user acceptance testing happening.

00:13:02

Is everyone happy that with the way it looks and works? Cool. Awesome. Okay. If everyone's on board and we're all happy and the boxes all tech, we can now roll out to 5% of our real users. Again, this is still no deployments happening here. We've done our deployments that on a separate app, we've still got our old app running our, perhaps a control. And now we've got our B new one and we're just sending some traffic there. And in this case, we're just sending 5% of all UK traffic who are English, who have their browser in the English language. Cool, nice and safe. No big bang, no massive rollbacks, no panic. It's just a nice small incremental rollout. And if that all is all looking good, the data is fine. Everyone's good and happy with that. Well then we'd roll it out a bit more.

00:13:48

We can roll it out to 25% of our UK English language code. Okay. If it continues to look good and by good, I'll come back to that in a minute. We can then roll out the 50% of the UK and our English speakers. But again, we don't have to do any releases for this to happen. And at this point, we're in a full AB test mode. So we've got our old new homepages traffic is being split equally. If one is converting better than the other, there's no real other variations involved. Now, other than that design of that single focus on a button. So this is what we're really testing. Is this page better than what was there before? Well, what we found was that page actually resulted in a 25% increase to our successful registration rate. Now, obviously that is hypothesis proven. Awesome. Right? Really good.

00:14:42

No, no big bang, no, no scary stuff. And that, that goes into the technical accomplishments of that project, where we have no rollbacks on, on a release point of view. We didn't have this crazy, oh, we've gone live with everything. We need to bring it all back. That was just not there. Sure. There were things that weren't quite right with that new home page, as we rolled it out from, uh, expanding the experiment point of view, we may have found issues and what we can do. We'll just turn the toggle off. Just bring that back to our internal users. Find no real risk. No, no. Getting up early in the morning at 2:00 AM to fix the problem. It's just easy. We didn't have any critical alerts either because of the way we were able to roll this out slowly, there was nothing that massively broke.

00:15:26

And again, if there was, we can just revert that, uh, that toggle down. Um, and what we found was because of those first two things in that nice ability to respond to things in a calm way, we actually found that all of our decree exceptions that were occurring decreased over time. And you can see that in this graph where each, um, bar is, is actually a different day, but each color is a different browser on a device. Um, and B it's the number of exceptions that we saw in that browser on those devices. And you can see it decreases over those five or six weeks. Um, but what's really cool is we can actually be very informed with the processes that we were doing in the, if we found an exception was being thrown in a particular browser on a certain device, we can actually stop the new homepage being there, but still keep it rolled out to anyone else who fulfilled that 5%, 25%, 50%, whatever.

00:16:22

Um, and then we could also decide if, uh, if an exception warranted us, bringing back down that, that size of who we had rolled this to. Um, we can just do all of this. We can look at all of this data cause we need this data to know about how the experiment is ultimately performing, but suddenly it's also guiding who we're rolling this out to and how fast are we rolling this out? And it's really, really powerful to be able to do this stuff with real users, real devices, real browsers, we're getting all of this, which might just not have been visible to us in the testing environment before going live.

00:16:58

The second, um, hypothesis we can use here is a technical one. Um, what we want to do in this scenario is actually validate an implementation. So this is about rolling something out to, um, uh, rolling out perhaps a new feature, maybe new authentication system or a new, um, performance improvement, which is potentially a very significant change. And again, you don't really want to do a big bang release with that kind of change. You want to be able to just put it out to a small subset of your users to validate that that is working as you expect, is it bringing in the performance gains that you wanted? So you can target just 5% of users. You could target just the country like Canada, you could target device type, make sure that the performance on mobile is as good as it is on other things or as good as you wanted, but you can do more complicated things.

00:17:48

So we've done Mo more complicated even than here, but this is perhaps a nice example where we can target sub domain and the user who was previously locked in. Cause there's a cookie set for that. So now we can start testing things in the returning user journey, all of the processes that are going on there are they all right? We can improve the performance of the system there. Okay, cool. We've proven that that is actually faster. And we only have to target 5% of the users that we were looking to improve. So there's some really powerful things that you can do from a technical role perspective. And we also found that technical testing was a really interesting thing that we haven't considered, uh, as something that we might be able to do. But what we found was, wouldn't it be great if we could load test how a production client side applications without impacting our downstream systems?

00:18:36

Um, certainly for big sporting events, our client apps, half the work really to high levels of load. But if we're not actually going into that, uh, fixture just yet, we might want to just test that that app is working fine for that impacting everything else. And what we then started doing was implementing toggles within our applications that would look at, which would use a toggle to choose a mocking service rather than the production service. And then we can turn those toggles on the particular use cases. So if a particular header is there, then this is a performance test or load test, and you can send the traffic through to the mock service rather than the production service. And then obviously you can target on different things as well. You could test browsers or platforms or devices or particular networks to see that they all stand up as you expect testing things that note on two G is quite interesting, especially considering the size of the NBN market.

00:19:33

So that's the different ways that we can do, um, hypothesis and testing, um, by testing and production. Um, but this now is some of the things that we found along the way, um, which I want to share with you because these are lessons that perhaps that we've learned. So first is we've been able to adopt the process on some of our teams have trunk based development. And now this is really interesting. So obviously we can still use gates, but rather than having lots and lots of feature branches, what we can do here is make full use of those feature flags in our production environment. And that the developers can work on the local environment and just push stuff with is contained within a feature toggle. They can push it to the production environment, perfectly safe. It's not going to break the production environment. They can turn that bit of work on for just themselves, even if it breaks for them.

00:20:21

And maybe it's giving them some error logs that they need to understand why that thing doesn't work in the production environment is a massive time-saver though, because rather than building something locally, putting it to a test environment, having problems with the test environment, then getting it to live and finding another problem because the test environment, their lives environment, aren't quite the same. Well, this is really nice because you can make sure that that feature is being built on the environment that is going to run on, um, which is very efficient. Indeed. Next up is what we found, um, was very important, was being able to track a device and not just a user's session, because what happens in that scenario, if you're only tracking a section is the user comes along once sees version a of something, then sees version B and then next session.

00:21:10

And then as they come back a few times, they fit between different versions of something. Now that's not a good customer experience, but it might even go a bit further than that in the sense that that might actually skew someone's impression of the experiment itself and the new feature. They may not like purely because of the bad experience that that feature has been delivered to them. So being able to track a device is very important. This is certainly true of a log out user. Obviously, once the user's logged in, you can use their username and consolidate that data accordingly. But that element of not necessarily knowing who that user is, you need to find a way in which you can actually as much as possible track that device. And certainly that's not possible in every scenario for sure. Um, but, uh, it, it's very good and being able to share a cookie across or something across multiple systems and apps can be quite useful as well, especially if you're running a, a larger experiment that maybe touches different applications and is not just defined within one, if you can have that way of keeping track of that same user across the whole lot, therefore they always get that same.

00:22:16

A or B variation works really, really well. And now this is, um, a term I've come up with it's probably wrong, but I think it works for the issue that I'm trying to raise here, which is avoiding scheduling complacency. Now, what I mean by this is when lots of teams start doing testing and production and make use of these feature flags to ensure that they can deliver things safely reliably and improve quality. Well, what that might mean is that people don't need to work together so well at delivering something around the same time, because maybe the assumptions are, oh, well, other team, they can turn that on whenever we're done, we don't need to race to it yet. So we'll get to it a bit later. But if that other team are busy building that feature, that features and I've done, but conquer lines. And then that ends up with this perhaps wastage in terms of the resource that, that people have available to them.

00:23:10

Um, and that features are being done much sooner than they need to be. Um, which in some ways it's great that that can happen, but it can lead to problems of real wastage of time management there. So, um, it's worth bearing in mind that just because you can do it, perhaps shouldn't, uh, you shouldn't be doing that. And that this is a really cool one. So a lot of what I've been discussing of being really feature toggles there shortly lift things to prove a hypothesis or a test or experiment. And once that's proven you can do some work to tidy your code up. And now you've, you've just got that good, better feature in your pipeline, your app, whatever it is, but what if you have a toggle that could live for longer? So we introduced this idea of a debugging mode. So some of our client's applications can actually ship with both a modified and unmodified version of the JavaScript.

00:23:57

Now all users by default get the minified version of the script, but it's great if needed, we can turn on the unmodified version of that JavaScript, maybe for depths QAs, or even in the instance where a particularly user is having an issue. It could be that you can have the call center actually enabled debugging for that user. And suddenly the dev team is getting all of these logs coming in for that one particular user. Who's having a weird edge case problem, really powerful to be able to do that and not really something that was considered when we adopted the process of testing and production, but it's a nice by-product of it, for sure. So moving away from the technical side a little bit, this is much more around making sure that working with, uh, testing and production and hypothesis driven engineering is effective. So what we found was early on and not everyone was in agreement as to what success look like.

00:24:49

And what I mean by that is there are a lot of assumptions and expectations from stakeholders from the tech side of it, uh, from product. And what it meant was that as the test was going on, the hypothesis was being proven, or even the conclusion was being developed at the end. People disagreed with what success really look like for something or disagree, what failure like for something. And if you get to the end of the hypothesis and that's where then there are disagreements is a very ineffective way of having run that test because you might not need to rerun it, uh, which is not ideal. Um, for sure. Um, and leading on from that there's even the problem of really understanding what the sample is for that hypothesis. Um, when people say, oh, well just target 50% of our users. What you mean by that?

00:25:35

Is it a particular device type? Is it all devices? Do you want 50% round? Is it particular countries sub-brands, um, network types? Who knows what? There's so many things there that again, people come with their own kind of preconceived ideas to them, what they think the sample should be, but unless they'll have in those discussions to draw that information out, it can be that people just don't add those assumptions. And then at the end, again, the conclusions of question, because the sample isn't what other people thought the sample was going to be, and that's a real problem. So wrapping it up really, wherever possible. I think you should be looking to use hypothesis driven engineering. It's definitely not possible everywhere. That's special. Certainly we've, we've encountered to places where it really doesn't lend itself to be an effective way of working. Um, but all the examples I've given, I hope give you a sense of the power in which really you can do stuff in your production environment in a very safe way, gather information, understand where the improvements can be made and should be made.

00:26:42

And that yields idea of a flow of features going out all the time in a much better, more quality way, um, that is very similar to lean mass production. We're moving away from this idea of software being this kind of when it feels right, it looks right. Well, no, we're actually using the data, the logs, all of that information that we've got within our software and our applications to really prove that something is better or not, then put it into the way in which we work into our applications, into our processes to continually improve the actual end product that we're all building. And so I do think hypothesis driven engineering is a really, really effective way of working. Thank you very much for listening. I'll be answering Q and a in the chat. I'm happy for any feedback and comments. Thank you very much.