How William Hill uses PagerDuty and Rundeck to Deliver Full-Service Ownership in a Highly Regulated Environment (Europe 2021)

William Hill, one of the world’s leading betting and gaming companies, needs to satisfy both its customers and its regulatory requirements. In their quest for a negative mean-time to repair, find out out how they use PagerDuty and Rundeck together to empower internal teams to take action and automate fixes to common problems – all while maintaining impressive levels of uptime and keeping the compliance team happy. This session is presented by Rundeck by PagerDuty.

europelondon2021breakout
ML

Matt Livermore

Principal Solutions Consultant, PagerDuty

RK

Rob King

Automation Product Owner, William Hill

TRANSCRIPT

00:00:12

Hi folks, welcome to our session. Um, my name is Matt Livermore and I am a principal solution consultant here at PagerDuty. And today I'm joined by Rob king. Who's the head of technical automation at William Hill. Uh, we're going to talk to you a bit about what William Hill have been doing today before I do any more. I'm going to hand over to Rob. My first question is can you help the customers in terms of understanding who William are and, um, what, what the journey has been so far?

00:00:41

Hi. Yeah. Um, well, William Hill, as I'm hoping, most of you know, is a gambling company, um, started many years ago in London and has been growing internationally ever since and recently expanded into the United States. Um, that was so successful. We've been bought out by Caesars who are now taking the American part of the business and, and going to push sports, betting and gambling in the U S um, while the rest of the world will continue on as William Hill where we hopefully will give customers the best betting experience possible. And the safest one, um, in regards to where we are and what we're up to the automation team are here at William Hill to just make everything we do, whether it's, um, uh, support staff, technical staff, we, you know, we look to automate away the Drudge, the toil, and make life better for our employees and ultimately our customers.

00:01:36

Okay. So let's touch on that automation piece then in terms of the, sort of the key workflows, what, what sort of things are we talking about there? What are you, what specifically are you working on?

00:01:46

Um, well, one of the biggest things we've been working on recently is, um, sort of been time to repair. So making sure that our customers can get to our product as often as possible. And if we do have an issue or we take some maintenance, that product is back with the customer as quick as possible. So it's generally where we start to concentrate our time, but more recently, especially with sort of work from home, we're looking at ways we can improve how our staff work, how they are, um, how efficient they can be and what they're doing in their day, and to try and make their job easier quicker, which ultimately does knock on to improve customer satisfaction. But, um, yeah, we, we mainly start with making sure our customers get the best product possible at all times.

00:02:33

Okay. Got it. So in terms of like how you do that, then obviously I'm expecting big ecosystem of tools, walk us through some of the sort of tools you're using and how you've put those together to deliver these sort of outcomes.

00:02:47

Um, well, I mean, it's, it's, uh, mainly Linux estate, um, windows underneath in part, um, from there we've got VMware, we've got AWS, all of the classic infrastructure elements on top of that is overlaid classic corporate systems. We've got office 365, et cetera. Um, or the corporate systems time, um, time and attendance, those kinds of systems. Um, and we run a system called open bet, which is a fairly ubiquitous gambling platform that helps run more, not just us, but our competitors too. Um, so what we tend to get is customer complaints, um, customer issues, issues from our monitoring platforms, whether that be sort of new Relic. We had CA suite before, um, we've got Splunk looking at all of our logs, aggregating our logs. We're looking for patterns, anything that's out of the norm. Um, we hoped to spot it and very much hope to it before the customer does so we can get there early, but it's not always the case.

00:03:46

So you sometimes get things direct from the customer, whether that be, um, via tweets or phone calls, emails, et cetera, chats. Um, what we're looking to do is string together, all of those things, whether it's the, the service tool. So we've got service now, and then you want to alert into JIRA. If you wanting adept team to look at it or an all-clear operational team via service now, or they used to be, and we used to use emails and that's where we started to get the real value where we got from PagerDuty, because we started to get that response automated and get the team we needed on those issues as quickly as possible. And what we're doing now is trying to string that first bit of getting an engineer on a line with the right information right through to actually actioning a fix. So that's where we are right now.

00:04:35

Okay. So then let's dig a bit more into that sort of what you're doing with pastry. So where did, where does PagerDuty fit into this? Um, originally it was, um, a way of getting the right people on a call during an issue. And it was a way we could also start to sort of unify the information around a fault or an issue. So most teams first experience PagerDuty was to be, you know, uploading their own call rotors and their escalation paths. If they're on call rates are failed. So most people that was their first first value, they got out of PagerDuty, the person who was actioning, the call just hit call the DBA called network and PagerDuty would do the rest. So we didn't have to know exactly who, what their mobile number was really clunky spreadsheets of who to call when that was all gone.

00:05:27

It was just right. That is the escalation path for that team. That's their current rotor. If that rotor fails, it followed the escalation path. So we, we knew how to get the right people very quickly. That's where we started with PagerDuty. Um, then as it grew, um, we realized that, you know, we didn't just buy it for that. We then started to build this sort of, um, the service structure. So you could start to identify in more detail what things were, where it had gone wrong, what we should do next two should be called. And that, that, yeah, that, that service structure started to be built. So you could align PagerDuty with, in our case right now is new Relic. And we would match up specific areas with specific PagerDuty responses. And that's when you really start to get into the, the fine art of improving your response time to incidents.

00:06:22

Okay. Good stuff. So obviously that's all around the sort of the initial part of an incident response process. So that getting the right people into a virtual room as it's become over the last year or so rather than a physical one, but that whole point of getting people together to work on an issue. So what, where did Rundeck come in? I mean, you, I believe you were using that product before PagerDuty acquired it. So a,

00:06:46

Um, yeah, so Rundeck, um, it was born out of, uh, you know, technical staffs want not to have to log on to every box to make a change, to run things in a batch. If you want it to check a port is open across it, entire service, um, yeah, all the VMs within a service, you want to know that port open or closed, or you make a change or configure a new path. You could do that in an instant using Rundeck that's where we started from just the, you know, the techies starting to make improvements. Um, and then for us, the, the sort of Eureka moment was shortly after we'd done a lot of work to improve our patching. So patching prior to Rundeck was very laborious. Um, we'd ask each team to come and help with the apps. And then the central team would take them down.

00:07:39

We'd take the database down, patch, everything, bring it back up, bring the apps back up with the engineers and the amount of engineers that are involved to make it quick so that we could get product back to the customer, um, was far too high. And we've run that. We've managed to reduce that by over 95%. So what we asked for was the, the teams who ran these services to give us service wrapper scripts, started up, bring it down, take one out of a load balance, uh, take it, patch it, put it back in all of the strips needed to test those things. So in and out of a low balancer check services, still up reboot apply patches, put it back in a low balance, their check services, good move on to the next that kind of follow it all the way through. We have quite a lot of time into that.

00:08:27

And in the end, we got a routine over 95% improvement in time taken to patch. But what these scripts also gave us was the ability when I know that not all applications are perfect and you do try the classic windows fix on things of just giving it a restart. So these scripts allow us to no longer need an engineer to restart them in the middle of the night. We don't even have the on-call lag. Particularly. We still use PagerDuty to raise ticket spot the issue alerts people, but those people can now run those scripts. Doesn't have to be the exact engineer from that team who might only have a small on-call team. It can be done by a central team who are given access via Rhonda, uh, run deck, um, uh, projects and they can restart applications. And that's when it started to mesh together. And it that's the point where we started to look at PagerDuty custom incident actions to actually allow an incident triggered to also trigger fix. So does it even need potentially depending on how you want to run it, it doesn't even need a human anymore to do some of the classic fixtures. We can, we can try a restart of a service before calling an engineer.

00:09:48

Cool. All right. Sounds good. Okay. Sounds great. On one hand I'll sound sounds a little bit scary in terms of what happens if the machine runs away. So how do you tie that back to other systems to look for an audit trail or anything like that?

00:09:59

Um, the Rundeck scripts are all audited. I mean, they are logged. So, um, that was one of the things we work through with InfoSec, um, the actual system, because PagerDuty's a SAS it's up there in the cloud and run deck can be different, different ways of doing it, but quite often it's, it's quite cork because it can action things on servers that are potentially very critical to your business. So we put a lot of work in matching up to make sure that the payload was exactly what it should be from, um, taking your duties before it would initiate a Rundeck action. Um, so we had that thoroughly checked by, uh, InfoSec. Um, they signed that off. We use a Lambda function. Um, it was shortly after we developed that and put it into live that we did show PagerDuty. Um, and that's when they started asking lots of questions about Rundeck and then a few months later they bought them. So I still sweat blind it's it was all the bouncer, me and my team, Just waiting on my commission check. I really

00:11:03

Okay. So just to that part, I mean, again, you're talking about lots of different moving parts or anything else, how difficult has it been to sort of go through this journey? You know, what level of effort you have to put in, in terms of like PagerDuty Rundeck bolt them together,

00:11:17

The bolting them together? I think because it hadn't been done to our knowledge and we didn't really find anything and neither PagerDuty, not Rundeck at that time we're together. It did take us quite a while. I mean, we were a, um, a small team and we were learning, um, certain aspects of our role because it was a quite a new role is sort of, sort of enterprise level automation. Um, it was difficult. It did take a lot of work, but I have got some excellent engineers once we've broken the back of it though. And it was, it was mainly around the communication. Um, we had issues that were, we're really down to how we originally set up PagerDuty, um, which I can go into a little bit, but the Rundeck end of it, that Rundeck setups really quite simple. I mean, it is the scripts.

00:12:03

Um, you can write them in many different languages. It's not really, it's not, it doesn't declare what you have to do. It gives you a lot of options on how you want to do things which allows separate teams. You own a service, um, you know, the sort of more now agile dev ops teams. They own that product. They can write them in the way they want to, but others can run them, which is very powerful. Um, and it frees up others to do all sorts of tasks. I mean, we've got great savings from the compliance team using Rundeck jobs. So when we have to do PCI audits, instead of having technical people who really hate doing PCI kits, um, you can work with them and you can write the scripts that will do based on their input, um, the checks they want. So quite often, a PCI audit Alaska for kind of have X number of this type for server audited, please for evidence packs.

00:12:59

Yep. And now we can say, right, well, there you are compliance team. There's, there's the Rundeck job. You hit go, you input your whichever ones. You can pick them at random, which auditor's prefer rather than being led by techies, which I always think they're a bit suspicious of. So there's the audit team or the auditors themselves, even external auditors, if they're, if they're, um, being, you know, um, chaperoned can run these things themselves and it saved a lot of time and it just gets a lot of work drudgery work off techies backs, but yeah, th that's that's what was difficult, but once it was done and I'm sure it will be a lot easier going forward now that PagerDuty and re-index to get out there. Um, it's been pretty solid. I know there's already improvements in place through the system. We currently use our custom incident actions, which I think is still currently limited three per, um, service instance. Um, but there are new, there are new systems coming along that we've got some view of which allows a much wider scope of actions that can be carried out automatically by a PagerDuty,

00:14:03

A new V3, web hooks,

00:14:05

Web hooks. Yeah. Which I think can even, depending on the flavor of what the alert is, we'll tailor what the options are, which is really interesting to us, vice run

00:14:15

Books, all that sort of cool stuff.

00:14:18

So while we're still in that end to end belt, we're still fairly basic. We're still, we're now building up and understanding how we bolt these things together in a better way.

00:14:29

Okay. So we just, I want to drill into a bit into the, um, the benefits. So you've already touched on some of the productivity ones. We come back to things like, um, sort of meantime to at large, I can remember, um, talking with some of the, some of the team that William who like Alan, for example, and all of a sudden, um, around sort of the, we want to get all this down. So sort of 20 minutes end to end, where are you at now in terms of sort of meantime, certain knowledge and meantime to resolve? I know, I know it's different for different types of issues, but give us a flavor for the sort of benefits you've seen as a result of having pages you here Rundeck, uh, integrating into the environment.

00:15:06

Well, the, the acknowledgements, I barely get to the alert before it's acknowledged. I mean, I, I get it on my mobile phone. I mean, that's one of the great things about PagerDuty is, you know, the stakeholders can be, they can have everything if they want to, or you can have just your area and you can have it by, you know, several different ways of getting that communication. But the acknowledgement rates, um, I'll hear my text go and buy before I've even read the alert, someone's acknowledging it. And we're on our way. Um, in regards to meantime to repair, we've had our, our best grand national and the last two quarters we have had exceptional, exceptional uptime and resolution times. So PagerDuty and run deck and all of our other tools like new Relic have really starting to prove a great deal of benefit in our customer services.

00:15:56

Just too good to hear, um, in terms of sort of lessons learned, if you're going to start again now, what, what would you do different, do you think?

00:16:09

So if I start at the PagerDuty end, um, I think you really need to know what your service structure wants to be. Um, w what tends to happen. You link a monitoring into PagerDuty, and quite often then PagerDuty might go into, depending on what your choice choices sort of service reporting is another tool like a service now, and then that goes on further, if you don't line things up. Well, if you do, you can set up PagerDuty and have just sort of one service, and you can run a large, large, large amount of products like that, but you just, you just can't then automate off the back of it. It's not granular enough. So you need to be quite granular in your service structure. I would say, you need to understand where you're going to do your reporting because, um, in the work we did and the work of the teams have done since PagerDuty came along, we did a lot of work to match it up with service now to help reporting.

00:17:10

So the PagerDuty information would go into a service now ticket to help with the reporting. But again, if they're not lined up, if you've got a trading application, that's got a thousand different parts, and you've got one big lump of a service definition in service. Now, you just can't match them up easily. It just makes reporting a bit of a mess. Um, and in terms of the run deck end of things, I would say the biggest thing you can do to make it work for you is to have those restart scripts, part of all deliverables so that a team has to deliver sort of restart scripts and management and maintenance scripts, and compliance scripts as part of delivering any product, just that will from the get, go empower many of the people to run those tools, um, including your service teams. And that should, I would say, empower you and run that to really improve your sort of, uh, service deliverables around meantime to repair, et cetera.

00:18:14

Okay. So you talking about both the sort of, um, the service design within Patriots and also the, sort of the jobs that you want people to have available out of the box effectively with, with Rundeck, are you templating that stuff now to make it easier for team? Um,

00:18:30

Um, we've started, we've had a few goes at this. It does depend, I mean, a lot of, a lot of times in tech people are always, oh, it's going to be new and it's going to be new and fancy, and it's going to work in a different way. And it rarely does. It is, it is starting to become a deliverable to, to ensure that you deliver X, Y Zed in a certain way, or within guardrails. I wouldn't say we have a strict template for Rundeck, especially because we don't want people to be forced to use certain things, but there are guardrails to what we're trying to achieve. So I think rather than a strict template on how you do it, it's more a, it must do these things within these parameters. That makes sense.

00:19:15

Yeah. Okay. Get that. Um, I guess going on from there then, I mean, you've achieved so much really in terms of, if you say best grand national ever really great last couple of quarters, you know, touch word. So we're going to keep going the way in the way it's been going everything else and obviously making him roads, um, you know, massive success in the U S right now. Um, what's next?

00:19:39

Um, AI ops is something you hear a lot about. Um,

00:19:44

What does that mean to William Hill?

00:19:48

Well, if, if you listen to our, a guru on capacity monitoring and, uh, integration of API is it means a negative meantime to repair. Um, for us, it's, it's collecting the data through systems like Splunk, which of sleek aggregates, your logs, um, you Relic, which is our current monitoring tool, um, both the out of the box monitors and the specific ones that you build for an application. And over time from that information learning and predicting from many sources of information, what's going to happen next. So if you notice a marketing campaign and there's going to be an uptake here, but you know, that pattern is actually a bit low, that pattern is already on a way to a problem. And, you know, through the marketing update that there is going to be, or should be more load, you can get ahead of that and make the change and scale things up.

00:20:42

So what we're after is Rundeck under the tools like it to provide the muscle that we can attach to this sort of artificial intelligence and machine learning pattern, matching brain to try and remove the human effort, human reports, human predictions, and rely on what's gone on in the past to learn about the future and make that happen. But to me, you know, intelligent automation is linking many tools together. You can be triggered by AI. You can be triggered by a straight up monitor, but it's what you do in how you communicate automatically, which obviously PagerDuty handles faults and communication and escalations really well. And Rundeck provides that muscle. If you've got the scripts written by the teams that you require, um, to give you an outcome, hopefully it's the right one, if you've configured it right. And you script run correctly.

00:21:38

Yep. Totally agree. Um, well, well, I mean, it's, it's been fascinating hearing sort of what you've been doing at Woodinville and, um, yeah. Thank you very much for, um, the time today. I mean, I found it really illuminating and I look forward to seeing, you know, more and more successes at William Hill going forward. Thanks very much. Thanks a lot.