How William Hill uses PagerDuty and Rundeck to Deliver Full-Service Ownership in a Highly Regulated Environment (US 2021)

William Hill, one of the world’s leading betting and gaming companies, needs to satisfy both its customers and its regulatory requirements. In their quest for a negative mean-time to repair, find out out how they use PagerDuty and Rundeck together to empower internal teams to take action and automate fixes to common problems – all while maintaining impressive levels of uptime and keeping the compliance team happy. This session is presented by PagerDuty.

breakoutuslas vegasvegas2021

(No slides available)

RK

Rob King

Automation Product Owner, William Hill

ML

Matt Livermore

Principal Solutions Consultant, PagerDuty

TRANSCRIPT

00:00:12

Hi folks, welcome to our session. Um, my name is Matt Livermore and I am a principal solution consultant here at PagerDuty. And today I'm joined by Rob king. Who's the head of technical automation at William Hill. Uh, we're going to talk to you a bit about what William Hill have been doing today before I do any more. I'm going to hand over to Rob. My first question is can you help the customers in terms of understanding who William are and what the journey has been so far?

00:00:41

Hi. Yeah. Um, well, William Hill, as I'm hoping, most of you know, is a gambling company, um, started many years ago in London and has been growing internationally ever since and recently expanded into the United States. Um, that was so successful. We've been bought out by Caesars who are now taking the American part of the business and, and going to push sports, betting and gambling in the U S um, while the rest of the world will continue on as William Hill where we hopefully will give customers the best betting experience possible. And the safest one, um, in regards to where we are and what we're up to the automation team are here at William Hill to just make everything we do, whether it's, um, uh, support staff, technical staff, we, we look to automate away the Drudge, the toil, and make life better for our employees and ultimately our customers.

00:01:36

Okay. So let's touch on that automation piece then in terms of the, sort of the key workflows, what, what sort of things are we talking about there? What are you, what specifically are you working on?

00:01:46

Um, well, one of the biggest things you've been working on recently is, um, sort of been time to repair. So making sure that our customers can get to our product as often as possible. And if we do have an issue or we take some maintenance, that product is back with the customer as quick as possible. So it's genuinely where we start to concentrate our time, but more recently, especially with sort of work from home, we're looking at ways we can improve how our staff work, how they are, um, how efficient they can be and what they're doing in their day, and to try and make their job easier quicker, which ultimately does knock on to improve customer satisfaction. But, um, yeah, we, we mainly start with making sure our customers get the best product possible at all times.

00:02:32

Okay. Got it. So in terms of like how you do that, then obviously I'm expecting big ecosystem of tools, walk us through some of the sort of tools you're using and how you've put those together to deliver these sort of outcomes.

00:02:46

Um, well, I mean, it's, it's, uh, mainly Linux estate, um, windows underneath in part, um, from there we've got VMware, we've got AWS, all of the classic infrastructure elements on top of that is overlaid classic corporate systems. We've got office 365, et cetera. Um, or the corporate systems time, um, time and attendance, those kinds of systems. Um, and we run a system called open bet, which is a fairly ubiquitous gambling platform that helps run more, not just us, but our competitors too. Um, so what we tend to get is customer complaints, um, customer issues, issues from our monitoring platforms, whether that be sort of new Relic. We had CA suite before, um, we've got Splunk looking at all of our logs, aggregating our logs. We're looking for patterns, anything that's out of the norm. Um, we hope to spot it very much hope to spot it before the customer does so we can get there early, but it's not always the case.

00:03:46

So you sometimes get things direct from the customer, whether that be, um, via tweets or phone calls, emails, et cetera, chats. Um, what we're looking to do is string together, all of those things, whether it's the service tool. So we've got service now, and then you want to alert into JIRA. If you wanting adept team to look at it or an open the operational team via service. Now, whether they used to be, and we used to use emails, and that's where we started to get the real value where we got from PagerDuty, because we started to get that response automated and get the team we needed on those issues as quickly as possible. And what we're doing now is trying to string that first bit of getting an engineer on a line with the right information right through to actually actioning effects. So that's where we are right now.

00:04:35

Okay. So then let's dig a bit more into that sort of what you're doing with patients. So where did, where does PagerDuty fit into this? Um, originally it was, um, a way of getting the right people on a call during an issue. And it was a way we could also start to sort of unify the information around a fault or an issue. So, uh, most teams first experience PagerDuty was to be, you know, uploading their uncle rotors and their escalation paths if that on-call rotor failed. So most people that was their first, first value, they got out of PagerDuty, the person who was actioning, the call just hit call the DBA called network and PagerDuty would do the rest. So we didn't have to know exactly who, what their mobile number was really clunky spreadsheets of who to call when that was all gone. It was just right.

00:05:28

That is the escalation path for that team. That's their current rotor. If that rotor fails, it followed the escalation path. So we, we knew how to get the right people very quickly. That's where we started with PagerDuty. Um, then as it grew, um, we realized that, you know, we didn't just buy it for that. We then started to build this sort of, um, the service structure. So you could start to identify in more detail what things were, where it's gone wrong, what we should do next, who should be called. And that, that, yeah, that, that service structure started to be built. So you could align PagerDuty with, in our case right now is new Relic. And we would match up specific areas with specific PagerDuty responses. And that's when you really start to get into the, the art of improving your response time to incidents.

00:06:21

Okay. Good stuff. So obviously that's all around the sort of the initial part of an incident response process, is that getting the right people into a virtual room as it's become over the last year or so rather than a physical one, but yeah, that whole point of getting people together to work on an issue. So what, where did Rundeck come in? I mean, I believe you were using that product before PagerDuty acquired it, so that's a little bit about,

00:06:46

Um, yeah, so Rundeck, um, it was born out of, uh, you know, technical staffs want not to have to log on to every box to make a change, to run things in a batch. If you want it to check a port is open across an entire service. Um, yeah, all the VMs within a service, you wanted to know that port open or closed, or you make a change or configure a new path. You could do that in an instant using run. That can, that's where we started from just the, you know, the techies starting to make improvements. Um, and then for us, the sort of Eureka moment was shortly after we'd done a lot of work to improve our patching. So patching prior to Rundeck was very laborious. Um, we'd asked each team to come and help with the apps and then the central team would take them down.

00:07:39

We'd take the database down, patch, everything, bring it back up, bring the apps back up with the engineers and the amount of engineers that are involved to make it quick so that we could get product back to the customer, um, was far too high. And with run that we managed to reduce that by over 95%. So what we asked for was the, the teams who ran these services to give us service wrapper scripts, start it up, bring it down, take one out of a load balancer, take patch, it, put it back in all of the scripts needed to test those things. So in and out of a low balance check services, still up reboot apply patches, put it back in a load balance, their check services, good move on to the next that kind of follow it all the way through. We, we have quite a lot of time into that.

00:08:26

And in the end where you go reteach over 95% improvement in time taken to patch. But what the scripts also gave us was the ability when I know that not all applications are perfect and you do try the classic windows fix on things of just giving it a reason. So these scripts allow us to no longer need an engineer to restart them in the middle of the night. We don't even have the on-call lag. Particularly. We still use PagerDuty to raise a spot, the issue alerts people, but those people can now run those scripts. Doesn't have to be the exact engineer from that team who might only have a small on-call team. It can be done by a central team who are given access via Rhonda or run deck projects, and they can restart applications. And that's when it started to mesh together. And that's the point where we started to look at PagerDuty custom incident actions to actually allow an incident triggered to also trigger fifth fix. So does it even need potentially depending on how you want to run it, it doesn't even need a human anymore to do some of the classic fixtures. We can, we can try a restart of a service before calling an engineer.

00:09:47

Cool. All right. Sounds good. Okay. Sounds great. On one hand, I will send sounds a little bit scary in terms of what happens if the machine runs away. So how do you tie that back to other systems to like for an audit trail or anything like that?

00:09:58

Um, the Rundeck scripts are all audited. I mean, they are locked. So that was one of the things we work through with InfoSec, um, the actual system, because PagerDuty is a SAS it's up there in the cloud and run deck can be different, the different ways of doing it, but quite often it's, it's quite cork because it can action things on servers that are potentially very critical to your business. So we put a lot of work in matching up to make sure that the payload was exactly what it should be from, um, take duties before it would initiate a Rundeck, um, action. Um, so we had that thoroughly checked by, uh, InfoSec. Um, they signed that off. We use a Lambda function. Um, it was shortly after we developed that and put it into live that we did show PagerDuty. Um, and that's when they started asking lots of questions about Rundeck and then a few months later they bought them. So I still swear blind it's it was all the bouncer, me and my team. I'm just waiting on my commission check. I really,

00:11:03

So just on that part, I mean, again, you're talking about lots of different moving parts in the house. How difficult has it been to sort of go through this journey? You know, what level of effort you have to put in, in terms of like PagerDuty run, deck, boat, them together,

00:11:16

The bolting them together? I think because it hadn't been done to our knowledge and we didn't really find anything and neither PagerDuty nor render time we're together. It did take us quite a while. I mean, we were a, um, a small team and we were learning, um, certain aspects of our role because it was a, quite a new role, this sort of sort of enterprise level automation. Um, it was difficult. It did take a lot of work, but I have got some excellent engineers once we broken the back of it though. And it was, it was mainly around the communication. Um, we had issues that were, were really down to how we originally set up PagerDuty, um, which I can go into a little bit, but the Rundeck end of it, that Rundeck, setup's really quite simple. I mean, it is the scripts. Um, you can write them in many different languages.

00:12:06

It's not really, it's not, it doesn't declare what you have to do. It gives you a lot of options on how you want to do things which allows separate teams. You own a service, um, you know, the sort of more now agile dev ops teams that they own that product. They can write them in the way they want to, but others can run them, which is very powerful. Um, and it frees up others to do all sorts of tasks. I mean, we've got great savings from the compliance team using Rundeck jobs. So when we have to do PCI audits, instead of having technical people who really hate doing PCI or kids, you can work with them and you can write the scripts that will do based on their input, um, the checks they want. So quite often, a PCI audit I'll ask for kind of have X number of this type for server audited, please for evidence packs.

00:12:59

Yep. And now we can say, right, well now you, our compliance team, there's, there's the Rundeck job. You hit go, you input your whichever ones. You can pick them at random, which auditor's prefer rather than being led by techies, which I always think they're a bit suspicious of. So there's the audit team or the auditors themselves, even external auditors, if they're, if they're, um, being, you know, um, chaperoned couldn't run these things themselves and it saved a lot of time and it, it just gets a lot of work drudgery work off techies backs, but yeah, th that's that's, it was difficult, but once it was done and I'm sure it will be a lot easier going forward now that PagerDuty and Rundeck to get out there. Um, it's been pretty solid. I know there's already improvements in place through the system. We currently use our custom incident actions, which I think is still currently limited three per, um, service instance. Um, but there are new, there are new systems coming along that we've got some view of, which allows a much wider scope of actions that can be carried out automatically by a PagerDuty,

00:14:03

Uh, new V3 web books. They give

00:14:05

You the web books, which I think can even, depending on the flavor of what the alert is, we'll tailor, what the options are, which is really interesting to us, vice run books,

00:14:15

All that sort of cool stuff.

00:14:18

So while we're still in that end to end about, we're still fairly basic, we still we're now building up and understanding how we bolt these things together in a better way. Okay.

00:14:29

So we gave him this, I want to drill into a bit into the, um, the benefits. So you've already touched on some of the productivity ones. We come back to things like, um, sort of meantime to at large, I can remember, um, talking to some of the, some of the team at William who like Alan, for example, all of a sudden, um, around sort of the, we want to get all this down. So sort of 20 minutes end to end, where are you at now in terms of sort of meantime, to acknowledge and meantime to resolve. And I know it's different for different types of issues, but give us a flavor for the sort of benefits you've seen as a result of having PagerDuty and Rundeck integrating into

00:15:05

The environments or the, the acknowledgements. I barely get to the alert before it's acknowledged. I mean, I get it on my mobile phone. I mean, that's one of the great things about PagerDuty is, you know, the stakeholders can be, they can have everything if they want to, or you can have just your area and you can have it by, you know, several different ways of getting that communication. But the acknowledgement rates, um, I'll hear my texts go and buy before I've even read the alert, someone's acknowledged it. And we're on our way. Um, in regards to meantime to repair, we've had our, uh, our best grand national and the last two quarters we have had exceptional, exceptional uptime and resolution times. So PagerDuty and deck and all of our other tools like new Relic have really starting to prove a great deal of benefit in our customer services.

00:15:55

Oh, fantastic. Good to hear. Um, in terms of sort of lessons learned, if you're going to start again now, what, what would you do different, do you think?

00:16:09

So if I start at the PagerDuty end, um, I think you really need to know what your service structure wants to be. Um, w what tends to happen, you link, uh, monitoring into PagerDuty and quite often then PagerDuty might go into, depending on what your choice of sort of service reporting is another tool like a service now, and then that goes on further, if you don't line things up, well, if you do, you can set it PagerDuty and half just sort of one service, and you can run a large, large, large amount of products like that, but you just, you just can't then automate off the back of it. It's not granular enough. So you need to be quite granular in your service structure. I would say, you need to understand where you're going to do your reporting because, um, in the work we did and the work of the teams have done since PagerDuty came along, we did a lot of work to match it up with service now to help reporting.

00:17:10

So the PagerDuty information would go into a service now ticket to help with the reporting. But again, if they're not lined up, if you've got a trading application, that's got a thousand different parts, and you've got one big lump of a service definition in service. Now, you just can't match them up easily. It just makes reporting a bit of a mess. Um, and in terms of the Rundeck end of things, I would say the biggest thing you can do to make it work for you is to have those restart scripts, part of all deliverables so that a team has to deliver sort of restart scripts and management and maintenance scripts, and compliance scripts as part of delivering any product that will from the get, go empower many of the people to run those tools, um, including your service teams. And that should, I would say, empower you and run decks to really improve your sort of, uh, service deliverables around meantime to repair, et cetera.

00:18:14

Okay. So you talking about both the sort of, um, the service design within pay sheets, and also the sort of the jobs that you want people to have available out of the box effectively with, with Rundeck. Are you templating that stuff now to make it easier for team

00:18:31

We've started? We've had a few goes at this. It does depend. I mean, a lot of, a lot of times in tech people are always, oh, it's going to be new and it's going to be new and fancy, and it's going to work in a different way. And it rarely does. It is, it is starting to become a deliverable to, to ensure that you deliver X, Y Zed in a certain way, or within guardrails. I wouldn't say we have a strict template for Rundeck, especially because we don't want people to be forced to use certain things, but there are guard rails to what we're trying to achieve. So I think rather than a strict template on how you do it, it's more a, it must do these things within these parameters. That makes sense.

00:19:14

Yeah. Okay. Get that. Um, I guess going on from there then, I mean, you've achieved so much really in terms of, if you said best grand national ever really great last couple of quarters, you know, touch word. So we're going to keep going the way in the way it's been going everything else and obviously making him roads, um, you know, massive success in the U S right now. Um, what's next,

00:19:39

Um, AI ops is something you hear a lot about. Um,

00:19:44

What does that mean to William Hill?

00:19:48

Well, if, if you listen to our, uh, guru on capacity monitoring and, uh, integration of API is it means a negative meantime to repair. Um, for us, it's, it's collecting the data through systems like Splunk, which obviously aggregates your logs. Um, you Relic, which is our current monitoring tool, um, both the out of the box monitors and the specific ones that you build for an application. And over time from that information learning and predicting from many sources of information, what's going to happen next. So if you notice a marketing campaign and there's going to be an uptake here, but you know, that pattern is actually a bit low, that pattern is already on a way to a problem. And, you know, through the marketing update that there is going to be, or should be more load, you can get ahead of that and make the change and scale things up.

00:20:42

So what we're after is Rundeck under the tools like it provide the muscle that we can attach to this sort of artificial intelligence and machine learning pattern, matching brain to try and remove the human effort, human reports, human predictions, and rely on what's gone on in the past to learn about the future and make that happen. But to me, you know, intelligent automation is linking many tools together. You can be triggered by AI. You can be triggered by a straight up monitor, but it's what you do in how you communicate automatically, which obviously PagerDuty handles faults and communication and escalations really well. And Rundeck provides that muscle. If you've got the scripts written by the teams that you require, um, to give you an outcome, hopefully it's the right one, if you've configured it right. And you script run correctly. Okay.

00:21:38

Yep. Totally agree. Um, well, I mean, it's, it's been fascinating hearing sort of what you've been doing at Woodinville and, um, yeah. Thank you very much for, um, the time today. I mean, I found it really illuminating and I look forward to seeing, you know, more and more successes at William Hill going forward. Thank you very much. Thanks a lot.