How Sky Betting & Gaming is Driving Real-Time Operations Transformation

For many organizations, DevOps transformation is now a business imperative as it drives widely understood advantages in innovation, agility, and empowerment. However, many organizations struggle to implement and realize the true benefits of DevOps transformation due to challenges with culture, processes, and tooling–in fact, as many as 78% of organizations fail to get DevOps right.

Join Sky Betting, PagerDuty, and New Relic as they discuss Sky Betting’s transformation story and their lessons learned from over the past decade.


This session presented by PagerDuty.

AJ

Amit Juyal

Sr Service Lifecycle Manager, Sky Betting and Gaming

SW

Steven Wheeldon

Service Operations Manager, Sky Betting and Gaming

NM

Neil MacGowan

Technology Evangelist, New Relic

RO

Rachel Obstler

VP Product, PagerDuty

Transcript

00:00:08

Thank you for joining us today on our webinar, how sky betting and gaming is driving real time operations transformation. We've got a great presentation on what it looks like to become a more operationally mature in your organization with civic input, from sky betting and gaming on how they view maturity and what it takes to get there. Before we start some quick housekeeping, um, today's presentation will be recorded and will be shared with overages trends, and we will have a Q and a at the end of the presentation, please enter your questions in chat box. And if we can't get to them during the presentation, we'll reach out afterwards. Okay. So our presentation, our presenters today, uh, our speakers are Steven wielder, who is service operations manager at sky betting and gaming. We've got, a senior service lifecycle manager at sky betting and gaming, Rachel, the officer VP of products at PagerDuty. And you may go and take another and strategies at new Relic. So without further ado, I'll go ahead and hand it over to you. Rachel.

00:01:18

Thanks, Corinne. So today what we're going to talk about is first an overview of how digital is changing operations. Um, and then I'll hand it over to sky betting and gaming, who will talk about their DevOps and operational transformation and the key learnings that they had going through that process. And then lastly, we'll spend a little bit of time, myself and Neil, um, talking about tools that can accelerate your operational maturity, PagerDuty and new Relic. And then lastly is train said, we'll take Q and a at the end. So if we move to the next slide, um, I'm going to talk a bit about how digital is really changing the way that you need to operate. And, um, on the next slide that leads to four key macro trends that we're seeing. Um, and the first one is, you know, in the digital world and probably in the last 10 years, pretty much every company that's been around longer than 10 years is going through a digital transformation.

00:02:27

And then there's all the companies that have been born in the cloud, um, who are those digital disruptors. Um, but for all of them, it's clear that there's a real importance of real-time action. Being able to respond to anything that happens in real time because, um, your, your customers, and that could be your external customer. That could be internal employees, but they've grown to expect 24 by seven capabilities, um, always on access to whatever they want. And in fact, there's a lot of metrics out there. One of them is that 81% of users will wait less than a minute before abandoning an application. So just saying this, this isn't working, it's taking too long and I'm outta here. Um, and so what that also means is that companies have been transforming the way that they operate, um, and moving to more of a full service ownership model.

00:03:27

Um, so where someone is coding and owning their software, um, and what that does is it approves agility. It means that the person who is operating something was the last one to touch. It knows what the problem could be and can more quickly and easily fix it. Um, and then this moves to the developers, really being the architects of the whole digital experience. Um, other trends that are going on is the rise in operational complexity. So, um, there's a lot of, um, infrastructure changing and that's the other trend. So you have migration to the cloud, you have, um, monolith changing to microservices. You have, um, growth in services like Lambda where you're leveraging and building on cloud services. And this is also leading to just a huge rise in operational complexity. Um, and a little bit of metrics around that are that, um, that the monthly events per responder, that we've seen just in PagerDuty data across all of our customers has increased three times in the past three years.

00:04:42

So everyone who is managing these, um, these, these infrastructure is dealing with more noise, more information and, um, and having to deal with an increasingly complex infrastructure. So if we move on to the next slide, um, what that means and what that's leading to is, is we're seeing a lot of leading companies really re-imagining the way that they're doing operations. Um, and you know, it means that the old reactive method of doing cued work or waiting to find out that there's an issue from a customer doesn't work, you need to be able to operate in real time. It's less of a acute system. It's more of a swarm when something goes wrong and you bring in all the people that you need quickly. Um, it's a lot more collaborative in that way. Um, it's putting monitoring in place, so you can be proactive in figuring out that there is a problem it's investing in automation so that you can instantly route data that you need to, to the right place, um, that you can spin up a response very quickly that you're not going down a call list, for instance, um, and trying to find someone, you know, by calling them one by one to help you.

00:06:03

Um, and it's moving from a more command and control type of method towards the individual, um, responsibility and democratization of data to make sure that the people on the front line can respond quickly. And then lastly, it's going from, um, more static or rules based approaches to assist them that can learn and be intelligent and get smarter and help you, um, operate better all the time. So moving on to the next slide, um, what PagerDuty has done is we looked at what makes a company successful working with a lot of our customers, and we built a real time operations, maturity model to, um, help companies understand where they are and how they can improve. And so, um, this maturity model is, um, at a high level pictured here. And what it does is it looks at, um, companies that operate in more of a reactive mode, um, all the way through companies that are operating in a preventative mode.

00:07:14

And some examples of that are, you know, when you're reactive, you're waiting until a customer files, uh, tickets. And that's how you find out that you have a problem. Um, whereas when you move to responsive and proactive, you're getting timely information in automatically, you're using lots of monitoring capabilities and you're automatically routing that information to the right person who can act on it in a matter of minutes, if not second. Um, so some other examples of, of what maturity looks like is that, um, you know, you may have knowledge in silos. There's a lot of, um, people who know things that no one else knows that when they're out, do you have a problem resolving things, red, you have a much better, uh, method of sharing information, um, when you're getting to be more proactive and preventative. And then in the preventative world, you're using techniques like machine learning and predictive capabilities to get ahead of issues before they even happen.

00:08:21

So what does that mean in terms of, um, you know, how you end up operating? And so what we did at PagerDuty is we actually did a survey and we asked a lot of companies about, um, their operational practices. And then we also asked them how they perform. And what we've found is that the more mature organizations significantly outperformed their lower maturity peers along several key metrics. Um, and so some of those metrics are for more mature companies, um, acknowledge incidents on average, seven minutes faster. Um, they're able to mobilize around incidents 11 minutes faster, or they're able to resolve incidents in general. So putting all those numbers together, plus the resolution two hours faster. And then what that leads to is that there was an average of seven incidents per month across these companies, um, that were major incidents. So it major typically meaning customer impacting. And so on average, the more mature organizations had 14 hours of less downtime each month than the less mature counterparts. So, you know, this has a real and very large impact to your customers when you can move up and operate in a more mature way. So with that, um, I'd like to hand it over to Ahmed at sky betting and gaming. Who's going to talk about, um, what they did to mature the organization and get a lot more effective at responding to issues. So, um, um, it over to you.

00:10:07

Thanks, Rachel. Uh, so just a bit of background about us as a company. Uh, we are a leading betting and gaming company in the UK purely online based, uh, and we are basically aiming to be the UK specifies to business. We do this with the help of 1400 plus colleagues and aim towards developing some of the country's biggest brand in the online betting in gaming industry. Just a bit about, uh, product portfolio. So, uh, product consists of, um, uh, is in, in the Sportsbook areas, aiming, uh, and fee to play products. Uh, and that's just widening the products that we developed and managed. In-house. Uh, one of the key reasons for our success is how we do things, which is what we term as the SPG ways of doing, uh, to highlight some key ones out of there. Uh, uh, we are customer obsessed. So everything that we develop, everything that we have facing towards the customers is always within the key aim that what is best for the customer in terms of experience, features, promotions.

00:11:19

Uh, and few of the examples, uh, we are game changes, um, in term it's what the key message here is. Basically we don't shy away from experimenting. Uh, we make sure that we take the risk, uh, with the right level of plans and right level of research. Uh, we learn and adapt, uh, which is quite key for any company to be successful. Uh, we won't say that we're perfect all the time, but we make sure we take the learnings very quickly adapt and we move on from there onwards. And the last days in last one, I would like to highlight, which is we are all one team. Uh, and that is one of the biggest trends that we hold is that, uh, we don't shy away from going and speaking to people within the organization, not external the organization, taking, learning, adapting, them, sharing ideas, openly, uh, amazing colleagues do the things that I think spider people, our business, and , uh, moving on to the technical operations journey from here onwards.

00:12:24

Um, we, I world of the world of dev ops first got introduced into us as a company in 2011. Uh, we had a centralized team looking to support multiple products and functionalities going live to customers almost every week along. This came our strategy to build, uh, new products. In-house, uh, key meant like the grand national champion and festival big football event brought new challenges for service and operations team. Uh, every year on your ass. We grew as a company, uh, in 2013, uh, we adopted a big change, uh, which is what we call as the tribe structure, uh, which was inspired by a model that Spotify introduced to the industry, uh, this involved over a period of time and resulted in sky betting and gaming divided into, uh, autonomous teams first, mainly at the product level. And then at the squad level, the key, the key team here was that dev ops with previously a centralized team.

00:13:27

Now it marched into, uh, into three basically main areas. As you continue the presentation you had, you had a dev ops role, which was there, but didn't the squad they were named as squad ops. You had reliability and platform engineering teams that would sit within the, within the tribes to basically help do all the evolution regarding the dev ops and the operational areas. Uh, the aim of these roles were to help support optimize, uh, reliability and delivery of products, features to customers. This fast growth, uh, brought in, brought its own challenge to the technical operations team. Uh, Steven wield and our service operation manager will now take you through some of the key challenges you've experienced and how page duty as a tool helped us through this journey, uh, to overcome them.

00:14:20

Great, thanks. Um, so this is just a quick, uh, overview of where we are right now. And with PagerDuty on this particular instance, it's taken us around two and a half years to get to this point. Uh we're now, uh, and while that may sound a little daunting to those of you considering similar transformations, you'll see in this deck are some of the most meaningful changes and improvements occurred within a couple of months of integrating, but, Uh, in the beginning, uh, we had a fairly unpolished, unloved and typical monitoring setup with our alarms for all our hosts. Often they were riddled with useless default functions, false positives and redundancy monitoring was a real black box pre 2016 alongside this traditionally it's super inefficient escalation process to random contact numbers for random people in random places within our knowledge base, but the information spreads so sporadically, incorrect, call-outs human errors and delays tainted our whole escalation process. The scenario is this. You sit staring at your screen for hours. Yeah. Another unknown along goes critical. You panic, you start trying to match keywords and the alarms, the straps and information, you've gotten the knowledge base. You think you've got it. You pull up the page numbers, you find your guide. You call it as quick as your trembling thing is we'll dial, but now you've called the CTO.

00:15:35

You give them a fake name and hangup up abruptly the panic intensifies back to the knowledge base. The next number. Could this be the one you call you call again, no answer. You were alone To try and calm the situation. We began to track every alarm. We came across and record what action was taken when they occurred. So that the next time we might be better informed, more manual work, more pain, Just to add to our sorrows. Many teams will have a single shared on-call phone. It was always the engineers to make sure a phone was handed over and B was charged or even worked at essentially on-call was managed. As you can imagine, this caused a lot of problems. Something had to change In case those sort of words, quite hit home, put this together just as a little image of what I'm trying to say.

00:16:23

I think

00:16:33

The summer of 2016, we made our first step toward introducing PagerDuty and a minimal, but albeit effective capacity we did away with on-call manual. On-call Roadsters opting instead to build the residents PGT itself, if an alarm fire, we would still need to figure out the service impacted based on the alarms. But once we determined this and figuring out which team was responsible, we would just need to select them from a dropdown, topical messages, PagerDuty, and let to do all the work. It would notify engineers in a way they preferred email push notifications or calls, and if they didn't answer it, they would also have to escalate to the next in line as per their escalation policies, which is what you're looking at there. This removes the manual process of entering phone numbers, searching for specific on-call roasters and deliberating over how many missed calls is to numerous calls. It also killed the on-call phone practice. So quite in the last slide,

00:17:24

Actually

00:17:31

It's fairly basic, subtle. Um, the first basic sense of the PagerDuty also granted as an invaluable break glass protocol, three, its response, playful. If SPG services ever take a real hit, then we have the ability to have every on-call resource, which is well over 50 individuals online. Within a matter of minutes, the only one call actually being made. This protocol can also be used on a more granular level to Chavon product. Go down. For example, we could page out all the resources at that product through a tailored escalation policy or response play. When your product products have really hit the fan, being able to immediately reach out to everyone. He may be able to help as a real lifesaver, and it's going a long way towards boosting our meantime to recovery.

00:18:11

Yeah.

00:18:17

With the initial success of RPGs integration, the next step was to Chuck some automation to the next by early 2017, we tied a ball of services to the responsible squads removing once. And for all, any doubts as to who should be, who should be contact when a service has been impacted, or we need to do now is search for whatever it is that's broken in PagerDuty and let it do all the work for you. PagerDuty's also helps us to manage their own ship of services or services, development, responsibilities, change. Some services are left behind in the process with no swaddle team willing to take responsibility, Peggy, to provide an irrefutable catalog as to who will be contacted, should an incident arise. I mentioned very easily when you need to amend it, Uh, at this stage, in the course of PagerDuty's implementation with our services, we reported a decline in our meantime, recovery of approximately 20% midway through 2017. This will certainly have been helped considerably by PagerDuty removing all the manual time-wasting and trial and error processes in the past. I meantime technology improved dramatically during this period too. The biggest change in the whole process was the move to where we are right now. We decommissioned our old monitoring tools and switched on visibility and PagerDuty. We use a number of different tools to monitor different aspects of our infrastructure and services and visibility allows us to integrate them all into one central platform for monitoring our new Relic instance, for example, feeds directly into visibility in real time. Combining these two powerful tools has been invaluable to our operations. Neil, we'll go into this relationship a little bit more later.

00:19:47

Our main source of information about service impacting incidents is now the major instance section, which is the one in the middle, along with the service health section to the left. Naturally the major instance section is reserved. For instance, with real-time in practice services, service health is for more generic non-critical alarms. The section on the right infrastructure, health or bubbles as it's fondly referred to an SPG is helpful as the retrospective tool to analyze periods of instability for specific services or alarms fired at a certain time, the bigger the bubble, if a bad platform has an issue with two o'clock last Friday, we can use bubbles to quickly build a picture of what services specifically were impacted, how

00:20:24

At the same time, as we move to visibility, we also title alarms to specific services and then configure the alarms to automatically call out the responsible squads. This completely removes the need for any manual work and monitoring alarms and puts the onus of maintaining alarms onto the engineers. This change really drove refining and service operations with trying to work into stream on their critical alarms, as opposed to leaving it up to the service desk, to determine if anyone cares that the disc space on post 2 0 3, is it 98% capacity tries would configure these alarms. So that pays you two pages. You two would essentially know if anyone watched, Uh, with this latest in our transformational journey you completed. We've seen an additional decrease in meantime to rebuy approximately 8%. Since I got a lot of visibility in all that out from the beginning of the play GT's implementation, we also had a total decrease in meantime to recovery by proximately funds. This will certainly been influenced by our page. You can get great integration, perhaps our most remarkable metric is the improvement. In our meantime, to acknowledge, which has risen by 86% over the last two and a half years,

00:21:29

The next steps in our transformational journey are loosely centered around additional integrations. Primarily once our PagerDuty interacts with both our slack and JIRA platforms, we recently launched a bot in slack that will allow us to type of simple command triggers, a PagerDuty alarm. So we won't even have to go into the PGC portal to use PagerDuty. The Emir is to claim a little bit more time during break glass or major incident situations. I mean said, this I'm reliably informed the PagerDuty on the verge of launching their own version of this. That will be up for grabs in the near future with JIRA. We're looking to avoid you protecting work by having JIRA and PagerDuty to work as one whereby pages you to call outs will trigger the automatic creation of JIRA tickets. Beyond those we're actively working with our suppliers, say come PagerDuty to further boost our combined to recuperate. And with that, I'll hand you back to Emmett for a closing sort of statement on our bit.

00:22:15

Thanks Stephen. So all these brilliant functionalities through PagerDuty has helped us basically, uh, drive the three main key areas of what we always seem to improve on. Uh, it's helped us, uh, drive revenue and improve customer. I'm going to, uh, 86% reduction in our main time. Technology has ultimately resulted in us reducing the impact the customer sees every time the systems get impacted. Uh, it does, it does improve people, productivity and engagement, uh, marketing integration through PagerDuty API, help to monitor, uh, services on real-time it isn't false alarms and reduces false alarm than call out. Ultimately having more productive time in the office rather than getting called out at two o'clock in the night where service was not even impacted. Uh, and finally it reduces the business risk and improve cost efficiency because the lesser issues customer experience the better it is from their perspective in terms of product journeys, uh, another functionality, uh, on that note I'll handle, uh, back today to discuss more about transformation of real-time operations with the right to vote.

00:23:34

Thanks, I'm it. So that was a huge improvement that you saw going down 86% from 30 to four minutes and meantime new knowledge. So that's also, um, so I heard a lot of themes in what you were talking about and, um, I think it's important to talk about what it really takes to do, um, a operational transformation and you really talked about a lot of different elements. So one of them of course, was a tool, but it was also changing your organizational structure. It was putting the right processes in place and also moving to a culture of ownership of, of people, both building and owning, um, their, um, their services. So all those things working together. Um, and so one of those elements is, um, having the right tooling. And so, um, wanted to talk just really briefly, um, and the next slide about PagerDuty's platform for real time operation.

00:24:37

And, uh, another theme I heard or a couple of themes I heard throughout your presentation was automation. So making sure that you can automate as much as possible about, um, you know, those mundane operational things that you don't need, um, you know, your brilliant team to be spending time on. And so PagerDuty's platform does that with on-call management and modern incident response. So that's where we, we essentially automate the whole runbook and automate the capability to spin up a response with multiple people in a very surgical way. So, you know, when you have this problem, you need these five people from these five teams and PagerDuty can automate the whole process. Um, and then the other thing is about having the right information in front of people so that they can be more effective. And, um, one of the ways we do that, um, that was shown in the presentation is through visibility to making sure that you can see all of the data, all of the issues that are going on at one time, who was working on them, what other services could be impacted.

00:25:48

Um, we also have events, intelligence, and what event intelligence does is it manages and helps manage the noise. So when you have a lot of data coming in, um, we can look across that data and say, these things are related to each other. And instead of paying with multiple incidents, um, we can automatically group together related issues into one incidence that all of that context is available for the responders. And they have more information from time, zero when they first get told, if there's an issue about what's going on. And then lastly, PagerDuty has, um, analytics so that you can look back afterwards and really learn and improve. So look at how many incidents you had, um, what was causing them, how long did it take to resolve them? Is there anything happening that's repetitive? Are there certain services that are more noisy than others and maybe need an operational investment? Um, so that's the PagerDuty platform all built on top of a enterprise class, scalable, um, platform and with, um, a large amount of unique data that all of these products, um, on top of it can really leverage to help you create a, um, a learning and improving, um, organization. So with that, um, I will hand it over to Neil to also talk about, um, monitoring and getting the right data into PagerDuty.

00:27:27

Um, thank you, Rachel. Uh, I appreciate the introduction. So, um, one of the things that I'm here to talk about actually is the, uh, the fact that, um, it's incredibly important to make sure that when you use the troops, when you do put people into, um, uh, the triaging incidents, et cetera, and, and you start notifying people that they're working on the right thing. So because nobody likes a false alarm, whether it comes to wake you up in the, in the middle of the night, uh, or whether it requires you to mobilize an entire team to try and resolve, um, uh, certain issues within your organization. Um, and an example of perhaps probably one of the worst false alarms, uh, that we've seen in recent years, um, was the missile warning system in Hawaii, uh, in January last year, which, um, uh, was triggered, uh, through actually, uh, bad user interface.

00:28:22

So, um, it, it was down to a code level issue, uh, and notified, uh, the entire population of Hawaii, that there was an imminent risk of them, uh, being attacked by missile. Um, now we're not saying that when it comes to running your business critical applications, that the consequences of a, of a false alarm are going to be quite so drastic, but it just highlights the importance of making sure that when you do trigger an alarm or an event, um, that it's accurate. And if you want to move your organization from being reactive to proactive and ultimately predictive, um, then you have to make sure that the notifications you provide can be relied upon. So how do we do that at, uh, at new Relic? Um, well, fundamentally we expand the, uh, concept of APM to go way beyond just looking inside your application code, or just looking at your infrastructure.

00:29:18

Instead, we provide full out of the box instrumentation that allows you to quickly ascertain what's the impact on the customer from, uh, every user experience mobile device through the applications here and to the backend infrastructure and cloud services that are supporting those particular applications. And that immediate visibility gives you idea as to where problems are potentially coming from and what the impact is. Secondly, you can extend the new Relic instrumentation with, um, custom attributes, custom metrics, which key performance indicators relative to your business. So, um, how many bets are we taking for example, within the betting and gaming industry, or how many orders are we taking through e-commerce and how is the alert that we are actually triggering at the moment impacting our ability to do business? So how many companies are being impacted? So this gives you a much greater context into whether or not what we're triggering, uh, as a, as an alert, um, is meaningful to the business and requires the appropriate action.

00:30:27

Uh, and then finally actually generating those alerts in an intelligent way. It's not just, um, basing alerts on fixed social breaches, but looking at using machine learning techniques to understand what's normal, what's abnormal looking at things like cohort analysis. So for example, if metrics are supposed to operate in a similar fashion, maybe across, uh, uh, loads of states, and one of them starts to behave, normally that's something which you should know about. And also how is the changing environment actually infer that change in behavior? So how is the deployment, which has resulted in perhaps a performance regression or introduced a new number of, uh, errors into the equation and providing that information back in full context on you, trigger alert, I mean, is it intelligent, but it's also providing the full context. So fundamentally what new Relic is doing is it's allowing you to connect the technical performance of your applications and infrastructure actually, to the business value that they deliver.

00:31:36

So to sum that up, really, if you think about, um, becoming proactive real time and, and delivering APM driven operations, there are four key areas that you need to set up. One is that you need to leverage APM and data. And when I say APM, I don't mean just looking actually at the code Gartner recently expanded the APM category to include user experience and infrastructure so that you have full context of what's going on so that when you are notifying the right teams and they understand everything which is going on within that complex environment. The second thing is to reduce meantime to repair. Now we've heard from sky betting and gaming, how they've made significant gains in meantime, reducing meantime. So acknowledge. So making sure that somebody is, um, is in receipt of that alert and is working on that faster, but also providing the context to make sure that it's the right that are involved and they've got all the information to hand and they can collaborate accordingly in order to resolve the issue faster.

00:32:47

And the third thing is make your alerts actually actionable. So, um, it's very easy to see is a force multiple bar. It's not just a case of one's screen. So if you can connect these two technologies that allow you not only to more intelligently alert, but also mobilize the appropriate resources to resolve the issues faster, um, then that is going to benefit your business greatly. And then finally we, in which organizations are, um, to deploy a much greater frequency, it's never been more important to be able to detect, um, issues faster, uh, the day, the faster you go, the more often you'll make mistakes. So you have to be able to not only deploy quickly, but you have to be able to determine whether or not that the appointment that's had a positive or a negative impact. I think that you have to be able to roll it back just as quickly as you rolled it out. And that requires either the need to do that or the triggering of automated remediation in order to make that happen.