SRE From Scratch: An Enterprise Journey

This session describes our journey of transformation at a large European Telco to SRE – starting from a traditional set up of having a Platform Development team, a support team, and teams that utilized the platform to onboard users and applications onto the Platform.


In the traditional setup, no one team was really happy, with the symptoms:

- Our version of “works on my machine” and “well, doesn’t on mine!” conversations

- Ownership of the availability and resilience of the platform


Our journey started with a diktat from leadership on SRE adoption. How do we make SRE happen from a blank slate?


Same place as most people start off – we pulled the Google books and the DOI SRE certifications. Obviously (we thought), we had to set our SLOs; everything flows from there.


But how do you set a SLO if you don’t know the capability of the current system?


Since it was meaningless to set SLOs without a baseline, we starting defining and then instrumenting our baseline. Once we had the instrumentation in place, we then started building on it to define our Services, coverage, and observability.


Along the way, we’ll also share our tools and processes experiences, what worked and what didn’t, and overall a great overview of what it takes to establish SRE in a large traditional setup.

MG

Monika Gupta

SRE & Support Lead - MOBIUS Platform & Tools, British Telecom

AR

Avinash Rao

VP of Products, Digite

Transcript

00:00:13

Hi everyone. We're having a shower and Monica Gupta, and it's a pleasure to be here at the DevOps enterprise summit. Today, we are going to take you through a journey of SRE from scratch and enterprise journey.

00:00:29

Hey everyone, I'm for movies, platform and tools at BT technology. I have more than 18 years of it experience in software development and management. I have been practicing agile DevOps for more than a decade. Now I am also a certified SRD foundation professional from last year. From last one year, I have been playing the role of a study lead for Mobius platform and tools. I NBD technology

00:00:58

I'm Avinash I'm VP of products at digit day in the period that Monica and I collaborated on this particular journey. I was part of Wipro digital, and together we were working on the Mobius program. I'm a DevOps Institute ambassador and the DASA certified DevOps coach. I've spent about a decade now working on agile lean DevOps, uh, as well as, uh, Kanban and that's me, by the way, on my way to Everest base camp, someone just reminded me that it's been exactly two years, uh, to the date that we started, that particular journey. Let's start with an overall view of what BT was trying to achieve. Uh, what you see on the screen is the leadership ambition that BT has that by 2030 BT wants to become the world's most trusted connector of people, devices, and machines, and there's really two or three key things that must come together for BP to achieve what it wants to the first one is the business strategy.

00:02:07

And the key part of building a strong foundation of that business strategy is to be able to bring a very strong technology capability to British telecom. The Mobius platform was started in order to provide some of the DevOps capability, uh, to the British telecomm environment. Overall, if you look at the journey from an end to end perspective, the products that we do offers are underpinned fundamentally, and this is true of any organization, um, in this digital world by a very strong ID, we also must upgrade the processes by which these particular value is delivered and together create a better experience, both for the people utilizing these technologies inside BP and for BTS customers.

00:03:10

A second key item that really underpins this particular strategy is a culture where people can be their best. And what BT wants to do is create an environment where people have access to cutting edge work and the latest technology. And this is not technology for technology's sake, but really a way for BD to give a world-class IP experience to its end customers. So when we started working together at the start of the Mobius program, there were a few things that we were aware of in terms of what is needed for success, of what we call the Mobius program. Obviously Mobius wanted to create a world-class DevOps, NCI CD environment. A lot of the environments, um, at that point in BP was on-premise. And so we wanted to start off the cloud adoption journey. Also, if you look at many enterprise journeys, there is a focus on DevOps and CICT, but not as much on continuous testing, which becomes a key problem.

00:04:27

And hence, we thought of continuous testing as being an integral part of the Mobius journey. Achieving a high level of DevOps capability automatically implies the need for automation. When we looked at the overall BT strategy, we also saw that culture is an important part of this change, and that's where the agile processes start coming in. However, there were some challenges that we faced right off the bat. One was the lack of a measurement culture. The second was a lack of tooling, governance and localized optimization. We had several teams each with their own DevOps pipelines at completely varying degrees of capability, some fairly advanced and some very rudimentary. Our first step was to put together a collection of open source tools and provide those as part of the Mobius platform at BT.

00:05:33

Yeah, so keeping all those goals in mind, we started building a framework to create a CSV pipeline using whatever tools we had available. So mainly those were the open source tools like get lab Jenkins nexus. And we also define the standard gates for quality and deployment for all our engineering teams to adhere to. So in order to provide centralized Rover scalable platform, our goal was to move from open source to enterprise tools so that we could scale to support 2000 plus applications. So slowly we started moving to enterprise grade tools. So now we are using get live enterprise version to give us a strong foundation for a source control system. Then we moved to a enterprise grade solution for CA solution. And so we moved to and for our deployment, we moved to Ansible. So considering all this, like we had us, uh, we thought we have built a very robust platform for adding genetic teams to be used, but I'm sure with all this, you might be wondering, we might be having a great platform that can be used by meeting engineering teams.

00:06:47

But to know, we started seeing a lot of plans from users, from our end users who were using the platform. So they were saying, platform is not stable. Masters are frequently going down, download the 500 errors are coming. So this was the time when we felt there's a need for us to retrospect. So regionally, how have you lived working? Like there were four teams, we were working as four different teams. So there was a, uh, onboarding enablement team who was helping the end users to set up the pipeline onto our platform. So then there was a platform team who was responsible for creating a standard pipeline with the standard stage gates, et cetera. And then there was a coast team whose main purpose was to provide good features related to each and every tool. And there was a lean support team to on a daily basis to support customers for providing them access or setting up, uh, web hooks, et cetera, uh, for them to run the normal pipeline.

00:07:49

So via realize like all, all of these four teams were really like not working in collaboration. So there were a lot of silos. So then we all got together in storms and we bought SRE books to understand SRE principles. We got ourselves certified. We also hired one SRE consultant to help us assess that current SRE maturity state. And also we wanted him to help us help us with that adoption, but we'll be followed this four step study adoption model. We assessed our current state, define a to be state in order to bootstrap the culture in our team. We set up some workshops, trainings, et cetera, for our team to actually understand what SLS mean, what availability means, what exactly service reliability mean. So now via more adopt phase where we are trying to scale to adopt these SRE practices across all our services

00:08:55

On a platform, um, model that was available from a partner, we look at what we need in order to really achieve the performance metrics that was important to our customers. So there's two parts to that. One was with regard to speed. And the second with regard to stability and reliability, the key metrics we picked up from a speed perspective was lead time and release cadence. And from a stability perspective, it was the time to engage, sorry, detect, engage, and restore from an incident because what used to happen is that the Mobius team realized that there was a problem when an end customer reported the problem. So it was really important for us to be able to proactively detect and then engage in restore, uh, uh, the issue or the error. And of course, we had to keep a really close watch on the change failed rate.

00:09:56

Now underpinning this entire change was to bring in a culture that was focused more on agile and lean values. As Monica just spoke about the structure, we had created multiple silos because of which there was no end to end value stream thinking. And hence that had become a major impediment to achieving flow across the platform. In terms of our movement to the cloud cloud clearly offers several significant advantages in order, in order for us to be able to provide additional requirements in terms of auto-scaling when there is a higher demand. And so it was an important component of what we looked at from a reliability engineering perspective. We first had to take a very close look at where we were that day in terms of service level management, in terms of monitoring and observability, toil management. So given all these, we obviously couldn't do everything at the same time. So we took a good look at each of these, prioritize them and created a subset that we should start with.

00:11:14

Yeah. So after assessing our current state, we this, we decided to start slow, but do the right thing. That was making sense at that point in time. So we focused on these four key areas like service level management. We're actually, we wanted to measure in terms of availability very well. And we wanted to set a very basic monitoring. So we, we were very reactive in nature when we started and there were the, we had no system in place that could tell you that could let us know about the problems. Well, ahead then we want that's when we wanted to set up a basic proactive and we wanted to, the other thing we wanted to focus is to, we wanted to improve on our incident response model model in order to address customer issues and queries on time. So we've wanted to provide them updates, right updates on right time.

00:12:08

So be it be related to incidents or be, to be related to any change or upgrade that was coming up and a fourth area. We wanted to focus on the toilet, keeping this in my mind, we set the goal to move from reactive to proactive and be more preventive in future. So let me now take you through our reactive to proactive journey, what we wanted to achieve in six months. So currently Vivia. So we have set up a good monitoring in Dynatrace. We have moved to a single ideas and platform service. Now we are, we are able to manage all our tickets like your incidents, change, problem, everything we have implemented a clear comstat Digi to send enough updates to our users. On time, we are able to measure some of the key metrics like availability, throughput being time to data, meantime to resolve, we have set up a good knowledge base for our users and for our Elvin and two engineers to follow. So as a result, we are seeing good, good amount of problems identified by Dynatrace and being reported as incidents and improvements in service. Now. So now we are able to take right to decision to address these problems in terms of raising CRS, problem tickets, et cetera, with that SMEs or with that infrastructure structure specialist.

00:13:29

So I I'm a vet of the improvements that happened as a result of some of these things that we did together, Monica, but given that I have moved on to a different role now, it'll be interesting for me to understand what really are the next steps that you are looking at in order to continuously improve the platform.

00:13:52

Yeah, so you're right. So we are still learning and improving. There are lots of areas that Mr. Lindell need to improve. So like, uh, for now, until now we were working as a centralized SRE team. So they were at as a whole, they were focusing on the service reliability issue for all the services. So we are seeing that model isn't working very well. So what we are planning to do is, is to set up a centralized federated reliability conscious team. So we are in, we will align SRE engineers with that key services. So as, and when there is any new feature or upgrade will be planned, so these engineers will make sure those features are robust enough. So they will be driving the NOLs functional requirements. That is one thing that we would be bringing in. And, uh, next step, we want to take our monitoring to next level.

00:14:42

We want to bring in good observability by bringing good by introducing good log monitoring, using elk stack, and also want to introduce AI ops. So with all these issues, et cetera, flowing in, right, we want to now analyze the trend and make use of those trends to give a good capacity prediction. So say for like six months or one year down the line, how much ad services in terms of infrastructure, it has to scale. So we have, we will be looking at that data. And also like, we are very much, on-prem heavy currently. We want to move to cloud. So we are really trying to analyze what will, in terms of what study practices we need to bring in order to support our tools on cloud. So these are the key things that we would be focusing on in coming next three to six months.

00:15:37

One of the things that I've learned as being part of this journey, along with the UN BT is that in the beginning, we did look to see if there was one SRE tool, which we could bring in and make this entire process happen. But I think the real learning, which we would like to share with our audience today is that you do have to see what is necessary for your unique ecosystem and to be able to do the specific interventions that are needed to give the improvements that your particular situation demands with, that we are signing off and we are happy to take any questions.