Building Confidence in Your SRE Team

Does it seem like your SRE team is starting to look way too much like the familiar Operations team that you have always known? It’s easy to fall back on the well known patterns of production support. While it is critical to demonstrate strong Operations expertise, it is equally critical that your SRE team adopt a new mindset.


You may think that your first steps in building a healthy SRE team is to adopt all the acronyms: SLI, SLO, SLAs. It's enticing to immediately implement Error Budgets with consequences. But you first have to build a culture of trust and measurable performance.


We strive to drive our operational burden to zero. We look to automate last year’s work to make room for new challenges. We make the time to eliminate TOIL in our daily tasks.


In this talk, we will take a close look at how process and automation can be the driving force behind a truly empowered SRE team.

MW

Michael Winslow

Senior Director, Software Development & Engineering, Comcast

Transcript

00:00:13

We have a phenomenal set of sessions for you this morning. So up first is Michael Winslow who is spoken here at DevOps enterprise six times. He is senior director of engineering at Comcast currently ranked 26 on the fortune 500 with thousands of software developers. Over the years, he's had a variety of roles, including leading, expanding mobile and extremity stream. Today, he will be sharing the story of how after a decade of leading dev teams, he was asked to lead the SRE initiative for a critical service Xfinity back office. And when I say critical it's because it enabled revenue, generating services, such as plan activations, uh, whether direct to consumer or through storefronts, he will be talking about the SRE principles. He tried to apply the lessons he learned and the value that he found within the SRA body of knowledge, as well as the capabilities and value that he and the teams created. And I am so delighted for my friend, Michael, because after we recorded the session, it was announced that he was being promoted to become a distinguished engineer, which represents less than 1% of the engineering community. So here's Michael,

00:01:24

Thank you so much for that introduction, Jean. I can't believe this is already my six talk. Uh, you hearing you say that takes me back to 2018 to that first lightning talk that I did, uh, in front of the crowd, uh, really excited to talk to the audience one more time. So like Jay Jean said, I'm Michael Winslow, uh, currently senior director at Comcast, uh, fun fact about myself is I still code for management tasks. Uh, I find a lot of joy in it. And, uh, I would say that anybody moves over to management from technology should consider doing, doing the same. So today I'm going to talk to you about building confidence in your SRE team. Now, before I do, I want to stay in touch with everybody. So if you go out and Google me right now, you won't find me. You'll actually find this guy, uh, Michael Winslow, who makes all the sound effects on police academy.

00:02:17

Uh, he was recently on, America's got talent as well. He's the most famous Michael Winslow. So if you want to stay in touch with me, you're going to want to reach out to me directly. Um, funny story with that also is I had a friend of mine that says, well, if you have the same name as somebody else, just use your middle name, you know, and start using that as your professional name. Well, the problem with that is my middle name is Scott. So anybody who's fans of the office knows that there's a more famous Michael Scott out there as well. So just to be safe, Michael S Winslow at either Twitter or LinkedIn, let's stay in touch. All right, let's start off. This is where I work. This is where I went into the office every day before the pandemic started. Uh, I honestly wait for the numbers to get back down and go back there.

00:03:02

Uh, this is the Comcast technology center, and I'm really want to call out specifically the word technology. Um, Comcast is not always top of top of mind when people think about technology companies, but we have in fact made the change over the years from a cable provider to a technology company that provides cable and other things to illustrate that. Let me go over an overview of what Comcast looks like as a whole. So I work in the area that is Xfinity mobile X one X, Y X home, but right in product and services, we have FreeWheel spotlight, Comcast business, and in Comcast as a whole, we have, uh, Spectacor spectra, a lot of other operating companies, including Comcast ventures. Now in 2018, when we really started to build our family up and bring on sky and NBC universal, you could imagine there were so many operating companies that we were working with and dev ops became absolutely critical because what we wanted to do as we brought on new operating companies, as much as possible is to kind of remove these artificial lines that were between the companies share information across.

00:04:17

So dev ops has been crucial in all of that. So right around 2016, I joined, uh, Xfinity mobile. And at the time it actually was a super secret, uh, project because Xfinity mobile didn't actually launch until April of 2017. So at this time we were just kind of acting as a startup. Um, so specifically my part and the team that I worked on was the Xfinity mobile back office. So we supplied all of the API APIs and orchestration for the direct to consumer website, the future stores that would be popping up the Xfinity mobile stores that started popping up. We did a lot of the logic and guts behind, you know, being able to sell and operate the phones. Um, so we adopted this dev ops model, which worked well because unlike Comcast as a whole, which is a large enterprise at the time, Xfinity mobile acted as a startup.

00:05:16

So it was easy to pick up these, these practices and get teams to buy into it. Now, after the launch of affinity mobile, and after I was on the project for a while, I moved to a different part of Comcast called software strategy and transformation. We were kind of the engineers that help the other engineers at Comcast. And so when we moved over to that group, I remember specifically, uh, wanting to bring over all of my knowledge of dev ops. And they said something interesting to me. They said, well, on this team, we don't do dev ops. You know, instead we do SRE, all right. Now I had heard about SRE, but I not dove deep into exactly what SRE was and how it from DevOps. Uh, since I was starting new on a new team, I really wanted to get an idea of how they were doing this.

00:06:04

And so I asked them, what is your definition of SRE? And at the time the team said, well, our software engineers develop the software and our site reliability engineers operate the software. And, uh, I thought probably the same thing that a lot of you are thinking right now, how is that? Not just operations, you know? And so w where, where developers would just toss, you know, code over to operations and have them operate it in production. But still I was a little skeptical, but I still wanted to work with the team to really define what SRE was. So I joined a book club with the team, uh, at the time where we were going to read site reliability engineering and see what practices we could take out of it. And I started with the mindset of, I want to bring a lot of my knowledge of dev ops into this and find a starting point that lets me relate to exactly what site reliability engineering is.

00:07:00

So, as I looked at the hierarchy here, um, I wanted to find a place that really, that really spoke to me. And there it was, it was released procedures, right in the center testing and release procedures that made me think of the CIC CD that I had done previously and thought that this was, this would be my starting point that I could build out from now, my OCD kicked in. And one of the first things I noticed was the P in release procedures was not capitalized. This was not a title case, but I didn't think much of it at first. That was until I went to the part where they're supposed to be explaining what this part of the hierarchy was. And as you can see here, they have testing in there, but absolutely nothing for release procedures. You can go online to the site, reliability engineering book right now.

00:07:46

This is what it looks like. It really started to feel to me like the release procedures in site, reliability engineering was almost an afterthought. All right. So you can see testing is, is there but not release procedures. So what I wanted to make sure that I did with this team to truly bring my expertise incidents since I was leading this team was modify the definition of SRE a little bit upfront, and I wanted to change it so that we said SRE and DevOps together means that our definition is the site reliability engineers will use dev ops principles to operate and improve software. So that was what we were the, the approach that we were taking at the time. All right. So let's get into some of the issues that we had in the beginning. Now, when you say you're going to start doing site reliability engineering, anyone who's ever gone to a talk or looked into, it knows that there's acronyms of, you know, SLS, SLS, SLS, you know, the, the agreements, the, uh, the, uh, the objectives, um, and error budgets.

00:08:56

Now, the problem that we were having in this particular organization, we were in is that we had very mature software engineering groups, and they were not just going to buy into this idea of this group called site reliability engineers that have not proven themselves. Now, remember, we had people already working there and you can't just slap a title on somebody. And then everybody is automatically going to give them the gravitas to have power over how you develop your software. Especially when coming from an operations standpoint, like we were. So what we needed to do was find ways to build trust, to build confidence in the team. So the best way that I thought about it was to dive back into my expertise. And that is, uh, things that have to do with, uh, DevOps and right there at the top. And it, for anybody who knows the cams acronym, you have culture, automation, measurement, and sharing.

00:09:57

So I said, let's really get good at automating what we're doing right now in the software strategy and transformation group. And once we've built up that confidence in our teams, let's start slowly bringing in, bring in these other ideas of SLS, error, budgets, SLS, and SLS. So we had a SVP at the time, uh, of reliability engineering, Dana Wilson, and she was quoted as saying, we must automate, automate a way, the hundreds of routine tasks, which create the fall that impedes our vision. So I was able to hang this up and give the team a little bit of a north star while we were going through these initial automation, uh, improvements. Now, one of the problems that we had was the folks that had set up any automation previously worked a lot like Brent, for anybody who knows the Phoenix project, which I hope you do Brent's was that very powerful, very capable engineer that everybody went to and he became overloaded, right?

00:11:02

We had our version of, of a Brent on the team, and we had all of these great engineer twos and engineer threes that could be doing some of this work, but they didn't have all the knowledge that was locked in Brent's head. So our first mission was to take some of these procedures that Brent has in his head and make sure that the team is skilled in them. All right. So the first thing I did was sit down and I said, Brent, you need to document how you release, you know, how you do your deploys. That was the first step. Now, one quality that you might find in brands is that they don't always enjoy documentation. And while this on the screen might look like a representation of, of the instructions that he might've created. He actually only delivered five lines with very little detail, as far as his, uh, procedure for releasing our software and thanks to the wonderful joys of confluence and being able to go back in history, I'm able to actually show you this was what our Brent put out as his first offering of documentation on how to release our software.

00:12:11

And like I said, five steps, verify the release notes, verify the release file, verify the release with release manager that the notifications have been sent for release, send communication about the start of the deployment disabled, the VIPs. Now, clearly you could probably see that this isn't enough instruction for anybody to pick up, but it was a good starting point. At least we were able to get Brent to put something on there. And that's when we put our plan into, into operation, we actually had Brent sit down with a junior developer. And the first thing we said was, okay, this junior developer is going to do the next deploy. And you could see it on our Brent's face. At the time, he knew that his documentation was not thorough enough, but we said, don't worry about that. We are going to improve it over time. So the process that we put in place, the first step of it was the engineer executing the runbook of the procedure, ideally should be one of the most junior on the team.

00:13:13

And that's what we did now. Brent can help, but only when the engineer gets stuck on the documentation that Brent put together, there would be a third engineer in place which helps increase the spread of the knowledge, but also allow us to free up the hands of Brent to be able to help the junior engineer every time the engineer has to interact with Brent, the third engineer will document the differences, the missing piece of the documentation. We would repeat this process over and over again with every deploy until errors were few with the, with the, with the deployment. Alright. And once we had that document in place, that becomes your pseudo code to automate it eventually. So this was the process that I'd gone through several times before and was now bringing to my new team. And we can go back to that, uh, that guide that we had in there in confluence and show you how his instructions evolved over time using this process, just on the first iteration of the process, we found out that the junior developer, uh, didn't have access to Jenkins, didn't have access to the right place in get hub.

00:14:27

So this was invaluable on its own. Just being able to say, Hey, we're able to find all the prerequisites just by having somebody other than the person who's always been doing it, do the, do the procedure. Um, and the other thing is you can see that the notes are slowly getting larger and larger. You can see, I had to, uh, blur out a lot of the content, but from what you see here, you can see instead of just five lines, we have links. We have if statements right inside of the, in the step-by-step guide, and it's just a lot more thorough. And over time, it got to a point where this was so repeatable that we now automated the process. And so we're so confident in the deployment of this particular, uh, software right now that we attached it to an, to an AWS button and put it right next to the water cooler that anybody at any time could walk by push this button.

00:15:27

And we would confidently release our software to production. All right. I've probably repeated this process on at least three or four different, uh, groups in Comcast now. And it's such a good feeling when you realize that you can, that that deployment is so easy. So, so small of a step at this point, um, one thing that came out of this, which was great, our VP of engineering at the time, Gustavo POS Michelle, uh, provided this quote about that time. Uh, what was most amazing to me was how automating our mobile back office deployments positively affected our relationship with other teams. When the product team was faced with an urgent customer need, we could make the change and deploy it to several environments quickly for review. And this is the best part of the whole quote. If you asked me delivering our software became a non event, that's how you build confidence.

00:16:23

That's how you build gravitas. If your team can put together something like this, they might just be open to other things. All right. So the second thing, once we, once we were automated several of the deployments in software strategy and transformation, we wanted to bring over this idea of reducing toil. All right? And so this means that we had a lot of very, uh, repeatable, uh, tasks that the team was taking on, uh, to let you know, uh, what VEC Ralph Google says about toil. Toil is the kind of work tied to a running types of running a production service that tends to be manual repetitive. And automateable all right. So to give you an idea of how we tackle this problem of too much toil that we had at this point was you can see the red bar here represents the amount of time we spent on toil, repeatable tasks, manual mindless tasks.

00:17:22

Now the green box that you see on the right is our engineering work. We want to be able to spend more time engineering, less time on toil. My job as a leader was to provide enough time for the group to work on automating some of these things. Now, as you just saw an example of one of the first things that we did to save up this time was automated. These deployments that we did so very often. So what you want to do once you've automated a task, you want to make sure that it is available for many people to use and add it to your automation library. All right. And then once that team, once that work becomes a time that you can spend on engineering work, all of a sudden you realize that you have enough time to start automating some of the other, uh, repeatable tasks, the other toil that you have.

00:18:13

All right. And so our next thing that we can tackle that software strategy and transformation was key rotation. Now that didn't happen often, uh, at the time, uh, I think it happened once a month, but it, but it was enough of a disruptor when it did happen, that it took us away from a lot of the other things that we were working on. So we automated key rotation and added that to our library, which then freed us up for a bunch of other things that we can automate. All right. And slowly as we brought in these things, uh, automate yourself healing instead of having to, uh, bring down and bring up new microservices on your own automate cer certificates. Uh, you know, this is for, you know, servers that might not yet be using containers and can be destroyed. And re-released, uh, we still had some that, that needed new certificates to be added, uh, patches release notes.

00:19:07

When we automate a release notes, we save so much time in that testing and then common support, uh, questions that we would get in our slack channels. We start, we created a slack bot that could answer a lot of your basic questions. We started seeing so much toil reduced and so much additional time for engineering. You know, that all of a sudden these things that needed to be done became so clear to us. We were able to go and spend time creating more robust dashboards, um, you know, working with other teams to figure out what they're doing, right with monitoring and, and bringing that into our own group. Um, you know, alerting all of these things just became things that, uh, aren't always easy to automate, but you do need to spend time on it. And we were able to basically say that, uh, a general thing was we spent about half of our time on engineering work and half of our time on new toil that came in.

00:20:05

Now, one thing that happens when you get good at, at this is that green time of engineering work becomes your team's premium time. Um, and groups that see that your, your team doesn't, uh, is not busy on toil all the time. There could be, uh, instances where teams try to just dump more of their toil, work on you. And this is where the next step of what we implemented comes into play. Um, we wanted to make sure that we were selective with the work that we took on, um, because once you free up that time and you have, you have that time to work, um, you have a choice. You can either bring on work that actually helps make your team stronger, or you can take whatever's dumped onto you. And I love this quote by Damon Edwards, which, which illustrates this. If an SRE team cannot regulate its own workload, it becomes the aggrieved party.

00:21:00

It becomes the, the group that no matter what, no matter how bad a, an offering a software engineering team, uh, throws over the wall and gives to you, uh, you have to take it, you have to support it. Um, in software strategy and transformation, as the leader of this SRE group, I stopped that. I did not just say we'll take anything that a software engineering group gives to us. Instead, we started creating a maturity model that allowed us to determine what work we wanted to take on. So we looked at it like this. What if handing off to our SRE group was not a right, but something that needed to be earned. We started treating our SRE group as a premium service that people would want to offload their work to, but it has to do things in order to do so. Um, a benefit is it encourages standards.

00:21:49

It encourages teams to not be as snowflake. And it allows us to get the benefits of economies of scale working with, you know, if the team is using promethium, you know, that's something we're really strong at. So we'll, that's, that's a checkbox. We'll take a team on, as long as they're, they're putting out good end points that we can scrape with with Promethease. Um, if, if, uh, they're using containers, a good containerization strategy that fits in with ours, you know, it makes it that much easier for us to support that. Um, if they don't, what we can do is send one of our engineers, our SRE engineers, to work on that team for a while and find out is it possible to get this team to align with what we, with what we're willing to take on? If so we do that. If w if not, we say you'll, you'll have to either find another team or continue to work in dev ops fashion on your own.

00:22:42

Um, so just to recap what we went over, uh, in, in this presentation, um, building confidence in your SRE team can be happened when you, uh, build your teams automation skills, or at least that's what we did. Uh, we were able to get teams to trust us more by giving them wins with, with the strength that we already had with automation. All right, we track down and eliminate toil. You know, if you become a really popular SRE team, you're not going to want to spend all your time on toil and, you know, overloading your engineers on that work. Instead, you want them to be able to find some joy and some engineering work to really solve problems that come, come to them. And third, we were selective about our workload. It's one of the only ways that you can make sure that you have that time for engineering work.

00:23:33

All right. So that's, that's where our team is now. The SRE team that I lead is now, um, and our next step that we're going to take could use a lot of conversation with you, people and possibly get some advice. Um, the help that I'm looking for at this point is who's actually doing error budgets for real out there who has created a good relationship with your software engineer team. In order to say, you know, there are error budgets that we have in place, and if you fall below those standards, there are consequences. Um, we've tried it in the past. Uh, and quite frankly, uh, priorities would always weigh out over what we called, error budgets, what we called important. So I'd love to see how you got leadership to really buy in and how you got individual contributors to be okay with this SRE team. Being able to set standards of quality to the point where they could say stop creating new features and start working on the errors that you're creating as a group. And with that, I say, thank you so much. And I hope you can build your own, uh, confidence in your SRE team. Thank you.