Las Vegas 2018

Operations Reform: Tom Sawyer-ing Your Way to Operational Excellence

Stack Overflow, Inc. had 150+ services across 3 teams with very little consistency in operational hygiene. Using a simple self-assessment, teams were motivated to improve. The self-assessment is blameless, non-bureaucratic, and succeeds where KPIs fail.


Tom is an internationally recognized author, speaker, system administrator and DevOps advocate. His latest book, the 3rd edition of "The Practice of System and Network Administration" (http://the-sysadmin-book.com) launched in 2016.


He is also known for The Practice of Cloud System Administration (http://the-cloud-book.com), and Time Management for System Administrators (http://www.tomontime.com). He works in New York City at Stack Overflow (stackoverflow.com).


Previously he's worked Google, Bell Labs / Lucent, AT&T and others. His blog is http://EverythingSysadmin.com and he tweets @YesThatTom.

TL

Tom Limoncelli

SRE Manager, Stack Overflow, Inc.

Transcript

00:00:05

My name is Tom Elli. I'm the SRE manager@stackoverflow.com. Who here has heard of stack overflow.com? Okay. 1, 2, 3. Okay, some of you, that's good. Uh, how many of you have heard of Stack Overflow Enterprise? Oh, okay. Not as many. You get the whole Stack Overflow q and a thing, but it's for your enterprise. You can use it for your, you know, internal technology projects. Um, more about myself, uh, I've been assistant admin system administrator for much too long. Um, I do a lot of writing. I, I blog, I tweet. I have a column in the A CMQ, uh, website. And I've written a number of books, uh, a little bit about this talk as mentioned earlier with, uh, Jason Cox's, uh, unhappy Face. Um, this is one of the talks that's trying to fix that. Most talks at DevOps conferences tend to be on the dev side.

00:00:56

This talk, we're gonna be, uh, on the operations side, so thank you for coming. Now, uh, I speak at a lot of conferences, and usually I try to have my slides all fancy with pictures and everything, and I'm doing something a little different this time. Um, I'm just gonna keep it simple on the slides, and I'm really gonna focus on three stories. So, the first story is about a big initiative, and these are all true stories, by the way. So, first story, rewind back to 1995. I am early in my career. I'm at this big, big telecom, uh, bureaucratic company. And like a lot of companies, every quarter you have a big all hands meeting and executives stand up on stage and talk about these exciting new initiatives. And being a young, you know, engineer, I was so inspired by this talk about this big initiative.

00:01:55

And afterwards, we're walking out of the auditorium and I'm talking to my coworker who's been with the company for more than a dozen years, and I said, oh, I'm so inspired by this and, and I'm gonna do this and this, and I think our team should do this and this, and this. What do, what do you think we should do, Andrew name change to protect the innocent? Um, and he said, I'm, I'm not gonna do anything. And I was shocked. I said, what do you mean? He said, Tom, someday you'll, you'll get to learn that these initiatives, they get announced a lot. And if, you know, everyone goes and they do all sorts of work, and then the initiative goes away. And that was all just wasted. I'm going, I do nothing. And I get to the same place and do a lot less work, <laugh>.

00:02:43

And I was devastated. I was shocked. Here I am, I'm like 23. And I'm like, this, this can't be how technology works. And I kind of, but, but it does, it does often work that way. Um, you know, operations can say no often because, you know, executives say yes, but we say no because we're, we're under resourced. We don't have the time, we don't have the resources. Uh, there's too much complexity in history to, to try these new things. And so the lesson that I took away from this story was, I want to make it a, a big part of my career to find non-top down ways of motivating people. I wanna learn how to, uh, motivate things through, you know, uh, peer influence and, and anything but executives saying, this is what we're gonna do, and we're gonna do it by fiat. One of my role models in this area is Tom Sawyer, the story of painting the fence.

00:03:44

Um, he was not the executive that said paint the fence. He was the executive that said, I'm really, I'm really good at painting this fence, probably better than you. I can't let you paint the fence. And that just made everyone want to paint the fence, right? Um, it revolution. The people who started this conference, um, one of the books that they published recently, or short topics, books, you can download it for free. I, i was involved in this product, uh, project, um, spoke expanding pockets of greatness. This is, uh, four case studies of successful DevOps transformations. And it, one thing you'll notice in none of them was there were a top down edict. Um, it was all, uh, from the grassroots building up, sometimes with management support, sometimes working around management. I think they're really good lessons to learn.

00:04:38

Okay, so story number two. I was at Google from 2006 to two to 2013, uh, some of their biggest growth years where we went from, you know, a search engine to, you know, a million different applications. Around 2009, the Google S3 team realized, which at that point was, I think probably 50 different teams. It's now something like a hundred S3 teams. Around 2009, Google S3 realized that their operational hygiene is a bit uneven. Like some services were run better than others. And, um, by, by hygiene I mean the things that every, every operational team should do, right? Like, we all know we should do backups. I think if you disagree, you're, you're at the wrong conference, right? I, I shouldn't, I shouldn't have to convince you. And we, we use the term hygiene 'cause it's like brushing your teeth. We all agree we should brush our teeth every day.

00:05:36

Right? Now, there might be, you know, disagreements about which toothpaste we should use, and there might be disagreements about how to do, uh, how to best do backups, but we all agree we should do backups, right? I'm, I'm not gonna write a ROI case study to explain that we should do backups. Uh, it's hygiene, it's stuff that we should do. Um, so now, uh, I didn't invent this. The, the brilliant, uh, management at Google had this, uh, very good insight that if they said, Hey, our hygiene isn't good in these, you know, nine different areas, go fix it. Well, that would just piss people off, right? So they had a very, you know, Tom Sawyer ish, uh, way of, uh, working on this. They had every team do a self-assessment of, uh, of their services. And they came up with these different categories. Um, and they didn't build this fancy app to track all this.

00:06:33

They just did it in spreadsheets. Um, and so here's a, a mockup of what the spreadsheet kind of looked like. Uh, one is bad, five is good. Um, the columns are every month. Every, every month, each team would fill in a column. So you have, you know, December, January, February, March, um, and you had these different, uh, categories of, of hygiene. So how are we doing on regular responses? Like think that as transactional requests, tickets, uh, emergency responses, you know, IR kind of stuff. Um, you know, when you get paged and monitoring, capacity planning, et cetera, et cetera. And, uh, instead of a painful, big audit that happens, and maybe, you know, at a big bureaucratic company, you would, each team would hire a full-time person that manages this, right? That's crazy. Um, no, this was just, you know, one hour a month. You, uh, teams were expected to, uh, spend one hour a month, uh, just going through this for each of their services and, um, and do a very simple basic assessment.

00:07:35

And the goal here was data gathering, not project planning, right? In fact, I don't think I ever heard a manager say, you got a low score on this. You should do a project to fix it. It was, we're just collecting data. And the engineers who are inspired to fix things, right? Engineers, just, they don't wanna see bad scores. They came up with their own projects. They would see, you know, by the time these exercise, uh, would complete, they'd have a general idea of what kind of things they wanna schedule for the next month. It gave people the data they needed to do their job better. So teams could actually take all of their serv, uh, take these, they could do a roll up to be service level. And this gives the team two things. A, where should we put some of our focus? And B, how have we been doing?

00:08:26

So you can, you know, get that good dopamine feeling from knowing that things have gotten better over time. You can see, um, service A has been, you know, kind of tweeting, uh, uh, teetering between one and two. While service, uh, c has been slowly improving over time, it also gives management the data they need. You could do a roll up by team and see where resources are most needed. Now, notice I said, where resources are not, are most needed, not which teams are doing badly. Because an important part of this is the assessment is judging the service, not the people. And that's how you keep it from being, uh, you, you wanna make it a blameless situation. Um, and, uh, I'll, I'll talk more about that in a second. So why did this work? Well, psychologically this works because it's so simple. It's a spreadsheet and it's only an hour a month commitment.

00:09:26

Uh, so that makes it a very low barrier to entry. It leverages pride and ego. No one wants to see a lot of red on their chart. So people are self-motivated to fix things. It also creates good culture. It's blameless. We're assessing the service, not the people. And it's transparent. These spreadsheets were visible and the roles were visible across all of the company. So you could see, uh, how you're doing in respect to other teams. And, um, and that helped create a culture of wanting to fix things instead of hiding things. Um, it also had non men, non-monetary recognition of good work. So you wanna encourage greatness. Um, and, you know, engineers, money is kind of motivating, but things, there are other things that are much more motivating. Um, so for example, if you wanna improve how your team is doing in a certain area, now you have the data.

00:10:25

You could look at what other teams are ranked better in that category and go talk to them. And what a great motivator than being the person that people come to like, Hey, how did you guys get such a high score? That, that kind of recognition more valuable than money. Um, it also helped direct, uh, project. So a good cis admin or a good IT worker fixes a problem, a great one, fixes a problem permanently. And a really great tech, uh, engineer builds a new paradigm that fixes or eliminates the problem companywide. Well, now instead of thinking, hmm, I think we kind of generally do bad in this particular category, say backups. Instead, you have the data, you can look and you can say, oh yeah, 30% of our teams aren't doing well in say, capacity planning. I'm going to build that new paradigm that lets us do that really well. So it lets your engineers guide their, guide their career in terms of making a bigger impact and helps them achieve that greatness.

00:11:35

Let me talk more about non-monetary recognition of good work. Um, let's say that bonuses were tied to these scores. Well, first of all, everyone would magically have high scores, right? Because it, it would encourage lying, and you don't want to do that. Also your best. Uh, if a, if a service is struggling, no one would join that team because reforming a struggling service could take two or three pay cycles to, uh, or, uh, valuation, uh, performance review cycles to improve. And that's basically guaranteeing crappy bonuses for two or three, uh, performance cycles. Who would join a team that is struggling? That's the opposite of what you want. You want your best people to OEP looking at these charts and saying, oh, that's red. I want to join that team. You want your best people to be hopping to the biggest fires, putting them out and leaving good culture there, good technology, good practices, um, and this creates a virtuous cycle or virtuous circle that encourages that kind of behavior.

00:12:41

Another reason it works so well is it seeks perfection, but doesn't require it. You should absolutely never have an initiative that's like, we wanna see all fives across the board for many reasons. First of all, perfection is impossible. Uh, second of all, you would be wasting the company's money. That last 10% of perfection is probably more expensive than the first 90%. So if you have absolute perfection, you're probably wasting money. Now, for example, backups are super important for like Gmail. I mean, if Google lost people's Gmail, that's like stabbing someone in the heart, right? You've lost their personal data. There's a certain commitment there. But maybe, and I'm just making this up, like Google Finance, maybe backups aren't so important. So a four there is fine, a five for Gmail is more important. So you don't want your engineering time, uh, wasted by going for protection, perfection.

00:13:42

Okay, so now the third story. So in 2013, I joined Stack Overflow, and in 2016, I became the manager of the S3 team. And one of the first things I wanted to do was implement this kind of self-assessment program at our company. Now, that's really difficult because in, in our case, SAC Overflow is a little bit smaller than Google. And so we had to scale it down. So for example, uh, we only had one S3 team with many responsibilities instead of many, many S3 teams. Um, we had a more granular definition of service. Um, so the way I scaled the process down was it was one spreadsheet. Um, I let the SREs create their own rubric. They were kind of intimidated by this process 'cause they were like, what's, you know, it's, it's really difficult to tell someone that they have an ugly baby, right?

00:14:39

<laugh>. And, and that's what this is about. This is a polite way of telling someone that they have an ugly baby. That the best way to do that is let them tell you that they have an ugly baby <laugh>. Um, I also, so, wow, I didn't expect that to get such a laugh. Okay, I, I'm, I'm gonna have to tweet that tonight. Um, <laugh>. So also to keep it simple, I said, let's just have a pass fail. I, I'm, we're just looking for like, where are the areas that we're kind of in trouble, right? Um, this made it a little less insulting. 'cause people were like, oh, no problem. I'll, I'll just mostly be pass and a couple failures where we need work. That was what they thought, what they got. When they started scoring their first stuff, they came back to me and said, Tom, pass fails.

00:15:26

Not good enough. There are some things that it's like, fail with an asterisk. <laugh> like, negligently fail. You know, we want management to take notice. I said, okay, we'll, we'll add, we'll add one more grade. And then they came back the next day, they said, you know, we were thinking about, and we, we want like a, a really good pass, like pass. And this is so good. Everyone should copy what we're doing. I said, okay. So, so now it's a four point scale, but people were empowered to make their own scale. In my mind, it's still a pass fail, but in their mind it's a four point scale. It just works. This is what our spreadsheet looks like. Um, this is real data. I didn't doctor this. Um, I just won't tell you what month it's from. And, um, uh, and the system worked really well.

00:16:14

Uh, some team I have, even though it's one team, we kind of have these sub-teams, and some of them were, um, uh, just did this, like their, uh, some did it together, some their leader went through and did the first draft and people updated it. Uh, what else should I point out about this? Um, I had nine different categories, uh, that I wanted to self-assess on. Um, but that felt very intimidating. So for the first iteration, we just did the four categories in, in light blue, up top, um, drill down on that. So we did it iteratively. We started just pass fail with these four categories. Next iteration, we added more, uh, more categories. These categories. I see a lot of people taking pictures of the slide. These categories work for us, your company, maybe totally different categories. Do what works for you. So why did this work?

00:17:08

We kept it simple, simple, simple, simple. If you find yourself writing an app to manage this, you're gonna, please don't, I mean, you're gonna, uh, you'll still be working out the database schema by the time you could have been done already. It's also blameless. It lets, as I say, assess the service, not the people. Um, people were motivated to ex, it motivates people to expose their own words. Um, and that makes people wanna fix it. People are more motivated to work on a problem when they thought of it, which is why as a manager, I never say like, that's broken. I say is, how's that doing? And they say, oh, that's really broken. Oh, would you like to fix it? Oh, let me tell you, I can't wait to fix it. Yeah, it's like Jedi mind trick. Um, and also these problems had existed for a while, but they were invisible to management.

00:18:02

And you know, engineers tend to think that managers have ESP. Um, and I actually at one point I said, uh, wow, I'm, this is great. I can show this to, you know, our, our management and they'll see where the problems were. And they're like, oh, they know all these problems. I'm like, wow, you really believe that the executives have ESP, that's so adorable. <laugh>. Um, ironically, if I did it the other way, if I did the assessment for them and hand, this is what I think you're doing right? A, they'd be insulted. BI think what they'd really come back, I think they, no one ever said this 'cause I didn't do it, but I bet they would've come back and said, oh, don't tell me what's wrong. I've been complaining about this for weeks. But no one listens, right? Because engineers, when they mumble underneath their breath, they think that CEOs hear that not true.

00:18:54

It did create a new problem, which was now all those red squares was like a hundred or so new projects that they wanted to work on. And if we tried to fix everything all at once, we wouldn't have time for feature related work. Uh, so I approached this problem in three ways. Um, one is, uh, I tried to rate limit stuff in our monthly and quarterly planning. I said, let's limit 20% of our work hours to fixing things in this area. The second thing I did was we, we established theme months. Like we had a theme month of backups one month. And, um, so all the different sub parts of the team, or, you know, everyone was working on it. And that worked really well for two reasons. First of all, it helped morale, like people were feeling a little isolated in their work, but because they were all working on backups, it actually improved team cohesion, even if they were working on backups of totally unrelated things.

00:19:56

Um, my, my team works all remote. We're all, um, well, two of us are in the New York City office. Uh, everyone else works from their home, um, in many different time zones. And, and that feels a little isolating. But the fact that everyone, uh, was talking about backup related things, uh, that helped morale. The third thing we did is I tried to focus on the theory of constraints. How many people are familiar with that concept? It's explained really well on the Phoenix Project. Who here has not read the Phoenix Project? Yeah. Raise your hand in shame. No, <laugh>, that's, that's a great book. Oh, sorry, that was, that was kind of not blameless of me, but, um, but it got a laugh, so I'm sticking with it. Um, so theory of constraints says if you have a process, say it's four steps and each step can process like 10 items per week, but you have, have that one process that is only five, you're gonna get a backlog of work between step two and three.

00:20:53

And when you're picking projects, well, the theory of constraint says you should focus all of your energy on fixing that backlog on step three. Because if you improve the system downstream, well that downstream is starved for work. You're just making that more efficient for no good reason. And if you make upstream steps more efficient, you're just contributing to the problem of a, of a large backlog or a large bottleneck. So, um, so we try to identify, we, we put a lot of thought into of all these things that are red, which of them would be fixing, you know, step three, for example, what would be, uh, fixing stuff at the bottleneck? Okay, so that's the end of my three stories, but there's actually a fourth story 'cause we have time. That story is your story. I'd like to see you all go home from this conference and write your own story. Take these three stories and apply them in your organization.

00:22:00

Shameless plug, one thing that'll help you is, uh, in a book I wrote, uh, chapter 20 is kind of the instruction manual for doing this. And appendix A is, uh, all these sample assessment questions and also what I call look fors. Uh, if you're, you know, capacity planning is at level three, here's what you should look for. You should see these things, and that's how you know you're at level three. If you, or here's how you know you're at level two, it's something like 40 pages of, um, look fors. Um, don't try to implement every single damn look for that would be crazy, but use this as a, as a guide, as inspiration. So do this in your, uh, in your enterprise. Um, try to do it. Grassroots management should set high standards. Uh, but the engineers should be figuring out how to get there. Uh, let teams create their own rubric, maybe even their own. Well, I think you should create your own grading system, but be flexible. Um, start with one team, uh, and then prove that success there. And then grow. Don't do a big corporate edict that all teams must do this assessment at Google. This was only done by some of the highest performing teams to kind of kick the tires and get it working. And then over time, all other teams started doing it.

00:23:24

Oh, there we go. Um, yeah, let the rubric grow over time and, um, people are gonna need resources to do this. That means postponing some projects or adding people. Um, but now you're gonna have the data that's gonna help you better focus those resources. And lastly, oh, no use, yeah, use a spreadsheet. Don't write code for this. Um, and it's so important that you do this in a blameless way. You want to encourage blamelessness, uh, transparency. Um, when I gave this talk as a, a dress rehearsal at a local, the New York City DevOps meetup, by the way, and we hear from New York, okay, come, come to my meetup. Um, someone said, Tom, if you're giving this talk at an enterprise conference, you know a lot of no one here, but you know, other enterprises <laugh>, um, have kind of a toxic culture. And this is just setting people up with a big target on their back.

00:24:23

They're, this is, this is gonna be weaponized against them. You should have some advice in that situation. Well, if you're really in that kind of culture, I have three bits of advice. One is maybe transparency isn't for you. Maybe you need to do this on one team and, uh, not be so trans, not be externally transparent, uh, until you've proved it out. And then people, um, one of the biggest motivators is people like to copy success. And so if you can be successful in one team, let other people copy it. The second bit of advice I have is have all of your management read this book Beyond Blame. Um, fantastic book. It's like the Phoenix Project in that it's a fictional story that you read instead of a textbook. Um, it's very readable, is fast read, like a hundred pages, um, all about how to create blame, why it's so important from an executive level to have a blameless culture. And the third thing you can do is, uh, try to change your culture. Try to change your culture, try to change your culture. And if those things fail, send out your resume. <laugh>, I'm very serious about that. For decades, actually, about 20 years ago, I wrote an article about, it was called Just Quit. And it was, it was saying, you know, try your best, but sometimes the best thing to do is send out your resume. So that's my talk. Um, we do have five minutes for questions. And, uh, thank you all for being here.