Demystifying DevOps & SRE

SRE is a large, complex topic, so we’ll start with common terminology and theory, then dive into practical examples—including lessons learned from our own journey here at Datadog.


We’ll further cover the relationship between SRE and DevOps, what success looks like (and how to measure it), and how to identify and nurture both internal and external talent in order to build a cross-functional team.


This session presented by Datadog.

DM

Daniel Maher

Developer Relations, Datadog

Transcript

00:00:07

Hi everybody. My name's Daniel. I'm a developer advocate on the community team at Datadog and for the next 30 minutes or so I'm going to be talking a little bit about DevOps and SRE. Just give you some context. I am a recovering system administrator and I spent most of my career either building and maintaining infrastructure or coding. I can say without hyperbole, that when I first discovered the dev ops movement about a decade ago, now it absolutely changed my life as it did for so many people. And then when I encountered SRE there again too, it was like a massive revolution for me and for the industry. And so I'm pleased as punch to be able to share a little bit about our journey at Datadog and my journey, and hopefully help you on your DevOps and SRE journey as well. So if you'll permit me a brief moment, I'm just going to share my screen so that we can get some slides started.

00:01:03

All right. So this is not a talk about Datadog, but it is incumbent upon me to at least explain to you what Datadog is. So Datadog is an observability platform that provides full visibility across your entire organization. We spent end to end from your infrastructure and your network, your applications, and your services all the way to your end users, wherever they may be on mobile and a data center somewhere. This enables everyone, uh, ops and developers, certainly, but security people, finance people, HR. If you like anyone in the business to have a shared understanding of your systems and the ability to communicate upon and resolve problems when they arise and ideally before they arise, also for more information about how Datadog can help your organization or to sign up for a 14 day free trial, no credit card required. Visit Datadog hq.com. All right, that's the spiel out of the way, let's get into the meat.

00:02:01

What are we going to be covering today? Uh, essentially four things. All right. And this is a good place to start as any is what are those four things? First things first, and maybe the most important takeaway. We want to clear up some confusion about what DevOps and SRE mean. These are loaded terms, very, very heavyweight, lots of baggage, right? So I just want to clear the air when I say dev ops. When I say SRE, when Datadog says them, right, what do we mean when we say them? And by the way, these aren't necessarily our definitions. They're, they're fairly industry standard, but I do want to make sure we get through them before we continue. We're gonna move on to talking about teams, people, humans, and, and how those people can be organized together to achieve success. What does that look like within the context of the dev ops and SRE?

00:02:52

And of course, those teams, as I mentioned briefly, are made up of people. Uh, so where do those people come from? Who are they? How can you nurture those, those people, those relationships, those structures, how do you put them into place? So we're going to talk about that a little. And then finally, we're going to wrap up with some practical suggestions, hard won lessons and truths from the trenches of my own career, as well as that Datadog and, uh, share some pitfalls and things to avoid on your journey going forward as well. Okay, great. So first things first, let's start right there to clear up some confusion about DevOps and SRE.

00:03:29

A key part of effective communication is having a common language. I'm speaking English right now, and hopefully you're understanding most of what I'm saying. If I were speaking French or Portuguese, perhaps you'd understand less. In other words, words have meaning, right? And unfortunately, when it comes to the terms, DevOps and SRE the internet as a whole has prayed pretty fast and loose with those meetings, with those definitions. So I like to establish what our common language is right now. DevOps is a professional and cultural movement that focuses on openness sharing and mutual respect. It seeks to improve the quality of life for its adherence and practitioners for their company, for their organization, for their customers, for anybody participating in sort of the grand dream of dev ops.

00:04:30

An interesting aspect of improving quality of life is the idea of availability and reliability. And, and that's where SRE comes into play is how can you ensure that systems can be trusted? How can you ensure that people have the confidence that the systems and services that are in place will be there when they need them? And that's actually an important part of, of DevOps in a way, and certainly a key element of SRE. And we'll start to see how these two things sort of get put together. We talk about dev ops. We could talk about a lot of things, but I like to talk about an acronym. We're technical people, technical people love an acronym. So how about this one? Cams C a M S in no particular order. Well, perhaps a particular order since if you're a to spell it, S S C a M that'd be a word that nobody would want to associate with their business.

00:05:26

So the cams against an acronym, the C is for culture, right? Uh, and at full discussion of culture as well out of scope for this presentation. But let's talk about it in terms of the way that people choose to organize and, and a social contract that they put in place with one another. Then there's a automation. This is a big one. And it's the one that I think is in many ways, easiest for technical people and technical organizations to sort of get behind, but it doesn't necessarily just writing bash scripts, although that could certainly be a part of it. When we talk about automation, we're talking about unlocking human potential, all right, we're talking about allowing the computer to do what the computer does best and allowing the humans to do what the humans do. Best repetitive tasks, things that could be scripted. Things that would be better if they were scripted, should be done by computers.

00:06:22

One, because it's more efficient, but two, because it unlocks the person that was formerly doing it and allows them to express their full creative potential within the organization. And that's powerful. That's powerful, um, measurement, uh, sometimes metrics. I prefer measurement because there's a lot of things to measure that aren't metrics. The idea here is not just measuring, but knowing what to measure and knowing how to measure it and critically how to interpret what's being measured. Super, super important. We talked about dev ops is measuring it and doing it well. And finally S sharing, right? It's all well and good to have this excellent culture and automate things and the measure, all the things you've automated, but unless you're sharing that information, helping people around you to also improve their quality of life through the dev ops practices, then you're not really succeeding. Uh, as a sidebar, you may have seen some times an L kind of crammed in here, columns.

00:07:26

Scammell right. Uh, the L stands for lean. Uh, I have no particular opinion on whether there should be an L in here or not, but sometimes you'll see one in here. So there you go. We talked about SRE. Then we can't really talk about SRE without mentioning the elephant in the room, or perhaps the big lizard, I guess, in the room. And it's this book, the site, reliable reliability engineering book, and a handbook that was released along with it, uh, not long ago, this came out in 2016 and it's a tome of how Google runs their systems, or at least how a part over Google ran some of their systems up to 2016. Now I mentioned this because again, it's basically impossible to talk about SRE without talking about this book, because it's hugely influential in how we understand and design SRE programs and environments today. But I think it's important also to note that this is just one interpretation and it's an important and good one, but it's one interpretation. And how Google did whole part of Google did something in 2016. It's not necessarily how you should do something today. Although it's a good starting point.

00:08:45

It could be said that dev ops is an idea and SRE is an implementation. I said, it could be said that because not everybody agrees with, but it's an interesting framework to think about it. I'm not going to talk about this book anymore, but if you haven't read it, I suggest you at least do take a look at it. So we talk about dev ops, we're talking about ideas, okay. And that's the kind of the key element that I'd like to get across there. And we're talking about SRE, we're talking about practicalities, we're talking about implementations, we're talking about actually doing something right. And so if we have a philosophy and a practicality, we have dev ops and we have SRE. Fair enough. All right. So we've cleared up some confusion about DevOps and SRE. They're going to move on to topic. Number two, team and organizational structure in order for dev ops to succeed in order for SRE to be effective as a practice, you need to consider how your organization is structured.

00:09:53

This is actually super important. You can't just declare dev ops, victory and name some team, the SRE team and what we're done, right. It's not how it works. Okay. You could just rename something and then it's something else that's not anyway. So let's talk a little bit about teams and organizational structure at Datadog, as well as in many organizations, we can think about organizing people in sort of three different ways that are pertinent to SRE and DevOps in particular, and that's product teams, squads and guilt. So starting from the top, we'll talk about product teams. Teams are big, right? Sometimes really big and every single role and function and responsibility that needs to be addressed is accounted for, by at least one person on that team. Uh, if, if you can actually reasonably state these things and have them be true, then you may very well have a product team already in your environment.

00:10:58

It's a self-sustaining right. It's a tenacious, everything that, that particular, uh, vertical needs to survive and function and succeed is actually contained within a single team. An interesting element of the product team model that applies in particular to dev ops is it's how DevOps scales a lot of questions around, okay, well, DevOps works in startups or in really small companies, but how can DevOps work in a really big company? Product teams are one of the ways that dev ops scales to large companies to enterprise companies, right? So we talk about dev and the enterprise. We are also very likely talking about product teams, but, and this is a big, but the product team is not the end all be all right. Product team is just one aspect of that organization. And it's a big one, both in terms of importance and size, but sometimes you don't want to be talking about something as massive as a product team.

00:12:02

It's just ungainly, right? Sometimes you have a little problem or even a big problem that you don't need a hundred or 500 people or however many people to be targeting on. And that's when we talk about something a little bit smaller, this is where we talk about squats. Squads are short term groups that are organized to solve a single goal or problem or accomplish sort of a single unit of work that wouldn't fit neatly onto a team. Right? A team scope is vast and large, which makes sense. The team itself could be vast and large, right? The squad again, might focus on a single thing like a single OKR or one particular intractable problem, ideally that single OKR or intractable problem would benefit, uh, the team, right? If it doesn't benefit the team, why are you doing it? Ideally though, it's going to benefit a large number of teams because even across a large and varied organization, there's going to be commonalities between different product teams, between different product groups.

00:13:13

And so there means there's going to be commonalities in terms of problems and, uh, challenges across those groups, assembling a squad to really focus in on a specific challenge or problem, or outcome is a great way to benefit across teams in a product team structure. I mentioned that they could be focused for example, around an OKR, uh, to put in more concrete terms at Datadog. We've had squads organized around such things as, uh, recruiting, right? Um, how do we build good coding tests, for example, uh, analytics, right? How do we actually help our customers to better understand this particular data science issue? Right. Uh, we even formed squads around hackathons ideas for hackathons, how we want to do them, what sorts of cool outcomes we want to get from them? The key element here is that squads are short term, all right, they have a defined beginning and they have a defined end.

00:14:19

And that defined end is not only goal oriented. It's probably also time-boxed right. You don't want a squad that just goes on forever, because then that probably means that the scope creeped or that the goal you defined was just the wrong one. Right? So a small sort of tiger team centered around a very specific goal or outcome. Uh, so if there's a definition of done, that's easy to explain there and that has a time box around it. Those are some good ways of figuring out, okay, what's a squad. And did we make a good one? So we talked about the very, very large, right. That's the product team. We talked about, the very, very small that's the squad. What, is there something in the middle there? Of course the answer is yes. Right? It's a leading question and that's guilt. So what's a guilt, a Guild owns and shepherds an important part of the organization of the architecture of the culture that crosses many teams, guilds are larger than a squad, but generally smaller than a product team for values of product team.

00:15:30

They're semipermanent versions of a squad, maybe as a way of thinking about it. I mentioned that squads have time boxes and fixed outcomes. Guilds can have time boxes and fixed outcomes as well, but there is an opportunity to make those scopes larger, right? They can exist for longer. They can try to accomplish a series of goals, for example. And I mentioned that they own and shepherd and important part of, of the organization, right? They cross product teams. They could be formed of a series of squads from a series of product teams, for example, or could operate independently. They work on things like culture. They work on things like standards around automation. They work on things like how do we interpret these measurements and how can we best share the results of our findings across the organization. You're going to have a lot of different stakeholders here.

00:16:29

You're gonna have a lot of different roles and responsibilities represented within a Guild because you need to have that plurality, that diversity in order to properly assess the success and suitability of a guilt outcomes for the organization. So we have the teams right responsible for the product. We have the squad responsible for a little problem, and then we have the guilt responsible for kind of stuff that's in between. That's how we've organized it at Datadog with frankly, quite a good degree of success and how we've seen a lot of our customers do it as well. And this has worked particularly nicely in large scale and enterprise environments.

00:17:12

So you might stand up and SRE team, right? And this is one way that SRE is distinct from dev ops. Uh, you, you may have a specific rule or a dedicated team called SRE in a way that you would not have a specific role or a specific team called a dev ops or a dev ops team that said there are different ways that this might take form. Uh, you think about the cohort of site, reliability engineers. So, so a group of humans, right? Organize and manage as a group, they're going to form an SRE team. So you could have an site reliability engineer on an SRE team. And then in a way that you couldn't have a dev ops on ADA ops team, that just doesn't make sense, right? SRE teams are versatile. This is one of the key aspects of a good SRE team.

00:18:03

They can participate in a variety of things, uh, code reviews, right? Uh, incident reports. They can help facilitate post-mortems, uh, SRE teams can be involved sort of in every aspect of the life cycle of a product or application rate from white boarding, the idea to retiring it in five years, right? The SRE teams could support a dedicated portfolio of products and services, a functionality within the organization. They could exist solely to support a particular product team or series of product teams, for example, and any one individual SRE could rotate through these functions. You could have an SRE, for example, that is primarily assigned to a specific product team, but maybe they're just getting bored or maybe their skills and talents could be used elsewhere. Don't worry about, you know, what the hierarchy looks like, right. As the teams are versatile, move them around, right. Figure out what works.

00:19:04

This is one of the ways that, you know, you can feel that an SRE team is being successful. If there is this, this concept of mobility and versatility, right? So we talk about dev ops as an idea. And again, you can't have a dev ops that doesn't exist. We're talking about SRE as a practice because you can have an SRE. Uh, you, you don't organize dev ops is into rules and the way that you organize SRE into roles, right? So we're going to talk about what these words mean. So we've talked a little bit about, again, these words, that link, uh, teams and organizational structures, how these things can be organized. When we talk about people, right? We start to get into the real nitty gritty of it. Where do these people come from, right? How can we, not only as an industry, but any, any one particular organization find and grow and nurture the talent necessary to successfully implement DevOps principles and more concretely successfully build and, and see the longevity and power of a good SRE team.

00:20:18

So we'll start at the start. SRS are people, right? They're human beings. The DevOps is not a person in SRE. That's a person, right? And people have personalities, sometimes very strong personalities, right? And these strong personalities can come out in a variety of different ways. Anybody could potentially do anything, but some people are happier doing some things than others. I, for example, really like being on stage, I really like getting up in front of people and waving my hands around and getting excited, right? Not everybody likes doing that. Although of course anybody could. It's the same with SRS. People who are successful in the role of SRE tend to have certain personality attributes. Although this is not an exhaustive list, nor is it an exclusive list, but from what we, as an organization have seen within our walls and outside within our customers as well, these are some common personality traits among successful SRS in the industry.

00:21:25

First one is, uh, well just patients, patients for staring at code patients for staring at infrastructure patients, for being able to analyze just absolute messes of data for long periods of time, patients is a critical aspect of a successful, uh, any one successful SRE and he's successful SRE program. You have to be willing sometimes to wait things out and that's not for everybody, right? I'm personally not a particularly patient person. It's something I'm working on, but patience is a big part of being successful and ultimately happy in this role. And you want your employees to be happy, right? That's super, super important. Most of us are really enjoy problem solving. And I know that sounds sort of banal show me a job requirement list that doesn't include problem solving on it somewhere, right? Yeah, of course. So here, when I say that most SSRI's enjoy problem solving.

00:22:23

I don't just mean in the sense of, you know, doing crosswords or something like this. Uh, the trick about being an SRE is you're often working on systems that were not maybe necessarily designed by you, uh, were not necessarily implemented by you. Although of course this could be true, uh, systems that were not necessarily programmed by you software that you've maybe never seen before. Right? And so that aspect of diving into the unknown, that aspect of working with software and systems and processes and people that you don't necessarily have any control over has to be enjoyable. And again, it's not for everybody, but for the people who do enjoy it, it's just ACEs. So this is another personality aspect we see oftentimes in successful SRS, the capacity for self teaching and self learning is super important. You cannot currently get a degree in SRE, right, in a way that you can get a computer science degree or computer engineering degree, you know, a mechanical engineering degree, right?

00:23:27

You can't get even a diploma in SRE. So for the time being, this is something you have to want to teach yourself and want to learn from others who are themselves self-taught to a large degree. And again, having not only the capacity, but the desire there is super important. Most of the stories have a wide range of technical interests. Uh, I would say that most technical people do, but in the SRE world, that's super, super critical. And again, it comes down to having to deal with an incredibly diverse array of things getting thrown at you sometimes, right? And, and that has to be enjoyable. Most of the surgeries have found success, not only with their technical aspect, but also with the human aspect of things. Right? Big part of being an SRE is communicating is teamwork, right? So good SRS are good at communicating in different ways, text to peers, to CTOs to the outside community.

00:24:30

Right. Good communication skills, super important. I mentioned that they work on teams as hurries need to be team mates, but also team builders building up that mutual trust and that confidence between their roles and their functions and the roles and functions of other people within the organization. And finally, as I mentioned, world of SRE something, you have to teach yourself and be taught by people who taught themselves. And so a big part of SRE is that mentoring and teaching aspect also super critical as a Reese have backgrounds, right? I'll ask, raise that backgrounds and just give you an idea of some of those backgrounds at Datadog, certainly from traditional ops and dev backgrounds, but we've also hired into the SRE program from our own customer success agents. We have at least one person in the SRA team who has a PhD in computer science and at least one who dropped out of high school.

00:25:24

So again, uh, the backgrounds, there are very don't fixate on where these people came from, worry about the, who worry, concentrate more on their personalities and their capacities and their desires. And what's enjoyable. That's how you're going to identify that good SRE talent. In other words, great SRA talent can come from anywhere. So last thing that I want to touch on today, uh, just some practical suggestions about how to implement DevOps practices and in particular SRE within your own organization. And talk about some pitfalls that we've seen, not only at Datadog, but also outside in the industry as well. So first things first, uh, is that sorry, a standalone team. Is it an embedded teams? It's something else? Uh, well, in the words of my esteemed colleague, Waldo, I hate to give, and it depends, but it depends mainly on the talent and personalities that you have access to.

00:26:23

Right? The important thing here is that you're willing to try some different modes. You're willing to try some, some different ways of looking at it and you're willing to run those experiments to see what works within your own organization. This is a super important part of not only DevOps, but about SRE. If you're building it, especially from scratch is you have to be willing to experiment and you have to be willing to go, okay, this failed, but that's okay because we learned, and we're going to try again. And this is where that cams acronym really comes into play. Right? When we're talking about developing SRE and as much as I to give you a definitive right way, it just doesn't exist. Right? You have to be flexible. Uh, I have a quote here I'd like to share with you. Archeology is the search. For fact, not truth.

00:27:09

If it's truth, you're looking for Dr. Tyree's philosophy classes right down the hall. There's a great Indiana Jones quote, but to take it into the context of, of DevOps and SRE, we might say that SRE is the search. For fact, not true. Uh, if it's truth, you're looking for gene Kim's dev ops class is right down the hall, right? Uh, pitfalls, pitfalls, uh, this could be an hour long presentation on its own, right? We, we have, we we've run into problems and we've seen other people run into problems as well. And so what I'd like to do is just share with you some common things to avoid. And I would say the number one thing that you're going to want to avoid is dogma. All right, don't look at how another organization has implemented these things. As the rigid engraved into stone tablet, way of how you should be doing it.

00:28:08

You need to figure out what works for your organization, for your reality, with the talent you have today, right? Don't fall into the trap of dogma. Look to others for suggestions, come out to the conferences, go to the meetups, talk with people, read the blog posts, bring all that information together and figure out how DevOps and SRE could look and work within your organizations. And it can be willing to run those experiments on the topic of dogma. You'll see a lot of chatter about how, oh, if you're doing SRE, you need to be doing it and go, or you need to, oh, rust is the one true path again. No right. Look at what's working within your organization. What programming languages and competencies and people are comfortable with, right? You're also going to be bombarded with information from consultants who want to sell you a dev ops.

00:28:58

No one can do that. Right? It's impossible to purchase. You have to build it yourself. And it's a long, very long path that never really ends. Right. It's ongoing. And that's just something that you have to accept and, and, and embrace because that journey is empowering. All right. So what did we talk about today? We talked about dev ops and SRE. What do those terms mean? We talked about team and organizational structure. We talked about how to find a nurture people within those structures. And I give you some tips and some things to avoid. Uh, I hope you found this enjoyable, uh, at least informative. I had a lot of fun, uh, getting this all prepared for you. And I would like to think DevOps enterprise summit for giving me the opportunity to share with you here today. Uh, normally there's a Q and a that's going to be happening. If not, uh, you can see my Twitter handle, uh, over there in the corner. And you could have seen it on most of the sides at frosty P H R a w T Y. Feel free to hit me up. Uh, that's it for me. Thank you very much.