Las Vegas 2020

The Ops in DevOps is a VERB. It is not a Noun!

As the #1 insurer of cars and homes in the United States, State Farm® has embarked on a journey to fundamentally change the way teams deliver software through DevOps. State Farm has reshaped the way teams work and interact from the adoption of DevOps practices and behaviors, to the realignment into empowered product teams. With over 6000 IT professionals, a transformation of this magnitude is going to have a few bumps and bruises. What happens when you focus heavily on Development and allow Operations to be treated like a noun instead of a verb? Operations is not a place or group of people that do something. Operations is an action that teams take on their products to remain stable and successful.

This session provides attendees an in-depth look at the State Farm journey to raise awareness of the concept of DevOps within their leadership and to revamp their approach to Ops by focusing on improving the observability, tracing, and monitoring capabilities of our platforms, adopting a site reliability engineer (SRE) mindset, and a lot of enthusiasm to encourage and accelerate adoption of DevOps by product teams. We will share how we used the SRE role and focused on improving the operations capabilities available to product teams to take our DevOps journey to the next level. We will also share how we influenced our leadership to understand how important DevOps is to be successful.

JC

Jeremy Castle

Engineering Director, State Farm Insurance

AH

Andy Hinegardner

Technology Manager, State Farm

Transcript

00:00:13

Hi, Annie.

00:00:14

Hey Jeremy, how's it going?

00:00:15

Not too bad. How are you doing

00:00:16

You doing pretty good, but it could be doing a little better. It could be in Vegas.

00:00:19

Yeah, it could be in Vegas. So, um, we're here to talk about state farm's dev ops journey and how we kind of focus a little bit more on operations and some of the wins we got out of that.

00:00:28

All right, let's go to the slides.

00:00:43

All right. So the name of our presentation is the ops and dev ops is a verb. It's not anonymous. So next slide, Andy. Um, so state farm is very large insurance company, um, where the 3,600 company in the fortune 520 19. Um, number one auto insurer insurers since 1942 and home owner, um, insurers since 1964, um, we offer about a hundred different products. We have nearly 19,000 agents and almost 60,000 employees. Um, you know, so we're a very, very large complex, uh, financial institution that offers insurance. So next slide, um, give you a little bit of background about state farm's enterprise technology department. We have nearly 6,800 different employee, uh, employees out of that almost 3000 of those are actually software and infrastructure developers. We have hundreds of different technologies, um, Java mainframe. We started, you know, building a lot more things with Golang. Um, so just many, many different technologies, um, over 2000 different web applications across so many different platforms was, um, PCF based, um, public cloud and that's spread across 15 different business areas and, um, 1200 product teams. So very large complex systems that all have to interact and talk to each other every day. Um, it's a little bit of background. Um, I'm an engineering director within state farm. Um, I work on their horizontal enablement for an enterprise technology, uh, mainly focused on application development, life cycle. Um, and I own the majority of our developer tools, um, and just the developer experience in general. Um, and really I'm, I promote a lot of our CICB practices and tools.

00:02:25

Yeah. And I'm a technology manager I'm in a newly established resiliency engineering area. Uh, I see myself as a SRE ambassador. Um, so my primary responsibilities are enabling enterprise site, reliability engineering practices, uh, for, for our environment and also a Shreve Angelist. Yeah.

00:02:47

So I'm going to do a little recap from our talk last year. Me and Kevin O'Dell gave, um, I'd say state farm's dev ops journey really started in 2015 and it really started with this value stream app. So this value stream map, we actually hung it up in my boss's office and it's spot 16, 17 feet long. And what we did is we use value stream mapping to really kind of understand our development process. So state farm really took a heavy interest in how do I get code from my workstation out to production, um, as quick as possible, but I had to start with understanding that process and really start with a challenge of my boss on day. One of, you know, my job was I need you to figure out how to go from my workstation to production in less than 24 hours. So that was kind of our north star, um, just to kind of recap some fun little stats, um, you know, 2015, what we found using value stream mapping was about 1500 hours to get, uh, an application out to production.

00:03:45

And some of that was actual time, a big chunk of it was wait time. Um, but you know, we found almost 150 steps, 35 different handoffs, many different specialized roles just to get one production activation. And, um, you know, it, it just can just really slow you down. Right? So, um, you know, using the value stream mapping and focusing on the dev side, we were able to take production deliveries for two, three weeks down to two, three hours. And actually now we get them, get them down to minutes. So, you know, as we kind of start factoring out the number of changes we put out per year, we estimated we were able to go from 175,000 days to only 150 days of time. Um, so that's significant savings, right? Not all that was, you know, people time, some of it, a lot of it was waiting, but, um, it really did help us hone in and make some, some good decisions, um, some good improvements in our processes.

00:04:36

Absolutely. Yeah. Uh, you know, another thing we did too, is, you know, state farm did that journey from, we went from projects, the products, um, really took in the agile mindset, um, really changed how we operated and, you know, like it was some significant improvements. Um, it also helped us work closer with business. So, you know, Hey, we focused a lot on dev. We focused a lot on products, right. So, you know, with that, you know, everyone gets dev ops, right? Andy gets dev ops, I'll get dev ops, whoever gets dev ops, you know, everything's perfect. You've read it on a book. Um, we're doing everything great. We get everything out the door quicker. Um, but if you kind of step back, you want to go next slide, Andy. Um, you know, if you think about, you know, day, one, day, two day one, everyone's super happy.

00:05:21

Um, you know, we spent a lot of time removing friction from our teams. Um, we focused really heavily on dev ops lately or get ops. Um, so the last year we've made major investment in get-ups and that's really made our developer developer community much happier, much quicker. They go to one set of tools. Um, you're defining your infrastructure as code and you're placing it with your source code and a lot of big wins out, uh, out to get ups. Um, we've also spent a lot of time on platform enablement. So things like investment in, um, cloud-based platforms, public cloud, um, heavily baking CIC D and compliance in these platforms where we can really helps remove a lot of friction, right? So that's something else, you know, product teams don't have to worry about, um, mentioned pipelines. Um, what we've tried to do is position all our strategic platforms to pipelines.

00:06:11

That's the only way you can deliver to production. Um, so as we started enabling get ups and some of our past work, um, any, any strategic platform with state farm, you're going to have to use get ups or some type of pipeline to get out of the production. So that helps us, you know, adopt the CIC D mindset. And, um, last on that we spent a lot of time as a company and developer experience, um, trying to understand, um, where we can improve their day-to-day life, whether it's better machines, um, better monitors, um, making it easier to remove, you know, remove as much friction as possible. I I'll say that repeatedly, but that was honestly kind of our north star on how to get better. Um, so yeah, great. They won. Everything's awesome. Well, you kind of get to day two and I'm pushing change out.

00:06:57

And are we really focusing on the right stuff on operations? Um, you know, a lot of times within the company, you know, you talked to folks and they would say, well, operations, that's a team. That's, you know, that's an area that I go to. I don't, I don't really have to talk to them. Right. Um, so as we made some of these, um, transitions from project to product and merged these teams and try to have, you know, multi-skill set teams, um, people started at, you know, they'd ask questions. Well, I have to monitor my application production. We did prepare people, um, all the time to, to be responsible for the end to end life cycle of products. And, um, really try to raise some, some education on how do I monitor my applications. What's observability. These are common things that we didn't prepare people for.

00:07:44

As we start thinking about our dev ops journey, we need to do a better job of it. Um, so what are some ways we did that? So state farm, I think has a pretty interesting story behind that. Um, we're a large company. I think we have almost 800 leaders, um, you know, just within our enterprise technology and that's from executive down the first line. Um, we need to get our leaders excited about dev ops. A lot of, you know, you'd hear this a lot. Well, that's just something you, you know, that's a state farm thing and it's not, it's, it's a, you know, industry-wide movement really. I think there's a lot of compelling and awesome stories behind what's going on in dev ops. And we tried to get to figure out how to bring that in the state farm. Another thing we wanted to really look at was site reliability engineering and how we could bring that into the day to day things to bring, uh, coding, um, a technical automation mindset into things that happen in production, you know, not just handling tickets and closing on manual tasks, how can start reliability, uh, flight, reliability engineering, make huge improvements.

00:08:46

Um, so I'm going to cover the, you know, get leaders excited part, um, you know, th that's, that's a picture of me talking to Kevin, O'Dell reading on the unicorn project. Um, and it's kind of a funny video, but I'll give Kevin a ton of credit. Um, try to get our folks excited. We, we, we started holding these big events and, um, within state farm. So me and Kevin spoke at, um, all day DevOps. Um, uh, we had other people speak there and we made this huge viewing party. So we tried to get everyone excited within our department and watch these videos and be engaged and learn more about it. And then Kevin spent a lot of time. Um, earlier this year, right before COVID hit, we were gonna have, uh, we had like five, six industry speakers lined up. Um, we're gonna have everyone come in.

00:09:33

And we had an all day conference, state farm is making a huge investment in their leaders to learn more about dev ops. And, um, it was awesome events, but we had to move it. Um, COVID hit and we ended up doing it all virtual. Um, in August we got industry speakers to come in and it was, it was pretty amazing, got a lot of, um, positive feedback. And we started building these dev op cohorts, um, within state farm. And that was like, you know, people are excited. Let's talk about dev ops. Let's, let's read the unicorn project. That's the other thing we did. We got every single leader, a copy of the unicorn project. We're starting to do book clubs around it. Um, talking about how dev ops can be applied within our products at state farm. So you just, you're getting a groundswell of excitement.

00:10:16

And then the other thing, you know, state farm, um, you know, we're starting to, to ask people to go out and actually, you know, talk through these different items, talk at these types of conferences, get people excited about them. Cause it shows that, you know, we are doing things in the industry and that gets this people excited, you know, back on the product teams. Um, I'm gonna go next slide. And so I'm here to talk, you know, I've talked to you about how the leaders got excited, but Andy about a year ago was charged with like, let's figure out how to get SRE and then farm. And he's going to kind of tell you some of the cool stuff we've been doing in that space.

00:10:51

Yeah. So thanks, Jeremy. Um, yeah, so that's really what I've been tasked with is helping to really start up, uh, SRE practices within state farm. And, you know, we are new at this, like Jeremy said, um, I haven't been in this position, uh, that long. Um, we've built, um, a couple teams we're in the process of building another. Um, but really when I started to look at all of this, none of what we're really talking about from an SRE perspective isn't necessarily new. So, um, I thought it's been some time today kind of explaining how we got to where we are and then some of the steps that, uh, we've taken so far. So the, the big reason that we looked at, uh, site reliability engineering is because we were seeing some, uh, increasing customer impact. So like what Jeremy said, like, Hey, we can get code from your desktop to production and, you know, two minutes.

00:11:48

And it's awesome. Right. But if you don't really know that application is running, how it's behaving, is it healthy as a debt, as a sick, um, then, you know, you're not really engaging in the full cycle of, of what this means. So we miss some availability targets. Um, we had our agency force that we're seeing some higher, um, recovery times for some of the tools that they use. And that really got us to think like, all right, we've got to do something about this. We can deploy code all day, but we need to look at the app side of the house as well. So what we did was we decided to start with a small team with a startup mentality, and we wanted to fill that team with folks who had a broad skill sets across the organization. Um, and we wanted to focus on our critical applications.

00:12:42

So our customer facing applications, financially significant applications, the, the apps that generate money for the business, um, that was kind of the, what we started with, uh, what does that mean? Right? Like there's a lot of book reading, a lot of videos, uh, to really start up SRE. Um, and one of the biggest things within that is really the culture change associated with it. So, you know, we're all in it and things are going to break. Um, I think years ago it used to be, you know, we can't tolerate anything not functioning. Um, but I think the goals that we have now is a realization, things are going to break. It's just how Fastly do you recover from them? Or what are the things that you can put in place? Um, via automation, things like that for recovery. Uh, the other thing we really saw in our organization we're balancing features with the operation side.

00:13:38

So the dev side of the house is, you know, the business is saying, go, go, go. I want all this functionality out in production. Um, but there needs to be that balance, right? So it can't just be pushed the code and forget about it. Um, and that's one of the things we were seeing as well. A big thing too, is just like automating toil or automating, um, those tasks that are repetitive within your team or your environment. Um, that's a big one. And I think that kind of ties back in with balancing the features with the ops side. Uh, you need to give teams time to actually work on some of that toil so that they can just get it off your plate. And then another big one we found is the concept of a blameless post-mortem right. So when events happen, what does that mean afterwards?

00:14:24

So, you know, you pull a bunch of people together, traditionally you get on a bridge call and, you know, everybody is working 24 7 to get something fixed. And then at the end, um, you figure out what the root cause was. You dragged that person in and say, uh, don't do it again, you know, or else. And, and that really kind of promotes a negativity within the operation side of the house. So, you know, the concept behind blameless postmortems is that we have issues where an it, things are going to break, let's all get together and figure out what they are. And then let's ask questions like, Hey, why was that allowed to happen? Like, can we do things from a systems perspective to not allow people to do that? For example, if it was, um, human error. Um, so that's kind of where we started some major things that we found, um, are around dependencies, observability, health, and traceability.

00:15:17

So dependencies, like, do you know what is using your application or your infrastructure and do you know, you know, what impacts you may have on someone else or they may have on you? I know it seems simple, but, um, when you get in a large organization and pretty complex, um, you know, applications and infrastructure setups, that that isn't as clear as it could be. So that's one target area we had. So the other was observability and that's basically like, is anything weird happening within the systems? And are you monitoring? Do you know that those are happening? Are you getting notified when, you know, you see traffic spikes or things like that. Um, and then another big one was, was the health. So is your appar infrastructure healthy sticker dead? And I know I mentioned that before, but that was really a big one. When we moved from this project to product teams, um, that product team kinda owns everything, you know, uh, start to finish on it.

00:16:14

And we were really seeing that folks weren't necessarily putting the instrumentation in place to understand the health of the system. And, um, that led us into also like the traceability side of the house. So, you know, if your systems, okay, like, Hey, how am I shooting all my screens are green. Awesome. Right. But if you're effecting your customers, right, because you're just a small part of their customer journey, then you need to really understand, you know, what that customer is doing via all the systems and know that those systems are acting how they should and acting healthy.

00:16:53

Let's see. Um, and that's really where we got to, like knowing is half the battle, right? So we ended up instrumenting, um, what we call lattes, which stands for latency, availability, tickets, traffic, and saturation. This was just to establish a baseline on some of those critical business applications, because we didn't necessarily have a good baseline where, um, dependent teams could talk to one another, um, and be talking about the same, uh, the same thing. So we, we very, uh, simply stood up, uh, Promethease instance. We use promo greater to scrape these metrics, um, on our, uh, PCF platform. And that gave us a good baseline and was actually able to let these application owners in this case, uh, really tell the dependent services that they're hitting, how healthy they were and see issues when they were, um, another thing, if you've looked at anything, uh, the big thing is, is around SLS and SLI.

00:17:53

So you have to establish targets for availability of your systems. And then also measure that. So, you know, here at state farm, we're, we're probably, uh, you know, we could have done a better job in that space. We have SLS and those things. Um, sometimes I think they just kinda got picked out of the sky and applied to things. And there wasn't a really good measurement. Um, some areas were doing it better than others, uh, but that's a big focus when we go engage with these teams to say like, Hey, what are you? What's your goal for availability? And then are you measuring and knowing that you meet that. So on top of that too, that led us into kind of thresholds and alerting. So, Hey, you found something, um, you need to let other folks know and then take action. So, um, I'll, I'll do a little more on that in just a minute here.

00:18:41

So the big thing we were focusing on in our early journey of SRE at state farm is really reducing the MTTs or the meantime twos. So there are lots of them, this isn't, all of them detect, identify, notify repair between, and then, uh, between failures, you need to be able to measure some of these things and know what's going on with your systems, and then be able to trend that, to know what's going on. So I mentioned our, uh, lattes monitoring solution earlier. Um, we were working with a customer facing team. We implemented that solution and, um, within 24 hours, uh, we alerted on an event that was happening in that system that they normally didn't have visibility into, um, which was really cool. The teams were able to recover that in about, uh, 54 minutes, which was a big improvement where we would sometimes see hours in between, um, recovery times.

00:19:35

And, uh, you know, we also saw the same scenario hit again, but the next time the alert fired, we recovered that in less than nine minutes. And I think even recently we're down to less than two minutes. So that's some awesome progress and just understanding and knowing what's happening with your systems and measuring these things. So, uh, the other part of SRE at state farm is focusing on the future. Um, we made a pretty quick determination that we really thought we could move the needle a lot more on applications that were re-engineering to move to public cloud. Um, you know, we could do a lot around visibility helping the stability of some of our current systems, but we really felt that, uh, we could build the resiliency engineering piece, uh, into applications as they move to platforms like AWS, for example. So we all know public cloud will solve all your problems right now, but, um, it, it is a really good platform, um, for resiliency, uh, and just easy things that are built into that platform.

00:20:42

Take a look at, uh, you know, things like, um, your deployment patterns and how can back that code out if you see any issues in production, um, just at a base level, that's a good place to start, and it's pretty easy to get in there. Uh, the next thing we really focused on was the architecture, engineering and design aspects. And this goes for software and infrastructure as well. Um, we need to be engineering for failure because it will happen. Um, like I stated before, and we need to really bake resiliency, uh, into part of the design and we all need to do it. Um, that's really been falls into that culture side of the house as well, that everybody really needs to be thinking about these, um, practices and principles. And we want to shift that as far left, um, in, in this process as we can. Uh, one of the last things we landed on is to make SRE easy, create consistency, build an SRE platform that others can very easily consume. And then we really need leaders to support the SRE initiatives.

00:21:52

What do you mean by SRE as

00:21:53

A platform? So if you build a monitoring stack, then make that available for everybody to use very easily. Um, if you have a log shipping solution, like those types of things from the ops side of the house, um, that just teams can implement and they get out of the box. Um, so from a leadership perspective, it really all starts here. Um, I can't say enough how much I appreciate the support that we have, um, from a leadership side here at state farm for SRE and for dev ops. Um, but from the leader perspective, uh, some things that, uh, you may want to think about if you're doing dev ops and SRE together, like just support the creativity of the team, focus on automation, let the teams experiment, um, and also expect the teams to know the health of their system, their app, or infrastructure like that should just be a given.

00:22:48

Um, I know that there's more, uh, enterprise kind of teams that look more broadly, but you know, if every area is ensuring that their app is healthy and getting notified when it's not, um, or taking even automated actions, if it's not, it's just going to be a much better experience for our customers. So that also leads to consistency. Um, we're a large organization, as Jeremy stated, we have a lot of different solutions, tools, teams, areas, departments that can get kind of, uh, crazy the amount of stuff to keep track of. So from an SRE perspective, we're really looking to partner with, um, a lot of different areas to kind of help provide, uh, consistency across the board. Um, you need to give teams the autonomy and responsibility for their solutions and then give them that time to tackle their technical debt, um, and their toil that they have that I mentioned earlier. Uh, the biggest two things I can give any leader in this space is to, uh, figure out how to measure everything and then automate everything. Um, that should be the first place to start. But if you can even be remotely successful in those two areas, I think you'll be successful in getting a site, reliability engineering, uh, kicked off for your organization.

00:24:14

So with that, like to say, thank you from myself and thank you very much. And we look forward to hearing from you. Uh, you can hit us up at these, uh, addresses on the screen here. We'd be happy to, uh, have a chat or answer any questions you might have. Thank you very much. Thank you.