Governance, Compliance, and Risk in the SDLC Can Be a Fun Event! (US 2021)

As the #1 insurer of cars and homes in the United States, State Farm® has embarked on a journey to fundamentally change the way teams deliver software through DevOps. State Farm has reshaped the way teams work and interact from the adoption of DevOps practices and behaviors, to the realignment into empowered product teams, but how do you balance the organizational need to manage risk and provide governance of the Software Delivery Life Cycle at a highly regulated company? This session provides attendees an in-depth look at the State Farm journey to embed a loosely coupled event architecture into our DevOps toolchain to broadcast key events in the SDLC. This has allowed us to bring better overall compliance to the State Farm internal standards and policies and development teams don't even know they are broadcasting events. The capturing of DevOps Events and their corresponding data has allowed us to capture a holistic picture of what really happens during the life of a code change and this leads to opportunities to use real time data and automation to govern our SDLC instead of the dreaded manual reviews or controls.

uslas vegasvegasbreakout2021
JC

Jeremy Castle

Architecture Director, State Farm Insurance

RC

Ryan Chambers

Technology Engineer, State Farm Insurance

TRANSCRIPT

00:00:14

Alright. Hello, I'm Jeremy castle. I'm Ryan Chambers, and I'm an architecture director at state farm, and I'm a technology engineer. And me and Ryan are here to talk about governance, compliance and risk, and the SDLC can be a fun event. And we've been working on a vetting framework here at state farm that we've instrumented and hooked into our developer tools to try to achieve a little bit better transparency into the STLC. It's going to allow us to build some automation on top of it. That's going to make for better governance compliance and managing our risk. So let's dive a little bit into the numbers behind the neighbors here at state farm. So we're the number one leading auto and homeowner insurance in the United States. We have 85 million active policies and accounts so far. Um, and then just in terms of size, we have little, um, about 50 over 57,000 employees.

00:01:03

And then we actually have over 19,000 independent agents all over the United States. Um, and in addition to being the number one auto insurer and homeowner insurance, we're also the number two larger largest life insurer based on policies in force in the United States of 2016. Um, so a little bit about the enterprise technology department here at state farm. So we've got approximately 3000 a little bit under 3000 software infrastructure developers, um, out of a total of 7,000 employees, that enterprise technology, um, as within our, in our walls, um, with that there's hundreds of technologies. So we have Java mainframe peel peel one. Um, we have wise, we have public cloud. Um, we have just tons and tons and tons and tons of technologies across all over the place. And that makes for a complex environment. Um, we have over 2000 web applications that we host on these platforms, and those are sprout, uh, across 1200 product teams, um, across 15 different business areas that enterprise technology supports.

00:02:08

So it's just a large complex environment. Um, you know, with us being a financial institution, we obviously have a lot of regulatory things we have to do to, to maintain compliance. Um, and that's really kind of centers around what our talks about. Today's how, how are we going to try to automate this and make it a little bit easier on our, on our developers and really get that dev ops mindset, um, remove as much friction as possible. Um, so let's start out and talk a little bit where me and Ryan fit in the state farm. Um, I'm actually, uh, it's as engineering director, I'm also an architecture director, you know, titles, we'll, we'll figure that out at some other point, but horizontal enablement for, uh, enterprise technology. That's really where I sit. Um, my responsibilities are really the application development life cycle. Let's stay firm on my own, the majority of developer tools and the experience around that.

00:02:57

And I'm heavily involved with CICB practices and the different tools. And I'm Ryan Chambers. I'm a technology engineer in the Liberty lifecycle insights area. I focus on building solutions that help improve the overall delivery experience at state farm. Um, and I built the framework that we're going to talk about today can make all of these events. That's both the teams and Ryan's one of those super super engineers and kind of figure anything out, um, and very fearful turn. So he is a great guy to work with. And I, I think he's a key key piece to make in what we're trying to do. You're successful. So, uh, let's talk a little bit about just the area. I want to give everyone some context about where we fit in that at state farm a little bit deeper. So, um, I live in, uh, I live in a suite called the delivery experience.

00:03:42

That's what we call ourselves. Um, I've got about 10 products that sit underneath me and our mission is to provide one cohesive ecosystem of developer tools and services to improve the experience of our product teams within enterprise technology. So this is a pretty cool mission because you actually get to go out and you try to solve problems for developers, you try and make their day-to-day life better. And that's actually a really rewarding thing to do. Um, I know Ryan, you, you helped a lot of developers and I think it's, it's, it's pretty cool to see some of the stuff to do so, and we've been fortunate enough that state farm, um, that they've really supported that within our enterprise technology department or executive support it. So if you look at this picture, um, you know, delivery experiences, our suite, we sit horizontally across these different business areas.

00:04:30

And we're about a group of about 60 folks that provide support for developer, developer tools and services. Um, and what are those tools and services thinking about things like SDN? So get lab, um, how do you build pipelines in Jenkins? Um, you know, big investors in the CIA runners that get lab provides as well, um, get ups, I would say we're doing some really cool stuff with get outside of a whole product team around get-ups and they're trying to enable it. And, um, I recommend we have several people that have done talks on that this year. So I'm seeing with security scanning tools, and we took that over from our InfoSec department. That's been a big win for us because I think we bring a developer mindset too, to some of the security tools who've been trying to provide open source API APIs, STL, best practices.

00:05:15

So really I think about it as this end to end developer tools and our mission just at the end of the day is making everyone's lives a little bit easier and get people thinking about dev ops and how to work differently. Um, so that's really, we got some questions, right? So does your developer experience look like this? Do you have cab gates use for governance of the SDLC? Right. Do you have to go into a committee and have them look at your changes? They probably don't really understand what is in those changes, but they're, you're over stamping. And before you go to production, um, as your proof of testing, is it manual, do you have to put those in word docs? Um, is it a test organization has to fill it out and sign off on it? Right. There's a lot of this. Um, you know, I think a lot of enterprises have this type of flow or at least they used to, right.

00:05:56

Um, no visibility to potential bottlenecks in the STLC. How long does it take to get a change to go out to production? Um, do you do that mainly? Is there a way you can automatically do that or is it just all manual and guess guesswork, right. And it just a myriad of tools and platforms to manage a change. How many people do you hand off to do you have a change area that has to sign off on things? Um, how many tools do you have to hop through? Do you have to go in to get lab service now? Just various tools. Um, that makes I think a pretty big headache for developers at the end of the day. So me and Ryan started talking and, um, I mean, this probably happened about beginning of the 2020, actually. I think we started having some discussions where like, um, you know, let's, let's, let's think about what it looks like for a developer to get something out, to, to production.

00:06:46

Um, you know, when we started seeing a feature is a lot like a shopping list, right. You have some requirements, um, you're going to do some coding, but then, you know, did you commit your coat? Did you run a security scheme? Did you unit test integration test system test? What environments did you deploy to? Was your code reviewed by anybody? Um, think about when you're trying to like build a cake or bake a cake, not build a cake, but bake a cake, you know, usually have a shopping list of different things you have to do, right? You have to go to the store, buy eggs. Um, some, probably some mix, you know, make sure you have pans. Um, you have all these different things you gotta do, and it comes into a list and that's very similar. You know, what a, what a developer has to do when they're creating a code and trying to deliver a feature out to production.

00:07:29

It's like, well, can we take this concept? Let's kind of think about it as a developer when it gets to want to push something out the door. So you start thinking about it. Um, you know, I have my list and today's model like a lot of people there's, there's a cashier, right? So their dish traditional model, you have a person running up all your groceries, right? And this process takes longer. You got stand in line, you gotta talk to the person, they have to ring you up. Um, you know, it can be slow. There's a lot of people in line, you know, you could be waiting, waiting a lot longer than you are to so same stay with traditional, with what a developer engineer would have to face. Um, you know, typically on a pile of my list, put all my ingredients in a basket, I'm going to come up to the cashier.

00:08:12

Someone's going to check off on it. It's a slow and kind of a tedious process for us. So we started saying, okay, well, like, we're sorry, we're getting close to the self-checkout miles. So if you think about that, this gives the customer an additional freedoms and hope to make for a smoother experience, right? So you just go up to your basket, there's an electronic kiosk. You go up to this process takes less time. There's the lines are typically shorter. You can ring up your own items. You don't have to deal with as many people, many handoffs. Right. Okay. So that's kind of where we're at today. I think at state farm is we've enabled a lot of self checkout, but still it's pretty painful in some ways, because there's still hoops. You have to jump through, there's still some gates have to do there's manual work.

00:08:57

Um, how do we get to the point where, what our developers is do day day, that this can be self-reported right. And that maybe, maybe they don't have to go through a checkout. And that's where we really start thinking today kind of where you can see some of the shopping going is you simply scan your app. As you walked to the store, you grab your stuff though in the basket you leave, right. Um, there's a lines you have to talk to anyone. You don't have to ring up items. Um, probably everything's embedded with near near-field communication chips. Um, but you know, you basically go into the store to accomplish a goal, pick up your food and use leave, right. You're automatically charged up. So that was kind of the concept. Um, as we started framing this up in our head is like, can we kind of get to this more grab and go model where, Hey, whatever you're doing day to day as a developer, that's, that's really, we're going to self record that and then use that to, um, determine whether you're compliant and go out the door.

00:09:50

So kind of interesting, uh, interesting thing happened, um, you know, we kind of had this theory, what if our developers will suffer, reported important actions and we book governance around those events. And the funny thing is of me and Ryan has kind of an, a couple of points in this journey of at the same time, we've, we've gone back to this, um, this paper called dev ops automation, automated governance, reference architecture that it revolution actually published. And I think we were down this journey and I think we both kind of stumbled across the paper on the same time we started. Um, I am in with each other like, Hey, have you read this part? And like, yeah, this actually seems really what we're trying to do. So, um, it's laid a foundation and kind of a framework and a map, I'd say a mental model for how we're trying to approach this, um, and kind of an architecture, right?

00:10:37

So we've been able to kind of look at this paper for back to it and say, Hey, are we headed down? Are we thinking about this the right way? Is this, um, this is how we should architect our, our of framework. So what do you think about that? Right. Yeah. So this, this reference architecture has a lot of good points in it. Okay. It walks you through the entire process of those things, those critical things that you really need to implement the pattern for automated governments. The thing that I kind of zoned in on as being the most important thing as we start our journey is the event framework apart. So in, in the document, it talks through how to collect the information and the event frameworks, your backbone, and that's where we really started our journey.

00:11:19

So we started out with a few architectural principles in mind, um, and these kind of helped us keep it in line with the direction that we wanted to go. First one is don't mind us, we're just listing it. And the goal for this principle is to really collect as much data as possible with as little impact our development community isn't as we can possibly manage. Um, and we did this by using tooling that we use the web hooks within get lab, um, and then wrapping our custom CLS so that there was no impact or our developers for our initial, um, set up the events that were collected. There are a lot of modern tools. I think we discovered already have web hooks. So I think tool ways to notify. So that, that was, we use that to, yep. It makes things super simple. Um, the second one is social distancing systems for this one, we wanted to really advocate decoupling processes and, and, um, systems so that our framework in the automated governance principle could grow over time, um, changes the constant.

00:12:19

So we know that as time goes on, tooling's gonna change. Processes are gonna change. Patterns are gonna change. So we want to make sure that our framework to keep up with it and we could kind of adjust as things change. The third one, you can run, but you cannot hide for this one. We knew that we needed to collect a lot of data, but we knew that just having the data by itself, wasn't going to be useful. We needed to provide context. So with this principle, what we wanted to do was tie that event back to something where that's a commit Shaw, a product, an artifact. We need to be able to associate it back to something so that we could paint the picture that we're looking for at the end of the day, the fourth one, go, go gadget. So as we built the framework, we knew that we're going to start off small, um, when it did kind of figure, get our feet wet, figure out how things are working, um, and kind of understand what the demand was.

00:13:12

And then as time goes on, we expect this to grow quite large. So the framework needed to be able to scale. Um, in, in context, we started out with around 15,000 events per day when we first implemented this last year, and now we're at over 150,000 and we're only have to start over during the last one robot insurance. I think this one's probably the most important one. Um, we wanted to advocate automation wanted to get rid of all the manual processes as much as possible and start really pushing, um, teams and areas that, that need to provide the governance a way to easily integrate with our framework to automate processes, um, and make things frictionless for our developers. Um, we did this by making the framework as easy as possible to both publish and subscribe to. So now anybody within state farm rules can now subscribe to those things no matter what platform they reside on and do their processing as do I think just decoupling those systems and suspend incredibly powerful, taking an event driven architectural mindset.

00:14:12

It's just been, um, it's the advantages that have popped up out of that have just been night and day. So, so we started building this out on AWS. The platform itself made it super easy and to hit all those architecture principles that we just talked about on the big services that we leveraged for the design, our Lambdas functions, um, didn't bridge and elastic search, or seem to be open search, um, Lambdas allow us to do the serverless computing, which makes scaling, reducing costs, and even just pushing quick changes out to production. Very simple, um, event bridge allows us to really build framework around that event driven architecture so that we are able to react to events as they're published through the framework and do the data manipulation storage or the broadcasting when it's appropriate in elastic search allows us to really have that visualization piece as well as the correlation and the ability to query data that allows us to tie everything together so that we can start building advanced analytics on top of the events and kind of get that in it and picture we're looking for.

00:15:21

So as we started this during, we had quite a few questions that we could probably answer through some manual work, but it was very hard to get to. So, um, those were the things that we kind of started with as we started clicking data and seeing if we could just answer them with the data that we had. The first one is the frequency of a good push. How frequently frequently are developers pushing to get lab? Um, we have quite a few projects as we noted on the first two slides as it's big ordinance organization. Um, so what we were able to see was it, we average around 15,000 pushes across the 2000, across 2000 projects in a single day. Um, there's much more than 2000 projects, but that's how frequently they get pushed to, um, the second one security scans. So as they farm each component that gets pushed to production has a wide variety of requirements and security schemes are one of those.

00:16:15

And we do white BMS kids' names. Um, and we were able to associate those back to the products themselves, but the process was sometimes included manual work, and we want to kind of get away from that. So today with our event framework, we're able to tie back the security scans back to the products as they're to push to production. And we're else able to determine what any determined any security violations are in those scenes and what type of stands they are. So today we, we do around 20,000 security scans on a daily basis of those security scans, 17,000 are secret scans them. If you get labyrinth, anybody pushing secrets on excellent. Um, 1500 of those are dependency scans. So just analyzing the composition of the application, kind of figuring out if anything needs updated or their security final findings for any open source penances and 1500 RS from state experience.

00:17:06

So just inspecting the code, looking for best practices. Um, and then the third one is lead time between changing the point. This one's, I think it's super important. And I was very surprised by the results today. It takes around 25 hours from the time a developer pushes a code change to get lab to the point that it is approved by a manager to go into production. Um, this amaze me because back when I started, it took months to get to production. So the fact that we're able to get down to 25 hours is really good. And hopefully by the end of this journey, we're able to get that even previous further. Yeah. And the nurse depart for me. Right. Um, you know, sitting in a director's seat is I get asked a lot of times like, well, how or how, what is our lead lead change time, right to production?

00:17:52

How many scans are being performed? Um, it's really hard to do that before you, this, we had the skim through was kind of estimate counts. Um, with this it's all automated and you're getting a hundred percent accurate data because it's based on what people are actually doing. So, um, our tools publishing these events have actually been, we've been on the answer to these questions with some accuracy, right? And now, now we can, now we can, um, innovate on this stuff and go, Hey, what's some things we might want to potentially do in these spaces that we really couldn't do before. And it's all set up to be automated. And actually there's a couple of things that actually really surprised us. Yeah. So the unexpected value in use cases that we were able to determine right off the bat. So, um, we started out with tagging our events with the context that the assessor earlier.

00:18:40

So as we deploy things out to test and production, we attach metadata to it that relates it back to the source code, the change. So the commit show within that, um, source code and also the product within our organization. So something that we know that ties back to the team, we do this, that information allows us to make some really cool observations example is this is costing kind of came to us and wanted to understand, um, the cost of a cloud Foundry application, as it sits in each of the environments, based off of how much memory it's using, how much, um, processing is being done within the application. So with this, we were able to quickly tie that back to the organization and give them that information to the point where they are able to plug it into their model and move along and figure out what they needed to.

00:19:26

Um, the other side effect of this is that now as we're publishing assists across the organization, teams that have downstream dependencies are able to identify when their downstream dependencies change, plugged it into their alerting, and that helps him get to root cause analysis. If there are issues much quicker, um, CLI instrumentation. So this initial effort would not be possible if we didn't have wrappers around many of our CLI or interactions that we do within our delivery. Um, Y cycle also with the CLI instrumentation, we were able to plug that in. We're able to get these events without any impact of the developers. They just had up the version, which will be automatically, and they we're getting those events. Um, the cool thing that we noticed with this was this not only helped us kind of figure out those various actions, like how many scans are happening with lead timelines and also helped the teams that support those customs CLS do their job.

00:20:20

So now we're not only getting information around those actions, but they're getting information about eras occurring, um, how frequent certain commands are used within their CLS. Is it stuff that we never had before? And it helps us really kind of troubleshoot issues when people come in and work them. And it also helps us do things like sunset or transition versions. So now we can actually reach out to our consumers because before we broadcast a whole bunch of people, not knowing if they got the message or not, now we're able to directly communicate with them and say, Hey, you're on this little version or you're on this tool that's going to go away now is the chance to do that. And here's some documentation on how to be using. Yeah, yeah. Venting, the venting has allowed us to know usage rate down to the exact person that in job and time that it's happening.

00:21:01

So that's been really cool. The last one was dash wasn't alerts. So we leverage elastic search and that allows developing teams to plug in and create dashboards using refiner or Kubota, um, on the fly. So they're able to create panels to show when their applications are deployed, but security scan, vulnerabilities pop-up and pre alerts based on those things. And now we're at our shopping cart at school to go back to our shopping experience analogy. Um, we've, we've hit all those checklist items. We're capturing that data automatically for capturing the good ops information, um, the evidence of test repository information. So at state farm, we have a centralized place where we put all of our scans and tests done on an application, and we're putting those through the dev ops intervention framework with events that correlate back to those, the organization. Now the future goal here is to get to the point where we get that revenue model. So we collect all the information behind the scenes, and now the teams are able to get through that experience much quicker and with much less friction.

00:22:06

And what does the future hold to transparency of changes in the environment. We want to get that end to end picture for clicking all this data. We want to tie it back to the various things that impacts and get a full picture of what's actually happening. Um, and that allows us to really get to that automated preference model. So we're collecting a lot of data right now. Now we want to be able to use that data to say, Hey, are you meeting this requirement? Are you doing the things that you need to be doing in order to go to production? I didn't find a level of risks that's going on with the change that you make changes into the analysis code change in almost real time, we're able to identify the exact changes that occurred on the application and what is going into production at this point.

00:22:45

Now we can identify, we can inspect those code changes and really see what the level of risk is. Um, was it a small change, was a large change how frequently to do changes the curve. We can use that information to really guide both the development team in the areas that govern these things on, how they perceive the change in the last one is analytics of quality of change. We can look at the quality of the change and see what was changed in there in, in how, um, useful it is for the organization. I mean, just disrupt the separate. So me, you know, in a leadership spot, um, having the ability to look at the transparency, the changes in our environments, um, be able to constantly go to our auditing and risk folks and say, Hey, this is exactly what's happening, their environments. Um, we're not manually auditing change records.

00:23:35

Um, now I can do this off the data and dashboards and you can visualize things that's incredibly powerful and we didn't really have that in the past. I mean, we're doing all the right things. It was just very manual and intensive and added a lot of friction. Um, I think having this, this framework in place really sets us, sets us up for the feature to do some interesting things with the data machine, where can we put machine learning and AI on top of it, and we didn't have that in the future or in the past. And now the future looks really bright for us and to have the ability to automate a lot of this, these governance things and checks and remove a lot of friction from our developers day-to-day lives. And that's, that's actually something really exciting. That's part of our mission. That's why we get out of bed in the morning. They sometimes, um, and it, it just makes it a, I think, a good working environment for the developers and engineers. So like to thank you for your time and our talk. Um, we have me and Ryan have our email, um, you know, contact us if you have any questions and want to learn more, we'd love to talk to you. Um, and just, you know, thank you and really appreciate, um, attending our talk. Thank you.