Las Vegas 2020

Your Baby is Ugly. Let's Be Friends

Highly decoupled teams allow enterprise-scale engineering organizations to move at greater velocity. But how do you build a shared culture of quality and reliability among disparate teams? Going around breaking their software doesn’t sound like a good way to build systems or friends.


Matt Simons, Senior Engineering Manager of Quality Assessment, will share how his team partners with the dozens of teams and 400+ engineers at Workiva to help them understand quality in distributed systems by breaking their software without breaking their spirits. He’ll discuss the challenges of building cross-team reliability and how they approached it using Chaos Engineering.

MS

Matthew Simons

Senior Engineering Manager, Workiva

JY

Jason Yee

Director of Advocacy, Gremlin

Transcript

00:00:12

Okay. We're live. Yeah. Sorry. We should've done like at 3, 2, 1 go, but here we are. Uh, so I'm Matt Simons, this adjacent G and this talk is your baby is ugly. Let's be friends. Uh, we're hoping to make some friends here today. So I guess with that, we'll jump to what this talk is going to look like. Uh, this is sort of the rough agenda. This is a story talk. It's a war story. So, uh, I want to tell the story or a story of a dev ops transformation that we went through at the company that I worked for Workiva and kind of its state and where we're at with that. Uh, I want to take dedicate a portion of that to talk about chaos engineering specifically. And so Jason ne uh, is here from gremlin, uh, to, uh, talk about that. And if you want to introduce yourself, Jason, or that leader, and we can do that later. Okay. We'll do that.

00:01:11

Uh, after we do that, then, then we'll kind of come back and I'll give you the rest of the story and, uh, share all the, the morals, uh, the lessons learned if you will, that's the rough outline. Uh, so I guess let's get to it. Uh, maybe the place to start here is that this is a war story. It's not a success story, not entirely like many worst stories. The moral may well be at the end of the day, the war is hell. And so as organizational transformation, and we had successes, we had huge successes, uh, but we had failures too. And as cliche as it may sound, we made some friends along the way. So, uh, so come friends, listen to a story of triumph and failure of love and loss of ambition and betrayal. My hope is that you'll be able to learn from my experiences and take something with you in your DevOps transformation, something that you can use in your war.

00:02:08

Uh, so without further preamble, let's, let's start here. Uh, your baby is ugly. That's right. I went there. Your software sucks. It's a hacky mess of minimally viable, garbage, and it's damaging your relationship to your customers. That's essentially the message that my team had had to take, had to carry with us to every other team that we engaged with. As you can imagine, it made us very popular, but let me back up a bit. I'll go back of, to the beginning here, uh, at Workiva, the first stage of our product back when we were a young startup was, uh, working on a model. So we've worked in the public cloud since the beginning. Our first product was built on Google app engine, uh, which encouraged at least for us building in a monolith. That meant that we had a lot of, uh, alignment. I guess you could say that we took for granted just because of the platform, sort of forced us to work that closely with each other.

00:03:04

We had one monolithic code base, and that really gave us, uh, a sense of alignment that we sort of took for granted without having that, having moved forward. Then we came to a place where we decided that that, uh, that was holding us back in some ways, uh, as many teams and companies have done, we decided to sort of split off from that and go to more highly autonomous, uh, highly decoupled and cross-functional teams. Now this gave us huge velocity wins. Uh, but we found out that that also came with some drawbacks that velocity, without a high degree of coordination eventually led us into misalignment. And some degree of drift. It's almost like a queen who walks out to her subjects and says, uh, we must expand the empire and all the Knights gathered their armies and then just charge off in different directions, right? You need something to bring people together. We need a high level of coordination to augment the autonomy that we gave the teams for us that resulted in a lack of quality, or at least that was the place where we saw it first was in quality. So my team was born out of that. The quality assessment team was what we called ourselves. A assessment, not assurance assurance was someone else's job. Ours was to basically go to these teams. And in painstaking detail, describe exactly how their babies were ugly.

00:04:39

What we really wanted to try and avoid in this was becoming the inquisition. That was the big threat, right? Uh, we were essentially there interrupting their workflow mandated by executive leadership to show up and figure out why this team was having issues with quality or where there, they may have sort of latent quality issues that just haven't rear their head. Yet. We triaged, we went to lots of different teams, uh, but we, uh, we tried to focus on those that had either a really large blast radius in their product, where they had had quality issues so far. And, uh, we decided to devise a process that would hopefully help us to avoid becoming the inquisition, which was the assessment. So the assessment had three main pillars to it. The first was that we did some quality consulting. This helped us build some Goodwill with the team.

00:05:31

We just showed up and said, what can we help with what hurts, uh, and kind of rolled up our sleeves and got to work with it. The second was an assessment, an actual standards assessment, where, uh, we would go through a checklist of best practices, standards, and best practices at the code and product level, and essentially provide a grade for the team. And the third was to do chaos testing. So this was new to us, uh, were Cuba had not yet been doing the least in any sort of widespread or coordinated way, chaos testing. So for many teams, this was something that we introduced them to. And, uh, for almost all teams, that was something that even if they were familiar with the concept, they hadn't really gotten a chance to do yet. This was a really, really big part of the process for us. And it was a pretty huge win for us. So I wanted to take, uh, a really significant chunk of this presentation and dedicated to that, because if it's not something that you've done before, it is something you should do, and it should be a part of your dev ops transformation. So with that, I'll kick it over to Jason. , who's the director of developer advocacy over at gremlin, uh, which is a provider of, uh, chaos as a service, uh, that we use. And I'll let him tell you all about it.

00:06:46

Thanks, Matt. Yeah. So what is chaos engineering or chaos testing? Um, you've probably heard of it. I mean, as mess, last slide said, Hey kid, do you want to break stuff? All the cool kids are doing it, right. Those cool kids being Netflix or Amazon or, or other cool places like that. And so you've probably heard these stories of chaos monkey and how it randomly destroyed servers. Um, and so you have these maybe achy breaky products, which leads us to have you met, um, Billy Ray chaos, uh, king of the mullet, right? So what is chaos engineering? Well, if you use the mullet definition, it's, uh, maybe some business on the front end and party in the back end, but that's probably not a great definition of chaos engineering, right? If we're, if we're serious about it, chaos engineering doesn't really care if it's front end or back end or infrastructure.

00:07:46

And it's not about random destruction. So, so really what is it? Uh, I gremlin we have the definition that is this chaos engineering is thoughtful, planned experiments that are designed to reveal the weaknesses in our systems. And when we talk about systems, we mean both our technical systems, things like our applications and our infrastructure, but also our human systems, right? Where our processes broken, where do we need better documentation? How do we treat the whole on-call and incident response process? Right? So all of these things can be encompassed and you can learn a lot about how you think things work by running experiments. Now, if we take that idea of experiments, we'll experiments are scientific. So let's do it for science. What do we do if we have a science experiment? Well, the first thing is we have to start with a hypothesis, right? How do we think our systems or applications or monitoring tools, our response processes, how do we think any of them work when failure happens?

00:08:53

Right? So you come up with this hypothesis and that hypothesis does not necessarily mean that they survive, right? If I have a database and I say that I'm going to kill that server. Well, it may come back. Um, but my hypothesis might be that my application goes down and that my end users receive hopefully some sort of meaningful error, right? That's a totally valid hypothesis is not surviving. But along with that, you know, my user should be save errors. My monitoring tool should show me, what's gone wrong. Those are perfectly valid hypotheses. Don't focus on simply just surviving, focus on the things that you need to run. Good operations. Next, we set a board conditions. We want to be responsible adults, right? Where it's not about that random destruction. So what is it about, well, if we're testing this for science, we want to have guardrails.

00:09:51

So what are our abort conditions? What, what happens if that small little failure that we're injecting leads to massive cascading outage? So we want to think about what are the conditions that cause us to say that we're going to stop doing this, right? What are we monitoring for? And what's our backup plan. What's plan B in case everything goes bad, but ultimately you just have to do it, right? You want to start small and you want to start in a non-production environment and just like your code or anything else, move up to staging and then move to production to get comfortable with it. But as my friend, Bruce Wong likes to say, if you're trying to, you know, get in shape, you don't work out first before going to the gym, you just go to the gym and you work out. So you have to start.

00:10:41

So don't overthink things, don't over plan them, just jump right in there and just do it like any science. It's all about repetition and iteration and similar with our engineering work as well. So as you start to do your experiments, you'll find that sometimes things work great. They work exactly as you expect, which is fantastic. It means that you know, a lot about your systems and how they operate. When that happens, you'll want to work, to increase your magnitude and your blast radius, how severe your tests are and how much of your systems they affect. So you'll want to keep iterating, try to find those edges or those limitations in your systems. And that'll help you gain a better understanding for the reliability, but sometimes things don't work as expected. So in these cases, you'll want to create those tickets, JIRA, Trello, whatever you're using and ensure that you're fixing these things.

00:11:37

And then coming back to these tests and actually rerunning them to ensure that what you've thought was a problem and what you thought you fixed has actually been fixed. But sometimes things go horribly wrong. And we all know that from our own experiences engineers. And when that happens, well, you have two options. One is you should have already set some abort conditions. So you're going to be monitoring against those. And you may want to halt. If things go horribly, horribly wrong, and it affects your customers, then you may want to stop that. So your customers don't have a bad experience, but depending on how things go wrong, you may want to lean into it and keep going. It's a good opportunity to practice your incident response, to validate your runbooks and your documentation. So when failure happens, sometimes just lean in and take advantage of that failure, take advantage that people are already in the mindset of wanting to fix things. And they're already available.

00:12:37

Ultimately though, it's about learning more. As we talk about experiments in science, well, the word science comes from the Latin word for knowledge. So it really is all about the more, you know, the more you know about how your systems operate and how they fail the better you'll be equipped to build more reliable systems and to be more reliable engineers, to know how to operate the systems that you're running. So all that said, whether you choose to do the mullet style of chaos engineering and do a little bit of partying in your backend, just go out there and do it and start to learn and build up your knowledge that that'll turn it back over to Matt.

00:13:18

Thanks, Jason. Yeah. Chaos for us was one of the big wins. In fact, it's the first, as we talk about successes here in our story at Workiva, uh, it's the first that I'll mentioned. So chaos was brand new to us. As I said before, it was something that hadn't really found its way into our standard practice. So this was a chance that we had as a team to bring this to the teams that we worked with as a standard part of our assessment. It became an opportunity for these teams who again, had never really worked with us to go through the entire process, to, uh, start from finding hypothesis, to actually seeing their products break. And in some more times, in some ways break in ways that they hadn't expected. Those are always some of the biggest wins for us. And this is something that we're now doing in a, a more concentrated and intentional way at Workiva.

00:14:12

More systemic maybe is a good way to say it. So we were really happy to be able to kind of pioneer that and proof concept that for the company, the next was quality. So quality did improve. Uh, this was in fits and start in fits and starts maybe, or, or in spurts, maybe as a way to put it. You know, we were working with individual teams. Um, but these individual teams that we worked with came away with, uh, standards and best practices, again, that they hadn't necessarily been exposed to. And that left a mark on them as well as a bunch of what we call findings, but we're essentially deviations from those standards and best practices that they were able to take away from our assessment and go to work on maybe most notably. One of the teams that we worked with, which was in charge of our, uh, unified messaging bus, uh, took every single ticket that we, that we made and understand this sort of a gold standard checklist, right?

00:15:10

It's kind of a, if you could, if you could put together any possible practice that would impact quality in a possible, in a good way. That's what this wreck, that's what this represented. So for them to essentially close out every single ticket that we created for them, uh, whether it be mitigation or actually performing the, uh, the upgrade that was prescribed was pretty incredible. And the results were pretty incredible so much so that it doesn't even make sense talking about in terms of like a reduction in percentage of, of, uh, incidents or selves or whatever you might call them. Uh, we call themselves in the time since they did that, they've had like two minors. That's kinda what it comes down to, uh, before that quite a bit more. So the efficacy, at least on the team level of this, the impact that it had on quality was pretty big.

00:16:03

That was a big win for us. And we didn't become inquisition. Uh, that was maybe the biggest win for us. So we really didn't want to become being position. You know, all the teams thanked us. Some of them even meant it. Uh, think about that for a moment. We were in interruption for them. We were an inconvenience forced upon them by executive leadership that had concerns about their product quality. In some ways, our very presence was an indictment that indicated lack of trust from the company that's maybe hyperbolic, but you could see how, if you were put in their shoes, that might be how you would interpret it. But the formula for making friends is actually pretty simple. It's do something nice for them, keep doing something nice for them. Eventually, if they have a functioning conscious, uh, conscience, uh, and they're not totally insane, we sort of build up a debt in the face of that kind of, uh, that kind of an imbalance.

00:16:57

And eventually we start trading back and forth and that's how friendships are often born. So for us, we made friends, this whole process was about making friends and then asking our friends to increase our product quality, which was actually super easy. It was something they wanted to do anyways. So it was really just showing them how that was. One of the biggest wins for us, but to go to failures. Now, this I'll be honest. This was a very difficult portion of the talk to write for me. I have gone through lots of written versions of it. Never really stumbled on something that, uh, I was super happy with. It's hard to talk about. Uh, failure is not an attractive color on anyone. It doesn't make anyone look good, but I think we're being honest. And if we're trying to fulfill the spirit of what this conference and conferences in general are about, it's about sharing what we learned and we have to learn from our failures to the lesson here.

00:18:02

I think most succinctly that we learned was that we weren't enough that our team was not enough that we had done what we could within our sphere of influence. But you know, these, these were four week engagements and assessment was a four week engagement where we would go to a team. We would spin up on their tech stack. We would, uh, have a dialogue with them about what they needed help with. We would start the assessment, the actual checklist, no binary checklist, uh, of, yes, no, are you doing this? Are you doing that at the product level? And we would go through chaos planning and experimentation with them. It was a heavy thing. And the teams that we went through with, I hope kind of left it changed, but we have dozens of product teams at Workiva. We have hundreds of services. So there was really never a way that we were going to be able to get around to everybody.

00:18:57

We were playing whack-a-mole with quality. And I think the lesson there for us to learn was that quality is not a thing that you do. It's a stance that you take with your product. And we could probably get into a whole tangent around that, around what that means. The short version is that there is quality is not an always, it's not always a more, is better. There's an appropriate level for your product based on your industry, the product itself, your customers. But, um, the reason why we got into this, the whole, the whole circumstances of why our team was even necessary in the first place were not something that we really addressed. Our stamps from a company standpoint was off as evidenced by the need for our team. And we were going around applying patches and fixes to specific teams. We were streaming, we were swimming upstream.

00:19:57

We got really good at swimming upstream, but ultimately we decided to disband our team decided to disband. That was kind of a tough pill to swallow, felt like a failure, but what we did with that, and in sort of our estate planning, if you will, was to try and take the things that we had, proof of concept, the things that we had validated chaos, testing standards, uh, work, and sort of general quality consulting and get those baked into the processes of the company in ways that would outlive us. I mean, the interesting thing about that is that had we not gone through the exercise of working with these teams so closely, had we simply tried to rush straight to that point? I don't know that we would've the influence, the relationships, the friendships, the political capital, whatever you want to call it, to actually get that done.

00:20:55

And this kind of leads me to the last point. So th this is the last slide, uh, and I'll leave you with a few thoughts here. If you'll allow me a moment of self referential indulgence, I gave another conference talk where I started that the dev ops values are empathy, collaboration and automation. And I know that automation is a thing we value, not a value in and of itself, but I can confidently assert having gone through this experience, that those values are still the right values. Especially if you are looking to go through your own dev ops transformation, you know, we could've kicked in doors and, uh, and brandished our badges and, uh, demanded cooperation from these teams. But I'm more convinced than ever that had we done that the result would have been short-term begrudging compliance and a legacy of resentment.

00:21:49

The advice that I give to new managers is this. I tell them that, uh, I always come to work with the assumption that nobody has to do their job. And I think that's true in any kind of big transformation, uh, maybe more so than even just our day-to-day jobs, which is that, uh, you know, I work with people that probably would tell me this is a generational thing, but I would contest that people of any generation and all backgrounds, we'll be more successful. We'll be more productive if they are aligned, uh, in terms of their desire, if they understand the goal that we're trying to achieve and not just acting out of a sense of duty. So ultimately the question of making a dev ops transformation has almost nothing to do with the specifics of dev ops and almost everything to do with change management itself within an organization. So when it comes to your own transformation, I can't tell you exactly how to do it. Your companies, the internal cultures and political structures will all look different and require different solutions and different strategies. Hell, I can't even tell you that I completely successfully enacted a dev ops transformation at the company I worked for.

00:23:00

But what I can tell you is this, I can tell you with confidence that to have any measure of success, you will need to embrace those values of empathy and collaboration, and you'll need to make some friends, the kinds of friends that will still be friends, even after you tell them hard truth, truth. Like your baby is ugly. So thanks for listening. That's our talk. And, uh, we'll be around for questions.