DevOps and Lean in Legacy Environments (San Francisco 2014)

Startups are continually evangelizing DevOps to be able to reduce risk, hasten feedback and deploy 1000’s of times a day. But what about the rest of the world that comes from Waterfall, Mainframes, Long Release Cycles and Risk Aversion? Learn how one company went from 480 day lead times and 6 month releases to 3 month releases with high levels of automation and increased quality across disparate legacy environments. We will discuss how Optimizing People & Organizations, Increasing the Rate of Learning, Deploying Innovative Tools and Lean System Thinking can help large scale enterprises increase throughput while decreasing cost and risk.

plenarysanfranvegas2014
SP

Scott Prugh

Chief Architect & VP Software Development, CSG

TRANSCRIPT

00:00:08

I'm pretty excited to be here. We had some great conversations last night and have some great conversations with Jean about really how to change organizations, especially legacy organizations. And that's what I'm going to discuss today. So my background is really in software architecture. I usually draw things on the wall and then I leave and teams in teams build it and I come back a year later. No, so I mean, most of my background is really building large scale systems and, um, I've been doing that for a long time and I'm pretty excited about that. But I started running into organizational problems of being able to actually deliver the great software that I was building. And it was really kind of a loud, uh, silos that we saw across a lot of groups and we needed to kind of defeat those problems. So I'm going to talk today about how we've applied really kind of lean thinking and dev ops techniques in, in a large organization.

00:00:55

And we really kind of over the years have documented, uh, 10 of these techniques that has, uh, have helped us really get there. And I'm hoping that you guys can, can take those home. So really quick on CSG, who, who we are. And I kind of divided this into two sections on the left. Basically we are a large outsourced customer care and billing operation, and, uh, we have basically have 50 million subscribers in the U S and across 120 customers that you may recognize like a Comcast time Warner charter, dish network, and other cable subscribers. We have over a hundred thousand call center seats deployed. And, um, we have billions of transactions that run through this application platform every every month. We about 40 development teams, uh, and about a thousand practitioners that we're going to talk about that will, um, we executed this optimization through the product stack is called ACP, and it unfortunately runs across about 20 different technologies, everything from JavaScript to high-level assembly language on the mainframe.

00:01:49

So it's a extremely challenging because we also deliver this as an integrated suite of 50 plus applications that all actually have to work together. So our challenges have been really kind of quality, uh, time to market, you know, release impact. Uh, we've got technology stove, pipes, and we've got rolled stove pipes across these groups. The other side of the business on, on the right hand side is our print and mail factory. It's a high-performance lean optimization, uh, lean organization that prints over 70 million statements per month. And their challenge is really continuously optimizing that business. So for folks that have read the Phoenix project, you'll see this kind of is eerily similar to, uh, the scenario there where basically that print mail factory is like our MRPA, that's basically our manufacturing platform that we're lucky enough to have, and we can actually go look at their processes and basically, uh, really compare and contrast those to actually how it work is managed.

00:02:44

So just to kind of the improvements that we've seen, we'll, we'll start with those. So we used to have 28 week releases and, uh, in 2013, basically we had a release impact score that we calculate by assigning a critical of four high, a three, a a two, and a low of one at, at basically 4 58. And when we would release our software with this 28 week batch, it would take us 10 days to recover. In 2014, we basically reduced our batch size to 14 weeks and we reduced the, uh, the release impact to 1 53 with two days of recovery time. So you'll see there that we had basically a three X improvement in the impact to our customers, uh, and reduce the, the impact duration, uh, by one fifth. And so that's significant, you can imagine how frustrating it is to get it, put a release of software out there and have 10 days to recover with all those incidents.

00:03:34

So we immediately know kind of saw improvement, right? There's also some additional information from a metric perspective that, that we've actually seen with this. So, uh, the two columns on the left represent what we saw in the previous slide. Now the next column basically further to the right is the exact same code that we deploy a second time with fixes in it to a second customer a few weeks later. And you'll see the dramatic improvement there where our impact score goes to a two. And now we talk about order of magnitude improvement that Jean talked about later in quality that we've actually seen. Now, what that indicates to us is that we're actually not getting quite enough practice. We need a little bit more. And then the final column on the right-hand side is a four week batch. That four week batch is another product set, but again, you'll see that order of magnitude by having that smaller batch size and the agility of releasing that code to production.

00:04:24

So if we kind of rewind and then just think about in 2010, when we looked at this, when we wanted to improve the quality and the time to market, we saw out there that we had all these system constraints, we had structure problems, we had stovepipes and handoffs. We had technology variants. We had silos of technology as, as the target folks talked about, uh, defects and quality, uh, issues, low automation and fragility. And so we really started on this three-year journey then, and along the way, documented these techniques that we found were valuable to actually help change the organization and really change actually how we were going to deliver value faster. And we'll dive into all those right now. So the first one and the most important, um, Heather, Heather from target talked about, um, really talent management and people that is the most vital.

00:05:11

And that's why we actually start with that one. So we realized that we needed to build a culture of learning and change the way people thought about self-improvement and inject, lean thinking into the environment. So we found a lean framework, and this framework we've adopted is called a safe or the scaled agile framework. You may have, you may have heard about it. And we use this and trained 2200 people in our organization in lean thinking techniques, you'll see that the pillars, their respect for people, incredibly important people do all the work product development flow, which is based on Don Reinertsen's work on how you flow value, um, uh, throw value through product organizations. And finally, Kai's on basically a improvement, continuous improvement. It's built on a foundation of leadership and has a goal to basically deliver continuously at a sustainable pace. The other thing we realized we needed to do as an encourage cross training, and we use several T techniques for that, uh, building, taking shape resources, encouraging them, and cross training them to move to T shape resources and in effect increase their response repertoire, which Kevin Behr quotes, basically in his presentation about coal mines, which is we, we increase their ability to react, to change and basically take on different types of work.

00:06:25

And finally, we encourage people to move to resources, where they build up the expertise experience. Uh, and then finally we want them actually to explore new areas of work. The second technique is called the inverse Taylor maneuver. So, uh, Frederick Winslow Taylor, basically principles of scientific management, uh, really helped revolutionize certain areas of manufacturing. Unfortunately, they trickled pretty deep into large organizations. And prior to 2010, we had organizations that look like this extremely role stove-piped different roles. And they were really all optimizing for specific roles and communicating through large handoffs of documents, uh, code for example, you know, we dumped code over the wall to operations and they suffer, um, getting that at the end of a large kind of batch handoff. So basically, you know, what we knew from that is that structure and the responsibility and forced the behavior, but it also prevented learning because the, these role specific groups did not learn the entire business process flow.

00:07:25

So really kind of a principle of the agile movement was to create those cross-functional teams. And we basically went and reorganized our groups. So we had those cross functional teams, feature teams in those areas that could deliver for the most part entire features. Um, and this structure basically removes cues. So if you know anything about lean QS is, are extremely problematic and also to incense the teams to learn at a faster rate. So we basically organize those teams teams to optimize the entire flow of value. The third technique is a, the inverse Conway maneuver. So Conway's law, uh, Melvin Conway in, in the sixties developed Conway's law, which really, uh, states that your software architecture will generally represent the structure of your organization or teams. He found that four teams create a four pass compiler, uh, many organizations and ours in particular, we found that the team structures really enforced the technology and the architecture of the software that we had.

00:08:23

So in our case, basically we had a traditional fat client server desktop. We had legacy middleware. We've got just about everyone. This is an example of one that we've got. And then we had a standard. So architecture that we're building our API strategy around very similar to how a target was building an API strategy. And we really favored that architecture, you know, really kind of centralizing the business logic in one place, operationalizing it one way, but we couldn't get there because we actually had all these stovepipes of teams that were basically building different things in different ways. So in effect, we inverted that and basically changed the structure and use kind of Conway's law against itself to say, look, we're going to structure the teams so that they provide a standard API strategy for all the, all the applications. You also note on Martin Fowler's technology radar on July that he actually quotes this inverse Conway maneuver.

00:09:12

We've used it for many years, but that was the first time we've actually seen it in writing. So I adopted it here and credited it to him. So the fourth technique is a shared service, continuous delivery. So one of the problems with 40 agile teams is that they will actually produce work really quickly. And then actually they will actually pay 40 different ways to production. So one of the things that we wanted to do is make sure that we provided predictability for the downstream teams. So we basically created a shared service delivery set of teams that provided the common infrastructure for all the teams to consume. And we use many of the same tools that target does. We use Jenkins, uh, for, uh, the continuous builds. Um, we use basically the Atlassian stack really. So the teams can communicate, um, we use, uh, get and subversion for, um, uh, for all that infrastructure, but we have that common platform and then we also try to make it self-service as possible.

00:10:09

So it doesn't create a massive bottleneck for the 40 different teams within the organization that are consuming it. So the fifth technique is environment congruency. So one of the things that we saw when we kind of looked out across the teams was that, that we had development teams carrying out operational roles in development environments and getting to practice doing that every single day. And then two to four times a year, we would basically hand off the code to production teams that have really never practiced that deployment and that created high batch transfers and high failure rates. So we kind of looked at that and said, well, this is a little bit crazy. And I kind of quote this to one of my colleagues, Steve BARR. He said, well, it's, it's like, basically we've got a game, a game team and a practice team, and the game team never gets to practice, right?

00:10:58

So we kind of looked at that and said, well, sports, you know, sports teams never do that. They don't do that. Right. They actually have a team practice every single day. And we basically looked at it and said, well, we need to fix it. So we basically created this concept of shared operations teams, where we have the exact same team. You do the deployments in every single environment and they use similar environments to production. And so in our cases, they actually get to practice, uh, about 70 times before release day. So we about 14 weeks, uh, if you basically get to practice five times a week and the deployment. So by the time they get to production day, they understand the system, they understand what's coming and they practice doing it 70 times. And basically now they have very low impact. Um, and they're very successful at their deployments on production day when we need to roll the software out.

00:11:45

And, uh, you'll see at the top, I've just humbles quote, you know, if it hurts, actually do it more, you know, so one of the things we found is that the skepticisms and the beginning is there's no way that we can take this on right. Operations teams are so busy, but because you start to do this every day, you automate at all the infrastructure and the components that before were extremely painful and that were done at 2:00 AM in the morning, you know, when people were trying to struggle to get the software working, you end up automating all of that. The six technique is application telemetry. So, and this one is one of my favorites because it's actually very surprising to me, actually, one an what improvement that application telemetry can make. And I think it's often surprising to executives and, and, um, other individuals that, that why it's important to make an investment in this.

00:12:31

So if NASA launches a rocket, you know, they have millions of sensors that tell you what is going on with that rocket temperature, altitude direction. If there's a failure, they record it, right. And it's a very expensive piece of equipment. So they basically, you know, if invested in the telemetry to understand why, what is going to, what is going to go on with that, with that rocket, unfortunately, what software we don't seem to take the same care. You know, we write distributed applications that run across a thousand nodes and we use console write lines to standard, out to write the EHRs. And then we expect ops teams to go grep through thousands of files to go find the problem. And we wonder why they can't find the problem. The development teams can't help. And then it takes forever to actually fix the problem actually, when something breaks.

00:13:17

Um, so what we do is we actually build and embed very deep telemetry into all pieces of our application. Um, so all of our process spaces are instrumented to basically collect, trace and activity information. And it's sent in real time to a repository that we call stat hub, which sits on top of Alasta search. And basically we collect a billion events per day from all the process spaces that are running in that environment. Uh, we've instrumented over a hundred thousand location codes, uh, sorry, locations in our software, so that we now understand distributed calls, database calls, rights to the file system. And when there's failures in those, we can see those in real time on a dashboard. So on deployment day, when we push things out, we can look at that pane of glass and we start to see red and we know we've actually created a problem.

00:14:07

We can click, we can get the stack trace. We can go look then back in the code. And we can say, go figure out what went wrong. Did a connection fail. Did, uh, uh, did the disc fill up, did a developer inject a parsing bug somewhere in the code. And we can see those things in, in real time, extremely powerful. And now the teams learn more how to make the application better. Also the seventh and eighth techniques are really around work visibility. So you'll find in a lot of organizations, and this is what they found in the Phoenix project is that work gets injected from all different places. There's emails, there's phone calls, people walk up to your desk. There's I AMS, Hey, I need to get something done. Right? And they go directly to the workers for that, that creates kind of chaos and context switching.

00:14:54

And then what happens is you get other workers that then get blocked, or aren't busy in the environment. So you don't get really great utilization. Um, really you need to fix that. And one of the things that we've been doing and we've spent in, you know, really the last couple of years, and it's really accelerated in the last year is getting a handle on that across all teams and making sure that we understand where the work is coming from. So you need to create an intake buffer. You need to take, basically create a way where all your work goes, kind of through one process flow and into one set of tools. I mentioned the Atlassian stack, we use JIRA, but there's, there's lots of great tools to manage this with, but it's not spreadsheets under people's desks. It's not napkins, and it's not phone calls to actually process that work.

00:15:36

And you see that in, in a, in a lot of places, once you have that one list of work that you've basically can triage and have unified visibility, then you can start doing things like whip management limit the amount of work that you inject into your system, because as you drive the amount of workup, as, as we saw in the Phoenix project, basically things slow down, wait time increases. So you really have to adjust that WIP limit and also release the work into the environment in a predictable way. And it manufacturing, this is what they call job and materials release. And we'll look at some pictures of that, uh, in a second, um, that really kind of illustrate, um, more of how it works, but these types of techniques now allow you to put predictability into your work stream. So I mentioned the, the, the print center.

00:16:17

So this is a work visibility, uh, example, uh, from the print center. And this is standing in what's called row one or aisle one, each one of those carts that their carts that are there represents a job. That's going to be put into the print stream. And on those cards are proxies for all of the materials that will be needed to process that job. And they don't put one of these carts into the system until they understand that they have the capacity and all the resources to handle it. So this gives them that predictable job and materials release. So we take our it managers here and it's really great. And we put them in Iowa and we're like, Hey, how do you do this in it? How do you release work? And first couple of times where, like, we're not, we don't know, we're not sure, like we just tell people to go.

00:17:02

And I'm like, well then how do you know you have the right operational resources? How do you know that you have the other infrastructure? Right? So it really kind of starts to click for them how important that is and how, in some cases, manufacturing is very similar to it. There are other cases that don't correlate well, but this is one that the does. So what we ask people is do you know where your work comes from basically and how it is scheduled and how it is released? So the second example is a new robot that we just got installed. And it was kind of a kitschy example, but I thought it was pretty cool because it reminded me of automatic of automated deployments. So I thought I'd show up. But again, this is from our print center. Um, and you know, we have the luxury of actually being able to go look at these things and then actually translate them to the it world. So if you play the video,

00:17:53

A robot, that's actually auto sorting mail to put it on a pallet for the us postal service to pick up, we used to have people do this 24 7 bend over, pick up boxes, sort, them get them on the right palette. Now we have a robot that actually does the work perfectly, right. It never fails. It never needs to sleep. Right. And so the things that, you know, I asked people from this is how do you get your code in production? Do you have someone up in the middle of the night deploying code and making mistakes, or do you have automated robots actually deploying that code the same way every single time? And again, this was a kind of great example that you can kind of point to and look at, you know, the physicality in the, in the print center and really, you know, translate that to it.

00:18:39

So I thought it was, uh, another good example. The knife technique is cadence and synchronization. So the example that I've got here is a picture of a bus stop. I'm actually standing at the corner waiting for the bus and there's five buses lined up and I'm sitting there kind of giggling a little bit taking a picture. Cause I think it's a great example of what failure to have cadence and synchronization does. This is actually called the bus bunching problem. And so I also then took a screenshot of the CTA bus tracker site that actually shows they have the data that this problem is occurring. They see that there's five buses bunched up in the same place, but they don't know how to fix it. Right. So the end result is all those buses end up late. The first one has a whole bunch of people on it.

00:19:25

The one in the back has no people on it. Right. And it goes down the street, right? So, so they see this happening. They've got the data, they're not doing anything about it, right? So the, the end result is the end. You know, thing to take from this is if in it, and in work processes, if you don't inject cadence and synchronization nature will do it for you. And it will do this exact same thing. All of your projects will collide and basically, uh, cause, um, cause a problem in the system. And in other words, all projects will probably be late or have low quality. So what we do to manage this as we inject cadence and synchronization, we basically line up, uh, release unpredictable events, like release planning to occur at the same time across all teams. So you basically have 40 teams that are planned their dependencies at the same time.

00:20:10

And then in inside a program increment, we have sub harmonic, basically iterations or sprints where every two weeks they continue to plan and resynchronize. And then you have a major push, a major pole event into production. And then you replant again and you keep doing that. That basically gives you predictable events to replan for problems that occur in the system. And humans like predictability. They like to look out and they'd like to know the date. Hey, I know exactly in 14 weeks, I'm going to read you this planning. So let's get all these stakeholders together, the architects, all of them together to actually plan out and plan that next release. The other thing that allows you to do is it allows you boundaries to manage new work injection into the system. So you don't want new work coming in every single day, right? Because that crew, you know, this unplanned work actually would create chaos for your teams if you're sending them new stuff every single day. So this gives us those boundaries to manage that on. And this is again how we use this cadence and synchronization to prevent problems like the bus bunching problem of all teams colliding and things, things occurring late.

00:21:19

So the 10th really kind of the final, you know, on this of, of all this is, is reducing batch size. So, you know, we, we knew and everyone kind of knows the manufacturing that reducing batch size can have significant effects. But as we mentioned, we had structures which prevented us from doing it. We couldn't have just woken up on January 1st in 2014 and said, Hey, we'd like to do 14 week releases. We knew that the constraints in the system, the fragility would have caused a disaster, but we put learning in place. We put infrastructure automation, we put processes in place. And then after doing all that, we're able to reduce batch size. So we basically took those 28 weeks, right. Which were 14 iterations. We reduced that to 14 weeks, a smaller program increments. And then from that, we get smaller and fewer things going through the system and they go through faster and they have significantly less impact and higher quality, right?

00:22:13

Once we've done all these things to actually reduce that batch size. So kind of a summary of the metrics, you know, where we were before this, we had this, you know, 28 week batch size, we had high impact of 4 58. Um, took us 10 plus days to recovery. We had a lot of irate customers from that. Then we crank things down to 14 weeks after all these improvements in 2014 reduced our, our impact to 1 53, which is basically a three X improvement, uh, and where we were. And it takes about two days to recover, which is a much shorter time to recover from putting a release in production where we want to be is probably somewhere between those last two. Um, we know what the extra practice that we're getting after that first, um, deployment that we can do a lot better. We could basically do two orders of magnitude better.

00:23:04

So we really want to get to that type of impact. Um, we know on smaller products with that smaller batch size that, um, we see a similar type of impact of four weeks. We do believe that somewhere in between probably eight to 10 weeks is the right sweet spot for us. If we release every four weeks, our customers probably can't consume those features that quickly. So it's probably an eight to 10 a week release cycle that actually we want to get to with that, with that lower impact and that higher quality. So some of these techniques, the accelerating learning very important, you know, really lean and systems thinking in verse Taylor maneuver and inverse Conway maneuver to change structure and technology, um, a shared service, continuous delivery platform to provide kind of consistent delivery, uh, resources and tools across all the teams, providing environment congruency and practice for the teams that actually do the, uh, do the deployment deployments application telemetry so that your teams can learn how the application behaves in production.

00:24:07

And so that they can actually now, uh, increase the time to recovery or deterrent decrease the time to recovery, uh, and continue to make the system better, visualizing your work and creating work release and whip limits. Um, then applying cadence and synchronization to really kind of line up these unpredictable events and, and large enterprises. And then finally reducing that batch size. So a few credits that I've got here in the slide there, you know, there's a lot of work that was pulled from, uh, for this and then kind of the final questions of how you can help me or things that, you know, we struggle with is really kind of standardizing these applications at scale, really proving the business case. We've seen great results, but we continue to struggle to prove out those cases and trying to make progress quickly on that. As a hard thing, you've got tons of legacy applications trying to clean those up, continue to standardize those, make that, make the operations of those things a lot smoother. Uh, and then also balancing standards with innovation. You know, one of the challenges of having standards is it does stifle innovation a bit because you don't have all your teams running out, creating new things. And we continue to struggle with that in our enterprise where we have teams that want to innovate and they should, right. We increase their learning. We are asking them to explore things, but all of a sudden now they're injecting all these new tools and then that creates a bit of a problem for us to actually manage. And that's it.