San Francisco 2016

When Ops Swallows Dev

CSG has been on an Agile and Lean journey to continually shorten feedback loops in its SDLC and Operations Processes. This began with moving from waterfall to agile and deploying cross functional dev teams.


Today, we have taken this transformation further by deploying cross functional product delivery teams that Design, Build, Test and Run their products. Join us to discover the things that went as expected and the surprises we discovered in this journey.

SP

Scott Prugh

Chief Architect & VP Software Development & Operations, CSG

EM

Erica Morrison

Director, Software Development, CSG

Transcript

00:00:02

Pretty excited to be back here at DevOps enterprise. This is our third time presenting, but it is the first time that I have a official operational, uh, duties. So I have a little bit different perspective, uh, this year, a little bit of a scarier one. Um, it's, uh, it's been kind of, um, it's been kind of interesting. So, um, the, uh, the theme today are the title is when ops walls dev. Um, about, uh, six or seven months ago, we went through a set of organizational changes to bring together disparate development and operations teams, uh, in doing that, it's kind of felt, uh, it's felt like really, we'd been doing definitely more ops and dev as we work through, uh, the cultural and the technical changes and kind of worked through a lot of the technical debt as the teams kind of, um, uh, format.

00:00:47

Um, I'm going to be going through really how we accelerated feedback and learning, uh, to enable understanding accountability and engineering on the teams. And then, um, basically walk through the different org archetypes that we've, uh, we've kind of journey through from 2012 to 16 and ended up creating these service delivery teams. Then Erica will be taking us through, uh, what we learned and then we'll sum up quickly with, uh, kind of what we, um, what what's next for us and what we're looking at next. Uh, so really quick. So CSG folks, uh, there's 14 of you here, so thank you so much for coming. Thanks for the support and really thanks for your leadership. Um, what CSG does very quickly. We, uh, are a SAS based billing customer care for the cable industry. We're about 50 million subscribers, a very large, um, operation, um, about 20 technology stacks, everything from JavaScript to high-level assembly on the mainframe.

00:01:40

We deliver it as an integrated suite, uh, four times a year. Uh, we have the traditional challenges, uh, large, uh, you know, kind of large legacy organizations to do a big focus now is operational quality. Uh, now on the right, we also run, uh, largest print mail factory in the U S we turn out about 70 million statements, very lean and efficient, uh, below. You'll see, um, uh, a bunch of the customers. We have the Comcast time Warner charter, uh, dish, if you're a customer of any of those cable and satellite customers, you're a customer of ours. Uh, we take our delivery very seriously because, uh, you know, people won't get service if there's actually a problem with our infrastructure. So I've kind of had this picture in my head for a couple of years and I kind of needed to get it out. And it's really the picture of kind of what, what development and operations has traditionally looked like on the left.

00:02:29

You have development on the right, you have operations and the leaders are basically yelling at each other and they don't agree about many things except that, you know, change stinks. We put change in and it blows up the environment. The second thing they kind of agree on is that the path to production is really a pretty precarious one. And it's usually, uh, born on the backs of others, your change managers, your release managers, your production operations, and your PMO, and that customers really want, um, high quality features, um, really quickly. And so it doesn't really have to be this way with dev ops. You know, your companies can win, uh, your customers can win and the path that we're going to production can be a lot smoother.

00:03:11

All right. So setting a little more kind of context here. I pulled this from Elizabeth Hendrix Hendrickson's presentation last year, and she really hit on two things I thought were fantastic. There was really some great stuff here. So I encourage folks to go back and watch it. But one of the first things she kind of pulled out that I thought was great was the concept of courage. And she talked about how important it was to experiment with team boundaries and reforming teams. She also told us of her head of QA whose first act was to basically get rid of their QA organization. That took a lot of courage. They realized that basically having dedicated QA was creating a target rich environment for defects and was basically D optimizing value in the system. The second thing that she covered was the concept of feedback. And I've got really kind of a picture of it here in the, in the lower left.

00:04:04

You basically have the traditional PDCA loop. Basically the faster you go through this PDCA loop, the faster you learn above, we basically have the different types of tests, which provide feedback about quality, everything from unit tests, to explore exploration tests, manual regression tests, and all the way at the bottom. You finally end up with customer opinion to the right. We basically have those loops spread out over time and basically latency in those loops basically means that you're, that you're learning slower across your teams or your organization. And we're always trying to move that latency from being, you know, days to hours to minutes, really kind of as fast as possible. So there's one loop here that wasn't in the original presentation I added in, and this is basically the concept of opera, operational quality and feedback. So when we talk about dev ops, obviously operational quality is a key component.

00:04:58

And basically getting that feedback about how your software is going to run and production is extremely important. I'll use this concept of kind of these latencies, as loops as we talk about the different organizational architects, archetypes, sorry, and how they affect basically learning. All right, so we'll rewind to 2012. So this is a traditional functional archetype. I'm hoping that a lot of companies don't have this structure in place anymore, but this is what CSG looked like. Pre 2012, traditional functional organizations, everything from requirements, design, development, testing, and ops, and you even have situations and still, we have some of it today where there's even organizations inside those organizations. So for example, an operations, you have, you have basically storage network, you have essays, you have DBS, you basically have those broken out at the top. You basically have the concept of QS. And basically that's how you pass work from organization to organization.

00:05:55

So at the bottom, there's a few kind of things about cues there. And now we're really kind of two works that really illustrated the concept of cues. And the problems verse was Craig, Klarman's basically scaling lean and agile development. And the second of course is the Phoenix project, which illustrated the formula of wait time has percent busy over percent idle. So basically as you load these organizations up, things slow down, basically exponentially. So a hundred percent loaded organization takes almost forever to get something out. So what happens with these queues is obviously they did delay feedback and learning, but even the more evil thing is actually what they do to your resources. They basically create what are, what are considered. I shaped resources and resources that learn very narrowly one role, and they don't learn across your business processes. So if we look at our kind of feedback loops, we have those spread out across. So you can see how something like operational quality and feedback is skewed in latency across all these organizations and all these handoffs. So you really don't understand very well how your software is going to behave in production with this, with this type of structure.

00:06:58

So the next is the agile archetype. So in 2012, we went through a reorganization and we went through and got rid of dedicated QA dedicated and analysis and design a lot of the dedicated organizations that were very role-specific. And we put in agile teams, and I think a lot of folks have gone through and done this. We rolled out the safe framework and that framework then helped us put in things like cadence and synchronization. Really, we, we reduced batch size, we got work visibility in place. A lot of test automation. We started inverting, you know, really kind of Conway and Taylor's law, basically some of the software architecture that was broken and was importing team structures, and also the role of specific pieces. We started pulling those out. We got our Serra CD pipeline and our delivery pipeline kind of, I call it the V1 pipeline in place.

00:07:53

And so this type of structure really kind of started to help us produce, you know, what I consider kind of a little T resources, at least resources that now start to basically learn a little bit more broadly across that SDLC pipeline. So they basically learn how to build the design, the software with automated tests. They basically learn how to design for testing testers paired together, using the CD pipeline. We start to deliver things a lot faster. So when you now look at these feedback loops, right? If these teams are executing in two week cycles, they're getting through basically many more of those feedback loops actually, before they actually release the software. You'll still see that that, that operational quality is still skewed out across the queue to another group. There's some things like shared operations teams that we used during this timeframe to help combat that basically teams that would deploy both production development and QA environments.

00:08:47

Exactly the same. So they got a lot of that practice and that did help things quite a bit. So this is really the quality from this. These are well-known as previously published. They also ended up in the DevOps handbook, but this really looks at the number of incidents from the left to the right over time. So in 2013 on the left and basically all the way through almost current day, and you see that we made a big jump and have kind of leveled off in the, in the improvement areas of what we'd done in a release and ended up being about a 10 X quality improvement and about half the time to market to get things into production. As I said before, we do it four times a year, which seems about right for our customers, although we're considering if we want to reduce that batch size again.

00:09:31

All right. So, and we kind of have this, this picture now of the total system view of quality. So this was taken as a snapshot about, of March in the year. And this is basically a look at incidents being resolved by groups in the environment and different types of incidents. So the blue are those release incidents we talked about, and you can barely see those discernible. It's very kind of small in ratio. And that reflects the previous picture that I had. The orange are all other incidents that are occurring on the left. We again have development and on the right, we actually have operations. And, and basically at the bottom, the summary is that 98% of these incidents are really occurring outside of our release. So although we did good with the release, we're not doing that great of a job outside of their lease.

00:10:14

92% of these incidents are actually being dumped onto basically the operations teams and the operations teams are having to fix those. So these folks are still yelling and they're probably saying something like this, your code is not very good. And they're saying, well, it worked in this other environments, but the reality is the lack of really the feedback. The lack of the understanding is creating this target rich environment for these incidents to occur. Basically, there's kind of blindness from the development teams. They don't really understand one how their software is actually used in production and actually how their customers consume it. So it's really kind of creating these problems. So the development guys kind of hanging his head. He's like, I really don't know what to do now. It's like, my code really does cause problems in production. So when we kind of go and look at this a little bit further, and this is what you know was, was highlighted in the it revolution article that came out this week is the market archetype.

00:11:11

Now we call it service delivery teams, but on the left, and there was a panel question about, this is basically what do you, what do you do with dev ops and traditional it organizations? The ant the short answer is you probably have to change things. So this kind of either is going to scare you a lot, or you're going to be very excited, but you have the traditional development organization really kind of on the left with those different product areas. You have the dev teams and you have the dev teams that basically hand off to operations teams to basically run their software. And then those operations teams work with platform teams to kind of get infrastructure, right? So you still have those handoffs. So we have the observations which we stated, which was really, you know, this release quality was optimized, but the system quality is not these manual processes, you know, around hard run software.

00:12:00

You just saw manual processes everywhere. You saw DAV really lacking this understanding and getting the lack of feedback. The collaboration just very unnatural, right? One of the things you see with the teams when this type of structure is, it takes so much work to get them to collaborate. You'd have different meetings, different schedules, getting that stuff lined up and they lack something kind of, which is called a spree to core, which is really that kind of sense of shared mission that they're really fighting for the same thing and lots of context switching and work chaos. So from this, we developed a bunch of hypotheses and really, you know, these different or goals, you know, we, we felt were working against the system level goals to improve this lack of the operational understanding was basically creating the hard to run software. It was creating that target rich environment for the incidents, lack of the shared mission, basically really had lack of empathy.

00:12:47

Folks had a lack of empathy for the different roles, like how hard it was to install this stuff, how hard it was to make it work and handoffs causing that elongated lead time. And then finally this lack of engineering skills in the operations group prevented improvements and really encourage what I call duct tape engineering, throw some binaries over the wall, let's tape some monitoring, some other stuff on top of it to make it work. So in, in, in March of 16, I was asked to basically take over the product operations for these groups. And so the first thing that we did is we spent a couple of days planning in a room and I challenged my leaders. And I said, how do we get rid of dedicated operations? How do we really combine this gathered to get teams that basically build and run their software?

00:13:33

And so in this basically a picture on the right is what we kind of ended up with. We have these service delivery or dev ops teams, and the teams really have all the resources on them. They still interface manually with the platform teams and where we're going here is to move more to infrastructure code, which is the green line where we connect RCI pipeline basically, and get out of infrastructure as a task, which is the IAT abbreviation there. And that's where we're headed. So if we take the team level view of this, it looks something like this, these service delivery teams build and run their software. You can see our feedback loops now with teams that execute in two week sprints, we really have contained those feedback loops around the team itself, right? So operational quality, we're getting a lot more about that on the teams, we get the product operations feedback on a daily basis on problems and how hard it is to run the software and things that we need to improve and support that's actually coming in.

00:14:25

We really are enabling those role shapes to be more big T and really go across that entire life cycle. So the summary on this is why to bring them together. They're really, the first is the understanding. Now the team really has the ability to understand once they understand, then they can start to become accountable for that whole delivery train. So no more racy diagrams of crazy stuff that says you install, but then someone else is in charge of the quality. They can be accountable for all of it. The next is engineering. So the team can now inject and cross chain, these engineering skills and principles into operations, and the team can involve operations from solely being a process activity before, which is a ton of focus on process, which, which is good, but now they can focus on actually changing that process in a bobbing and engineering the software to make it smoother, to run. So we want to shift left and basically get rid of duct tape engineering. The final benefits, communication meetings, planning, collaboration, work, visibility, and shared leadership vision are all really key things that you actually get from the structure. So now I will turn it over to Erica and she will cover actually what we learned from these changes the last couple of months.

00:15:35

Thanks, Scott. So when we started our reorganization journey, I thought I knew a lot about running dev ops teams. I already had multiple teams running in a DevOps fashion where they own the end-to-end development and operations environment. And so, although I have a development background through this experience, I thought I had a good understanding of the operations world. And this was basically, this is how I view dev ops is tranquil best practice of how we do software development. However, after we got going with the reorg, it seems started to feel a lot more like this, so that the teams that I had previously running in a dev ops fashion were internal support teams, teams that did things like our production telemetry system, our internal build system, but teams that I gained after the reorganization were more pure operations in nature. So they were customer facing interfacing with a lot of our other product operations teams.

00:16:25

So it's been an extremely educational experience, both for me and for all the different development teams that have gone through this journey. It's given us a whole new perspective on the operations world. However, it's also been extremely challenging, but a number of bumps along the way we've had production outages. We've had feeling basically like we were thrown into a world of chaos. For those of you from the, the operations side of things. I think you can attest to the fact that operations is more disruptive in nature. There's lots of unplanned work when you don't have the construct to deal with this unplanned work to reduce the, the occurrence of it. It very much can feel very chaotic. So it's really solidified a lot of the dev ops principles for us as we've seen what it's like to be in the operations world without these principles in place.

00:17:10

So the good news is we've got a framework, we've got methodologies to deal with these challenges. However, it's not an overnight thing. We've really had to dig in, roll up our sleeves, start attacking these challenges. So I'd like to talk about my experience with one team in particular, the team that manages our network load balancer, and then kind of extend some of our lessons to our larger organization. Before I get going with that though, I thought I'd share a few thoughts from some coworkers of mine on the presentation itself, that kind of frame the year a bit for you. So one suggestion for a title was when a dev manager takes over the network load balancer, the horrors and realizations that follow another suggestion for a title was pizza, beer and illegal drugs and overview of motivational techniques and dev ops. And finally, I got a suggestion on timing as I'm now well-versed and how we ask our operations teams to make lots of late night changes. And that was, we should present something on the network load balancer, but we need to present at 1:00 AM.

00:18:10

So as I got going with his team, I quickly started to feel like I was in the Phoenix project. I've seen parallels to the book and previous work experience, but nothing like this. So first of all, lots of invisible work and work in multiple systems, people were working things that nobody else on the team has visibility into. And then we had a lot of work that was just an email and really wasn't getting tracked anywhere. And then finally work in multiple systems. So we had operations teams using multiple tools to track the work that they were doing. They use pretty much all of those tools to send their work to us. So we did not have a single pane of glass where we could view our work. We had competing priorities from across the company. So we support lots of products at CSG. All of them feel that their items are number one priority.

00:18:54

That's been a challenge to juggle high whip and overutilization. So it became especially evident as we got all of our work in one board, just the amount of work that we had in flight at any given time, it's been challenged to get this down and then overutilization. So the team was understaffed. Many of you have probably seen the graph of what happens to wait time. As percent busy goes up, definitely saw that here as team members jumped from emergency to emergency just driving that whip up further manual configuration, most of the configuration in this product was manual in nature. And that's the large part of the work that we're doing is going in and making those configuration changes, technical debt multifaceted here. So, first of all, we've, we've struggled to get some upgrades into production of our vendor software. As a result of running these older versions of software in production, we had production outages.

00:19:44

Some of the challenges were things within our control. Some were not standards. In some cases we lack standards and others that haven't been universally applied or rolled out to production. And then in other cases, we've had complexity that we've built into the operations side that really belongs more on the development side. So as a result now we've had maintenance challenges and again, more production outages. We definitely have our own Brent he's extremely intelligent and knows our system very well, but we ended up single threaded through him on certain types of things and then poor visibility into specific changes. So feedback both within our team and then with other teams that we interface with is that they didn't always understand the changes that we were doing and what the downstream impacts of those could be. And then even at times, they didn't understand what we had configured with them today.

00:20:37

So what we did is we took a look at the challenges facing this team and we attack them with dev ops concepts. So John shook at Toyota talks about changing what we do in order change our underlying belief system and culture. And that's basically what we've done here. So we've had people resistant to change viewing this as just a fad, both on this team and throughout our larger organization. But what we did is basically just jump in, start applying the methods and let them speak for themselves. So, first of all, we made resource changes. We brought in architects, developers, and QA resources to this team to help drive automation and to help drive culture change. We implemented automated reporting. So I mentioned that we were doing a lot of manual configuration. We're doing lots of other manual stuff as well. So for instance, we were logging into each our devices every day, checking, swap memory by hand, cutting and pasting that into an Excel spreadsheet, emailing that out.

00:21:32

That was a leading indicator of performance issues for us. So that's all automated. Now we've automated a series of reports for our dev ops team, so they can see more about what we have configured for them in the product. Things like traffic, through their different IP addresses, we've implemented dev best practices. So I mentioned, we brought those developers and architects over. They're familiar with these best practices. They've brought them over, not only to the code that they're writing, but to the operation side as well. So for instance, that manual configuration is backed by a config file that wasn't in source control. We've got that in source control. Now we've brought in peer reviews. So just like we do peer reviews for code, we use crucible. We're now doing that for all those operations changes that are going in as well. The, any code that's being written expectations, we're building on top of our continuous integration framework with Jenkins automated test coverage is expected to be very high unit tests and feature tests.

00:22:27

We're retroactively trying to add automated testing to a lot of these manual configuration changes that we're doing. We're bringing continuous delivery type concepts where we're deploying to QA multiple times a day and deploying to production a couple of times a week. Historically this team deployed to QA once, if we were lucky and then straight to production. So it's instilling that culture of small batch size in lots of practice with these things. We're also implementing with our logging and tracing framework. I've implemented configuration as code. So we selected a config as code framework and we're moving product by product over to this framework, we've developed an automated deployment strategy for this and also working on automated validation for this completed a work tracking system overhaul. So now we've got everything in JIRA and one Kanban board, and we wrote scripts. And just from our other tools, change workload management, something that came as a bit of a surprise to us, we thought we'd bring developers and they would go do deployments that would help troubleshoot incidents.

00:23:25

And they were going to see problems and they were going to fix them. They were going to automate them. Well, they definitely got in, they learned a ton. They saw better ways to do things, but they didn't have any time to actually do it. They were so burdened by doing all these changes. So we took a step back. We've still got a joint team joint up joint Kanban board, but the developers are now primarily responsible for the strategic work, the coding, the automation, they peer review, the operations changes. We've got an ops rotation. And then the ops side, they are primarily responsible for implementing the changes, but they also help contribute to this config as code framework and they helped design the long-term strategic work and we've increased our change process visibility. So now we meet with our key stakeholders on upcoming changes and we talk through the details what's needed for validation, et cetera.

00:24:12

So before I turn it back over to Scott to talk about where we're going next from a dev ops standpoint, just wanted to share a few thoughts on operations. From a development perspective. As our development organization has gone through this journey this year, there's been a lot of things that we basically expected to find and they've come true, but there's also been a number of surprises for us. First of all, ops is hard. And I think we knew this, but the extent to which this is true, the extent to which this is true really has come as a surprise to us on the change process. This wasn't something we dealt with very much. Now it's part of our everyday process. It can be time-consuming, it can be cumbersome. We deal with large volumes can be daunting. We've seen that we need to streamline this, but we need to be able to day in and day out demonstrate we can safely make change to get there.

00:24:58

And we're not quite there yet, but we are applying DevOps principles to get there along those same lines. Change can be scary. So as developers, we want to push changes to production quickly, but when we're the ones on the front lines, dealing with deploying those changes, dealing with fallout, working directly with angry customers, we start to realize that there is some inherent risk with this, especially if we don't have good automated tests, automated deployment. So we understand that viewpoint a lot better with being resistant to change. We know we got to move through it, but we understand it. Application architecture is needed. We've historically thought of architecture as basically applying to the software development life cycle phase. However, it really applies to the entire thing. So we need to take a system view. We signed, found the same principles, good design interfaces, automation standards apply to the whole thing.

00:25:44

And we need to rethink where our boundaries are with architecture. Enablement is key. We need to give these guys proper number of resources, training, good equipment to work on. They're a hardworking, dedicated group of people, but we got to set them up for success. Sport is truly 24 7. So I now sleep with my cell phone next to me. I make sure to crank up the volume every night so that I can get woken up and I do get woken up. I have troubleshot production issues from a boat while I'm on vacation from a car while I'm road tripping, you are always on supporting your product, making sure it's up in production. And that's something that until you experience it firsthand, it's just hard to put into words. And finally, ops is forced to tolerate a lot of pain so that I turn it back over to Scott. And he's going to talk about where we're going next to help reduce this pain.

00:26:31

Thank you, Erica. Eric has been doing some fantastic work with, with this team. And you know, when, when I first took over, I went through PCI. So who's been through PCI. So that was my first experience in March is actually encountering PCI as a new operations leader. So, so just really quick summary, this is the two things that Paul the top accelerating feedback and learning incredibly important, get the understanding and learning on the teams. Accountability then comes after that, right? And engineering can really help challenging the org norms. I think that's something really kind of, all of us see as really necessary in this. It takes a lot of courage structure for feedback. Basically your outcome that you want is speed and quality. And you have to really look at the org norms for that. So what's next. There's a lot of stuff we're doing next.

00:27:20

Then change lead time. I'm going to show a slide on it, which is kind of a bonus slide that I wanted to stick in getting change with STLC rigor. Erica talked about that. Getting ownership of the change, really kind of on the teams, bridging idle and the STLC, both the processes and tools, bringing those together really kind of have to have really that end to end flow impact reduction, moving more to centers of enablement, as opposed to centers of excellence, centers of excellence, generally give you huge bottlenecks and really kind of knowledge, siloing technology, more engineering, less duct tape. I talked about that really kind of get that into the front of the process. Some things we're doing in the mainframe, we're moving from assembly to Java. We get about 31 code reduction there. We're going to use our same tools, CD pipeline to basically deploy Java to the mainframe.

00:28:04

And developers really won't know much of the difference than whether the developing the mainframe open systems getting version two over our CD CD pipeline, which is basically infrastructure as code and cloud pipeline enable. And of course, people really important engineering culture and cross skilling upscaling. So now kind of the bonus slide is this. And I couldn't resist. I'd put this in Kevin's story is our change process manager. He's here somewhere. Thank you, Kevin, for this. He put this together really looking at change, a lot of complicated stuff here, but very simply the blue lines are changes we've put into the system basically. And they're bucketed into basically different our windows based on the lead time of how long they took to go in. We shoot for a 99.5% success rate basically is the KPI. We want to look at the orange line or sorry. The yellow line to the left basically is changes that meet that success criteria basically to the right are ones that are failing. So basically if you look at it, we fail basically to meet our KPI goal for changes that go in from 24 hours on changes that we put in actually really quickly meet our success criteria. So that shouldn't be a surprise, but it's basically very interesting when you see it in the data and then start looking at the processes that help enforce it, right? So we want to really make change. Great. Again, that's the platform that I'm running on for, for this year.

00:29:25

So, so we start kind of look into this and we see some things basically in the system. Okay. So very quickly the summary is basically up top, which basically shows that our failure rate of 0.17%, basically when those changes are less than 24 hours, it basically gets some 600% worse. So it's huge, right? In difference when you're putting thousands and thousands of changes in it, right? When these things happen, right? You set up boxes and stuff to work. It's a big deal. You look at our scheduling policy. Our scheduling policy requires a five day lead time. So basically we baked the scheduling policy to add a schedule, buffer schedule buffers are bad. Just bottom line. Don Reinertsen basically says that a schedule buffer what it does is it transforms basically uncertain lateness into certain lateness in any process. In this case, we basically transform uncertain failure into more certain failure by putting the scheduling lead time.

00:30:20

Right? So, so again, I had to put this in I'm over my time, but I thought it was really kind of telling of when you start looking at the data across the system, you look at the processes that people put in place that you think are making things better. They're actually making things worse. And so now one of our challenges of course, is to figure out, okay, why we have that lead time, break it down, make changes smaller. Cause people's argument would be, well, the bigger changes are out of the right. We'll make them smaller, put them in quicker, make them lower risk. Right? So those are the types of things we're now starting to kind of look at to make change. Great again. So again, thank, thanks. Thank you guys. Thank the CSG folks. Thank you guys for coming. It's awesome. And thank it. Revolution. Thank you guys.