Erica Morrison (CSG) x Shaaron Alvares (Las Vegas 2020)

Erica Morrison is the Vice President of Software Engineering and her teams provide software solutions to CSG’s 40+ development teams. These solutions range from continuous integration frameworks to reusable libraries to telemetry visualization platforms. Erica is passionate about agile and has experience leading DevOps teams where members own the end-to-end infrastructure and code. Erica also has software development experience in the defense and aerospace industries where she worked on projects such as the replacement for the space shuttle. She lives in Omaha, Nebraska with her husband and two kids. Shaaron A. Alvares works as an Agile and DevOps Transformation Coach at T-Mobile. She has a global work experience leading product, organizational agility and cultural transformation across technology, aerospace, automotive, finance and telecom industries within various global Fortune 500 companies in Europe and the US. She introduced lean product and software development practices and has led significant lean and DevOps practices adoption at Amazon.com, Expedia, Microsoft and T-Mobile. Speaker, trainer and writer, she is a news reporter and editor at InfoQ for Agile, Culture and DevOps, and Ambassador at the DevOps Institute. Shaaron published her M.Phil. and Ph.D. theses with the French National Center for Scientific Research (CNRS).

2020vegasbreakout

(No slides available)

SA

Shaaron A Alvares

Sr. Agile DevOps Transformation Coach, T-Mobile

EM

Erica Morrison

Chief of Platform, CSG

TRANSCRIPT

00:00:00

Welcome to another DevOps enterprise summit, uh, interview with our speakers today. I'm very excited to have Erica Morrison with us. Every CA uh, is a VP at a CSG. And, um, we, she, she presented a keynote at the DevOps enterprise summit in London, and she's presenting a keynote as well at the DevOps enterprise summit in Vegas this year, which is a remote. So if you haven't registered and you have an opportunity to listen to her talk, I highly recommend you to register. It's going to be next week, actually on October 13 to October 15. So why come Erica? I'm so excited to have you, we collaborated on a paper and that's going to be published as well by it revolution something sometimes this year about the incident management and a lot of the paper actually is based on the work that you and Scott pur have been leading at CSG. So welcome, Erica, would you like to introduce yourself?

00:01:04

Yeah, thanks so much for having me. Um, so as you mentioned, I work at CSG, um, I'm over a number of different software engineering teams. Um, a lot of the work that they do is in the shared services space. So we're providing services for other teams, things like monitoring solutions, um, CIS systems. And then also we provide some front end for some of our next generation products as well. So I've been introduced into the dev ops space over the course of the last several years as in, um, is in writing software as a developer. And then I've gotten introduced into the operation space as part of my journey through, through dev ops. So those teams that I run today, um, our teams that cross the development and the operation space.

00:01:51

That's awesome. Yeah. And I think you've been recently promoted to vice president at CSE, so congratulation, I think you've been doing an amazing work. Yeah. So could you tell us a little bit more about, uh, CSG, the projects that you developed for your clients and, um, any, uh, success story, any client success story?

00:02:12

Yeah, sure. So, um, so the space that I'm in with NCSG revenue management, digital monetization and output. And so, um, we, you know, we're north America's largest, um, SAS based customer care and billing provider, and also working on next generation products that kind of leverage that base and also expand into other markets and things like, like wireless. And so basically providing these, uh, solutions to support some of our customers, um, you know, customers that you've probably heard of like a current Comcast dish time, Warner Disney, some of those are all customers that we serve.

00:02:50

Uh that's that's uh, that's incredible. And, um, so what was that CSG journey to DevOps adoption? I know you mentioned been working in the DevOps area for a little while, and I know that you had been present at the DevOps enterprise summit since it started, I think in 2015. So what were you influenced by the talks and what was your journey as yet at the CST?

00:03:15

Yeah, so our journey has really been the course over over many years and it started with, um, the foundational concepts, you know, things like agile and lean reducing batch size, and then it evolved over time. And really we took a drastic step and 2016 of having a major organizational change and part of our business bringing together development and operations. And so that's proved really for us and for our journey foundational and some of our successes. And so then the next couple of years we're really working through, you know, what does that mean to have teams together? And so, again, coming from that development background, there was so much, I didn't know, an operation space so much harder than I, I think I fully appreciated. Um, and then likewise, you know, bringing those development best practices to, to the operations world. So continuing to, to build on that, to, to look at things like how do we automate our deployments and then how do we design better software? So it's easier to run in production and get defects down and, uh, or detect issues sooner and monitor. Um, those are all been things that we've just built on and now we're taking those learnings that we've thought, and we're applying in other areas of the business as well.

00:04:30

Wow. And, um, so, uh, I know you've been to talk about incident management. You're going to do the keynote actually, and it's going to be really interesting because you are going to talk about how, uh, the, the worst outage in CSU history was turned into a powerful learning and growth opportunity. So I don't want to disclose too much of your talk, but can you tell us a little bit more about the, this outage and how did you turn around the outage?

00:05:04

Yeah, sure. So, um, the outage, when all of a sudden down was about 13 hours in duration, um, and it took a good chunk of our products pretty much far down. Um, and so, you know, it started in the middle of the night and we troubleshot it throughout the day. And it was a particularly challenging issue because not only were our products down, but so were all the tools that we normally use to troubleshoot them. So things like our monitoring system, um, access to tools who were victims of the same issue that was affecting our production services. And so kind of as the hours script by, and we were, um, you know, troubleshooting blind in a lot of ways compared to how we normally do things. So we would come up with an idea, we would try it and then, um, things would work for a little bit.

00:05:51

So the sense of like false hope for a few minutes, and then it's like, the problem just would move around from place place. Um, so we eventually did take more drastic action as the day progressed and eventually started filling the land by view land until we identified, um, one specific V land that everything just started working. And so once we identified that we could zero in on it, um, it, it takes us a couple of days to reproduce in a lab what was going on. And it was actually just some routine maintenance activity where we actually had a server patching activity. And when it rebooted, it was a non-standard OSTP that behaved a bit different when it rebooted, um, put some traffic out on the network and got interpreted as spanning tree in our load balancer and had a misconfiguration. And so it looped, we created this, this network storm.

00:06:43

Um, so I was whining, nothing was working on, on the network. So that was kind of the outage itself. And I think you asked, you know, kind of what did we take from that outage? So this outage was very big, very impactful, um, much worse than a normal outage in fact, that the worst outage in our company's history. And so we knew we wanted to respond differently to this particular pace. So we, we reached out to several experts in the field. Um, so we reached out to, um, John Allspaw and Dr. Richard Cook adaptive capacity labs, and we learned a bunch about incident analysis. So that came in and taught us about that. And then we, we implemented the incident management system. So, um, we worked with BlackRock three, Chris Ronna and Ron there, and, um, implemented incident management and transformed how we run our outage calls, which, um, you know, I was someone that ran those calls before.

00:07:39

And I can't tell you how different and amazing it is. And if you would've told me before that, um, we'd be able to transform how we ran a call so much. I wouldn't have believed that, but it's just night and day in terms of how a call is run. And so that's been very transformational for us. Um, you know, there's been cultural changes and I've had a number of other things that we've improved on a technical front. Things like improve monitoring some things we're doing on our network side as well. So it's a ton of learning. That's come out of this as an organization where it felt just like this devastating thing. And, you know, I talked about this in my talk. You could feel it walking in the halls, just sitting in meetings. We just felt we had let our customers down. And it was just this devastating feeling. And so we wanted to maximize that and ACL has this, um, staying in it, something like, you know, uh, outages or unplanned investments make the most of your unplanned investment. And so we really feel that we did do as much as we could to leverage that scribble event, um, to learn as much as we could from it.

00:08:48

Yeah. You talked about the impact on, uh, people right within the company. And did you feel there was an on the culture as well? I think you, you know, um, uh, in incident management we talk a lot about, uh, safety, psychological safety. So how was the culture and, uh, that aspect handled at that time?

00:09:08

Yeah. So with an outage this big, um, as you can imagine, there was lots of, lots of pressure, lots of, uh, people involved in, in all of that. So to us, it exposed that we still had work to do. So we started talking about psychological safety a lot. Um, it became something that we talked about at staff meetings at all hands. Um, we do a monthly dev ops leadership series. So we started talking about that. Um, one thing that we realized is we needed to make it safe, to talk about failing. Um, and from that, what we've learned is know we have outages, we move past them. People are embarrassed, want to move on, but in reality, almost all of those outages had valuable learnings that are then being shared if we're just trying to move past them. So we actually dragged back up some old outages from several years past, and it took some convincing people to talk about them, to get it, to be a more, a more comfortable state. So that's a normal thing where we can talk about these and learn from them. Um, we will continue to talk about psychological safety, I think forever. And it's got at the end of the day, if, if you talk about it and you don't live it, um, you undermine that. So it is something that we as leaders continue to try to make sure is present within the culture.

00:10:29

Yeah. I know that's really important. I agree. We can better collaborate and better respond to incidents when we know that we feel safe. Right. And, um, so what were the key lessons, some of the key lessons of the incident? I know you introduced a new practices or like a leadership meeting as well, monthly leadership meeting to look at incident management across the entire organization, not just a single, uh, local incidents. So what was some of the key, uh, lessons learned and maybe practices that you introduced after that to ed then that could benefit the other organizations?

00:11:04

Yeah, I think by far and away, the most impactful thing that we changed was rolling out the incident management system. And again, how we, how we run our calls. So that that's been a big focus for us. You know, you mentioned the global meetings that we've got when we review our incidents. Um, so we've really, we had a practice already of Loco review and have this global review, but I think we've re looked at both of those and said, how do we leverage these for maximum learnings to make sure that everyone's participating or doing these in the same way that people understand these? And one of the key things that I have been reminded of, you know, it's easy to look at an outage and focus on how we make this better. That's where we all want to go. How do we make sure this doesn't happen again?

00:11:50

But what do we, what did we learn from this is also a really important piece as well. So with this particular outage, we learned a thought about how our system function and that was eye opening to me, um, to, to say, Hey, go focus on this particular aspect and make sure you're learning about your systems as well. So now that's something I try to include when I'm doing a post-incident review with that as well. Um, organizationally, I think, you know, it was, it was a, it was a good wake up call for us that, Hey, there's work to do. Our is how we handle our incidents. We've come a long way now, it's time to up our game to the next level. And then I think the other key thing we learned about complex system failure, um, what complex system failure looks like, and you know, what we can expect going forward, which means that we need to be better prepared for some of these sorts of things.

00:12:45

And then, like I mentioned, we collaborated with other collaborators actually writing these incidentally, this framework for incident management and learning. And it was a great opportunity for me actually, because that's when I learned about all the work you've been doing in this space. So it's a forum paper that's going to be published by it revolution again. And it's packed with a lot of very valuable lessons, learned practices, patterns to set up an incident, uh, process and Cedar to framework actually within any organization. So, yeah, I think it's wonderful what you're doing because you're sharing at conferences that the lessons learned from your experience, but also you publishing those and you collaborating across the other companies. Right. So I think that's the right thing to do. So, um, uh, thank you very much, Erica. It was great to have you today. So if you want to listen to Eric as a keynote at the DevOps enterprise summit, please register. There's still time to register it starting next week on October 13. Um, thank you very much, Erica.

00:13:51

Thanks for having me.