Las Vegas 2018

More Engineering, More Culture, More Security

Over the last several DOES conferences, we've outlined CSG's DevOps journey. This is a continuation of that story.


Erica’s teams provide software solutions to CSG’s 40+ development teams. These solutions range from continuous integration frameworks to reusable libraries to telemetry visualization platforms. Erica is passionate about agile and has experience leading DevOps teams where members own the end-to-end infrastructure and code. Erica also has software development experience in the defense and aerospace industries where she worked on projects such as the replacement for the space shuttle. She lives in Omaha, Nebraska with her husband and two kids.


Joseph Wilson is a computer security expert whose career included five years as the Chief of a major DoD Network Operations and Defense Center (NOC) before entering the private sector. He also served as the security architect, strategist, and Manager of Security Operations for a Fortune 250 food company prior to joining CSG International. Joe now serves as the Executive Director of Global Information Security (SecOps, NetOps, and SecDev) and is responsible for protecting customer as well as company assets for CSG International 24/7. He lives in Omaha, Nebraska with his wife and three kids.

EM

Erica Morrison

Executive Director Software Development, CSG

JW

Joseph Wilson

Executive Director, Global Information Security, CSG

Transcript

00:00:05

My name is Erica Morrison. I'm an executive director in our software engineering space, and this is Joe Wilson. Joe's an executive director with our Global Information Security Team. Before I get going a little bit more about CSG, if you don't know who we are, we're a global company. We've got about 3,300 employees around the globe. We're the largest SaaS-based customer care and billing provider in North America. Some of our biggest customers are companies you may have heard of, like Comcast, time Warner and, and Dish. So we've got about 62 million subscribers for these customers and about 150,000 call center seats. So we support all this with a tech stack that really runs the gamut. Everything from JavaScript to mainframe. We've got about 40 DevOps teams, same challenges as many companies. Things like time to market and quality of software and operations.

00:00:54

So a little bit more about our DevOps journey. I've had the privilege to present at DevOps Enterprise Summit three previous times in San Francisco and also this past summer in London. And when I look back at our different presentations, I think they kind of tell the story of our journey. So in 2015, we talked about reducing batch sizes, applying Agile and Lean. In 2016, we had a major organizational transformation where we brought development and operations together with True, you build It, you run IT teams. In 2017, we built on top of that foundation spreading culture, investing in engineering, and shift ops left. And in 2018, we're focusing on more automation and shifting security left.

00:01:39

So I wanna start with some metrics that show some of the progress with, with our journey. And I get asked sometimes, what's the most important metric to track when you're doing a DevOps transformation? And my answer is always, it depends. So it depends, uh, what's important to your company. For us, it reducing customer outages is something that we really wanted to, to focus on. And so, uh, we came up with a way to quantify this. We call it impact minutes. And we take into account the duration of the outage, severity of the outage, and the products that are impacted. And we use a framework from the book, the Four Disciplines of Execution or four DX to track how we're doing with this. So discipline number one is to focus on the wildly important. So we've said we want to focus on a wildly important goal in this area in 2018.

00:02:25

We wanna significantly improve how we're doing with impact minutes compared to 2017. I'm very excited to report that we are actually 58% better than we were last year with impact minutes, we had set a goal of 10%. So we have wildly exceeded that goal, which makes us very happy. And we also have 74% fewer incidents. So this, this doesn't just happen, right? So this happens with a conscious decision to continue improving how we're doing. So we'll talk about some things that some of the teams are doing. Examples include including, uh, improving our synthetics framework, imp improving telemetry, modernizing our platforms. Lots of people have been really hard to accomplish the numbers that you see here on the screen. So let's talk a little bit more about the other disciplines. So you understand this framework. Discipline number two is to act on the lead measures. So this is where we create epics and features and we actually plan and do the work. Discipline number three is to create a compelling scorecard. So with the, the scorecard here, we've actually used Power bi. We can slice and dice the data in lots of different ways by service owner, by product, by customer, by date. This gives us a lot of insight into what's going on here. And then discipline number four is to create a compelling scorecard. So we do a regular review with our executive team of the progress that we're making and make sure that we're going the right direction as a company.

00:03:49

So let's talk a little bit more about some of the things that we've done this year. So when I look back on the year, infrastructure as code is something that I see as a theme that we've talked about quite a bit. And we, we started our infrastructure as Code journey several years ago. And this isn't new to the DevOps space, but we've really accelerated our progress in this space this year. So platforms that we've chosen, we've chosen Chef for core infrastructure as code and rundeck for our operations management platform. These are really the foundational elements for us that support our public and our private cloud rollout. So I wanna talk about how we support this organizationally, we've got a core support team that, uh, or a core team that supports all of this for us. So this team is a key knowledge to me. They provide a training curriculum.

00:04:37

They also help answer lots of different questions as teams are rolling this out. They own our chef infrastructure and they also own a common set of cookbooks. They own our standards and best practices. So with it, the teams themselves are responsible for rolling out chef, but they often do this with the support of the a SA team, which is the name of the this core team. And the a SA members get loaned out to these teams from time to time as well. And with this, then we can feed that back into our standards and best practices team supports that workstation configuration and also a test framework that leverages AWS and uh, test kitchen. So with all of this, we've rolled out to production eight teams are using this in production now, and we've got another five that are using this in Dev. So this is substantial improvement for us this past year. And one thing we found is it really requires a change in mindset. So for our more mature teams, this has kind of allowed them to level up and take their game to the next level. And for some of our less mature teams, it's allowed them to basically have a forcing function for getting on some of our foundations, like our continuous integration system.

00:05:44

Thanks, Erica. I wanna talk a little bit about what we've done at CSG regarding combining DevOps and security. We all know that when we combine those two things, it can be a very powerful combination and that everybody wins. So we took a step back and understood that the attackers always have an upper hand, they always have more time, money, resources. So we wanted to create a National Guard model at CSG, and that is to give autonomy, mastery and purpose to our developers. So what do we do? We took a step back, looked at what is considered a defensible architecture, right? And, uh, Richard Bait, like a well-known security author says there's six things that include a security, uh, uh, network architecture for that are defensible, they're monitored, they're inventoried, they're controlled, minimized, assessed, and current. Those six bullet points are really, really hard. And we knew we had to get some new technology in order to accomplish them.

00:06:43

So we evaluated some technology and we landed, as Erica said, on chef in particular chef inspect the difficulty we had. The biggest pain point at CSG we had was every single PCI assessment, we struggle with configuration management at the granular level that was expected. So we leveraged chef inspect, we combine that with our application specifications and our platform specifications, as well as our global asset management capability that we have in-house. Our global asset management capability includes things like IP addresses, host names, whether or not those systems are PCI based systems, and it also includes, um, some information about who's in charge of that asset and the leadership at the next level. So we can always monitor what we accomplished was we have the ability now to detect network drift across the enterprise. So application and operation teams have the ability to update that configuration and that application specification on demand, or they can go and fix the drift as needed through our change management process.

00:07:52

Scott Pru covered this information earlier this week, but if you missed it, this is now open source. So you can go and, and grab our asset compliance tool, otherwise is as otherwise known as act. And please do so we'd love your feedback. Go ahead. Next slide. So the second biggest pain point for us this year in the security space was vulnerability management. We had lots of feedback from teams saying, Hey, I don't understand what I need to need to patch. I don't have enough time to do it, and it's really, really challenging for us. So again, addressing this at a National Guard level, we decided to enable those teams and we did that via the same development processes that Erica's teams use. We're using Jenkins, Python, Ruby, ms, sql, and our Global asset management database. So what we do with vulnerability management is we've automated the process end to end.

00:08:47

So we take our, our information from our global asset management database, we then port that into our vulnerability scanner. That vulnerability scanner automatically runs every day. We take the vulnerabilities from that vulnerability scanner and pipe it right back into our global asset management database with role-based access control wrapped around it. What that does is it gives the application teams and the platform teams immediate feedback and a vulnerability fast feedback loop with our daily scans to know what they have to prioritize and go fix. Additionally, we've piped that information directly into Power bi, which gives us the ability to understand who's performing well from a patch and cadence perspective. Senior leadership reports are also included over email. Our patch cadence at CSG is a 90 day marker. So no matter what patch needs to be applied, you have 90 days from the start or the first detection of that vulnerability.

00:09:41

We send out automated emails at the 30 day, the 60 day, and the 80 day marker. If you can't patch within 90 days, we ask for a policy exception to be put in place through our governance board. And this has been extremely powerful for us. Go ahead. Next slide. The next biggest challenge for us was file integrity monitoring. Across the board, we had disparate processes and procedures at the application and platform levels. So what we decided to do is we had to consolidate to a central tool, and that tool was trend micro deep security. But we wanted to leverage also the, the entire framework that we talked about earlier. Let's learn from that and accelerate this, this process for a faster onboarding. We stood up a Confluence page and we integrated trend micro deep security FIM into that confluence page. And what happened was we gave the autonomy of our developers to go ahead and modify that confluence page and select folders that needed to be monitored.

00:10:40

So in the case of a web server, web root directory obviously needed to be monitored. So application teams, once TMDS was installed, all they had to do is go and update the Confluence site and immediately trends started receiving alerts. But beyond that, we also made molehills outta mountains. So we decided to discard all of the alerts that were no longer of value. And we worked with those teams specifically to do that. And we separated into operating system and application alerts. So every day they would get automatic reporting, they're enabled, they're empowered, they can look at it and quickly remediate that. The other thing that we gained from this is huge visibility across our enterprise. Today we've got 30 products and 51 technical owners and nine service owners that are using this end to end and that will continue to grow.

00:11:29

Another huge challenge for those that are familiar is third party software, right? I think everybody struggles with this. So what we did to address this is we leveraged CHEF Act, our code scanning capabilities, check marks Flexera, as well as US cert data and all the vulnerability information that we could possibly pull down, including SECM data. What we, what we obtained then was a complete application snapshot. That snapshot's very powerful. We started to do RegX and other vulnerability pad pattern matching so that we could automatically detect further left in the development process, what needed to be remediated, and then we would automatically create a Jira ticket and then remediate that as a part of the normal developer workflow, which is a huge win for us.

00:12:19

So it's been pretty cool getting to partner with Joe's teams this year and adding security into this has just really been kind of a natural progression and Joe's shared a couple examples of that. Another area is in the cloud space, and Joe's gonna tell you in a minute a little bit more about how his team was engaged here, but the, the infrastructure as code work that we did really laid the foundation for some major cloud wins for us. So the first one is stat hub. This is our system monitoring tool. And so in May we moved 40 backend servers to AWS. We did this all via automation with Chef and Terraform. Not only did we spin up our servers using automation, but we wrote over 1400 inspect tests that verify that we are PCI compliant, which makes Joe happy with me, makes it a lot easier for me to convince him to let me do cool fun stuff.

00:13:05

So, uh, along with this, we really greatly improved our patching. We can now do bluegreen and we're able to scale in a manner that we just couldn't on-prem due to our ever increasing needs for storage and compute. So this was really a culmination of months of work across many different teams partnering together, Joe's security team, DevOps teams platform, and our our networking teams. We had to work through things like figuring out how to create a reusable a MI that meets security, uh, requirements and also how to integrate with our on-prem server inventory. So through all of this, now we've got an inventory and a blueprint that other teams can follow. Uh, another important cloud rollout for us is a, a voice product. So with this product, they have a lot of the benefits that I'd say would be pretty textbook benefits of an infrastructure as code and cloud rollout.

00:13:55

So consistent, uh, consistent rollout of changes. This is is a big one. They have a lot of different unique environments and making sure that the right code went to the right place at the right time, that it came outta source control. All those things had been challenging for this team now greatly simplified for them and they don't have to log into the servers to administer them. Obviously this is very nice as well. And then backout had been another challenge for this team. So their backout was, uh, it could take over an hour, very manual, and if you're backing something out, your customer's probably down. So you're not in a good situation and you're in that for an extended period of time. So now we've enabled bluegreen deployment so we can back out changes in minutes, which is obviously a substantial improvement. And then another product, our eCare product, which is a customer care product.

00:14:42

With this change, we, they were doing manual server build outs for 100 servers per release. So we do four releases a year. So they had actually, they were doing blue green, which is great, but they were doing all this manually. As you can imagine, a lot of work that goes into this. The requests are having to get to our essays two to three months in advance. And so now we've streamlined all that, we've automated all that, and they've been able to go one step further now and come up with a de dedicated VM cluster per client.

00:15:13

So how do we, how do we reframe cloud security, right? This is a common challenge and Erica talked about some of the wins that we've received this year. So the first two, first two bullets, cross pollenization is everything. There might be some pain associated with it, but I'll tell you, it's, it's worth its weight in gold. So why not embed security experts in the development teams, specifically the hardest security challenges. If we solve it once, we can solve it for many. Also embed the developers in our security team. Thanks for letting me have one of your resources this year, Erica. It's been a huge win for us. We've got two, uh, employees on the security operations team that get to work on automation. Additionally, when we talk about autonomy, mastery and purpose for the developers, we wanna reduce the speed. We want to increase the speed in solving complex security issues.

00:16:03

So we want to, we wanna set precedents, set guardrails, but we wanna remain iterative. This includes community of practices. So when we have somebody that comes up with a great idea, let's share that. And it's not always the security team that comes up with great security ideas. For example, Erica mentioned the A MI that we're using. We said Absolutely you can use that A MI, what other security benefits do we get? And Erica's team came up with a great solution. They automatically searched the webpage, identify any available patches. As soon as they're available, they're pushing it into dev. Additionally, I think it's important for us to use the cloud to secure the cloud. It's there, right? Let's leverage the capabilities AWS Azure capabilities and the like and take advantage of those is provi IAS provider, uh, capabilities if they're there. Additionally, don't forget about your legacy data centers this year.

00:17:01

We've migrated our two primary data centers at CSG over to software defined networking. What that's done is shorten the time to resolve vulnerabilities as well as configuration management. Previously it took us almost six months to plan and work out any changes to our core network because that's how we eat and breathe our business today. We have the ability to upgrade firmware on those switches in almost an hour. So that's moving at network speed and that's where we get value. Um, lastly, everyone's gotta eat education awareness and training is everything. So we still dedicate resource time and education for our employees regarding PCI and what it means to them as well as we hold DevOps leadership series. So that might include beer or other types of drinks like that to draw some folks in. But we talk about key topics, whether it's development, security, and everybody gets that grounds level based knowledge.

00:18:08

So at the beginning of this presentation, I, I showed the slide that had kind of the history of the presentations that I've done with our senior VP Scott Pru. And as we've gone through this and we've added Joe's teams in, as we've added security focus, a kind of a compelling theme that we've had from our teams is, Hey, we need to have a better, uh, focus on work-life balance. And so this year we that we, we went one step further with this. So this is something that we take very seriously. We've partnered with our, our product management team and we've said, Hey, we're gonna dedicate 15% of teams time this year to focus on work-life balance initiatives. Not only that, but this is gonna be a team driven initiative, right? So let those closest to the pain figure out how to solve that pain. So the teams are responsible for coming up with areas to target, coming up with the metrics that they're going to use to track how they're doing and then actually executing against that.

00:18:59

So I'd like to share some of the success stories that we've had coming out of this with a couple of different teams and some of the things that they've chosen to work on this past year. So our continuous integration team, they've done something called patching on demand. And so the way the patching did work for this team, our essays once a month would give them a a six hour window. And of course, this is off hours, middle of the night, we like to call that stupid o'clock. And, uh, and with that time then they didn't even get to control which servers were gonna be patched when. And so now they've taken this in into their own hands and they can patch these servers. They've gotten it down to three hours during business hours and they can control the order that these are going in. So obviously much more, uh, towards work-life balance in this regard.

00:19:46

And then they've also improved their health monitoring. So our CI system is an essential system that needs to be up all the time. That means that if it's not up, I'm paging out in the middle of the night. So we've developed a synthetics framework, we're testing, we're always making sure builds are succeeding, but it was kind of flaky. And when we started, we said, that's okay, it's important that we get this working. But as we started getting more of those middle of the night pages, we said, Hey, we need to fix this. So we've gone in and we've improved that. And I'll say that that was actually a pretty common area that a lot of different teams have tackled with their, their paging uh, our data warehouse team, they chose to to tackle some test automation. So in one particular case there were, there's jobs and they used to take 15 manual steps to validate these jobs.

00:20:29

And we've now reduced that to, to three steps. And the team believes overall this is about a 50% reduction and the amount of manual work to, to validate each of these jobs, which is a big win for us. And then in another area, we were looking at the, the pages that this team receives and this team gets paged a lot and trying to identify some themes and maybe some areas that we could tackle in one particular area. We identified that there was some cash locking going on and every time this happens, at least three different teams get paged out and then it's an hour of research for each of these teams. Obviously a lot of time goes into this, so we said we think we can actually fix this with just putting some better air handling in place. So that's what we've done here with our SL boss product, which is an API product.

00:21:12

They've enhanced their synthetics framework, so they already had a very strong synthetics framework. What they've done here is to split out all the different components and test each of those individually. So I'll give you a couple examples of how that can be really helpful. So imagine you've got three servers that sit behind a load balancer. So if one of those servers is not, not quite right and it's still responding to pings from that load balancer, two thirds of the time my tests are actually gonna pass and I might not know what's going on. But if I've got tests that also go directly to those servers in addition to going through the load balancer, I'm gonna detect that a lot faster and I can get to MT MTTR faster with that. And then we'll take that same example. Let's say I know I've got something wrong in my system, but I don't know what component it is. Is it at the load balancer layer or is it at the server layer? If I've got tests that go both directions, I can much more quickly zero in on what's going on. And then the last team that I wanna talk about order management kind of ties together a lot of what we've been talking about here. So they said, Hey, we wanna streamline our deployments. They're doing deployments in the middle of the night. They were time consuming. So they, they use Chef Rundeck and cloud to make their deployments much smoother.

00:22:27

So to kind of wrap things up, some help that we're looking for, um, we'd love to hear what you guys are doing from a DevSecOps pipeline, particularly what tools are are being used at your company. Something Joe and I have talked a lot about, we do have a DevSecOps pipeline, but it's something that we're looking to continue to iterate and improve on, uh, SRE best practices. So site reliability engineering for us feels like a natural progression of continuing upon our DevOps transformation for us. So would love to hear what best practices you are implementing and how this is going for you. And then finally, reducing toil. So this ties right into SRE. It also ties into the work-life balance stuff. So Damon Edwards, hopefully you got a chance to see Damon speak. Yesterday he talked about reducing toil and he talked about how if you have excessive toil, you actually can't fix your system. And so we wanna make sure that we're continuing to tackle this and continuing to, to move our teams forward. So that's it for us. We've got a few minutes left if anybody has questions. All right, I can't see any hands over there. If there's anybody up there, feel free to just speak up.

00:23:44

All right, I'll take that as a note. Thank you everybody. Thanks.