Europe 2022

Automated Change Management

Automate the change management process by using development activities and data driven automated assessment of risk of making a change to production, such that changes were automatically approved, resulting in both better delivery outcomes and reduced risk of change implementation.


Gus Paul

Executive Director Application Infrastructure, Morgan Stanley



I am so excited about all the talks here at DevOps enterprise this year. So our first speaker of the conference is Gus Paul, an executive director at Morgan Stanley. One of the largest financial services firms in the world. I can say with some certainty that in my decades of studying high performing technology organizations, I've never met anyone in banking like GU. So when I met him, he told me about his 20 years at Morgan Stanley and his early days on the trading floor were traders and developers worked side by side where quickly developing capabilities and even microsecond advantages could not only help their firm win, but more importantly, to help the clients. He talked so eloquently about how life changed after the 2007 financial crisis. When Morgan Stanley was one of nearly 30 banks designated as a systemically important financial institution, those are the organizations that are sometimes called too big, too complex, and too interconnected to fail.


So all of those organizations are impacted often creating an increased focus on controls, which in turn creates more processes, more rules, more approvals, which definitely changed the way that developers work. Those stories reminded me of that amazing poem from the Lego movie. The world was once free and full of possibility, then came order and after it authority, everything changed <laugh> until nothing changed at all. So Gus is going to tell the incredible story of how a group of amazing and dedicated technologists worked to liberate over 15,000 software engineers across the firm from an ineffective change management process to liberate their full creativity and problem solving potential. And by doing so improve the reliability, safety, and security of the control environment far better than what one could do with manual reviews alone. Here's Gus Paul,


Thanks Jean. As you said, I'm from August Stanley. My name is Gus Paul. I'm a product owner in our software delivery assurance squad, which is part of our DevOps enablement platform. I'd like to talk to you today about our journey from what we had as our change management process through to our new automated change management process. And hopefully give you some insights into what life is like in a huge financial institution. For those of you dunno, Morgan Stanley, we've been in business for almost a hundred years. We split three ways. Institutional securities is our investment investment banking, franchise sales and trading advisory business and our wealth management and investment management franchises, where we manage assets for our clients. For those of, you know, the us, you may have heard of a company called E-Trade. If you haven't had E-Trade, they have the original Robin hood.


That's where you go. If you wanna trade your, your security as an individual customer wealth management is to people with lots of money. We split our revenues, basically 50 50 between institutional securities on our asset management businesses. We have global scale. We are on every continent and as you can see, our institutional security business is split fairly evenly between those different regions. So a huge business like this with over 76,000 employees, what does technology look like at that business? Well, we've got over 15,000 people working in technology. We've got three and a half thousand applications and systems, and those systems are processing up to 10 billion applica, 10 billion transactions every year. The critical things about Morgan Stanley technology are volume processing and time to market efficiency. We have algorithm, algorithmic trading algorithm. We have trading algorithms that require sub microsecond response times to give them that little bit of an edge in the market because of this Morgan Stanley adopted technology aggressively in the early nineties.


And back then there was no cloud. There was not a lot of stuff that you could use. So we started, we hired a bunch of smart engineers and we built a lot of things that are still with us today. Server management, binary artifact, distribution on a global basis, our market data plant for providing price feeds to our traders. All of those things have bespoke to Mon Stanley and have given us a significant competitive edge. Over the years. It was fun. Being a developer. When I started 20 years ago, you would sit on the trading floor. You'd be with people you'd work with every day. If there was a problem, you'd roll it out. But that was part of the fun you enjoyed. Being able to make, enable the business to work faster. When we became a bank holding company in 2008, there was a subtle change at first, but we increasingly ended up with more and more processes and procedures.


Some of them were completely understandable. We had to be able to document what we were doing and why, but some of the interpretations of those processes, I was convinced there was a better way of doing it. Why, what benefit was it to the firm that I had to go on a call every day at 2:00 PM to tell somebody what I was gonna do the next day for them to only just say yes, every time it seemed like we could find a better way of doing that in 2018, our CEO at the time decided our CIO at the time decided we needed to get more on board with this fancy new cloud thing that everyone else was using. So we started an agile and cloud transformation. I know sometimes those transformations with the capital T get a bad reputation, but I think we had a good balance between teaching people, the foundational subjects.


There's lots of lifers like me and Morgan Stanley, who had no idea what agile and DevOps was. And then once they have those foundational skills be able to adapt and run their teams the way they think is appropriate for their business area. When we started that DevOps was seen as a critical enabler and like mostly at Morgan Stanley, we have a big hint of land of people, uh, working in tech, who are asked to volunteer for these new initiatives. When the volunteer call went out for develop, do ops. I couldn't wait to get involved because I knew that there was a better way of doing change management. We started off the first year of the transformation, just teaching people what DevOps was. It was around that time accelerate came out and it was really useful to have the do metrics, to be able to base our conversations on, because we already had a pipeline that we used to build by applications.


We were able to use that as the ground, as, as the starting point for increasing people's automation and making sure that they knew that it wasn't just about speed. It was also about efficiency and reducing risk that went well for the first, uh, year or so. We decided to turn DevOps into a strategic program, Morgan Stanley, uh, from that point onwards until now. And we continue today, we've had three key focus areas, accelerating software delivery and deployment. That's the one where we talk about how can we improve automated deployment, automated testing, mortgage Stanley has many different endpoints for doing web services for doing binary distribution, for doing, uh, all kinds of things. Those are all custom August Stanley. We need custom tools to be able to talk to those increasing predictability frequency and quality of change. That's one of our three key areas. Why I talked to you about that in a second.


And then revolutionary, how we operate technology. SRE was tr was, was opened a lot of eyes on August Stanley. We always had a very close relationship. We took it in our institutional business between our developers and our operations people. I was never in a situation where we had to give our release to somebody else to deploy it. We deployed our own stuff, but there was definitely divide between the guys who had to sit on the training floor every day, listened to the support queries from the traders and developers. Sometimes in other regions who were not quite as close to that cutting edge. How could we bring that back together? The SRE principles seem to fit, and we've been on that journey for the last three years, but back to change management, why do we bother doing change management? Well, we did to developer survey and it came out with the result that I wasn't the only one who thought that our develop our automated change pro our change process needed some work.


One engineer took it on themselves to document all the things they had to do to get one line of code into production, three different JS, a change ticket, 81 individual steps, just to get that paperwork created and get it approved. You could have up to seven or eight different approvers on that. Usually senior people in the business or senior people in it. They have better things to do than approve these change. Tickets, surely four hours of effort over two business days just to, to, to get their permission, to put something in production. I really took heart from the part of accelerate in chapter seven, where they talked about how change approval change advisory boards and manual approvals are not correlated with better outcomes when it comes to risk or efficiency. In fact, having no change approval process, having a change proof process is worse than having none at all.


That definitely rang a bell with me. So what else was wrong with our change process? Well, it was manual. Our tooling was very old. We don't use service now. So we built our own change management system because we're Morgan Downley and we had many different approvals that were required when we started, there were only one or two approvals, but we added a few more over the years when individual events resulted in the call going out, we need more overlay. So let's add another approval that approver stayed forevermore. And we kept adding additional bits and pieces to the change management process on their own. They all sounded fine, but as we continued to scale up and increased the amount of change we were doing, it was really slowing down the amount of, uh, the change process. We also, after we became a bank, holding company had to do a software delivery life cycle procedure that came in after change management procedure and actually duplicated large parts of it.


Things like testing approval, uh, approval of the, the, the re the, the, the thing you were deploying, those were repeated in both processes. One thing that came in half, about five years after we started this more formal change housing process was we needed something better than just a risk assessment that said low, medium, and high. So we put in place press that we had to answer eight questions, but those questions were filled in by the person creating the change. And then you had to just use, and they became so routine because you were doing so much change. You answered them the same way every time. So they were not contextual. They were not necessarily providing the right level of risk assessment. They weren't bad, but surely we can do better than routinely filling in the same questions. And then we were behind the curve by this point, like we did that in the year, in the area of it till two.


Now we're at I till four I L for better or worse is a good way of managing our it service management function. We need to make sure that we adopted the best practices from there while still being able to be efficient. And as, as, as, as, uh, um, risk, uh, reducing as possible. And finally, it was time to make sure that change wasn't seen as this spooky barrier by a lot of developers, quite often, developers were not necessarily aware that it was the change process that was causing some of their behaviors. Some teams over interpreted the rules. We had to make them simpler to follow so that everyone could benefit. What kind of volume are we talking? So August download said, we've got three and a half thousand systems over two and a half thousand systems, delay systems of software systems. This chart is showing you the amount of volume of change we do every year from 160,000, 2019 to 175,000 last year, a small dip in 2020.


But if you look on the right hand side, the percentage of software change keeps increasing every year. This is a trend we definitely expect to continue, cuz we're gonna start doing more infrastructures code, more cloud deployment. Those are still early days in Morgan Stanley. How are we gonna cope? If we have the same ones change process, we also have lots of different change restrictions that we put in place that we don't have good ways to make Grae enough to allow them to be effective. How can you do a change on the weekend? If you are an team, if every change every weekend is restricted for one reason or another, because of this over owner's process, we saw a lot of people batching up their changes into larger and larger releases. And because the releases were then so large, we gotta do it at a time when we're not gonna risk the business being unstable.


We are Morgan Stanley. We don't operate 24 by seven, apart from E-Trade. But the rest of the business, particularly the big volume businesses, they don't quite, they don't work. They don't operate on Saturdays. So a lot of our changes got pushed to the weekend because that was the time when we thought it was easier to recover from them when things went wrong or the change themselves was complicated. Now some of that is cuz of the volume of the architecture of the, of the systems, but there's also no incentive to change that architecture because there's, you still have to suffer the same change process, whatever architecture you have, how can we make this better? First of all, we focused on the SDLC side of the, the equation we had as well as the change imagine process with its, its its its flaws. We had an SDLC process where we were asking developers to commit their code of source control, to use JIRA for their requirements to make sure they were testing stuff in the pipeline.


And then we'd get to the end of that process and have someone approve all those things all over again. Why do we have to do it twice? Let's fix the SDLC. People are already approving the requirements in their squads. People are already approving the code because the most part everyone was using the poor request workflow in GI. People were already generating test cases. They had the results from their automated testing system. Does it really need someone else to say yes, I approve that this log file says all the test pass is the log file itself. Not evidence enough. So we were able to change the test to change the pipeline and shift left. We for SDRC reduced the app approvers from the senior people who are approving at the end to the people in the squads who are working together, approving their own, uh, approving each other's work.


This crucially meant we were able to keep the separation of duties that's required. Sometimes you have conversations. I've I've, I've seen discussions on Twitter where people like, oh, even poor request is you shouldn't do that. Like it it's constraining on the team. Unfortunately we do end up ultimately working for a regulated environment. Separation of duties is critical for things we do, but separation duties only has to be one person. It doesn't have to be five. So let's see how far can we get if the poor request and the requirements are approved by a human, can we automate the rest of the process? So with that in place, we are ready to start tackling the change management side. So change management and August Stanley, as I said, because we have such a complicated progress process. We were actually able to divide lead time up into subcategories of lead time.


Believe it or not. So the lead time from the time the story is marked in progress in JIRA to the time it was deployed was further broken down into the delivery lead time. So that's the point when the change was committed into the source code. So it was recent production. And then we broke that down further because there was enough of a gap between these pieces. So the delivery prep time, which is the time from when the co was committed to when it was built and ready to, and then the change approval lead time. So everything is finished. It's ready to deploy. We've just gotta get the paperwork done and then push it out the door. We wanted to know, like if we actually fixed the change management process, would we actually see any benefit from this value stream mapping exercise? The answer was yes, because on average from our data, we saw that it was three and a half days to get a change approved in the change approval lead time, but only half a day after that approval happened for it to be deployed.


Now, quite often that's probably because people were getting to the point where they wanted to deploy their change and then staying to their boss. Oh quick, can you approve that ticket for me so I can get it out the door. So that was more evidence that we thought that we need to fix this process. And we can, we maybe get three days potentially back on average from each team. Some of that architect, some of that batching I said is due to architecture. Some of that batching is to do with, uh, the business requirements of, of, of changing things when, when the business is not open, but we were confident that we could find a way to get that down and get back to a point where we didn't have such a big chunk of our time spent doing chasing approvals for changes. So how are we gonna do that?


We looked at the different bits of the change process and wondered what, what could we do that can automate here? So we have the SD CIO said, that's taking care of the code approval separation duties. Then we had the risk assessment, those eight questions, some of them, most of them were to do with things like, are you deploying this in a north, in, in, in a, in a repeatable way, do you have good sense of the risk of the system you're deploying it to do you have a good sense of like the back out procedures? Couldn't we automate some of those things. We know how risky the system is. That's metadata. We have a history of how many incidents the system has caused related to change. We can track that. We know whether this deployment, we know how this deployment is done. We can tell the difference between automated and an annual deployment.


We have the evidence from the SDLC that's been tested. Let's feed all of that data into the, into a risk calculation. And we'll give you a score back. We decided to make it a number rather than just low, medium and high, because that would allow more differentiation. And we decided that the more points you'd get, the more risky your change was if you reach a certain threshold, which we'd work out, which we'd decided we'd work out from looking at the data. If you are under that threshold, you could go down this automated change approval path. It wasn't their approval. It was automated approval that the systems decided that you are low risk. If we could strip out all those low risk changes, then the really risky changes could get the focus they needed from the human approvals, who could drill into what they were happening at the moment.


It was too much wood for the trees. You couldn't work out. Every process change went through the same process. How can you work out? Which one's a higher risk and which ones need to focus your time on this would give us increased confidence in our risk assessment, which would allow us to do the automatic approval, which would then allow us to push these things out without having to go through many different change approval boards or change advisory boards later on, we'd go back and fix the normal change process. Even there, we don't need five approvals. Let's work out the best way to get the next approval, but that's for later.


So what things do we think would make up this automated risk assessment as an there's a whole list of things here on this slide, just to highlight some of the things we thought were things that we thought were important, like the quality of your SDLC release. Other things in terms of came from, from industry statistics, as well as our own data analysis, things like the size of the change, uh, how long it takes you to do the change. The more longer it takes, the likely it's more complicated. Are you automated your execution things with manual steps, usually more likely to go wrong. What's your previous history, both have incidents and deployment failures. That's likely to be a leading indicator of like, you might have a problem again. And what's the risk of the system. You're changing. High risk systems will have more impact if you cause an instant.


Remember our whole basis here is, are we gonna have an instant from this change? Not, is it a good change? Not is it gonna make money? Just, are we gonna cause stability problems or incidents when we make this change? Because if we can eliminate that risk, if there's a business problem, we'll just turn around and do another change to fix the business problem. We don't have to wait another week to do that. So that was our hypothesis. We built a whole matrix of the things we thought were important. The things the score waiting we'd apply to each one we had over 200,000 changes. Every two years, we wait, we could use that data, take the rules. We'd got apply them back to that data and say, are we do we think this is gonna work? What that meant was we had confidence in the data structure, the data levels, we data thresholds.


We built because we found some things that were not relevant to the, to the scoring. And we found other things that are very strongly correlated. The ones that were strongly correlated, we were able to wait. We were able to wait higher than the ones that were not as strongly correlated. That meant we were fairly confident. We had a good range of things that would allow us to reduce the impact of changes if they went wrong. But there's August. Night's a big place. There's a lot of people who've been here longer than me. There's people who've been here less time than me. Some people never done anything other than TCM, lucky them, but that meant we had to have a process where we could both show that the system was working as we expected and overcome the resistance of people who were still convinced that manual approval was always gonna be better than having a machine do it for you.


So what we did was first of all, we built the risk service. We had a system where we col calculated the risk score based on the different inputs. And we said, let's roll that out first because then everyone can find out what they need to do, how they need to build it, uh, and how they can change their system potentially to adjust underneath the thresholds. One, one senior person said to me, you, I don't understand these rules. You set it up so people can easily game these rules and get under the score. I'm like, yeah, that's exactly the point. We want people to game this because the rules we've set up. If they game it, their system will be lower risk. That's exactly what we wanted to happen. Then we started a pilot with the actual change process from end to end. First of all, we did it internally with our own teams that was used as proof to say, can we start a pilot with the wider teams?


Yeah. Morgan standards, regulated environment. It's not easy to vary things like change procedure without having new procedures in place. First, this way the organization worked with us and said, yes, we can give you risk exception to do a pilot. And then that can inform whether we go forward with the full thing. So we did a pilot, took us six months. Second off last year, we did 1500 changes across 58 different systems. There was not one change related incident. Not one. Our average is about one and a half percent. There were no, uh, there was significant improvement in lead time, even better than we thought there was gonna be. And we reckoned about $10,000 is savings just a month just in approvals from doing this work. So then we released it to everybody. This year, we've seen increasing adoption. We're up to about 10% of all software changes now done by, uh, uh, this, this process.


We're hoping 25% this year and then maybe 40% next year. But just to emphasize, this is not just about going fast. You know, you get a lot of people say, oh, developers, you just don't wanna have a change call because you wanna just do it and not have any oversight. We also strongly, this is about reducing the risk. We have two examples that in the pilot, first of one, there was a, one of our critical systems, uh, in terms of, uh, compliance, the product owner founded bug 5:57 PM by 6:42 PM, less than an hour, that change was fixed. And in production, this is a system that used to take two weeks to get their change out the door. Didn't have to go through any exception process, no emergency change might break the glass, totally legit, fully automated committed the code code review, boom in production. Similarly, there's another system, much bigger system.


One of our key components around, uh, one of our key, uh, infrastructure components. They did a regular change, the old fashioned way where they rolled it out. They did all the paperwork they saw. There was an issue with that prop with that change, that if they'd left it untouched, would've resulted in an incident, um, of increasing severity the longer went on, but because they was a small fix, they were able to use systematic change to get that fix out the door without again, without any break the glass, without any additional, uh, waking people up in the night to approve things. And they were able to resolve that incident before it became a big incident, which was obviously beneficial. Sometimes there's not easy cost for these things, right? You don't necessarily see the incident happen. You can't prove to me that this was the thing that helped it here.


We have evidence that it prevented the bigger incident and then the stats were even better than we inspected, right? So this is from the pilot. The blue bars are the old fashioned way. The orange bars of the systematic change way, significantly increased, increased delivery, lead time, significantly reduced deployment size, significantly improved, increased frequency of deployment, and much more changes on our weekdays. That was critical because we have, uh, we had a large, we had quite a large morale issue with our developers who were fed up of doing releases on Saturday morning. They wanted to do it when they were in the office. Why can't we do that? Key to that was a lot of these systems had to adopt zero downtime releases, but that's part of the, the architecture discussion. There's probably another. And then was this broad base, was it just one system that did all those changes and made the numbers look good?


No, almost, you know, three quarters of the systems in the pilot showed improvement in these metrics, both on the delivery, lead time, delivery size and the deployment frequency. I, I, I don't say I told you so, but I was pretty pleased with these numbers because they really bailed out that we thought this could, could be the impact had happened. And then almost as important as a sentiment. As I said, I used to be a developer, my heart in my heart, I'm still a developer, but I wanted to make sure that people saw the value in this. These are the kind of quotes we were getting. Things like the ability to move faster in more, in a more sustainable way. People saying this is the best thing that happened to them. In 20 years at the firm, people who felt like they were working with a different company and even somebody who felt this was just an example of us implementing exactly what happened in the Phoenix project.


Those kind of quotes really gave us heart that we were doing the right thing. And it's created its own culture of increasing adoption because people are seeing these things shared by other teams and they want a piece of it too. And then how is this impacting the business? Well, these are quotes from over, over our overall dealt program and how they impacted particular businesses in wealth management. There was one non-technical product owner. They were very skeptical about doing things more frequently. Would they thought more going faster? That means high risk surely, but we were able to show them that actually going faster with smaller releases more often is a reduced risk because it's less complex in algorithmic trading where key is time to market faster. You can get something in of benefit to the customer in production, the faster the customer benefits that iterating rapidly thing unlocked.


What I used to do when I worked on the trading floor again about being able to push stuff out very quickly, even the same day to GA able to benefit those clients. And then one of our big IPO customers, one of our big IPO clients, we had to build in tiny system, cuz they wanted to run the IPO in a specific way that was built with automation from the start. And it was revolutionary to the, to the, uh, investment bankers. You'd been used to working with the old tech and how fast and how relatively it could be built and rolled out to enable that client business as usual. You know, Jean always asks, you know, asks us to ask what help we're looking for big cap, financial services firms. There's only a few of us we're trying to work together to see, is there a way we can present a persistent view of this stuff?


You know, we've we went through that internal overcoming people who thought manual changes, manual approval is better than automated approval. Can we present that consistently across the, across all our different organizations to externally, to external regulators and say, look, we've got the results that prove this is good. We are working together through an relations called S which is the financial open source foundation. Many big firms are part of it. You probably are. If you don't realize you are, if you're a financial firm, um, and we've got a working group in there called DevOps mutualization uh, if you wanna get involved with that, your member of SS, let me know if you're not a member of finance, I'd still love to hear from you about what you are doing, uh, around AI DevOps, in a regulated environment, our risk assessment. Can we make that better? It's targeted for software changes right now.


What things when to think about when we're talking about infrastructure changes, uh, we're looking at a number of vendor products that could potentially do machine learning on this. So rather than being a static set of rules, every change gets its own assessment. That's contextual to that change. If you're doing something in that space, we'd love to hear from you. And then the audit trail, we're getting better at this, but we still have a number of different tools involved in this pipeline. It's still hard to present an end-to-end picture, uh, simply and consistently to our auditors. Uh, if you've got anything that we're working on that space or you solved that problem, I'd love to hear from you as well. So that's it. I hope you enjoyed, uh, uh, and found this talk useful. Um, and I'm, uh, I'd love to hear from you if, if you're going through something similar.