Iterative Enterprise SRE Transformation

It's easy to get discouraged reading books about industry best practices that say things like "always test in prod!" and "10 deploys a day!!" At times, they can make the goal of being a high-functioning DevOps organization feel out-of-reach for large enterprises, where changes to the way we operate take time to roll out.


A few years ago, Vanguard started its journey to adopting Site Reliability Engineering across the IT organization, and that transformation effort is still underway today.


In this talk, we will share where we started, how far we've come since then, and all of the steps we've taken along the way, as we've worked to evangelize changes to the way we measure availability, enable experimentation, leverage highly-available architecture patterns, and learn from failure.

CY

Christina Yakomin

Site Reliability Engineer, Vanguard

RD

Robbie Daitzman

Vanguard Intermediary Platform – Delivery Lead, Vanguard

Transcript

00:00:13

The first talk today is from Christina Yaman. She is a senior technical lead for site reliability engineering at Vanguard who created the world's first index fund in 1975. They currently have over $8 trillion of assets under management, and is now the world's largest provider of mutual funds. And the second largest provider of electronic trading funds technology has always been critical to their business evidence by the fact they have over 7,000 developers. So last year, Christina gave a talk at this conference, which I thought was one of the best talks on next generation infrastructure and operations that I had ever seen because it describes so wonderfully how one modernizes, a technology landscape, uh, and the technical practices around them that has been built up for over 50 years. I asked if she could present a talk to all of us here at DevOps enterprise and bring someone who could tell specifically how they benefited from what she and her team created. So I am so delighted that Christina will be presenting with Robbie Dayman it delivery manager who support Vanguard's financial advising technologies, and he will be sharing some of their goals and how modern infrastructure practices help them achieve it in ways that Vanguard customers appreciate here is Christina and Robbie.

00:01:35

Thanks Jean. In this topic, we'll tell you the story of Vanguard's iterative SRE transformation and our journey as we've modernized all of our technology platforms over the past few years to set some context. If you're not familiar with Vanguard, we are a global asset management company with over 8 trillion in global assets, under management and working behind the scenes to power. All of that is over 17,000 crew members, both domestically and around the globe. And for a company like Vanguard technology is critical, not just because of the current modern landscape of the industry, but also because Vanguard is unique in its structure. We don't have brick and mortar locations where our clients can come in and talk to their financial advisor face to face. We never have all of our interfacing with our clients happens either over the phone or on the web. Anyone familiar with the call center knows that that really incentivizes us to have a great web experience for our clients.

00:02:34

It makes technology one of the top priorities for the firm and for some background on who we are, as Jean mentioned, my name is Christina Ackerman. And in my current role, I am the senior technical lead on our site, reliability engineering coaching team. Over the past few years, I've picked up a few AWS certifications, including most recently the AWS solutions architect professional certification, and in my career so far, I've had experience with full stack application development, as well as cloud infrastructure and automation. In my most recent prior role, I was a member of our chaos engineering team. So if you've heard me speak in the past, it may have been on that topic. And I think of myself as a bit of a chaos engineering enthusiast still today in my current role and outside of work in my free time, I am a member at the Philadelphia zoo, and I love to spend some time there on a nice day on a weekend.

00:03:32

And I'm Robbie Dayman, I'm currently an it delivery manager supporting one of our lines of business at Vanguard. My background is fairly similar to Christina's and we actually crossed paths as we were both on the chaos engineering team together, helping to stand that up at Vanguard, going back a couple of years ago, something that has always intrigued me is the people side of the engineering organization and how things such as organizational dynamics come into play there, which is actually something that helped lead me into leadership. And like many of us, we all had to find different hobbies throughout our quarantine period over the past 18 months. And so something I picked up was a real interest in wine and including picking up a certification, both on service and tasting.

00:04:13

All right. So now let's talk about where Vanguard started this journey to do that. We're going to rewind about five to seven years ago and paint this picture. I give a range because we're still on this journey today. And I want to preface all of this with that. I'm not here to tell you today that we solved every single one of these problems throughout the entire organization, but hopefully I'll be able to demonstrate all of the great progress that we have made. So to start about five to seven years ago, all of our technology was hosted in a private data center. We had pretty much exclusively monolithic applications and no presence in the public cloud. All of our deployments to production were extremely controlled. There were about quarterly releases and the development teams were definitely not the ones handling the deployments in terms of observability.

00:05:06

There wasn't any, there was alert, only visibility, meaning really no dashboards that were particularly meaningful and no positive affirmation that our systems were functioning. The way that we want them to be. All we really had was whether or not an alert had fired recently. And we depended on those alerts to tell us whether or not our systems were functioning, take things one step further. Those alerts were centrally owned development teams. Weren't setting them up on their own. They were submitting requests to another team that managed the alert portfolio, who were configuring all of those hopefully correctly. And another ticket would need to be submitted to update that alert in any way. Finally, as you might expect, development in operations were completely siloed. We had teams of programmers and teams of operators, and they only interacted when something wrong. All of that production support was centralized too.

00:06:03

So the next step from here is talk a little bit about the cloud migration that enabled our SRE transformation from that data center that we had all the way to majority public cloud. The first step was breaking down that monolith. It is incredibly difficult to take a large unwielding monolithic application and lift and shift it into the cloud directly. And if I had figured out how to do that, I probably wouldn't be giving this talk right now. Instead we carved out microservices from our monolith slowly but surely, and started running them initially on a private cloud platform as a service that we hosted internally, as part of this move, we were able to improve the deployment frequency for those portions that were carved out. It really changed the game for those development teams who were used to waiting for a orderly release cycle. Now the regression test cycle was able to be significantly reduced at a test automation engineer role was introduced to help automate as much as possible of the testing so that we could ensure that we were thoroughly testing our systems.

00:07:11

Before we went to production, a new pipeline was introduced to run our automated tests and then our deployment frequency drastically increased. We even developed automation to generate change records and attach that test evidence to them all as part of the C I C D workflow, in order to now migrate all of those microservices that we've spent all of that time carving out into the public cloud. We made an interesting decision instead of moving them one by one and putting each and every team through a migration effort, we decided to lift and shift the underlying platform as a service, the private cloud that we had been hosting. Now, this made the lives of the microservice development teams significantly easier, but it was incredibly hard on the infrastructure teams that were migrating that underlying infrastructure. I can speak to this because I was on that team at the time. And that platform as a service really did not want to run in the cloud, but we got it there. And once we did it left us with a now unnecessary abstraction layer between our applications and the AWS infrastructure. And because it had been such a headache to get it to the cloud in the first place, ultimately it was over complicating the environment and causing more problems than it solved.

00:08:32

That's when we decided it was time to move to a more cloud native solution and actually more cloud native solutions. For the first time, we were able to introduce optionality into the application design process. We removed the abstraction layer and instead leveraged AWS resources like ECS Fargate to provide a really similar for the majority of those microservice teams. Now they were getting basically the same experience where it was primarily serverless. They didn't have to think about VM provisioning, but we didn't have a team managing the operational overhead of the PAs maintenance, but there were other resources the Amazon made available that were a better fit for some workloads like Lambda Lambda's incredibly cheap to run and very lightweight. It was especially great for migrating some of our event driven workloads or those microservices that are accessed very infrequently. So we didn't have to pay around the clock for compute. And there were other applications that were a better fit for Kubernetes and could benefit from a control claim. So EKS emerged as another front runner for hosting those microservice applications.

00:09:45

Now everyone's heard the quote with great power comes great responsibility, and this is what those product teams ran into next while they were most, most of them were very excited to take on more accountability for their systems. It was new to them. In most cases. Now they had to test not only the application code, but also their configurations extremely thoroughly. We provided some guidance to them on how they could best do that with the adoption of a process called the failure modes and effects analysis. This is a technique borrowed from the more physical engineering disciplines and adopted into software and technology. We look through your system architecture, identifying possible failure points, how those could fail and what the effects might be. Once you've done that exercise, you basically have a list of hypotheses about the way that your system behaves under stress and failure and what a great way to test those hypotheses, but chaos engineering and performance testing.

00:10:50

So we created platforms to self-service both of those things as well in our non-production environments. And I'd actually like to talk a little bit more about some specific instances where we did run chaos, experiments, and performance test and see some really great outcomes for these teams. First, we ran a chaos game day where after we developed those hypotheses about our system resilience, we went ahead and caused task crashes on ECS, for example, and validated that under certain amounts of load, we would observe auto scaling and self-heal from crashes. This was great. And in most cases, when we've done this, we've seen a lot of our hypotheses verified and maybe just one or two go a bit differently than we expected. And it's always a great opportunity for us to learn and continuously validate that our systems are running the way that we think they are.

00:11:42

We've also run a chaos fire drill. We differentiate this one in that the failure we are adding into the system is one we don't expect to be resilient to. We expect to break things. The reason we did this is because we weren't trying to test the application. We were trying to test a new observability tool, the new tool we were bringing in promised easier troubleshooting, and we wanted to put it to the test. So we injected failures that purposely broke the system and let engineers try out the tool for the first time and see if it really was easier. And it was, this was a great tool for onboarding onto the new observability platform. And we still have the recording of that and use that for training new engineers who are going on call, who might be interfacing with this tool. Finally break testing our C I C D pipeline.

00:12:29

This is a really interesting one because our pipeline is not SAS. We are hosting the pipeline and the team that was handling that hosting and maintenance was faced with recurring instability of their pipeline at high traffic times, which is the worst time for any system to be having instability. But of course, with a pipeline, what that means is it's when your developers are most productive, they are building and deploying the most. And these crashes because the system was wiping itself out before it could even offload its logs were preventing thorough investigation during the business day. So we actually had to come up with creative ways to recreate the condition, create deployment environments and build environments that looked like real ones and generated the same number of logs without actually impacting all of the pipelines users. We did this on a weekend, recreated the condition while someone was able to watch what was happening in real time on the server, collect those logs, collect a thread dump, and we were easily able to stop the bottleneck and increase the sizing and saw benefits the very next Monday.

00:13:38

Now I'd like to spend some time talking about our observability journey. I mentioned right at the beginning that we started for a place of no observability and whatsoever alert, only visibility. And this timeline kind of walks you through the transition from there to where we are now. Initially the operations teams started to make dashboards that were really more like alert consoles so that they could at least see what the current status of the various alarms that existed were. And this worked for a while, but we knew that it was insufficient. So when we developed that central microservice platform, the team developed some standard application dashboards to go along with it, to give us better visibility into all of those disparate microservices, because we simply weren't going to be able to keep track centrally of all of those different microservices anymore. Teams loved these. And it started to go really well.

00:14:35

But what we found was they were looking for more, we got feature requests slowly, but surely as teams started using these to monitor their own applications independently, we added more logs. We added metrics and made these dashboards really, really robust. And then finally, the level of customization that was being asked for was so great that we allowed teams to gain access to the tool. After completing a training, they were able to clone the standard dashboards that had been provided and tweak them to meet their needs and the specific monitoring, uh, that their application would benefit from. As we saw this grow in scale, we saw some really positive outcomes, as well as some not so great consequences of this decision. The benefits were obviously that the agility in creating new dashboards and alerts from those dashboards queries were, was so much faster. Teams now could move more quickly.

00:15:34

Didn't have to submit tickets. And that's always a good thing. Teams were also leveraging data in their decisions. We saw teams decide not to move forward with the deployment based on what they observed in their launch, or even decisions to roll back early because they were able to monitor and they knew really intimately well what their steady state was. And there was an increased focus on production support teams in pockets really started to swell around this and take ownership for what felt like the very first time. But the consequences that I mentioned included dashboard clutter, some teams took this to the extreme. They made dashboards for everything. So many dashboards that you couldn't keep track of them anymore. Saying those for alerts, a lot of the time teams were alerting on the wrong things, or they were alerting on just too many things. And when you're alerting on too many things, as you probably know, that can lead to alert, fatigue, and ultimately ignored alerts, which is a real problem. When one of those ignored alerts was actually the signal of a production incident.

00:16:39

Another issue we started running into is that up to this point, this platform that we were using for these dashboards and alerts was simply our log aggregation tool. And we were sending all of this information to it as logs. It worked as at first, but at scale things started to fall over just because you can do everything in your log aggregation tool definitely doesn't mean that you should. We saw cost increase exponentially and observed performance concerns, which is a real problem. The last thing that you want is to be trying to troubleshoot a production incident and have your resolution slowed down because you can't even see the error message that the system's printing out for you. So we started to adapt. We needed to put metrics in traces, where they belonged starting with using Amazon CloudWatch directly for looking into the critical system metrics and then bringing in honeycomb to observe distributed traces across all of our microservices. When I mentioned earlier that we had done a chaos fire drill to put a new observability tool to the test. This is the one that I was talking about.

00:17:46

We've also decided to standardize around open telemetry. We see this as an investment for the future. Open telemetry is a framework that's going to allow us to make the investment in tracing and instrumentation, but avoid vendor lock in. We can do it once and have confidence that it won't be thrown away later because we're not the only ones standardizing around this framework. It really looks like the industry is doing so as well. And lots of observability tools are now coming out of the box integration with telemetry collectors. We've also spent time at Vanguard developing shared libraries to extract common fields that might be helpful to add to the metadata of a trace, for example, to make this as easy for teams as possible. And we've added auto instrumentation for open telemetry to all of our exemplars. So all new projects will get it right away with no additional effort from the application teams.

00:18:43

Now, finally, after talking about our crowd migration, our observability journey, I can start to talk about our adoption of site reliability engineering to start. We needed to change the way we measured availability up to this point. It had been extremely binary with no nuance whatsoever. We were either up, which was good or down, which was bad. And what this implied is an attempt to achieve 100% uptime, which we all know is a truly impossible goal. Instead, we needed to shift our thinking to these core concepts that SRE introduces the service level indicators and objectives, and these associated error budgets. These aren't real Solis and SLOs that we're using at Vanguard, but they look a lot like the ones that our applications are using. Say, for example, you want to set a definition of healthy for availability, which is HTTP status codes and latency you'd measure against maybe the HTTP statuses being less than 500.

00:19:41

And the response times being less than half a second or 2000 milliseconds. Now we can set targets accordingly, maybe 95% success or 99% are faster than two seconds and modify those over time. Based on our client's expectations. This gives us better impact measurement during incidents, instead of just up or down, we can say there's 10% of requests are unhealthy or 50% of requests are unhealthy. Also lets us better prioritize availability and resilience initiatives against our feature delivery. Now product owners are a part of the conversation saying, you know, we really need to be 99.9% available for our clients. And if that's the case, they can maybe make some trade offs with their team and deliver features a little bit more slowly. Or if the product owner says, I really need this feature sooner, they may be willing to compromise and accept 99 or 99.5% availability for a given product.

00:20:38

We also introduced an SRE coaching team as part of this new adoption of SRE. The purpose of the coaching team, which is the, that I am on is to evaluate new tools that can make this transition easier, develop a self-study curriculum so that all of the engineers at Vanguard have an opportunity to learn about the core concepts of SRE and observability and also help to define the strategic vision for what adoption of SRE will look like at Vanguard. And here's what that is. This is our SRE operating model. We take a hub and spoke approach to this where the SRE coaching team that's my team will work with designated full-time SRE champions in each one of our it subdivisions that support our various line of business. Those SRE champions will work with various SRE leads within the subdivision. Those leads will be aligned to groups of related products, responsible for making sure that those teams are conducting their failure modes of effects, analyses, their chaos experiments, their performance tests, and that they are setting appropriate SLOs and Solis that are complimentary to the other products that they're interfacing with. And then finally we have the product teams where we may or may not have dedicated SREs on those teams there to ensure that the nonfunctional requirements are always prioritized in the absence of that full-time role. All of the DevOps engineers on those product teams will share the responsibility of managing the alert portfolio and ensuring that the individual Solis and SLOs are being met in the error budget is being appropriately managed.

00:22:18

Now I'd like to turn it over to Robbie, to talk a little bit about the impact that this modernization has had at Vanguard.

00:22:26

All right. Thanks Christina. And so to help set a little context around what exactly it is that my team supports we're responsible for supporting Vanguard's financial advisor services line of business. Really what, the way that you can think about this line of business is it's how Vanguard partners with external financial advisors so that their clients can help meet outcomes such as saving for retirement or leaving legacy gifting behind to those that they love, or that they're close to financial advisor services is responsible for roughly half of Vanguard's assets under management. So in terms of criticality of the business, it's pretty up there in terms of importance. And what's most interesting about this business right now is that for years, Vanguard has been able to compete on the investment product. And as the industry is really seeing pressures in terms of fees and performance, being able to compete solely on that investment product, isn't feasible for the business.

00:23:21

We are really working towards making this shift in terms of how are we providing that best in class experience in terms of the methodology and tools that those advisors have access to. And for us, from a technology perspective, this really puts the fall into our core around how can we really make sure that we're delivering for our business. And so this was a journey that for our team really started in 2019. When we set out to create Vanguard's first SAS product that we were able to bring to market. Initially, what we were taking a look at was a retirement planning suite of tools that an advisor would be able to work through. This was going to be a multi-tenant platform and we were building it 100% in AWS for Vanguard. This was our first time building an application like this, to be able to build an application that all the way from login through data live 100% in the cloud was one that came with a lot of challenges.

00:24:14

And as we continue maturing what this product is and what it looks like, we see that this really has the future for the way our businesses is going to continue to operate. However, understanding the challenges that come with that, how do we not only build, but operate this type of application is, has been one of the biggest things that we've been looking forward to doing. And so for us to be able to do that, we've had to focus on how do we focus on these core reliability principles, bringing in tooling and practices, and really in partnership with Christina's team in the SRE space is how we've been able to be successful as a part of that. And so in order to make this journey really be successful, one of the first things we needed to do is take a look at what tools are we using.

00:24:59

Christina already walked us through this journey that Vanguard has been on to be able to separate out everything from living in that central logging aggregation tool to understanding what is the right tool for the job at the right time. This was something that definitely was a point of friction for not only us from a technology perspective, but also from a process side, in terms of that logging aggregation tool was something that engineers were comfortable with and familiar with. However, as we continue to see issues, book with performance and understanding what the benefits would be going forward, we were able to make that transition and separate out into these three pillars of observability. As a result of this, we've seen that we've been able to provide better service to our clients by not only being able to recognize incidents in a more timely fashion, but also having better information at our fingertips as our engineers are actually going through and responding to those incidents, to be able to restore service to our end clients.

00:25:57

Now, the tooling is absolutely very valuable, but what really excites me is are the practices that we've been able to adopt as a part of this transformation as well. We talked about the failure modes and effects analysis with that. What we've actually been able to do is sit down with not only our engineering teams, but also our business teams and talk through, if something were to fail at this given point, what would we expect the technical response to be? Would there be a level of self healing? What alerts would we expect to fire? But with our business partners, we've also been able to walk through the activity to say, what would we want our business response to be both from communications to the external world? Or are there ways that manually we would wanna do a certain type of processing or respond in some like way that has been able to identify that AC list of action items, to be able to say, there are areas that we wanna be able to go and put in additional self healing or make it more resilient, but it also has sparked discussion around areas where we either, weren't quite sure what the impact would be, or what's even more exciting in some ways is if we had different engineers actually disagree around what would happen.

00:27:06

And this is where the role of self-service tooling, such as performance testing and chaos testing had been able to come into play as we've been able to take those hypotheses that our teams have had, and really put them to the test to see what does happen in the system when we're under those given conditions, there's really no better way to align mental models than to see it in action. Finally, something that I think we're all familiar with is that in incidents it's most often the senior engineers that will be taking the lead and helping to actually go through and resolve that while it is great to have those rock stars and those people that you know, that you can always count on what happens when they're not available, whether they're on vacation or they're not the one on call on that. Given evening, we ran a chaos fire drill just a couple months ago where we actually told our senior engineers, Hey, you all have sitting this one out.

00:27:58

And what we wanted to do is provide an opportunity to some more junior engineers, to one step up to the plate and be able to show that they're capable of it, but also be able to push themselves to learn and grow overall. What this has done is really strengthened the engineering excellence within our organization to push folks, to be able to get involved in stack parts of the stack, that they may not be involved with day to day. And so that's been something that's been really exciting for us to be able to do as we go forward. Now we aren't quite in a perfect state. And so I think this is aply titled as my wishlist to be able to go forward. And so one of the first things that we need to continue doing is educating our business partners, not only around the practices that we're adopting.

00:28:42

So those activities like the failure modes and effects analysis, and understanding the benefits to SLO based monitoring, but also understanding the investment that we're making in these areas and what that really does to help drive our business forward. Vanguard is a company that's very much built on reputation. And for us to be able to maintain that reputation, we need to make sure that we're continuing to provide that best in class service to all of our clients. Second, from more of a technical aspect, something I would love for us to be able to do is truly mimic external traffic. We are, we have the tools and the capabilities today to be able to mimic traffic that's coming from within our network, within our firewall. However, to be able to truly mimic what it would look like for an end client, something, someone coming in from the outside, from different regions of the globe as well, keeping in mind that Vanguard is a global company.

00:29:34

This is something that would be just another valuable tool in our toolbox to be able to give to our engineering teams. And then finally, we've been able to see a huge transformation as van has gone from SRE as a buzzword to a role, to really moving towards a mindset. And for us to have these dedicated champions embedded within our teams would just be something that would help us to continue pushing that forward and continue helping us to educate ourselves as our practices evolve. As the tooling is out there, we wanna make sure that we're staying on that cutting edge and having those dedicated folks would just be invaluable as part of that. And so I've talked through a little bit of the impact that we've seen and kind of what I hope is coming up next. I'm now gonna toss it over to, to Christina to talk us through what really is next.

00:30:18

Thanks Robbie. So of course, straight from Robbie's wishlist directly into my backlog. But along with that, there's lots more that we have left to do some of the challenges that we are still facing are striking the right balance between efficiency and flexibility. There's always tension between providing recommendations and standardization versus giving teams the freedom to deviate from the norm while standardization reduces rework and increases that efficiency. It's going to limit the ability to do things that are more custom. And we need to be mindful of that when making decisions, we also need to strike the right balance. And this is of particular importance for my coaching team between time spent training and time spent delivering every hour that I take away from an engineer to put them through a training course is an hour that they could have spent delivering a new feature. Now, when done right, the right amount of training accelerates the rest of their time spent, but this is a very tricky balance to get exactly right.

00:31:19

There's a base level of knowledge that you need to take advantage of a really good observability platform and finding out what that is without boring people in a week long training is something that I will always be trying to get exactly right. Some additional challenges that I face are demonstrating the impact, especially anticipated impacts and more subtle impacts just because you now have set a service level objective, and some service level indicators doesn't necessarily mean that you are suddenly going to be more available or that the number of incidents is going to decrease. This is something that happens really slowly over time and people who are used to hearing that we're shooting for uptime all the time might see the impact of my work as negative. There's also the budgeting challenge today. All of our budget in it is allocated based on how much work it's going to be to implement a feature.

00:32:10

And in order to consider those nonfunctional requirements and dedicated SREs, the budgeting is going to need to shift a little bit and finally staffing. This is really hard to do anyone who has tried to hire SREs can probably empathize with this point. We need to find a way to do a combination of internal upskilling and external hiring to meet our needs for talent. So now looking ahead to the future vision and what we still have left to accomplish in the next few years, we'd like to continue to reduce our on-premise workloads. I told you, we hadn't gotten all the way there yet. We've got the majority, but we'd like to go even further over 90% of our workloads cloud native. We'd also like to do a better job of being fully observable. I mentioned that we were dealing with alert, fatigue, and dashboard clutter, and while education through our SRE curriculum has alleviated that somewhat.

00:33:00

We still see problems with that today. Additionally, we see teams struggling with where and how to create their alerts. Now that all of our sources of information are in different places. We hope to, at some point put in a telemetry aggregation layer in front of all of our logs metrics and traces, so that dashboards can surface a combination of those various things. And finally, I would love to see us adopt a truly blameless culture toward post incident reviews, where we have started to do this in pockets. We have seen incredible feedback from everyone involved about how much it facilitates knowledge sharing and learning, and an overall better culture of psychological safety. And I'd really love to see that adopted across the entire organization for all of our incidents.

00:33:45

And I'd also like to lead you with a call to action here, Vanguard is hiring. So of course I will get a plug out there for our career site at vanguard.com/career. If you are interested in site reliability engineering or any of the topics that Robbie and I have talked about today, and you can connect with us as well. I would love to hear what your stories are that you have to share how you've solved similar problems and what problems you're facing now. So you can reach out to us on Twitter. My handle is SRE. Christina Robbie's is Robbie underscored dates, man, or you can find either one of us under our full names on LinkedIn, and we would love to talk to you even more with that. I will wrap up here and thank you all so much for your time and attention, Robbie and I will be available on slack to answer questions, following the presentation.