Iterative Enterprise SRE Transformation (US 2021)

It's easy to get discouraged reading books about industry best practices that say things like "always test in prod!" and "10 deploys a day!!" At times, they can make the goal of being a high-functioning DevOps organization feel out-of-reach for large enterprises, where changes to the way we operate take time to roll out. A few years ago, Vanguard started its journey to adopting Site Reliability Engineering across the IT organization, and that transformation effort is still underway today. In this talk, we will share where we started, how far we've come since then, and all of the steps we've taken along the way, as we've worked to evangelize changes to the way we measure availability, enable experimentation, leverage highly-available architecture patterns, and learn from failure.

plenaryuslas vegasvegas2021
CY

Christina Yakomin

Site Reliability Engineer, Vanguard

RD

Robbie Daitzman

Vanguard Intermediary Platform – Delivery Lead, Vanguard

TRANSCRIPT

00:00:12

Thank you, Danny and mark. So Christina yakamein is a senior technology lead for site reliability engineering at Vanguard, the creator of the world's first index fund. Back in 1975, they currently have over $8 trillion of assets under management, and is now the world's largest provider of mutual funds. And the second largest provider of electronic trading funds technology has always been critical to their business evidenced by the fact that they currently have over 7,000 developers. So earlier this year, Christina gave a talk at DevOps enterprise summit, London, which I thought was one of the best talks on next generation infrastructure and operations that I've ever seen because it describes so wonderfully how one modernizes technology practices across a vast landscape that has been built up over 50 years. I asked if she could present a talk to all of us here at DevOps enterprise and bring someone who could tell us specifically how they benefited from everything that she and her team did. So I am so delighted that Christina will be presenting with Robbie Damon it delivery manager who supports Vanguard's financial advising technologies, and he will be sharing some of their goals and how modern infrastructure practices are helping them achieve it in ways that Vanguard customers appreciate. So here's Christina and Robbie,

00:01:32

Thanks Jean. In this top will tell you the story of Vanguard's iterative SRE transformation and our journey as we've modernized all of our technology platforms over the past few years to set some context. If you're not familiar with Vanguard, we are a global asset management company with over $8 trillion in global assets, under management and working behind the scenes to power. All of that is over 17,000 crew members, both domestically and around the globe. And for a company like Vanguard technology is critical, not just because of the current modern landscape of the industry, but also because Vanguard is unique in its structure. We don't have brick and mortar locations where our clients can come in and talk to their financial advisor. Face-to-face we never have all of our interfacing with our clients happens either over the phone or on the web. Anyone familiar with the call center knows that that really incentivizes us to have a great web experience for our clients.

00:02:31

It makes technology one of the top priorities for the firm and for some background on who we are, as Jean mentioned, my name is Christina Ackerman. And in my current role, I am the senior technical lead on our site, reliability engineering coaching team. Over the past few years, I've picked up a few AWS certifications, including most recently the AWS solutions architect professional certification, and in my career so far, I've had experience with full stack application development, as well as cloud infrastructure and automation. In my most recent prior role, I was a member of our chaos engineering team. So if you've heard me speak in the past, it may have been on that topic. And I think of myself as a bit of a chaos engineering enthusiast still today in my current role and outside of work in my free time, I am a member at the Philadelphia zoo, and I love to spend some time there on a nice day on a weekend.

00:03:29

And I'm Robbie Dietzman, I'm currently an it delivery manager supporting one of our lines of business at Vanguard. My background is fairly similar to Christina's and we actually crossed paths as we were both on the chaos engineering team together, helping to stand that up at Vanguard, going back a couple of years ago, something that has always intrigued me is the people side of the engineering organization. How things such as organizational dynamics come into play there, which is actually something that helped lead me into leadership. And like many of us, we all had to find different hobbies throughout our quarantine period over the past 18 months. And so something I picked up was a real interest in wine and including picking up a certification, both on service and tasting.

00:04:10

All right, so now let's talk about where Vanguard started this journey and to do that, we're going to rewind about five to seven years ago and paint this picture. I give a range because we're still on this journey today, and I want to preface all of this with that. I'm not here to tell you today that we solved every single one of these problems throughout the entire organization, but hopefully I'll be able to demonstrate all of the great progress that we have made. So to start about five to seven years ago, all of our technology was hosted in a private data center. We had pretty much exclusively monolithic applications and no presence in the public cloud. All of our deployments to production were extremely controlled. There were about quarterly releases and the development teams were definitely not the ones handling the deployments in terms of observability.

00:05:03

There wasn't any, there was alert, only visibility, meaning really no dashboards that were particularly meaningful and no positive affirmation that our systems were functioning. The way that we want them to be. All we really had was whether or not an alert had fired recently. And we depended on those alerts to tell us whether or not our systems were functioning, take things one step further. Those alerts were centrally owned development teams. Weren't setting them up on their own. They were submitting requests to another team that manage the alert portfolio. We're configuring all of those hopefully correctly. And another ticket would need to be submitted to update that alert in any way. Finally, as you might expect, development and operations were completely siloed. We had teams of programmers and teams of operators, and they only interacted when something went wrong. All of that production support was centralized to.

00:06:00

So the next step from here is talk a little bit about the cloud migration that enabled our SRE transformation from that data center that we had all the way to majority public cloud. The first step was breaking down that monolith. It is incredibly difficult to take a large unwieldy monolithic application and lift and shift it into the cloud directly. And if I had figured out how to do that, I probably wouldn't be giving this talk right now. Instead we carved out microservices from our monolith slowly but surely, and started running them initially on a private cloud platform as a service that we hosted internally, as part of this move, we were able to improve the deployment frequency for those portions that were carved out. It really changed the game for those development teams who were used to waiting for a quarterly release cycle. Now the regression test cycle was able to be significantly reduced at a test automation engineer role was introduced to help automate as much as possible of the testing so that we could ensure that we were thoroughly testing our systems.

00:07:08

Before we went to production, a new pipeline was introduced to run our automated tests and then our deployment frequency drastically increased. We even developed automation to generate change records and attach that test evidence to them all as part of the CIC CD workflow, in order to now migrate all of those microservices that we'd spent all of that time carving out into the public cloud. We made an interesting decision instead of moving them one by one and putting each and every team through a migration effort, we decided to lift and shift the underlying platform as a service. The private cloud that we had been hosting have this made the lives of the microservice development teams significantly easier, but it was incredibly hard on the infrastructure teams that were migrating that underlying infrastructure. I can speak to this because I was on that team at the time. And that platform as a service really did not want to run in the cloud, but we got it there. And once we did it left us with a now unnecessary abstraction layer between our applications and the AWS infrastructure. And because it had been such a headache to get it to the cloud in the first place, ultimately it was over-complicating the environment and causing more problems than it solved.

00:08:29

That's when we decided it was time to move to a more cloud native solution that actually more cloud native solutions. For the first time, we were able to introduce optionality into the application design process. We removed the abstraction layer and instead leveraged AWS resources like ECS far gate to provide a really similar experience for the majority of those microservice. Now they were getting basically the same experience where it was primarily serverless. They didn't have to think about VM provisioning, but we didn't have a team managing the operational overhead of the past maintenance, but there were other resources that Amazon made available that were a better fit for some workloads like Lambda Lambda's incredibly cheap to run and very lightweight. It was especially great for migrating some of our event driven workloads or those microservices that are accessed very infrequently. So we didn't have to pay around the clock for compete. And there were other applications that were a better fit for Kubernetes and could benefit from a control plane. So EKS emerged as another front runner for hosting those microservice applications.

00:09:41

Now everyone's heard the quote with great power comes great responsibility, and this is what those product teams ran into next while they were most, most of them were very excited to take on more accountability for their systems. It was new to them. In most cases. Now they had to test not only the application code, but also their configurations extremely thoroughly. We provided some guidance to them on how they could best do that with the adoption of a process called the failure modes and effects analysis. This is a technique borrowed from the more physical engineering disciplines and adopted into software and technology. When you look through your system architecture, identifying possible failure points, how those could fail and what the effects might be, once you've done that exercise, you basically have a list of hypotheses about the way that your system behaves under stress and failure, and what a great way to test those hypotheses, the chaos engineering and performance testing.

00:10:47

So we created platforms to self-service both of those things as well in our non production environments. I'd actually like to talk a little bit more about some specific instances where we did run chaos, experiments and performance tests and see some really great outcomes for these teams. First, we ran a chaos game day where after we developed those hypotheses about our system resilience, we went ahead and caused task crashes on ECS, for example, and validated that under certain amounts of load, we would observe auto-scaling and self-healing from crashes. This is great. And in most cases, when we've done this, we've seen a lot of our hypotheses verified and maybe just one or two go a bit differently than we expected. And it's always a great opportunity for us to learn and continuously validate that our systems are running the way that we think they are. We also run a chaos FireDrill we differentiate this one in that the failure we are adding into the system is one we don't expect to be resilient to.

00:11:48

We expect to break things. The reason we did this is because we weren't trying test the application. We were trying to test a new observability tool, the new tool we were bringing in promised easier troubleshooting, and we wanted to put it to the test. So we injected failures that purposely broke the system and let engineers try out the tool for the first time and see if it really was easier. And it was, this was a great tool for onboarding onto the new observability platform. And we still have the recording of that and use that for training new engineers who are going on call, who might be interfacing with this tool, finally break testing our CICB pipeline. This is a really interesting one because our pipeline is not SAS. We are hosting the pipeline and the team that was handling that hosting and maintenance was faced with recurring instability of their pipeline at high traffic times, which is the worst time for any system to be having instability.

00:12:46

But of course, with a pipeline, what that means is it's when your developers are most productive, they are building and deploying the most. And these crashes because the system was wiping itself out before it could even offload its logs, we're preventing thorough investigation during the business day. So we actually had to come up with creative ways to recreate the condition, create deployment environments and build environments that looked like real ones and generated the same number of logs without actually impacting all of the pipelines users. We did this on a weekend. We created a condition while someone was able to watch what was happening in real time on the server, collect those logs, collect a thread dump, and we're easily able to stop the bottleneck and increase the sizing and sub benefits. The very next Monday.

00:13:35

Now I'd like to spend some time talking about our observability journey. I mentioned right at the beginning that we started from a place of no observability and whatsoever alert, only visibility. And this timeline kind of walks you through the transition from there to where we are now. Initially the operations teams started to make dashboards that were really more like alert consoles so that they could at least see what the current status of the various alarms that existed were. And this worked for a while, but we knew that it was insufficient. So when we developed that central microservice platform, the team developed some standard application dashboards to go along with it, to give us better visibility into all of those disparate microservices, because we simply weren't going to be able to keep track centrally of all of those different microservices anymore. Teams booked these, and it started to go really well.

00:14:31

But what we found was they were looking for more, we got feature requests slowly, but surely as teams started using these to monitor their own applications independently, we added more logs. We added metrics and made these dashboards really, really robust. And then finally the level of customization was being asked for was so great that we allowed teams to gain access to the tool. After completing a training, they were able to clone the standard dashboards that had been provided and tweak them to meet their needs and the specific monitoring, uh, that their application would benefit from. As we saw this grow in scale, we saw some really positive outcomes, as well as some not so great consequences of this decision. The benefits were obviously that the agility in creating new dashboards and alerts from those dashboards queries were, was so much faster. Teams now could move more quickly.

00:15:31

Didn't have to submit tickets. And that's always a good thing. Teams were also leveraging data in their decisions. We saw teams decide not to move forward with the deployment based on what they observed in their lawns, or even decisions to roll back early because they were able to monitor and they knew really intimately well what their steady state was. And there was an increased focus on production support teams in pockets really started to swell around this and take ownership for what felt like the very first time. But the consequences that I mentioned included dashboard clutter, some teams took this to the extreme, they need dashboards for everything. So many dashboards that you couldn't keep track of them anymore. Same goes for alerts. A lot of the time teams were alerting on the wrong things, or they were alerting on just too many things. And when you're alerting on too many things, as you probably know, that can lead to alert, fatigue, and ultimately ignored alerts, which is a real problem. When one of those ignored alerts was actually the signal of a production incident.

00:16:36

Another issue we started running into is that up to this point, this platform that we were using for these dashboards and alerts was simply our log aggregation tool. And we were sending all of this information to it as logs. It worked as at first, but at scale things started to fall over just because you can do everything in your log aggregation tool definitely doesn't mean that you should. We saw costs, increase exponentially and observed performance concerns, which is a real problem. The last thing that you want is to be trying to troubleshoot a production incident and have your resolution slowed down because you can't even see the error message that the system's printing out for you. So we started to adapt. We needed to put metrics in traces, where they belonged starting with using Amazon CloudWatch directly for looking into the critical system metrics and then bringing in honeycomb to observe distributed traces across all of our microservices. When I mentioned earlier that we had done a chaos fire drill to put a new observability tool to the test. This is the one that I was talking about.

00:17:43

We've also decided to standardize around open telemetry. We see this as an investment for the future. Open telemetry is a framework that's going to allow us to make the investment in tracing and instrumentation, but avoid vendor lock-in. We can do it once and have confidence that it won't be thrown away leader because we're not the only one standardizing around this framework. It really looks like the industry is doing so as well. And lots of observability tools are now coming with out of the box integration with open telemetry collectors. We've also spent time at Vanguard developing shared libraries to extract common fields that might be helpful to add to the metadata of a trace, for example, to make this as easy for teams as possible. And we've added auto instrumentation for open telemetry to all of our exemplars. So all new projects, we'll get it right away with no additional effort from the application teams.

00:18:40

Now, finally, after talking about our cloud migration or observability journey, I can start to talk about our adoption of site reliability engineering to start, we needed to change the way we measured availability up to this point. It had been extremely binary with no nuance whatsoever. We were either up, which was good or down, which was bad. And what this implied is an attempt to achieve 100% uptime, which we all know is a truly impossible goal. Instead, we needed to shift our thinking to these core concepts that SRE introduces the service level indicators and objectives and the associated error budgets. These are real SLIs NSLS that we're using at Vanguard, but they look a lot like the ones that our applications are using. Say, for example, you want to set a definition of healthy for availability, which is HTTP status codes and latency you'd measure against maybe the HTTP status is being less than 500.

00:19:38

And the response times being less than half a second or 2000 milliseconds. Now we can set targets accordingly, maybe 95% success or 99% are faster than two seconds and modify those over time. Based on our client's expectations, it gives us better impact measurement during the incidents, instead of just up or down, we can say there's 10% of requests are unhealthy or 50% of requests are unhealthy. Also lets us better prioritize availability and resilience initiatives against our feature delivery. Now, product owners are a part of the conversation saying, you know, we really need to be 99.9% available for our clients. And if that's the case, they can maybe make some trade offs with their team and deliver features a little bit more slowly. Or if the product owner says, I really need this feature sooner, they may be willing to compromise and accept 99 or 99.5% availability for a given product.

00:20:35

We also introduced an SRE coaching team as part of this new adoption of SRE. The purpose of the coaching team, which is the team that I'm on is to evaluate new tools that can make this transition easier, develop a self study curriculum so that all of the engineers at Vanguard have an opportunity to learn about the core concepts of SRE and observability and also help to define the strategic vision for what adoption of SRE will look like at Vanguard. And here's what that is. This is our SRE operating model. We take a hub and spoke approach to this where the SRE coaching team that's my team will work with designated full-time SRE champions in each one of our it subdivisions that support our various line of business. Those SRE champions will work with various SRE leads within the subdivision. Those leads will be aligned to groups of related products, responsible for making sure that those teams are conducting their failure modes and effects analysis, their chaos experiments, their performance tests, and that they are setting appropriate SLOs and SLIs that are complimentary to the other products that they're interfacing with. And then finally we have the product teams where we may or may not have dedicated SRS on those teams there to ensure that the non-functional requirements are always prioritized in the absence of that full-time role. All of the DevOps engineers on those product teams will share the responsibility of managing the alert portfolio and ensuring that the individual SLIs and SLOs are being met and the error budget is being appropriately managed.

00:22:15

Now I'd like to turn it over to Robbie, to talk a little bit about the impact that this modernization has had at Vanguard.

00:22:23

Alright, thanks, Christina. And so to help set a little context around what exactly it is that my team supports we're responsible for supporting bangers financial advisor services line of business. Really what, the way that you can think about this line of business is it's helped Vanguard partners with external financial advisors so that their clients can help meet outcomes such as saving for retirement or leaving legacy gifting behind to those that they love, or that they're close to financial advisor services is responsible for roughly half of Vanguard's assets under management. So in terms of criticality of the business, it's pretty up there in terms of importance. And what's most interesting about this business right now is that for years, Vanguard has been able to compete on the investment product. And as the industry has really seen pressures in terms of fees and performance, being able to compete solely on that investment product, isn't feasible for the business.

00:23:18

We are really working towards making the shift in terms of how are we providing that best-in-class experience in terms of the methodology and tools that those advisors have access to. And for us, from a technology perspective, this really puts the ball into our court around how can we really make sure that we're delivering for our business? And so this was a journey that for our team really started in 2019. When we set out to create Vanguard's firsts SAS product that we were able to bring to market. Initially, what we were taking a look at was a retirement planning suite of tools that an advisor would be able to work through. This was going to be a multi-tenant platform and we were building it 100% in AWS for Vanguard. This was our first time building an application like this, to be able to build an application that all the way from log-in through data live 100% in the cloud was one that came with a lot of challenges.

00:24:11

And as we continue maturing what this product is and what it looks like, we see that this really has a future for the way our business is going to continue to operate. However, understanding the challenges that come with that, how do we not only build, but operate this type of application is, has been one of the biggest things that we've been looking forward to doing. And so for us to be able to do that, we've had to focus on how do we focus on these core reliability principles, bringing in tooling it practices, and really in partnership with Christina's team in the SRE space is how we've been able to be successful as a part of that. And so in order to make this journey really be successful, one of the first things we needed to do is take a look at what tools are we using.

00:24:56

Christina already walked us through this journey that Vanguard has been on to be able to separate out everything from living in that central logging aggregation tool to understanding what is the right tool for the job at the right time. This was something that definitely was a point of friction for not only us from a technology perspective, but also from a process side, in terms of that logging aggregation tool was something that engineers were comfortable with and familiar with. However, as we continue to see issues, book with performance and understanding what the benefits would be going forward, we were able to make that transition and separate out into these three pillars of observability. As a result of this, we've seen that we've been able to provide better service to our clients by not only being able to recognize incidents in a more timely fashion, but also having better information at our fingertips as our engineers are actually going through and responding to those incidents, to be able to restore service to our end clients.

00:25:54

Now, the tooling is absolutely very valuable, but what really excites me is are the practices that we've been able to adopt as a part of this transformation as well. We talked about the failure modes and effects analysis with that. What we've actually been able to do is sit down with not only our engineering teams, but also our business teams and talk through if something were to fail at this given point, what would we expect the technical response to be? Would there be a level of self-healing, what alerts would we expect to fire? But with our business partners, we've also been able to walk through the activity to say, what would we want our business response to be both from communications to the external world? Or are there ways that manually we would want to do a certain type of processing or respond in some lightweight that has been able to identify that list of action items, to be able to say, there are areas that we want to be able to go and put in additional self-healing or make it more resilient, but it also has sparked discussion around areas where we either weren't quite sure what the impact would be, or what's even more exciting in some ways is if we had different engineers actually disagree around what would happen.

00:27:03

And this is where the role of self-service tooling, such as performance testing and chaos testing had been able to come into play as we've been able to take those hypotheses that our teams have had, and really put them to the test to see what does happen in the system. When we're under those given conditions, there's really no better way to align mental models than to see it in action. Finally, something that I think we're all familiar with is that in incidents it's most often the senior engineers that will be taking the lead and helping to actually go through and resolve that while it is great to have those rockstars and those people that you know, that you can always count on what happens when they're not available, whether they're on vacation or they're not the one on call on that. Given evening, we ran a chaos fire drill just a couple months ago where we actually told our senior engineers, you will sit in this one out.

00:27:55

And what we wanted to do is provide an opportunity to some of our more junior engineers to one step up to the plate and be able to show that they're capable of it, but also be able to push themselves to learn and grow overall. What this has done is really strengthened the engineering excellence within our organization to push folks, to be able to get involved in stack parts of the stack, that they may not be involved with day-to-day. And so that's been something that's been really exciting for us to be able to do as we go forward. Now we aren't quite in a perfect state. And so I think this is aptly titled as my wishlist to be able to go forward. And so one of the first things that we need to continue doing is educating our business partners, not only around the practices that we're adopting.

00:28:39

So there's activities like the failure modes and effects analysis, and understanding the benefits to as to load based monitoring, but also understanding the investment that we're making in these areas and what that really does to help drive our business forward. Vanguard is a company that is very much built on reputation. And for us to be able to maintain that reputation, we need to make sure that we're continuing to provide that best in class service to all of our clients. Second, from more of a technical aspect, something that I would love for us to be able to do is truly mimic external traffic. We are, we have the tools and the capabilities today to be able to mimic traffic that's coming from within our network, within our firewall. However, to be able to truly mimic what it would look like for an end client, something, someone coming in from the outside, from different regions of the globe as well, keeping in mind that Vanguard is a global company.

00:29:31

This is something that would be just another valuable tool in our toolbox to be able to give to our engineering teams. And then finally, we've been able to see a huge transformation as has gone from SRE is a buzzword to a role to really moving towards a mindset. And for us to have these dedicated champions embedded within our teams would just be something that would help us to continue pushing that over and continue helping us to educate ourselves as our practices evolve. As the tooling is out there. We want to make sure that we're staying on that cutting edge and having those dedicated volts would just be invaluable as part of that. And so I've talked through a little bit of the impact that we've seen and kind of what I hope is coming up next. I'm now going to toss it over to Christina to talk us through what really is next.

00:30:16

Thanks Robbie. So of course, straight from Robbie's wishlist directly into my backlog. But along with that, there's lots more that we have left to do some of the challenges that we are still facing are striking the right balance between efficiency and flexibility. There's always tension between providing recommendations and standardization versus giving teams the freedom to deviate from the norm while standardization reduces rework and increases that efficiency. It's going to limit the ability to do things that are more custom. And we need to be mindful of that when making decisions, we also need to strike the right balance. And this is of particular importance for my coaching team between time spent training and time spent delivering every hour that I take away from an engineer to put them through a training course is an hour that they could have spent delivering a new feature. Now, when done right, the right amount of training accelerates the rest of their time spent, but this is a very tricky balance to get exactly right.

00:31:16

There's a base level of knowledge that you need to take advantage of a really good observability platform, and then finding out what that is without boring people in a week long training is something that I will always be trying to get exactly right. Some additional challenges that I face are demonstrating the impact, especially anticipated impacts and more subtle impacts just because you now have set a service level objective, and some service level indicators doesn't necessarily mean that you're suddenly going to be more available or that the number of incidents is going to decrease. This is something that happens really slowly over time and people who are used to hearing that we're shooting for uptime all the time might see the impact of my work as negative. There's also the budgeting challenge today. All of our budget in it is allocated based on how much work it's going to be to implement a feature.

00:32:07

And in order to consider those nonfunctional requirements and dedicated SRS, the budgeting is going to need to shift a little bit and finally staffing. This is really hard to do anyone who has tried to hire SRDS can probably empathize with this point. We need to find a way to do a combination of internal upskilling and external hiring to meet our needs for talent. So now looking ahead to the future vision and what we still have left to accomplish in the next few years, we'd like to continue to reduce our on-premise workloads. I told you, we haven't gotten all the way there yet. We've got the majority, but we'd like to go even further over 90% of our workloads, how native we'd also like to do a better job of being fully observable. I mentioned that we were dealing with alert, fatigue, and dashboard clutter, and while education through our SRE curriculum has alleviated that somewhat.

00:32:57

We still see problems with that today. Additionally, we see teams struggling with where and how to create their alerts. Now that all of our sources of information are in different places. We hope to, at some point put in a telemetry aggregation layer in front of all of our logs metrics and traces, so that dashboards can surface the combination of those various things. And finally, I would love to see us adopt a truly blameless culture toward post-incident reviews, where we have started to do this in pockets. We have seen incredible feedback from everyone involved about how much it facilitates knowledge sharing and learning, and an overall better culture of psychological safety. And I'd really love to see that adopted across the entire organization for all of our incidents.

00:33:42

And I'd also like to leave you with a call to action here. Vanguard is hiring. So of course I will get a plug out there for our careers site at vanguard.com/careers. If you are interested in site reliability engineering or any of the topics that Robbie and I talked about today, and you can connect with us as well. I would love to hear what your stories are that you have to share how you've solved similar problems and what problems you're facing now. So you can reach out to us on Twitter. My handle is SRE. Christina Robbie's is Robbie underscored Gatesman, or you can find either one of us under our full names on LinkedIn, and we would love to talk to you even more with that. I will wrap up here and thank you all so much for your time and attention, Robbie and I will be available on slack to answer questions, following the presentation.