Las Vegas 2018

The Nth Region Project: An Open Retrospective

For the past year, a small team of engineers and I have had one job: allow New Relic to run an independent European region for data sovereignty reasons. That means taking around 500 services written by around 50 teams that have historically been assumed to run in just one deployment and changing them to work anywhere. And, at the end of the process, we needed to be able to spin up new regions quickly and sustainably operate them with our existing staff.


The talk will be in two parts, because a project like this isn't purely technical or organizational. We needed to choose technical changes that turned building out a new region from a many-month-long process for all teams into a project for one small team. We decided that the key was to move all services to run in containers, and have them all do service discovery via dependency injection. The reality of working at a medium-sized organization meant we had to have a lot of coordination and buy-in. I'll talk about how our roadmapping process both hindered and enabled this project to work at all, and how we used test buildouts and teardowns to integrate early and often.


This wouldn't be an open retrospective without talking about what didn't work well, which was primarily organizational rather than technical. We've learned some lessons on how to run large-scale projects that will hopefully help us on our next one, so I hope that we can provide some hard-earned lessons.


Andrew has worked on a wide range of projects, including the NRDB distributed event database, charting, the autocompleting NRQL query editor, bare metal hardware provisioning, and supporting multiple regions. He lives in Pittsburgh, Pennsylvania, USA, where he also sings classically in the Mendelssohn Choir of Pittsburgh.

AB

Andrew Bloomgarden

Principal Software Engineer, New Relic

Transcript

00:00:05

In this talk, I'm presenting an open retrospective, talking both about the technical and organizational aspects of the project and what we learned along the way. So think of this as a case study in the way one medium sized company deals with the challenges of at scale. So most of you have probably heard, heard of New Relic and some of you're probably customers of ours. But let me introduce you to what we do from the perspective of an engineer at the company. New Relic is an observability company that builds software that our customers use to analyze how their software actually works in the wild. That means that our customers send us a ton of data that we process in store and then we present it back to them in the form of alerts. Curated UIs and ad hoc queries, all for a variety of different products.

00:00:41

Uh, browser mobile, a PM, um, and a few more Last year was part of our partnership with IBM. We agreed to build out a new region in Europe, in IBM's data centers in Frankfurt in order to be able to let European customers keep their data in the u. And note that this isn't the same as some kinds of multi-region projects, where it's just purely for redundancy. We needed to actually keep the data in a certain place and this was a pretty scary proposition to us in engineering. So despite our fairly well run organization, so let me explain why. When I started at New Relic eight years ago, we had just one application. It was the UI and the data collection tier all in one, a true rails monolith. The company was two years old at the time, um, hadn't really had the time to, you know, build out a whole ton of technical debt.

00:01:22

And shortly after I started, one of our engineers split out the collection tier into the Aply named collector. It was written in Java and it was a couple orders of magnitude faster and more efficient than the initial rails implementation. And if we had stopped here and said, great, we're gonna build out a European region, technically and organizationally, we would've been totally fine. This wouldn't have been a huge project. It might not have been the right decision for the business. And given that we didn't do it, it probably wasn't. But changing a couple of code bases, 2-year-old company, not a huge deal, but we didn't do that. So instead, in eight years, we've done a lot of things. We've scaled a lot, introduced new products features, and to handle that growth we've had to continuously re-architect our software. Today we're handling around 30 gigabytes per second of data inbound, 15 million Kafka messages a second, writing around 600 million.

00:02:06

I think this is now up to around a billion events per minute. Actually, slides are a bit outta date. We have around 50 engineering teams and hundreds of engineers working on that. And along the way, one transition we made is that there's no way that this can work with a central operations team. All of our teams are on call for their services. And this is what our architecture looks like today. So like most service oriented architectures, this reflects both technical requirements, like actual products doing actually different things and organizational structure like this team worked on that and this other one worked on that thing. Okay? Our architecture's at the point where a single easily understandable diagram can't faithfully represent it in all its detail. And there's no way that our original one or two app architecture would've scaled this far. So we made the right choice, but we never really considered that we'd ever have to run in more than one region.

00:02:53

That was always just a potential future that never seemed to arrive. So when the business said it's time to build an EU region, we knew that this was gonna be a very painful exercise. And when we confirmed that there was a really high chance that this wouldn't be the last region the business wanted to build, we said, okay, this is going to be a slightly different project. We're gonna focus on building tools for building regions, um, so that even though we know there's a lot of manual work that's going to go into this round, we wanna make that an automated process the next time. And the aspiration is that one small team can be in charge of clicking a few buttons and making some changes and supporting a whole new region. And we called this project backpack as we mostly Americans were finally going to get on a, uh, getting to go on a trip to Europe.

00:03:32

I,

00:03:37

So I'd said, you know, that we did, you know, we ran in one region and we didn't really have experience building out multiple regions, but that's not technically true. Every year we do a disaster recovery exercise where we prove to ourselves and to our customers that we can successfully rebuild the entire New Relic stack in a new environment. So we did actually have experience building out new regions and we knew that it was just incredibly painful. We did proven that the invent of real emergency, we could drop everything and we could recover, but the exercise took a lot of effort across the entire engineering organization. So we knew when we started this project, that's what we'd have to look into. Why was that so painful? So in the eight years I've been at the company, we've had to solve a bunch of problems. Um, this is, you know, something that's, you know, very familiar to organizations that have, um, you know, tried to adopt DevOps.

00:04:20

You know, we had to figure out how to deploy many services, support a polyglot environment, have some kind of service discovery system, some sane secret management, and eventually more recently having some better container or orchestration. And so we were just gonna tack a new thing onto the list and we were gonna leverage all of our expertise in all of those earlier things and we would be fine. That was at least the, the theory. But the reality is a little different. Will Larson, who's worked at Dig Uber and now Stripe wrote an article earlier this year about migrations. He said, migrations are the only mechanism to effectively manage technical debt is your code and company, uh, company and code grows. And if you don't wanna get effective at software and system migrations, you'll just end up languishing in technical debt. And it turned out that we just weren't really very good at large scale infrastructure migrations.

00:05:02

They just kind of seemed successful because we were so fast growing. We would do a bunch of new things, they would use the new practices, those would be great. The old thing didn't really come along for the ride and it wasn't really a problem until it maybe blew up in production. And then we'd realize, oh, we have a problem here. Now a variety of vintages might be nice in a wine cellar, but it's not what you want for large scale software systems. So our many vintages included dating back to 2010 applications deployed via Capistrano and Puppet 2013, um, application. Uh, we realized that Docker was key to us scaling, having way more services than we had before. Um, and that was very early days of Docker. It was, um, kind of on fire a lot of the time, but it was crucial for us. But we wrote an internal tool to deploy that Centurion kinda worked, had an in-house service discovery system that we abandoned after a couple years, but it was still used in a bunch of production software vault we started using in 2016.

00:05:57

And in 2017 we started building out a new container orchestration platform based on Mesos and our own internal tooling on top of that. And the reality was that for disaster recovery, we just had to deal with account for all of these different things. We had to, um, copy paste configuration, tweak it, um, find all the places where there were little edge cases in the code. And it got so bad that for our, um, most recent disaster recovery exercise, when we started the project, we just closed out an entire pull request for our puppet environment. 'cause we couldn't, uh, trust ourselves to move forward with it. And you know, all of this is to say that if you have, you know, a DevOps environment like we did, you do it because you think you can move faster. But it's very, very easy to let yourself, you know, sort of mask over the problems that you even still do have.

00:06:40

And then, you know, come a large project request like this, find yourself in the position where you're not actually able to execute on it. So that said, we had the actual requirement, we have to build this region and we gotta make it so the next one's not gonna be bad. So we had to look for what were the high leverage interfaces, what were the things that we could do, implement now that, you know, maybe we wouldn't cover all the boxes, check all the boxes, but we would be put ourselves in better place for the next time and make this one better as well. So we decided a few things. First, we needed to solve service discovery. As I mentioned, we had a internal system that we didn't really use well. Um, and we needed some kind of system. But another way we were thinking about this problem was in terms of static analysis. Uh, so excuse me, lemme get some water.

00:07:28

Uh oh. So if you're building and deploying a new region, you're deploying all of the hundreds of services that you already have in our case, and you don't really know everything about them. You know that these kinds of things, they all exist in production today and therefore they're all necessary for production to work, but you don't really necessarily know the relationships between them or how they depend, uh, or hurdle subtle subtleties about them. So if you're deploying Service Alice to, uh, to your new production environment, it would sure be nice to know that it actually depends on service Bob without deploying service by Alice. And maybe a few layers depending on top of that before you realize that this bottom layer thing was never deployed or was broken in some way and didn't actually work. And with static analysis, with the ability to say, I know for a fact that Alice depends on Bob, you can just deploy Bob first, deploy Alice, you're in great shape.

00:08:17

And our original service discovery system just didn't have that property. It was buried in code. So you could guess that something depended on something else, but you couldn't be confident. It could be buried many libraries deep, which wasn't really helpful when you were trying to do a from scratch build out. So we wanted to figure out a way to encode that information in our software so that if the software, if one service depended on another, we knew for a fact that that that was the case. Uh, so we had a configuration system for, for deployments that we usually passed hardcoded information in, but we introduced an abstraction layer. So we could say, okay, we're normally passing in these hardcoded values. We're gonna say ask for something that says where Bob is and we can replace that for you. And then we can sniff out that information at deploy time to understand what was actually going on.

00:09:03

That Alice depends on Bob. This also let us solve a pretty related problem, which is how do you provision credentials? So we had a solution for where you put credentials vault, it's a really useful tool, but how do you get credentials there in the first place? Again, we have hundreds of services, hundreds of databases. These dependencies are somewhat known, somewhat not. And so we needed a way to say, okay, I'm gonna take this kind of information. I can encode it in URL and then I can get that same kind of information extraction, this abstraction layer that lets me know that this service actually has an authenticated database on my db. This is service discovery is dependency injection services. Declare the dependencies with a standard format. You can put credentials in there and static analysis is actually possible. Next containers everywhere. This was key for us.

00:09:51

Uh, there are a great interface, most importantly between teams and machines. We had originally when we had a few number of services, we would have to go and make all sorts of configuration management changes on machines to deploy new services that had slightly different dependencies. This was killing us. So we knew since 2013 that containers in the form of docker were crucial for us. Um, but we could push that forward, we can make it better for the new region, push that into our container fabric, into our Mesos platform, and we could say, okay, we're setting ourselves up even better for the future. We're going to put Stateful services there as well, so that even if we can't necessarily orchestrate them today, we do have some experience running some service stateful services like our Cassandra clusters in this. We're going to push more services into that. In fact, all of our services are going to be in containers and eventually we're gonna orchestrate all of them, just not right now.

00:10:37

And next, um, one important thing is to standardize on a better operating system for our modern use case. Core os not cent toss. Cintas was great for us when we managed via Puppet, but it, we had sort of aged past that it wasn't really helping us today. And so CoreOS allowed us to, uh, encourage the behavior we wanted to see. Configuration is very limited. You don't have a package manager and that means that anything you wanna do, you wanna do in a container, which is the behavior we want to see. Um, and it has a first boot configuration system that's really well matched to the configuration. It does support. And that means we can stop using Puppet, which we just haven't managed very well. Um, we were even able to use the cloud config transer that they published to make assertions in our machine provisioning code that things were happening.

00:11:17

So we can assert here that our New Relic infrastructure agent is installed on every machine in our clusters, um, just as part of our provisioning process. And finally, we use Terraform because some infrastructure is just sort of fiddly customization. We needed a way to make that repeatable. We needed to know that okay, yes, this S3 bucket is different from that one, but the next time we go and build this out, we're gonna build them out in the same fiddly different way instead of accidentally making them the same and running into problems at runtime. Um, and importantly, uh, if you use Terraform, maybe you don't, maybe you do, you can develop your own provi providers and it's, it's relatively easy and we found that the investment in doing that just paid off repeatedly. So I've just, you know, spouted off about all the requirements that we had just tacked onto our initially simpler project of EU region.

00:12:03

And you may be thinking that this is kind of like a second system level of project. Like you've just turned this smaller project into this huge large project, which is a great way to have a project fail. Um, and you're right, this was risky, but you know, we did try to ameliorate some risk. We tried to choose things that weren't actually necessary. For example, we have our load balancing infrastructure, um, via F five hardware load balancers in our US region. We couldn't give those to IBM, but we could run virtual f fives in containers on CoreOS. And that led us say, cool F fives are just the same in the US not part of this project.

00:12:36

But the reason that we did this was, okay, so let's say we built out the European region and we just didn't make the infrastructure investments we needed to and just business as usual. We might've ended up with twice the ongoing operational load. And that might've been killer, especially if you know, okay, you know, you spend the same time to build a new region. You do that a couple more times, you have burned months and months and months and months of time and you now have five regions to regions to support that are eating each the same amount of time. This is just lighting money on fire. We might drag the company down under endless operational toil or just be forced to turn down business opportunities by not building those regions in the first place. So we just had to strike the Goldilocks bounds, what is the right work that we needed to do to make it so that future regions were gonna be better and so that we were set on the right path for success in the future.

00:13:22

So that's all the technical stuff we wanted to do. And we knew that pretty early on in the project. We hoped that we would do that discovery, then we'd fan out all the, the work that we had to do. 'cause every team was gonna have to do something to their own software. Uh, you know, they'd all do that. We'd integrated, uh, and build, build out the regions a few times. We'd test it and then we'd release it to our customers. That was the hope. But here's the reality, now it's time for the retro what actually happened. I wouldn't be standing up here if the project had gone perfectly 'cause I honestly think that this went perfectly. It's just a boring talk. So, um, and also more importantly, none of the things we did technically are all that unusual these days. Um, this is, you know, the kind of buzzword bingo that you'll see at many conferences. So, you know, what's more interesting to me is that we made these changes at some real scale. So I'm gonna go over some lessons we learned along the way and some each one, each one up with a set of things that we'd start, stop and continue in future large scale projects, whether at an infrastructure layer or uh, product level. And I hope that these are all lessons that can be relevant to you, even if you're working in a smaller or bigger company as you consider large scale pro projects of your own

00:14:30

First quick ramp ups. So how do you prioritize work? The Backpack project needed work done by every team at the company, which needed, we needed meant that we needed some way for all those teams to agree to do the work. So let's talk about how Road Roadmapping works at New Relic. First, all of our engineering teams are autonomous. They have their own roadmaps set by the team and their product manager. And those board maps are supposed to meet the team's own goals as well as to meet the broader requirements of the business. Those broader requirements come from a group called the Product Council, which quarterly meets, um, or more often if necessary, hopefully not. Uh, and publishes a list of up to five high priority projects that are going on across the company in order of priority. And teams are supposed to contribute to those products and projects in order if they can.

00:15:12

And some of those might be something that only one team that can effectively work on, like team A needs to get this better. Cool. What that means for everybody else is just make sure that team A can do everything they possibly can. Make sure they're not blocked. If you're, they're blocked and you can unblock them, do it. That's the most important thing you can do. And then there are other projects like Backpack, which is just a high priority cross-cutting project. And then we have, you know, other, you know, high like, uh, features that we wanna make a press, press splash with that kind of thing. And this project actually process actually does work, but there's a catch what happens if the project needs resources to move forward. And the product council just hasn't prioritized it yet. That's what happened with Backpack. There was a month or so of delay passed when the project should have been prioritized but wasn't.

00:15:55

And then suddenly it was all systems go. We were prioritized and almost all teams at the company were knocking on our door trying to figure out what they actually had to do. And in retrospect, this sudden swing some from no support to near total support really wasn't good for the project. We suddenly had to transition from operating relatively independently, you know, solving some problems in the IBM infrastructure, you know, doing some strategizing, whatever to, oh wow, there are like 40 teams talking to me right now. Um, and we'd had some organized discovery work, you know, helping teams figure out what they had to do. And that was really well organized. But even still, it left the central engineering team scrambling with our Cis Hudson success and just getting attention. And we weren't ready, we weren't ready with documentation, service discovery, or other core tooling which were critical to the success of the project.

00:16:38

And worst was that we just didn't have an easily digestible philosophy to help people make decisions. If you're going to embark on something like this and you want everybody to be able to make their own decisions local to themselves and have it be the right kind of decision, you have to tell them what they should be thinking of how they should be trading things off, um, so that they can, you know, prioritize the right things locally. And we just didn't have that philosophy of available. You know, we talked some internally, but it wasn't good enough. So in the future we're gonna start, you know, preparing for what happens when you get this high priority. Gonna produce a project philosophy document to try to help people make those decisions. And we're gonna continue prioritizing important work across the company. Very related problem. The problem of moving goalposts.

00:17:17

So I mentioned that we had this big discovery process and we were gonna specify this work upfront. You know, containerization move to our Mesos platform service discovery. Um, but there was also later work that we kind of knew was coming and we just didn't talk about. So for example, our initial instructions for teams asked them to make their services ready to receive URLs in the service discovery format, but not actually use the tooling because the tooling wasn't ready yet. And we didn't mention that the tooling wasn't ready yet because we didn't want them to delay starting the work until it was ready. This logic made some kind of of sense at the time. Um, and you know, we thought okay, you know, changing that last step of using the central tooling was gonna be pretty trivial. It's just changing a couple configuration files, but it's still work that we didn't really specify.

00:18:02

We thought that this was, you know, a good idea for a couple reasons. First, each team has a hero role that passes around, typically via the on-call rotation. They're supposed to handle small requests from other teams. So we thought that the smaller stuff to come could be handled by whoever was the team's hero. Second, we didn't want to exhaustively list work that couldn't be done yet, especially since we didn't know what the work precisely looked like yet. So we didn't want teams to say, oh, okay, I can't start it until you know exactly what I'm supposed to do. We'd rather teams make mistakes and follow up. In retrospect, this was just a mistake, most especially not communicating that there would be follow up work, even if we didn't know the full extent of it. And the fact that the teams here rotated just meant that there was a lot of context switching necessary since the person that did the work initially might not be the person who was responding it this week.

00:18:46

And that was one kind of moving goalpost. There was another, we did three test buildouts as part of this where we tore the environment down and tore it and brought it back up again. Um, and everything was changing basically every time. So our goal here was that we were gonna inter iteratively improve the infrastructure side of the buildouts, figure out some changes we needed to make. While it wasn't a production environment, um, you know, if you're working on an infrastructure project, it's great to be able to say, oh, that's not actually really production. I can make a cross-cutting decision and just have it come into effect without having to realize, okay, now I need to ma manage a really slow rollout.

00:19:19

But the reality of the build out then was that teams would do work, we encourage 'em to do the work in the US first, 'cause the, we were making the same improvements there, then they'd wait a few weeks or days depending, um, 'cause they couldn't test it in the EU yet. Then the backpack team would try to deploy their work and then we'd realize something was broken. We'd go to them, they'd say, oh no, you know, we, everybody's blocking the project. It's, this is a mess. Um, and in our minds, the goalposts weren't really moving here. 'cause the goal was always your software works in the eu, but at the same time, that's a really ambiguous statement. And given that things on the ground were changing, it's not unsurprising that things just weren't always working correctly. So in the future, the most important thing that we could do to fix this is to use a steel threat approach, validated, validating the design using a sub-project that tests it thoroughly.

00:20:09

For example, if we could find a slice of our system that is top to bottom, you know, some product to data storage and, you know, authentication and everything we need, but it's, you know, say 20% of the services at the company, we could have tested things with those 20% instead of making the 80% come along for the ride and then realizing that we screwed something up. Now this might have actually been the wrong decision if we'd gone this way. 'cause it could have lengthened the wall clock time of the project. If we had said, okay, we're gonna spend three months with the 20% and then six months with the 80%. Maybe those 80% teams wouldn't be done in time. Maybe we did get a benefit by starting everybody at the same time, but there's probably a better balance to strike than what we actually did. So in the future, we're gonna start having that steel threat, uh, steel thread test case, gonna be more honest and transparent. We're gonna stop hidden work even if only by acknowledging that, you know, some unknown future work exists. And that's because we are going to continue to avoid complete waterfall planning. Agile's really important. We have to be able to react to changing conditions, changing realizations. We don't know everything upfront.

00:21:16

Next communication is hard. I know this is surprising to everyone in the room. Uh, so New Relic has a strong culture of internal blogging as a means of broadcasting ideas and posting updates on projects. You know, if you wanna influence the company, this is the way you do it every step of the way. The Backpack team wrote documents about the ideas behind the project, the changes we needed made in software plans for build outs and what we just did each week. But the trouble with an internal blogging culture is that everyone's doing this. So there's a ton to read and a ton to discuss. And you as an individual basically have no way of knowing exactly what you should be read reading. You know, you can read the things that you, you think you need to read, but you don't know what you, um, you should have been reading all the all that time.

00:21:55

And this is true, you know, across the board it's really hard to know how to keep off on everything, you know? So we had town hall events to broadcast updates, but people would be on vacation, they'd miss that. We had some, we tried having a checklist application that wasn't looked at or you know, okay when it was, if we made a change to it, people would have to notice a change was made. Um, had tried to have some automated linting so that we could sniff out problems in code before teams ran into them in production. Um, that wasn't used super well 'cause it was kind of out of band wasn't great. And emails, people just don't read them. And I say, you know, okay, you know, blog posts don't get read, emails don't get read. Um, everybody's always reading these just, you can't count on it.

00:22:33

You can't count on an individual having read any individual thing. So the most important thing we could have done was have some kind of centralized documentation. And this is something we realized later in the project and implemented. And this probably would've let helped teams and individuals catch up on what they missed without having to hunt through the blog, blog post history or watch recorded events, which is something that no one is going to do. Uh, so we'll have some kind of, you know, centralized documentation. We're gonna have a user readable change log of requirements and GI commits are not good enough. Nobody's gonna look at those. If you want white space changes to show up, then you know, people are probably going to ignore that, gonna have some kind of better linting. But you know, we're gonna continue to blog internally. It's really useful for us and we're gonna continue to just communicate using as many channels as possible because we know that one is not enough.

00:23:20

Next local maximums. So I'd mentioned that we'd gone through all these phases of, you know, incrementally improving our infrastructure. Um, but we didn't really have a standardization at any point along the way. And so one consequence of that was that teams were heavily incentivized to make the system better locally. And what I mean by that is let's perform a thought experiment, travel back, say three years in time. You're a team at New Relic, you own 20 services, you deploy them all frequently, but the tooling's not that good. You kind of wanna be able to deploy all 20 at once or say, okay, when I deploy this one, these other ones have to be deployed too. Or I wanna move from staging to production automatically. So rather than just twiddle your thumbs, you just build that and this is great. You get three years of productivity benefits, you're able to move faster, was the right decision for you.

00:24:06

Uh, so, you know, maybe some throw in a couple other features too, like shared, uh, service discovery. Three years later you've had that benefit, but you don't have the standardized platform that was built in the meantime. And now someone comes along and says, cool, if you want to get on our European region project, uh, which you by the way, you have to because we need to ship this thing. Uh, you need to adopt the standardized tooling. So then you're kind of in a little bit of a problem. You have some kind of transition pain, but you know, there's a promise of, oh, that standardized tooling is great. You know, we have a build and deploy tool. It's team working on that. You know, they have your interest in mind. This is gonna be better for you. But reality is actually a little different. That standardized tooling is going to be worse for some teams.

00:24:48

In some ways it's going to not quite match what they want to see. But the benefit is for the company as a whole, not them. So it's really like there's this future tooling benefit that everybody wants that, but we need everybody to move to it now yesterday, not when it checks all the boxes. This really isn't a good way to make friends to basically force all of your engineering teams to simultaneously adopt shared tooling. Um, but it's kind of critical. I don't have a great answer here, other than just having more empathy for teams stuck in this situation. Communicating well in advance could help too. If the tools team had known that there were a couple gaps that 20 teams had that they could've closed to make this easier, that might've helped. And also, frankly, I also, we, we just need to stop making assumptions in general about how teams or individuals will react.

00:25:33

So, you know, I said, okay, so this team built out this tooling and they've had it for three years. And you might assume, oh, they're gonna be really reluctant to build, to give it up because they built it and it's working so well for them. When in reality they may just hate it. The team may have completely cycled throughout that time. Um, and they may just be stuck with some legacy stuff they don't like, whatever reason, maybe they just wanna get rid of it. Um, so just stopping, stopping those assumptions is critical and we are gonna continue to make standard tooling better. So the best decision we made was leaning on what we have. Uh, excuse me. Um, we had our in-flight projects already. We had our container fabric, which was our Mesos platform. We had our grand central build and deploy tool system I mentioned or had on the slide earlier, our containerized database platform.

00:26:18

All of these were crucial in making this project possible. So by saying, okay, everyone moved to this, we were able to, you know, basically get a lot of bang for our buck. There was a huge uptick in adoption rate as part of this, which wasn't necessarily good for the teams involved supporting those central things. But on the other hand, part of our design goal here is to make it so that platform teams are the ones bearing the brunt of the work, not spreading it across a bunch of teams that aren't actually equipped to do that work. So in the future, we're going to start making clear which priorities are highest for infrastructure teams so that they know that this is coming and we're gonna look for high leverage work. A small number of teams can do. 'cause it's really, really useful when you have those platform teams that are able to contribute to large projects like this.

00:26:57

And finally the last lesson learned the importance of a pilot phase. So our original plan was discovery, fan out, test ourselves and release it. And we realized that there were just way too many unknown unknowns in this environment to live up to our own expectations for reliability. So we changed it. We delayed GA significantly and opted instead to run a pilot phase for a very limited number of customers. And that allowed our teams to be on the hook for reliability without the same consequences when things just inevitably went wrong. Whether in the underlying cloud or in our software, whatever, we knew things were gonna go wrong. We wanted to be able to learn how to deal with that before we were live in production for everybody. And this was just absolutely the right decision. So in the future, uh, honestly, we just gotta stop magical thinking.

00:27:36

I have no idea how we thought that this was okay in the first place. So sum up, where are we now? First off, the project did work. It was painful at times, but we do have an EU region. Um, if you're interested in it, um, contact, uh, contact us. We can hook you up. Our disaster recovery exercise has seen a ton of benefits. We used an order of magnitude, fewer engineering hours this year to do it. Um, we've had less busy work in general service operations. A lot of the boilerplate configuration is gone and we've laid the groundwork for future improvements, have 95% of our services in our container fabric. And there's a meta benefit as well. We're an observability company serving the modern market. And so the more that we are on the bleeding edge of things, the more that we are able to see the gaps in our own product and cover them before our customers encounter them. It's a really nice thing to have and we've learned a lot. We now know how to run large scale projects better. So in the future we're gonna continue just a little bit of magical thinking and trying bold things like project projects like this. So thank you to all of you for listening and to all the hundreds of people who worked on this project. Thank you.