Las Vegas 2018

Progressive Deployment, Experimentation, Multitenancy, No Downtime, Cloud Security, Oh My!

This experience report is about rearchitecture from a monolith to cloud-native practices. We cover moving stepwise from single tenancy to multitenancy, scaling up to scaling out, fixed resources to optimized variable costs, periodic upgrades to zero-downtime updates, single backlog to continual experimentation, linear to progressive deployment with a controlled blast radius, long release cycles to continual testing, opacity to observability, and pre-release security reports to continuous security practices.


Dylan Smith was a Microsoft MVP (ALM) and DevOps consultant for many years before joining Microsoft to lead the DevOps Customer Advisory Team. Now he works with Microsoft’s largest customers to help them accelerate their DevOps journey.


Sam Guckenheimer is the Product Owner for the Microsoft Visual Team Services and Team Foundation Server. In this capacity, he acts as the chief customer advocate, responsible for strategy of the next releases of these products, focusing on DevOps, Agile and Application LifeCycle Management.


Sam edits the website, DevOps at Microsoft. He is a regular speaker and has keynoted at many conferences including DevOps Enterprise Summit and Agile. He is the author of four books, most recently Journey to Cloud Cadence, Visual Studio Team Foundation Server 2012: Adopting Agile Software Practices: From Backlog to Continuous Feedback. Prior to joining Microsoft in 2003, Sam was Director of Product Line Strategy at Rational Software Corporation, now the Rational Division of IBM. Sam lives in the Seattle area with his wife and three children in a sustainable house they built that has been described in articles in Metropolitan Home and Houzz.

DS

Dylan Smith

DevOps Architect, Microsoft

SG

Sam Guckenheimer

Product Owner, Visual Studio Team Services, Microsoft

Transcript

00:00:05

We both work on something that's now called Azure DevOps. You may know it from, uh, uh, its predecessor Visual Studio Team Services that I've talked about, or it's on-Prem Sling Team, foundation server. This is also the basis for Microsoft's one engineering system. To give you an idea of the scale at which we are doing DevOps, uh, we rolled up from, uh, our SaaS, some stats. We're doing about 78,000 deployments a day. Uh, not on the slide is that we have 94,000 active engineers internally who are our customers, in addition to millions, uh, of folks like you Externally, the state of DevOps report, which I hope you've all read, Nicole and Jess will be here, uh, talking about it, I think tomorrow talks about how in order to become a high performer or elite performer, you need to go a lot faster. And, uh, they emphasize all of these practices that high performers do with regard to the speed of delivery, the time to recover the, um, uh, lower change failure rate and so forth. And the question that the report doesn't really speak to that, uh, I get in in customer conversations all the time is, well, we've got an existing business. What do we do about our existing code? How do we, should we throw it away and go cloud data and start over or, and how do we do that and keep the business going forward? So this is a story about that. Dylan,

00:02:09

Sam.

00:02:11

All right. So what I wanna talk to you about is, is what are that journey, that story look like? Um, we used to have, or we still have TFS on premise sometime around 2010, we started building what is now called Azure DevOps. I think the first preview is 2011 2012. But we've been on this journey for I guess seven, eight years now. And we've told various aspects of our story at previous conferences like this. Um, today I wanna focus on a specific aspect of that journey. And the first question that we faced when we decided to build this cloud hosted software as a service version of TFS, was, we have this existing code existing architecture. Do we re-architect it for the cloud, whatever that means, and then move it? Or do we just move it as is and deal with the problems as we run into them?

00:03:07

And we chose the latter. We just moved it as is and went from there. And what I wanna talk to you about is, is what did that look like? What were the problems that we face specifically? And, uh, and where have we come in the last five, six years on this journey? So when we started, we had TFS, we shipped it every two years or so on-premise server product, some context. TFS architecturally is basically a SQL server database, has all the data, and then application tiers and job agents, which are asp.net web applications hosted in IAS. Uh, it's not multi, it's not multi-tenant. Uh, but we did have a concept of a collection in TFSA collection, a collection of team projects that each collection got its own database that'll be important. Uh, and we could low balance the application tiers and job agents. So when we decided to move this to the cloud, uh, we basically took what we had with almost no changes and threw it up into Azure. Uh, specifically it was called web roll and worker roles, which are an ancient Azure technology. Basically a bunch of VMs and Azure SQL databases. Uh, we basically installed TFS up in Azure and made it available to our customers. But the only tweak we had to make was how we do identities. Everything else was almost identical to what we had,

00:04:31

And we ran into some pretty significant problems almost immediately. Uh, the first problem we ran into is every time a customer signed up, uh, they got a new collection inside of our software. Each collection meant a new SQL Server database. And we very quickly had something like 11,000 plus databases, uh, in the cloud. And I don't know if it was our software or the SQL software, but that just fell over dead. Um, it just wasn't never designed to handle that many databases. Uh, so the first problem we needed to do was we need multi-tenancy. We need to have, uh, multiple customers in one database for, uh, several reasons. But having 11,000 databases just wasn't sustainable.

00:05:14

And our approach to that was, you know, typical multi-tenant implementation. We added, you know, customer ID column to every row and every table, and changed every query to filter by customer id. Uh, we had some clever trick that we used, uh, to do automated testing to make sure we didn't miss any queries, so we weren't leaking customer data. Um, but that, that's the approach we went. Now we have, instead of 11,000 databases, one or at least a small number of databases, uh, the next problem that we faced was, now that we have one big database, it turns out that SQL Azure back in the day, um, well, I can't handle a giant database with multiple terabytes. I think at the time the limit was 500 gigs. Uh, but even if it could handle it, um, our cost, uh, you know, we get an Azure bill just like everybody else. And our Azure bill was big, uh, especially the database part of it. Um, so the second change we made to kind of optimize our cogs, our cost of goods solds, our cost of running the service, uh, was move as much of that data as we can outta SQL server into much cheaper blob storage. Um, I believe nowadays we still have something like 60, 70 terabytes still in SQL Server, but that's just the metadata. All the customer data, the source code files, work item attachments, build outputs, all that other stuff is, uh, in blob storage.

00:06:33

And the other major problem was TFS on premise, and this is still true today. Uh, when you upgrade it, you need downtime. We go and make changes to the database schema. It requires downtime. When we move this up into the cloud, uh, for the first, I wanna say nine months, uh, every time we wanted to upgrade our cloud service, we scheduled a maintenance window and we took it down for the world, uh, today in 2018, that obviously wouldn't be acceptable. Uh, but that's, that's what it was. The first, I dunno, four or five kind of major updates to Azure DevOps required downtime. Uh, and our approach to that, uh, you know, we basically had to come up with a system to do specifically database updates. There's a lot of stored pros and database stuff in Azure DevOps to do those, uh, without downtime. And we have a system of PowerShell scripts, uh, that kind of allows us to do that.

00:07:24

The, the kind of key implementation details are, um, every time we implement a feature, you know, has code changes, has database changes, uh, those code changes need to work with the old database version and the new database version. So that's kind of the, the key detail of how we did no downtime deployments. So we can roll out the code change first that's gonna work with the old database schema, and then use our PowerShell script framework to do that database, uh, change kind of online in place. Um, the, and that works great. The big downside to that, or cost to that is every feature that we implement from now until forever is, is more expensive. Every single feature we have to consider, how do we make that work with the old database version and the new database version? And that's cost. And it's, I don't think of it as a cost. I think of it as a tax. It's a tax on all feature development forever. I don't see a way to avoid it. We've just accepted that we are gonna pay that tax. Um, but anytime I see a change that is a, a tax instead of a one-time cost, need to think very long and hard about that.

00:08:36

Another thing that we make extensive use of is feature flags. Uh, pretty much every feature that we develop is hidden behind a feature flag where some period of time until we're ready to, to turn it on for the world. And I'm sure you've all heard about feature flag and how great they are. So I'm gonna tell a story about where it bid us.

00:08:56

So back in 2013, a Microsoft Conference called Connect. And these are the conferences where during the keynote we usually announce big new features. And we had some big new features, uh, for, what was it called at the time? VSDS or VSO, whatever it was called at the time. Uh, and we had deployed them to production, you know, in the weeks leading up to the event, hidden behind a feature flag. And our plan was, you know, an hour or so before the keynote, let's flip that flag on for the world and then we can show it off in the keynote, these great new features. Um, while that didn't go so well. So we flipped that flag an hour, maybe two hours before the keynote. Uh, and this is a chart from one of our blog posts on the root cause analysis report of, of the chaos that ensued.

00:09:40

Uh, we flipped it on for the world and the feature started seeing load like it had never seen before. We tested it obviously, but, uh, there's no place like production. The service started going up and down. It was down during the keynote. Um, you know, it didn't, uh, it didn't feel good. So we learned a few things from that experience. Uh, number one, probably don't flip on major feature flags an hour before a major keynote. Uh, so if we ever have any, uh, Microsoft conferences with keynotes where we're announcing new features, uh, you know, go check out our service the night before that keynote and you'll probably see some stuff that we haven't announced yet. Uh, so that was one lesson. But the bigger lesson that we had here is, at this point in time, I think it was November, 2013, um, we had one instance of our service in the cloud, which means if we break it, if it goes down, we break everybody. The blast radius is global. Uh, so our bigger learning is we need to do something about that. Uh, we need to slice, slice our service up so we can limit the blast radius when we inevitably break stuff.

00:10:50

And the way we did that,

00:10:54

Do

00:10:54

The animations here, is we split it up into things that we call scale units. Now, effectively independent instances of our software in the cloud, um, nowadays we have many dozens of scale units. And this helped us in a few ways. Uh, number one, it helps limit the blast radius. If we break something hopefully is limited to just that scale unit, it's not always true, but often that's true. Uh, but it brought us a couple other really important benefits. Um,

00:11:30

We have these dozens of scale units and we group them into rings. So we have six rings, rings zero through five, right? And this is how we do progressive deployment. Uh, when we roll out an update and we put our release notes every three weeks, uh, we go to ring zero first, we wait a little bit, ring one, wait a little bit. It takes about a week to go through all the rings. And we designed these rings specifically. So ring zero is our internal accounts. So if we break something, we will likely break it in ring zero first. It'll take us down, we'll fix it before it hits our customers. Uh, but then the first few rings are specifically designed, uh, ring the next ring, ring one. We have specifically targeted accounts that use features in our product that we don't use internally. For example, we have test plans for manual test management.

00:12:18

We don't really use that internally on our team. Uh, so that means that feature doesn't really get tested in ring zero. Uh, so ring one is kind of customers that use the breadth of the features that we may not necessarily test extensively in ring zero. And then, I can't remember exactly what each ring is, but one of them's, you know, a non-US, uh, geography, and one of them is, is very large customers. And then eventually ring four and ring five is, is everybody. You know, when we do our three week deployments, we deploy to our ring, we wait 24 hours, we deploy to the next ring. When we do daily haul fixes, we wait an hour between rings and we're basically waiting to see if any alarm bells go off. Uh, and, and then the third thing that the scale units allowed us to do, you know, is put an instance of our software in different geographies around the world. You know, we have customers all over the world, um, even 11 customers in Antarctica. So that statement is true. Um, so now we have, we can put the scale units in in different regions. When you sign up for an account, your account lives in one of those specific scale units.

00:13:23

Lemme show you real quick what this looks like. So I'm gonna show two things. So this is our release. This is Azure DevOps, and this is our release screen where we release Azure DevOps. And if I bring one of these up, so this is, uh, active release, I think it started yesterday, and I'm gonna have to zoom out my browser here to make it all fit on the screen. So these individual boxes, those are effectively our scale units, kind of the columns represent roughly the rings. I can see here that, uh, ring zeros up top ring one is done, ring two is done. It's in the process of deploying to ring three. Um, so every release goes through, goes through our rings. We've kind of modeled the scale unit, some rings inside of our release tooling. Uh, while I'm in here, I'm gonna show you one other thing. So I said we use feature flags extensively,

00:14:39

<affirmative>,

00:14:41

When you hear us talk about, we have features that are in private preview or public preview, um, how we surface that to our customers. If you've used our tool, you may have seen this. Uh, we can go in there and I can see preview features.

00:14:55

I can turn various preview features on and off just for me or for my account. And, and we have our own little, a little mini workflow that some of these preview features go through. Uh, so not every feature flag is exposed through this screen. This is kind of the big major ones, potentially disruptive ones. Uh, but by exposing this directly to our customers, it allows us to do some pretty useful stuff. Uh, when we have a major feature, uh, we will release it and it'll be off for everybody. At some point we'll flip it to on for everybody and allow them to turn it off. And then at some point we'll just get rid of the flag. Uh, but that first stage where the flag is there, but it's turned off for everybody by default, uh, that allows us to release a feature when it's not done.

00:15:36

So oftentimes some of these big features, you know, it'll be half done, uh, but we'll release it in kind of public preview, hidden behind one of these flags. And that allows us to iterate on the feature kind of in the open, getting feedback, not having to wait until we're done the feature to put it in the hands of our users. So we're, we're at least features behind those flags very early. Um, and then at some point we feel that, you know, we're, we're done or done enough, and we'll flip it to on for everybody. We'll kind of monitor our telemetry, see if people are manually flipping it back off. Try to figure out why. Once we feel comfortable that, you know, people aren't turning it off anymore, we'll just remove the flag.

00:16:28

All right.

00:16:33

And then there's one other, uh, kinda really important aspect of our journey that we kind of touch on every, regardless of which aspect of the story we're telling. Uh, and that is, you know, now that we're in this cloud service cloud native world, we're releasing every three weeks instead of every two years. We're moving much, much faster than we ever have before. And at some point in this journey, uh, quality became a really big focus for us. We're moving very fast, which means we're potentially breaking things very fast, and we needed to have a really good handle on that. Uh, so we made a few changes

00:17:15

In order to kind of wrap our arms around, around the quality of our product. Uh, the biggest change that we did, in my opinion, uh, is we combined the developer role and the tester role into one role combined engineering. And that really changed the cultures. It wasn't one person responsible for features and another team responsible for the quality. Every engineer is responsible for their feature and the quality of their feature, and now also the health of their feature once it's in production. Uh, but the other thing that we did is we had to change our approach to manual testing, specifically how we implemented our tests. Uh, it used to be that pretty much all of our tests were uh, end-to-end integration tests, whatever you wanna call it, where the app needs to be deployed. And it runs kind of an end-to-end test. And we had tens of thousands of them. They were slow, they were brittle, they often failed. And we weren't sure if it was really a bug or just a problem with the tests At its peak, I believe it took 22 hours to, to do a test run. So we did it about once a day

00:18:13

And we ran those tests every day for eight years. And never once did we have a completely passing test run. I'm told we came close one Christmas time when people were on holiday. Uh, but so we knew that that needed to change. And the change that we made is, you know, we basically just adopted kind of good unit testing practices. We came up with our own hierarchy of tests and because we like to inventor names for things at Microsoft, we call 'em L zero, L one and L two L zero, and L one is like a traditional unit test L twos. And then there's actually l threes also are the more kind of end-to-end scenario tests. And our philosophy was, if you can test it with an L zero, don't use an L two. The vast majority of our tests should be these L zeros, these unit tests that are fast, they're not brittle.

00:19:00

Um, so that's what we did. And the chart at the bottom of the slide kind of represents the main scorecard that we used. And to make that shift from 22 hour test run of something like 50,000 end-to-end tests, uh, to what we have now, it took about two and a half years. That chart represents every kind of column is a three week sprint. So that whole chart represents about two and a half years. And kind of what you're seeing is the orange bar is the old style tests. And over on the right, the big bar is the L zero test. And then the smaller blue bars are L ones and L twos. Uh, so it was slow, it took years, but we just had to start and we scorecarded the crap out of it. And we had other scorecards that show how many of those orange tests does each team own, and are they slowly working it down and shifting the mix? And over two and a half years, you know, we, we got there.

00:19:56

So

00:19:56

Lemme show you what that looks like. All right.

00:20:08

So,

00:20:11

So this is our pull requests and this is active. So everything on the screen is kind of pull requests completed in the last 20 minutes or so. So I hope there's nothing secret in those comments. And if I pick one of these recent ones, uh, let's say

00:20:28

This one.

00:20:36

So we have various builds and tests that run on our pull requests. I'm gonna see a brief blurb about them in a second. Uh, but in addition to shifting all our test mix to these fast kind of reliable L zero tests, we also needed to, to shift left, right? Uh, we wanna run those tests now that it's not 22 hours. Now that it's more like 22 minutes. Uh, we wanna run those tests earlier in the process before the code makes it into our master branch. So in our, our case with get that means we wanna run on every poll request. So if I look at any of these poll requests and I go to tests, I'm gonna see somewhere around 85,000 tests. Every time I look at it. It's a few hundred more. So 84,000 tests. These are all of our L zero and L one tests, about 18 minutes. And every single code change has to pass all 85,000 tests, or it doesn't get in. That's our L zero and our L one test. The vast majority of our tests fall into that bucket. Uh, but we do have what we call L twos and l threes. And if I look at another dashboard here, all right, so this crazy colorful chart,

00:21:46

What this is showing is once the code gets into master and we kick off some other builds, and we're gonna run some of our slower tests, um, every column there is, uh, build the CI builds. Basically once pro across is merged, it kicks off one of these CI builds. The blue stuff means is still running. So over on the right will be the most recent builds, and each of the rows is a different suite of our slower L two L three tests. So you see every build kind of runs through with a 10 or so suites of a kind of slower tests. Um, most of them are green. I see this kind of a red sprinkler down there, but that's our approach to, to testing and driving and trying to keep quality high in the product. And one last comment that I've been thinking about a lot lately is, um,

00:22:38

If you look at some of our root cause, we publish our root cause analysis when we have major life site incidents. Uh, you may know a month or so ago we had one, the south central US data center got hit by lightning. Bad stuff happened. Um, and what I notice is that the, the failure failure scenarios are getting increasingly complex. And I think, you know, we've gone to microservices. I haven't, didn't really talk about that, but over the course of those six years, we carved up our monolith into about, I think we're 31 separate services now, but, and there's lots of benefits to that. Our teams go faster with higher velocity. The teams that are on the microservices versus the monolith life is much better for them. But I feel like there's a downside, which is the complexity has exploded. And if we look at the causes of some of our life side incidents, um, you know, it's these, it's these really niche, intricate, very specific kind of failure cases. And there is just so many of them now with the increased complexity of microservices. Um, so that's, that's a challenge that we're still struggling with today. Uh, I think that one of the, one of our approaches to help solve that is to, to really get into chaos engineering or what we call fault injection testing, uh, which we do a little bit of today, but probably not as much as we should. Uh, so that's probably where our journey will lead us next.

00:23:58

Sam. Yeah,

00:23:59

Thanks Dylan. So, so Dylan's, uh, shown us how bring the flag. We, um, try to go faster without breaking things. How we, uh, use feature flags to control exposure to whom, how we use ring deployment to control exposure to where going from a data center with a smallest user count, to the largest user count to the highest latency, then how we modified testing so that we can test at the earliest level possible, shifting left as far as we can. And by the way, there's also monitoring that, that helps us shift, right? This has been a significant change in engineering process.

00:24:51

If you, uh, remember the move to agile, we got this idea of a definition of done. The definition of done that we believe in in DevOps is that you have delivered code with tests and telemetry. And the telemetry that goes with your code will substantiate or diminish the hypothesis that motivated that deployment. In other words, you're not done until you can prove in production that you're getting the results that you wanted. Whether those results are higher customer engagement, faster or performance, uh, uh, lower abandonment, uh, any of the things that you might wanna achieve, you need to measure and you need to be sure. And if you're not getting those, you pivot. That's a real change. As Dylan pointed out, we didn't have any telemetry when we were OnPrem and we had to go through this process of introducing extensive trace points and then using a big data pipe so that we can gather everything that happens.

00:26:09

We gather eight terabytes a day of telemetry, I believe,

00:26:12

Yeah, across Azure services. It's five petabytes daily, uh, using what you would know as Azure log analytics and Azure monitoring. Now, one of the other things that, uh, comes out of this, uh, experience is what the state of DevOps report highlights the idea of a J curve. We got better. We introduced feature flags, hit a bump, we introduced progressive exposure, hit a bump, and then we said, oh my God, we cannot go fast enough because our tests make us too slow. So we had all of this technical debt in these long running tests, and Dylan talked about the 22 hour quote, nightly automation run. There was actually a full automation run, which was more than twice as long, and they were inconclusive. They always ran red and someone needed to investigate. So we needed to go through that valley of darkness and factor the technical debt in order to get back to the place where we could actually go fast at high quality, uh, and, uh, high reliability for customers.

00:27:34

So the lessons learned that, that we'll leave you with are, in our case, we said we're not gonna throw away the existing business. We are going to refactor. We are going to in fact make the code base work both for the continuing on-prem business and for the SaaS from the cloud. And we did that incrementally. It starts with a single sprint. We're now in Sprint 143 and we're still doing it. We needed to then figure out how we could get to safe deployment by controlling exposure. Not inflicting changes on everyone at once, but a little bit at a time. Folks who were in the preview based on feature flags, people who were in the canary ring, then people who were in the lighter rings. And we do that for every change.

00:28:35

That also allows us with feature flags to do continuous experiments as part of our continuous improvement journey. And without the experiments and the measurement of the results, we just wouldn't know. It lets us do trunk based development where everyone's going into one master branch and it's safe because the testing's done in the pull requests before the commit to master. And we get there because we have enforced the idea that green is green and red is red. And testing needs to be a reliable signal, which means culturally you're responsible for the tests with your code and you're responsible for the telemetry. It's not done until telemetry says it's done. That's the story of eight years. The change takes time. We're maybe halfway there. We're going to keep going. We're going to keep improving, but we wouldn't be halfway if we hadn't started. That's what we have time for. Uh, there's, uh, uh, a meet the authors this afternoon during the, uh, networking time. I think three 15 will be, uh, up there to chat further and, and, uh, answer questions. And we'll have our laptops if you wanna drill into any demos. Um, look forward to seeing you there. Look forward to seeing you, uh, at our booth. Thank you very much for joining us.