Las Vegas 2018

Evolving Windows: This Journey to DevOps

An overview of the journey Windows has been on to transform its process, tools, and culture to a DevOps model.


Catherine Kamerling is a Principal Program Manager in the Windows Engineering Systems team, and manages the Windows Engineering work management team. Her team oversees the largest VSTS account in the world and uses DevOps to listen, experiment, and engage with Windows developers to improve their productivity.

CK

Catherine Kamerling

Principal Program Manager, Windows Engineering Systems, Microsoft

Transcript

00:00:04

I'm Sam Guckenheimer from Azure DevOps.

00:00:07

I'm Catherine Kaling from the Windows Engineering System team.

00:00:11

Uh, did any of you see my talk on Monday with Dylan Smith about, uh, great. Okay, so about half the room. Um, so that was one experience report about, uh, moving to a SaaS, uh, and we showed you how we worked with that. Um, Gene's been asking for, uh, a long time. Can we hear about Windows? I wanna hear about Windows. If Windows could move to DevOps, then anyone can. So, uh, I, I brought someone who knows the story and, uh, uh, you can hear about doing this at a huge scale. Catherine.

00:00:54

Great. So thank you everyone for coming, taking the time to listen to our story. I thought what we would start with is to go a little bit in the way back time machine and go back to Microsoft in 2007. In January of that year, Vista was, uh, released after a much needed time release. Remember, XP was in 2001. So we, after fits and starts, had eventually released, uh, Vista to the marketplace with much bated breath. I was a brand new PM at the time joining the company, and my job at the time was to go around to all of the areas around the world and to get to the privilege to tell the county managers, the country managers, that they were not going to get their bonuses that year for the deployment of Vista. And the reason why is when you take a look at it, I don't know if you remember any of the predictions, but the predictions were the deployment of Vista was going to be around eight to 10% and was going to re-energize the PC unit shipment base.

00:01:58

And a lot of the country managers had talked to their employees who had talked to all of their sales team and felt that this number was on par. One of the challenges that we'll talk about is that we didn't really have good telemetry at that time. And so my job was to go around and scan through a survey process people's machines across the world to understand what was really deployed. And the reality was the deployment was 1% vastly different than the 8% number that we had, uh, hoped, and that a lot of our goals had been measured against. And as you might expect, we were, we were just stunned. We had no idea why this was happening. And a lot of that was because at the end of the day, there were three big challenges that we just were not paying attention to. Looking at the unconscious bias of our market position of where we were at the time.

00:02:51

The first one was telemetry. We just did not have the data to understand what was going on in the market. We relied primarily on third party providers who provided that information that they had and would provide it back to us. We had horrible customer connection, listening systems. We were very good at reaching out to the customers to ask their opinions on all sorts of information, but it revealed the unconscious biases that we had from our position, and we didn't have any pipes that allowed that flow of information to come back the other way. And as you all know where this story goes, a lot of the information that people were trying to tell us this, we had huge compatibility issues with Vista. If you remember, with XP being in the market for so long, our customers had built so many customized solutions and tools on that operating system. Even if they wanted to move forward with to Vista, they were unable to do that and they had to roll back. And we, uh, missed those opportunities to take that information. And the third one was just speed. I mean, our waterfall process was three years. This one had been doubled to six years. And by the time we had done that planning back in 2002, and we released it in 2007, the market at that point was starting to move much faster. And we couldn't respond to the plans that we had built.

00:04:12

So we know we needed to do something differently. This is some of the big challenges that people know about the Windows database. We have about 11 million work items in our database across 12,000 engineers that work on this system on a daily basis. On an average day, we do two and a half million queries, um, or updates per day. And we have about 350 million revisions. So if you take a look at that data from a different perspective, you can see how that can be perceived as five and a half space needles or steam engines or the population of Chicago. We have that much data and I think that is one of the things that everybody knows about Windows. But the other thing that really is the, the issue I think that we wanted to talk about today is that scale is so much more than just that content.

00:05:04

It often is like a learning disability is how we talk about it, where it's often invisible and you need your own terms to figure out how you're gonna move through what that scale means. So we talked about it in terms just in terms of content and just in terms of engineers, but when you look at our scale in a general update, there are day-to-Day processes and activities that also create a lot of complexity for our system that aren't readily apparent. We work on over, uh, eight and a half million devices that are out there in the world. We scale everything from HoloLens to Xbox to servers to embedded oss that, uh, we work, um, proprietary for some of the companies, uh, that are out here. And as you can see in a general release, we have half a million pull requests. That's half a million people asking for a code review from someone else.

00:05:57

We have 3.2 billion test cases that we're doing these day-to-Day activities are put in front of a frequency that is every frequency imaginable. So on a day and a frequency, when you look at a general release, we have some of our challenges such as a defender in store that need to release multiple times a day. Um, whereas if you get to Xbox, it's monthly. Our core OS is twice a year. And if you're building hardware, you're on a longer cadence such as an 18 to 24 month cadence. Those issues, when you talk about not just the people, but you add in the complexities, the ability that we need to provide all of these engineers to be nimble for the environments that they have, but also have us all work in one structure was not the system that we had when we left, um, the Vista days.

00:06:53

And so we needed to do a few things differently. So we're gonna talk about things that we did in three areas. First one I'm gonna talk about is, uh, things that we have done to really help us be more agile when it comes to planning and work management. Then I'm gonna hand it over to Sam. He's gonna talk a little bit about how we moved agility through code and through the process of building and release. And then we will finish it up with some of the challenges and changes that we've made for our customer listening systems. But if you take a look at this, this is very similar to the Agile framework. It's a little bit different 'cause Windows always thinks they're a little different. So it's got different names up here, but the intent of it is the same. When we ended up with Vista, um, and we decided that we really needed to restructure and move everything onto one OS code base, we were all in different systems with different languages, and we needed to merge to one language.

00:07:46

So in working with, um, Sam and the DevOps team, we looked at the Agile framework. We have something very similar for Windows. And as you can see at the top is story, which is a little bit like epic in that, uh, agile framework. Those are the stories that executives talk to all of you about the things that we're releasing. We weren't that great at understanding what were these big rocks that we wanted to do and the story from the customer perspective. So we've gotten much more crisp about understanding every piece of code that we work on has to ladder up some way up into one of the stories that we do. And just to give you a sense, this is what we did last year. A couple of the, the numbers of how this, uh, takes a look for the Windows base. Uh, we do about 43 stories across all of this work that we talked about.

00:08:35

You can see that that ladders into value. Props about 206 customer promises down to about 143,000 or so tasks that engineers are working on. So now that we had understand we had a custom, uh, language, we recognized that while we were working with Azure DevOps with Sam's, uh, team, and we were inserting all of our data in there, we needed a couple of extra ways to have the ability to be, uh, assess what was going on in a nimble fashion based on all those invisible complexities that we had talked about. So what my team has done has built a couple of extensions that we put on top of Azure DevOps. And I thought I would go through what some of those are today. The first one is story tracker. So again, if you think about those 43 stories that we generally have in a release, the executives need to understand in the planning cycle, which stories are we going to go to, and they need a better way to assess how are we doing on those stories, um, as we're moving through the release.

00:09:38

And so we built this extension at the top for each one of the stories. You can see the state of the deliverables for all of the stories that are going on. And this particular story, half of those deliverables are gray, which means they're proposed, they haven't yet started being worked on. That might be okay, that may not be okay depending on where we are in the cycle. So that gives the information for the executives to have those decisions should they want to de dig in a little bit more. On the right, they have a bar chart that tells you all of the groups that are involved. And then below you see all of the customer promises that align to that story. And you can dig into at each customer promise, also get a sense of how are we doing in terms of work that's been completed down to the deliverable and task.

00:10:25

So this is a great way the executives meet monthly, they review these stories to get a sense of how are they doing, um, across some of these big items that they're tackling. But then we needed another layer, another layer for the senior, uh, leads for our teams. A lot of our teams are managing groups of 500 developers, um, anywhere from 500 to a thousand engineers at a time. And one of the things with this complexity is we have a lot of dependencies in windows, uh, speak. A dependency is something that I'm building that Sam and his group needs me to finish before he can finish his work. And that is something that I'm producing. Conversely, we have a dependency that's the other way. Sam's creating an API that I need him to finish before I can move forward with the plans that I've committed to for my team.

00:11:18

And so I'm consuming work that Sam's building In the past, all of that information was done organically and generally face to face or people, um, working to try to understand how things were moving through the system. And so what we did was we created an extension to pull that information together in one place. And so this is the dependency tracker, which you can see here in the view that I just pulled up, is these are all the groups that are creating work that other people depend on. So if you take the group in the left, which is called Sigma, you can see that they create the most amount of work that other teams across Microsoft need them to complete before they can move forward with their jobs as they're rolling out their release cycles. You can look down below and understand what are those specific tasks, who's being, um, who's, who's consuming it, who needs that information and the risk.

00:12:12

We also have a timeline view and, and a risk view so that if I'm Sigma and all of a sudden I decide I'm going to cut these five features and you're the first person here and you're the team, PCE, and those five features actually matter to you, you now are aware sooner rather than later that something is materially impacting your schedule. We've created too an immediate communication system so that emails go out to everybody to allow people to understand that these changes are happening and so that conversations can happen sooner and people can make the changes that they need and move forward without it causing gridlock or preventing blocking issues from a release. We've also rolled this out in our planning, uh, system so that all of our executives now have to think about these dependencies before they even start and what they commit to. And they all agree to these dependencies at a high level and are aware of how they're interacting with each other so that more thoughtful work is done throughout the release and an understanding of how this work crosses other teams.

00:13:18

Um, finally we have for the leads who have information in terms of smaller teams, we have something that's a, a sprint view in windows. We call it an iteration, but think of it as a sprint. So this is a way for, again, a lead who's not a senior person, but who has multiple customer promises that they're working on to see how that work is laddering. You can see here all of the work for a month or a sprint, what's going on in terms of what's, uh, been completed and what's not. We've added information to understand what's getting pushed out and what's getting pushed in so that we have, uh, more information to understand are we losing ground and what that timeline is and how we can regreen gr ground and have those conversations as needed.

00:14:07

Dashboards are, uh, a, a hugely, uh, valuable tool in the Azure DevOps world, and we use them as well. I just threw this up here on the right. You can see some customized widgets we have for bug tracking that we've created. So we have a heat map over here on the right. You can see for all of the groups that are related to this person's work, they get, uh, 48 hour bugs and all the way through the various definitions that they need to understand what's a hot bug, what's a blocking bug? And we've created a bug glide widget, which we can share with people, but it helps us understand where we are in that process of getting to a healthy state, which, uh, before we, uh, put that out into getting ready to release. And then we have two things, um, that we have created for ics.

00:14:58

So if you are an engineer and you're one of these 12,000 engineers who need to work in this base, uh, it can get a little overwhelming, especially because a lot of our code is 30 years old and isn't as clean and as agile as we would like to, uh, have it be. So one of the things that we have created is something that's called an areas extension. And really it is a place for us to understand organizationally where information is um, situated. And so what you can see here in our OS project, these are the various groups that ladder up to the OS project. You can double click on those to get further and further information of various teams and where their information and their data sits so that you can, and then who owns it So you can go have a conversation as needed and not feel like you have to work up through the task to understand where your information is and who are the people you need to talk to as you are working through building out your features.

00:15:56

The other thing that we've spent a lot of time on that might be unique to Windows is because our code base is so old and our culture was not set up in a way that, um, is as agile as we would like. We have lots of binaries and lots of files that don't have owners associated with them. Some of it is really co old code, but some of it is new code, newer code that just didn't have an owner. So if you have gotten a signed a bug and you're trying to understand what's going on, you're looking in the binary, you don't know who to go to for more information. So one of the things that we've spent a lot of time on over the past two years is how do we increase the percentage of files that have owners assigned to them? And so we have a process where we have this extension where you can go down to the file, see the area path, and see the owners that are associated. If an owner isn't there, you have information on the right that gives all of the history of the poll requests and the other history around that file so that you can get a sense of who it is that you need to talk to. And, uh, an algorithm is constantly working through this to help assign owners to files to improve that, um, that piece of information that people need to go and cover as they're going through their work.

00:17:13

So that is basically some of the tools that we've created in terms of planning and work management, but we know that there's much more to that. And part of that is how do we handle all of this code, this legacy code, and get that into a system that makes it, uh, more agile and aligns with this journey to DevOps. And for that, I'm gonna hand it over to Sam and he's going to talk a little bit about some of the work we've done in that space.

00:17:40

Thanks, Catherine. So four or five years ago, uh, when, uh, the senior leadership team under new CEO Satya said, we're gonna have one engineering system for the company and everyone's gonna use it and, uh, it, you know, needs to work for the whole company. And that in, of course includes Windows. We started this male alias, uh, called engineering system architecture discussion. And, um, this was a torture machine. It, uh, had all of these threads about, well, you know, we need to get to modern code practices and get and pull requests and blah, blah, blah, blah. And the solution is we just need to refactor windows and, uh, it, you know, it, we've got this monolith, we need to turn it into microservices, right? And so, uh, explaining up the management hierarchy that you're gonna refactor this thing and it's a journey if we don't know how long, but certainly measured in years and there's gonna be no customer benefit or deliverable along the way.

00:18:51

That didn't work so well. Um, we have in the core windows repo, there are several side ones, but in this core repo that's at the center, the monolith, um, something over 7,000 developers who need to, uh, deliver code there, that translates into about 11,000 topic branches that they work on. Um, in a month, uh, something like a third of a million commits and, uh, three 30,000 pull requests. And, uh, you know, like 10,000 branch integrations. 'cause those topic branches get, get collapsed. If you look at that daily or in real time, that's 10 commits per minute, that's 1100 pull requests a day. So, uh, if you think about people working in master, uh, it's churning all the time. And how, and, and if you assume that, that these are fantastic developers that say they, they only, you know, make a mistake one day a year, uh, it means that your code's broken all the time, which by the way is, is how, uh, Vista and as Jeffrey talked about on Monday, Longhorn, uh, uh, didn't quite happen.

00:20:42

It creates a merging problem that, uh, isn't nicely. I'll take your poll request, you'll take mine, but it's like the, the freeway from hell. Um, so, uh, we know, and Jess and Nicole said yesterday afternoon, uh, you're supposed to do integration and collapse branches every day. You're, you know, you're supposed to get your code to master all the time. How on earth do you do it at that scale? Well, the good news is that we realized that, uh, the proprietary hierarchical version control that had been used in Windows and most of Microsoft for decades called Source Depot wouldn't cut it. Um, but we did an eval. This is like, uh, four years ago now, you know, source Depot. We also looked at commercial alternatives like Perforce and, uh, looked at Git and said, Hey, we need to, uh, to get where we want to go. Uh, we need all these good things about get, you see there, uh, only GI was gonna going to meet our needs of being able to work fast and get a pull request flow going and so forth. But only if we could get it to scale appropriately.

00:22:18

What do I mean? In that core repo of Windows, we had 360 gigabytes of data. Now, to put that in perspective, if you saw Dylan and me on Monday, we were showing you, uh, in Azure DevOps, uh, in its, uh, most monolithic repo, maybe three gigabytes. So 100 x down. If you compare this to Linux, which was, uh, built a different way from the beginning, it's more like, you know, 300 x. So to move to gi, we needed to do something about performance. It took 12 hours to clone that repo. Now, and, and that's, that's counting the successful ones.

00:23:14

If your laptop went to sleep, uh, you had to start over. If the wireless burped, you had to start over. So this is, you know, being on a great in the office network with a machine that's up and there's no hiccups in anything and it doesn't have to restart. And you could, I mean, just doing get status was eight minutes and, and half an hour for a commit was ridiculous, unusable. So to make Git work for Windows, we had to fix Git. We took three tries at this. You may have heard about the, uh, uh, GIT large file system. That one, you know, was one of those attempts and what have you. Uh, and it, it took like, uh, three years and a lot of dedication, um, top down to the belief that this was worth it. So we developed what's now called the virtual file system.

00:24:22

Forget, uh, it was GVFS in the beginning and that's still the repo name, uh, which gave us 300 x performance improvements, uh, pretty much across the board. So that 12 hour clone was down to five minutes. A commit was not half an hour, it was six seconds. The way we did that technically was essentially to use the pattern that you see on photo sharing sites. So if you think about, uh, OneDrive or Google Photos or, or anything like that, you, you see thumbnails of everything, but you don't actually download the big JPEG until you click on the thumbnail. So you can think of this as providing thumbnails of all the files, but not downloading them until needed.

00:25:16

So we implemented GVFS and we started moving parts of, uh, the Windows team to GVFS. How'd it go? Well, it took about six months from no use. The blue is Source Depot, the predecessor and everything in Source Depot. And the orange color is, uh, Git. And if you notice, there was a point in March when we moved, uh, the bulk of the organization and it happened over a weekend. So organization, and if you look at the, uh, heights of the, uh, the curves, that's the number of pull requests or before pull requests, what, um, uh, source Depot called submits similar ideas. So the amount of code activity actually went up, no interruption, which was quite remarkable.

00:26:18

Uh, and now all of Windows is using Git and along with the rest of Microsoft under, uh, the, the what we offer in the market as Azure repos, uh, part of Azure DevOps. How did we deal with that problem of getting to the intraday pull request in that fast flow? Well, we have this problem of, you know, masters up here and you got all these people in Windows Core 7,000 engineers working on their branches down here, and you need to figure out before the code moves to master how to validate it. So we developed a, a a, some custom tooling. Uh, when Dylan and I showed you what we do on, on the DevOps sas, you saw one pull request running, one build running for each pull request. So each pull request got its build and when the build completed, um, that's when you saw the, the results Windows builds take too long to do that.

00:27:33

So we set up a system where we would have a continual build running, uh, as soon as it could or continual builds running in parallel as soon as they could. And your changes in your poll request would be applied with a an LKG last known good to be validated from master before that poll request could move forward to uh, be committed to master. So it's a way of getting that high speed of changes back to master and having them validated before the commit to master. And we had to do some custom custom machinery for that. So how'd it work,

00:28:25

<laugh>? So one of the things that's working with the pre-code validation is we're in the middle of a whole bunch of pilots right now with Windows. It's working really well. Um, our hope is that next year we can come back and go deeper into our learnings and if it's something that we can roll out, but we wanted to share with you where we were in that journey. The other place where we wanted to quickly I, uh, isolate, remember back in the way back machine with Vista and we had no customer, uh, pipes for listening. We created the Windows Insider Program. The Windows Insider program is basically in almost all it's worldwide in almost all countries, 95% coverage in terms of the 8 million devices we have and the 21 million apps that we have to take a look at. How do we make sure that this, uh, build reaches out to these and doesn't cause issues?

00:29:13

Um, and our insiders are a key, uh, team with us. We rolled that out in Windows 10. And you can see with the public previews how we have consistently increased, um, our connections with this team of insiders. And so I don't know if anyone's an insider out here, but thank you very much. Um, since we started this in Windows 10, the Windows Insider teams have isolated and, and identified about half a million bugs that we have fixed as a result of that connection from our customers. And so we thank you for that. Um, the other thing that we wanted to share with is, as we talked about some of these extensions, what we're trying to do is make them public so that if anybody is interested in using some of these extensions, if there are unique issues that you have on your team and some of these tools might be helpful, you have the ability to do them.

00:30:04

So we have two tools on GitHub right now. One is work item, one click. That is for the individuals that you, uh, the engineers on your teams, if they are working through and need to create certain rules for their wits or their ki uh, queries, they can do this and have that help them with their workflow. We found that to be incredibly helpful because again, with these 12,000 engineers, there's no consistent default system that we could set up that helps everybody. So personalizing this for them was the way for us to go. So that's there. The second one is a work item. Migrator, you might have noticed that we had 11 million items in our account that really affects the performance. And so we worked with Sam's team to create a tool to migrate some of the archive older code that we weren't using or older files into another, uh, system that is accessible if we need it, but helps us maintain our performance.

00:31:00

And teams are finding all sorts of interesting ways to use that tool. That also is an on GitHub. And then finally the dependency tracker that I talked about is on the Marketplace store and Microsoft. If your team has unique needs around dependencies, that tool is available for you all to use, um, and see if that can help, uh, meet your needs and provide additional value back to us that we can learn from and continue to move forward with advancements in that space. Um, so in, in uh, conclusion, I guess I would say this is the challenge and the passion that Sam and I have, which is really trying to get these 12,000 software engineers to work independently and together. And, uh, we're on that process right now and we'd love to come back to you next year and let you know how that's going in terms of some of the build and release cycles that we have. And with that, uh, thank you very much.