Las Vegas 2018

DevOps for AI

Because the AI field is young compared to traditional software development, best practices and solutions around life cycle management for these AI systems have yet to solidify. This talk will discuss how we did this at Microsoft in different departments (one of them being Bing).


Gabrielle Davelaar is a Data Platform Solution Architect specialized in Artificial Intelligence solutions at Microsoft. She was originally trained as a computational neuroscientist. Currently she helps Microsoft’s top 15 Fortune 500 customers build trustworthy and scalable platforms able to create the next generation of A.I. applications.


While helping customers with their digital A.I. transformation, she started working with engineering to tackle one key issue: A.I. maturity. The demand for this work is high, and Gabrielle is now working on bringing together the right people to create a full offering.


Her aspirations are to be a technical leader in the healthcare digital transformation. Empowering people to find new treatments using A.I. while insuring privacy and taking data governance in consideration.


Jordan Edwards is a Senior Program Manager on the Azure AI Platform team. He has worked on a number of highly performant, globally distributed systems across Bing, Cortana and Microsoft Advertising and is currently working on CI/CD experiences for the next generation of Azure Machine Learning.


Jordan has been a key driver of dev-ops modernization in AI+R, including but not limited to: moving to Git, moving the organization at large to CI/CD, packaging and build language modernization, movement from monolithic services to microservice platforms and driving for a culture of friction free devOps and flexible engineering culture.


His passion is to continue driving Microsoft towards a culture which enables our engineering talent to do and achieve more.

GD

Gabrielle Davelaar

Data Platform Solution Architect/A.I., Microsoft

JE

Jordan Edwards

Senior Program Manager, Microsoft

Transcript

00:00:05

What does DevOps for AI mean? Well, I don't have to kind of explain what DevOps is. I'm assuming everyone knows this. Um, but it's a bit different when we're talking about DevOps for ai because the traditional way of DevOps is not that much working for ai. Um, what we see for this is kind of, you know, A-C-C-I-C-D solution requires, you know, reproducibility, um, validation storage and versioning deployment tracking and data collection and all these things. Because eventually if you have this really cool, awesome model, you want to bring this into production. And this is where we see a lot of customers feel big time. And we actually call it the disappointment value. Why? Because people start very enthusiastically with this model. They hire a data scientist, and the data scientist is going to work on some data set and then, you know, figures out the model and then says, okay, I'm done. Here you go. You can have it in production and, you know, typical software developer will say, yeah, not so much. It definitely doesn't work this way. We have to start all over again. So you get too annoyed people, like you get the software developer saying, what the fuck are you actually presenting here? And I'm saying, you know, as a traditional data scientist, well, I did my job, you know, I gave you the model. That's what you were asking me, right?

00:01:33

Or here's a Jupyter Notebook, figure out how to make it work in production.

00:01:37

So how to make this work. These are also kind of the trends that we saw. You know, people are dealing with suboptimal knowledge, not knowing exactly what you're doing, either the business delivery manager, a data scientist engineer. And that actually brings also kind of to, when you have to do a return on investment, people want to see kind of their money getting back to them, especially those business owners that are saying, yeah, I'm gonna give you that million. You know, they want to see kind of their return on investment. And then what we also see is like people just start random using a data set. They think, oh, this works. Definitely going to use that, put it in. And then at some stage actually are unable to replicate it.

00:02:36

Then we have another problem. Evil black box. Um, GDPR, everyone knows that it's a major pain point for a lot of big customers, especially the customers that I'm working with. They have to kind of represent to regulators how they were building their models, how they replicated, you know, if I provide, uh, a model that will tell you whether I should get my mortgage, then I at least want to know if I got rejected, why I got rejected. And here you have a pain point, because how am I going to show that if, you know, I don't actually know a lot how I build my model, how I can reproduce it, et cetera, et cetera. And then we have this problem that we're all familiar with. I've always done this this way, and this problem actually is like with a doll of a lot of data scientists.

00:03:31

And I am one of them previously started, you know, coming out of college, ran my model, and then, you know, I got into a huge fight actually with one of the engineers saying, well, I don't know what you're doing, but I cannot put this in production. So my, my model actually ended up in a presentation, PowerPoint presentation, and well, we all know what happens with PowerPoint presentation, <laugh>, <laugh>. Um, so yeah, that was definitely a disappointment. But I also soon realized that actually the data engineer that I was talking to and the software developer were actually making a point. And that also got me to the stage of saying, well, we have to do something about this. So I moved to Microsoft actually to work on this and see how we can bridge this gap. Yeah. So what does, you know, how can DevOps for AI help in this space? Well, it can provide you an overview. It can provide you, you an overview on the resources, the ability to know more on, you know, how much resources are actually used. Is this really helping or have, do we have to shut down because like again, the business owner wants to know, do I get my return on investment?

00:04:48

And now we're getting to the digital audit trail. You know, the kind of trail that you are running back to saying, if I start all over again, I can reproduce this whole model, and I know exactly which type of data set I've been using and who has been working on it and where it all started.

00:05:11

And this is very important actually for regulators. They want to know these kind of things. They want to know how you came to a certain decision. And if you can show that regulator, especially in a financial sector, you know, this is how I got to my model. And even though there is a mistake in there, they will won't be that harsh on you as they otherwise would because you are able to kind of replicate it and everyone makes mistakes. But at least if you can track it down and being able to kind of follow what happens, it will definitely help. You know, saving your, how to put it politely, <laugh> <laugh>, okay. And then building transparent models. You know, also the kind of thing is that you want to know what your model is doing, and you want to know, you know, if there's an algorithmic bias in there. So algorithmic bias is, you know, if your data set is kind of skewed to one side and there's no equal distribution to it,

00:06:13

And then the data science unicorns, everyone knows that, at least I hope. So everyone kind of has the experience with working probably with a data scientist coming freshly out of university and working the way you do in university, because obviously they have not been trained in a traditional software development. So if you're starting about CICD, if you're talking about unit testing, you will see blank faces happening. Even Git, <laugh>, even Git, you know, we, like, I even got a question like, is this an acronym and what kind of acronym is this? Like, is this a new model that I have to know of? No, it's actually not a model, but it will definitely help you to get there. Um, so yeah,

00:07:06

That brings us to kind of how to do it more in a much more mature way. You want to do model control, model validation, model versioning, model storage. Storage is so important because ultimately it will help you trace back how you came to that model. And you have to do that quite organized, because otherwise you will have a swamp with so many models and so many data sets, but if you're unable to kind of connect them, you still have a swamp and you still are unable to create a digital audit trail and a model deployment. So now we're getting somewhere, we're going to go a little bit more into the tech side.

00:07:50

What does it mean if you are combining DevOps with ai? You know, you want to kind of have three phases, experiments, develop, operate, and you want to prove feasibility and build it and then operate it and eventually scale it. Because ultimately that's also the goal that we want to achieve, right? We want to make it possible for people to have a successful AI production that can scale on a large scale. So during night, we actually had a very awesome case that we have been working on. Um, shell has been showing, um, a very cool case where they showed using an AI model at retail stations to kind of predict behavior. And they want to scale this to 45,000 retail stations. That means that you have 45,000 retail stations that all have their own models and all all have their different situations, but still wanting to bring that together and still want to learn from that. That's a lot. So now we're getting to an awesome stage where I'm going to hand it over to my fellow engineer who's actually going to show you how we do this.

00:09:05

How do I click it?

00:09:07

Uh,

00:09:08

Okay,

00:09:10

Cool. Uh, so again, from a data science point of view, uh, these are the sort of the three major steps that we have in creating models. There's the data preparation step where you're going and taking data out of your lake, shaping it, extracting the features that you care about to go and build the model. Um, there's the experimentation step, uh, where, you know, use the IDE of your choice, uh, submit a job on some type of compute and try to find a model that actually solves your problem. Uh, then we get to, once I have a good enough model, I want to be able to register it and track it and allow my developers to be able to use it in production scenarios. And so that's sort of the three major steps we have there. Uh, the model lifecycle. So from a data scientist point of view, this is really what they care about.

00:09:58

They're taking data in, creating a model, publishing it. Uh, they may be customizing an existing model, but from their point of view, all they really care about is the model asset itself. Uh, the deploying of the model is sort of a, it's a black box to them. Uh, and the model may be used in a variety of different places. Uh, you could be deploying it to the cloud, it could be running as part of a larger data pipeline like inside of a Spark pipeline, or it could be deployed on edge devices, as Gabrielle mentioned is the case with Shell. Um, and so you have this, this whole funnel. This is the data scientist point of view. Um, now we're gonna talk about when we introduce the developers, what changes.

00:10:38

So, uh, when we talk about breaking the wall and lifecycle convergence, we have the app developer flow, which I'm sure you're all very familiar with. So you have IDE source control, CICD, going to the cloud. Data scientists, again, it's a bit different and more simplified, uh, from their point of view. They're just, they're just building a model that's gonna solve a problem. They might have a little test app or a couple of cells in their Jupyter Notebook that's showing that the model works, or they may write a paper about showing how the model works, but that's, that's all they really care about. Um, when we talk about normal applications versus AI infused applications, uh, you'll see two new assets which enter the, the domain. Uh, one of them is data and the other is the model. So before you get started with creating a new version of a model or building your first model, you need to analyze and see has the data changed?

00:11:30

Has the profile of the data changed? Is one of the columns not there anymore or all zeroed out? Do I have enough data to go ahead and build a new model? Um, then you have to analyze the model itself. Comparing to the previous version, it's not like with software where you are gonna test pass and you're good to ship. You might have, say, higher precision, but lower recall on what you're trying to do. Um, then you talk about testing the model in the application. Again, the, the model is usually built on a different stack than the rest of the app. So how do I make sure that I have those same features available in my application so I can actually use the model to predict? And so all these things just make it tricky. Uh, so step one, this is the, what I call the nasty handoff phase.

00:12:14

Uh, where the data scientist throws the model over the wall to the developer says, Hey, make this work in the application. I think I've got something that's better than, you know, your, your conditional statement you have in your code. Uh, that usually takes on the order of a couple of months today to be able to actually use the model in a real application, in production, uh, even longer for some teams, depending on the compliance concerns that are involved, what type of data is in the model. Uh, after that, after a little bit of back and forth between the developers and data scientists, we usually get to this step here, uh, where the, the developers say, Hey, at least put the code you're using to generate this model in source control somewhere so I can be able to reproduce the model. And so most of the customers we're working with today are just at this phase now where they're trying to be able to automate the training process and have reproducible models, not just a path to something in a data lake somewhere saying, Hey, here's the model.

00:13:09

You know, I don't know where it came from or what data's inside of it, because that's very dangerous. And along with that, you can now throw things like unit tests on your code. Uh, so before I go and waste, you know, eight hours of GPU compute time training a new model, how about I make sure the code actually passes all the tests first? Um, surprisingly common. We'll see somebody burn, you know, hours or days of compute time, uh, and there was a bug in their code, right? So how do you, how do you short circuit that?

00:13:38

Uh, next we talk about, uh, you know, now that I have this process to get the model trained automatically, where do I store the model? How do I version the model? Um, you know, do we apply standard semantic versioning concepts like we do in a packaging environment? Uh, from a lineage point of view, do I need to trace, uh, you know, what, what dataset went into my model? What code went into my model, which compute was used to train my model? 'cause that also has, uh, potential concerns on the compliance side, the second year data and or your customer's data enters a different compute context, you know, how do you handle that? And so that's where we start to talk about lifecycle management as well. And the goal from a model CICD point of view is to give you the controls and knobs to be able to effectively manage that lifecycle.

00:14:22

The final step we get to the happiest path is when you actually have feedback flowing from your model deployed across a variety of targets. And that data can go back to your data scientists and they can actually use live information from your app to improve their model so they can see was it helping the users or not? How was the model behaving when it was being used in a real application? And they can have a healthy relationship now with the developer where they can actually say, Hey, if you instrument and add this extra telemetry into the app, then I can improve the model for you. So now you get to this healthy and productive flow where the friction's gone, the model's being used in the app, the developers are happy, and the data scientist is happy because their work's being used for a real production system.

00:15:08

And so this is just again, a different pivot on the happy path flow, uh, where you have, yeah, a company-wide model store or model catalog. Uh, developers can browse the catalog, figure out which models they want to use in their application. Um, they can ask data scientists to say, Hey, can you customize this model for me? Sure, okay, I'll take it off the store. I know how it was produced before. I can feed different data into it or try different features on it. Uh, be able to take that, publish it out, and then seamlessly consume it on the development side. So even giving the developer something like, oh, here's a, you know, a swagger spec where you can just generate a client library to call my model or package it up into a DLL or make it easy to use in my modules. I'm trying to deploy out to the edge. Like all those things are just sugar and help to make this a happy relationship.

00:15:59

<laugh>. So pain points, um, I think I touched on a few of these already, but the ML stack, the code is often R or Python or spark some variant of Spark Javas scholar code. Uh, it's usually not the same as the rest of the application stack. So again, that featurization logic needs to be rewritten. There may be lots of glue you have to wire up. Uh, there's also, it's also hard to track breaking changes when you're dealing with different languages. Uh, and then on the model side, you know, testing accuracy of models is not easy for developers. They don't really understand it. How do you design tests that can float and have some variance on them instead of just being like, okay, was this the exact metric produced that I expected? Where do you set that barrier from one version of a model to another?

00:16:49

How much float can you actually support? And you need to be flexible there and work with the data scientists to figure out what's an acceptable accuracy loss. Um, also on the performance side, you know, how do I, how do I compare and contrast the improved accuracy of a model? But it takes three times as long to run now as it did before. So these are all things that you need to think about when you're bringing models into production systems. When we talk about traditional applications, I'm not gonna waste too much time on this, you're all familiar with it. Um, so traditional CICD pipelines, you have build and release in place, uh, and AI applications. Again, now you have two personas, uh, working in two different contexts or environments. Uh, normally they're working in different repositories as well. So as I'm working in these repos, you, I go and build my code and test it. Um, either the model gets directly integrated into my app and deployed, or the model's deployed as a separate service and I call out to it, uh, over, you know, arrest API or something like that. But in either case, you need to have integration testing in place to make sure that it works.

00:17:56

So, uh, what we have now is this is a proposed process for doing CICD for models, and we have services in Azure that help support doing this.

00:18:05

Uh, so this is from, uh, you know, feature branch point of view. Every time I commit to my data science repo, I want to actually create a sandbox environment. Um, it says NDA here, but we, we encapsulate everything we can in docker containers to clean it up. Uh, make sure you have all the requirements there, link the code, run unit tests on it, uh, publish those test results and also look at your code coverage to make sure that you're not adding in a bunch of extra functions and features into your model that aren't being covered. Sorry.

00:18:37

Uh, so from a PR point of view, whenever I kick off a pr, uh, I want to actually do testing and validation and go and train the model. Uh, however, I may not always want to train the model on the full set of data because again, that could take time to do and be expensive. So I may want to have a smaller sample set I can use. Uh, also in the case of compliance, when I'm doing pull requests, I may want to use say, public data. I've scraped off the internet and not my customer's data. 'cause then I can debug and figure out what's actually going on in the training process of 'em having issues. Uh, this example here is talking about, uh, you know, deploying to a CI that's an Azure container instance for those of you who don't know what it is. But basically it's deploy it onto a, a container. You can spin up quickly and tear down, make sure that it actually works when you've put it in the container and put the service and put it on top of it. Uh, then here's just the example flows for when I'm actually going to production. I've got my model artifact, packaging it up and deploying it out to a production cluster. Today, our happy path for real-time, uh, inference of models is to use a KS, which is the Azure Kubernetes service. So we deploy it out to there, put an autoscaler on it, and then your applications can start calling into it.

00:19:47

One second.

00:19:47

Sorry. Okay.

00:19:50

Uh, so other pipelines we'll be talking about here. Uh, so one of the things we haven't talked about in this talk yet is converting and quantizing models. So I may have trained a model on a set of data, but to make it run on this edge device, I need to shrink it. It's three and a half or four gigs of a deep neural net. Uh, but my, my device may not have that much memory on it, especially when you're talking about edge devices. Or I may not wanna wait for the time to propagate that model out to everything. Uh, so we have converting and quantizing of the model where we shrink it and then analyze how much accuracy it lost when we pruned layers of the graph out.

00:20:26

Okay?

00:20:27

Uh, then there is retraining. Uh, this is sort of the, the happy nirvana I showed in phase four there, where now that I have my model is reproducible, I have validation on top of it. I have a proper place to store it and version it. And I have deployment pipelines that allow me to do a safe and controlled rollout of it. I can do real retraining and then I can get it out to customers and do AB testing on it. Um, so I can have both models in my service at the same time and flight a sub subset of the customers with a new version to see how it works. Um, this is how we do it today in all of our big AI infused products like, like Bing in office. So a couple of demos here. I'm just gonna jump to the CICD pipeline 'cause I have that one popped up already. Could you flip to my computer?

00:21:19

Yep.

00:21:19

Awesome.

00:21:21

So, uh, this is an example, a build pipeline showing basically how to train a model. Uh, in this case I am using Azure DevOps. For those of you who aren't familiar, it's an awesome CI CD solution for Microsoft. Uh, here, the first thing I'm doing is installing the, uh, the ML extension for Azure machine learning. I'm unit testing my code before I go and submit the training job, analyzing the quality of my code, storing those test results along with the build. Uh, then I'm basically creating my, uh, uh, log for the run with the name of my experiment, submitting the job to go run. In this case we're running on a data science vm. Uh, but you can also submit a job to go and run on a batch compute cluster that will spin up and tear down on demand for you. Uh, you can also submit jobs to go and run against a spark cluster.

00:22:10

So the intent here is to keep the code sort of agnostic of the compute. Uh, we also have data stores as a concept. So you can mount and unmount data stores as part of your training job. I'm downloading the model, I'm putting the model in the model registry here. Uh, and then I'm basically preparing my dependencies I need. So in this case, I have a score file and a Conda file that I'm gonna use to create my container. Uh, copying the dependencies and then preparing the artifacts for deployment. And just show you what one of those actually looks like. Z Live demo <laugh>. So drill into the logs for this one here, right? So I can say, oh, here's my test that ran. We've got 40 tests sitting inside here so long. I can see how long it took to run. Um, I've got the artifacts. So if I look at these, pop them open here, you'll see, okay, I've got my, my, uh, score file. I've got an example model file I could use that's not the one in the registry, but I have one stored in the registry as well. Um, then I have some other dependencies that I care about, like classes I may need when I bring my model to production.

00:23:22

Uh, and then you can see from a logs point of view about how long everything took to run. Um, I can see when I go and run my experiment can drill into here. Basically say, oh, okay, here's all the dependencies

00:23:40

Authenticating

00:23:41

With Azure and here's the actual training of the model. Um, we also give you a link where you can go and click and see your experiment, run results live. And then here's the metrics for my model. So we, we pin these as well. And then I can also track all the individual runs on the uh, release side. So here's an example. Release pipeline. Uh, again using similar steps to what I had over here. If we look at what these commands are actually doing.

00:24:11

So if I drill into this, uh, this is that test deployment step I was talking about. So basically I am creating the container image, which has my model packaged up inside of it. I'm creating a service, testing it and then tearing it down. And I can see these commands are pretty simple. It's just I'm creating a container. Here's my model artifact, my score file, it's the codes in Python. Uh, go ahead and deploy it. Create the service. Again, it's a dev test service. We're just creating a container instance, making sure that it actually works. Running this test query against it. Um, I have this as an inline script, but you could also put this in a repo if you wanna treat it more as an infra code. Uh, and then I tear it down when I'm done 'cause I don't want to keep my ephemeral resources running. There's no reason to do that. Then I have the, uh, production a KS deployment plug into here. Again, similar steps as before. I've already created that image so I don't need to make a new one. I'm just gonna take that same image from the previous phase. That's this image ID parameter here. Uh, deploying to, you can pick which a KS cluster you wanna deploy to. And I also make sure that it works over here. I don't wanna tear down the production one because that's a production service.

00:25:24

And just, I'll give you a quick insight into the repo as well. So you can see the actual code, right? So there's, here's some example training code. So this is what's actually getting submitted. Then on the scoring side I can say, okay, here's my kind dependencies that I care about. Um, and then my score file.

00:25:45

Cool.

00:25:46

And I think it's also important to kind of know that what we're now also trying to develop at this moment is that it is going to be agnostic. So what we find really important, obviously we would love people to use our products, but we find it way more important for the industry to kind of change in this because this is really necessary for AI to grow to a large scale. So if you would like to use Jenkins, if you would like to use Databricks in those pipelines, fine by us. Um, you know, it's really more about standardizing a way of working rather than saying, you know, please use Azure. I do say please do use Azure. But on the side, note it, this is also possible if you want to use other products. And it is really important to have this standardization because companies are running into these problems and that's what we call the value of disappointment. It can actually help a lot of customers with this. Um, also one note, if you would like to have the presentation, because I saw a lot, lot of people making photos, <laugh>, feel free to reach out and we can send it to you. Um, we're also posting it on, I think the deck are uploaded, right? Yeah, I think so. And otherwise we have, we have it on LinkedIn now also available. So, uh, thank you so much for being here. If there are any questions, please feel free to ask them. Cool. Alright, thanks.