Las Vegas 2019

MLOps - Accelerating Data Science with DevOps

The business impact of high-performance DevOps has been proven widely. Unfortunately, as enterprises now try to use big data with machine learning, data science tends to get left behind. Challenges that engineers and data scientists face when developing an ML-based system are reliably deploying models, managing ML assets at scale, and knowing when models are going stale.


Introducing a DevOps approach to Machine learning helps solve these challenges by providing structure for collaborating and bringing models into production.


This talk covers real improvements that MLOps has brought to large enterprise customers in maintaining asset integrity across teams, accelerating model operationalization, and enabling a sophisticated AI application lifecycle. We share pointers on how you can follow the same journey in your org.


Jordan Edwards is a Senior Program Manager on the Azure AI Platform team. He has worked on a number of highly performant, globally distributed systems across Bing, Cortana and Microsoft Advertising and is currently working on CI/CD experiences for the next generation of Azure Machine Learning. Jordan has been a key driver of dev-ops modernization in AI+R, including but not limited to: moving to Git, moving the organization at large to CI/CD, packaging and build language modernization, movement from monolithic services to microservice platforms and driving for a culture of friction-free devOps and flexible engineering culture. His passion is to continue driving Microsoft towards a culture which enables our engineering talent to do and achieve more.


Shivani Patel is a Program Manager at Microsoft working on MLOps for the Azure Machine Learning Platform team.

JE

Jordan Edwards

Senior Program Manager, Microsoft

SP

Shivani Patel

Program Manager, Microsoft

Transcript

00:00:02

Um, Shavani Patel, I'm a program manager at Microsoft and Azure machine learning.

00:00:05

I'm Jordan Edwards. I'm also a program manager at Microsoft working on Azure machine learning.

00:00:09

Um, and so today we're going to talk about ML ops and basically accelerating your data science workflows with dev ops. All right. So for today, we're going to be covering, what is Emma lops? What is ML ops life cycle look like? And how do customers in the world world use ML ops and get to an ML ops workflow? Raise of hands. How many of you guys have heard of them? A lops used them a lot familiar with it. Okay. Wow.

00:00:39

A little bit, nearly 10%

00:00:41

Of the room. All right. So some of this might be repeat for some of you, but let's kind of dive in a little bit into the ML piece of it. So what does the machine learning lifestyle life cycle generally look like? So first you kind of start with getting this data, data acquisition, cleaning it up, putting it into a dataset and then developing, experimenting, and training, and eventually coming to a model that is solving a real business problem. Then you package it up into a format where you can actually run it and get value out of it. And the next you want to go ahead and validate it and making sure that it's reaching all, you know, accuracy, performance, thresholds, any sort of regulatory compliance that you have in place for that model. Next is the fun part of actually deploying it and making predictions. So actually getting the real value out of your model. And then the last phase here is monitoring it. So you want to make sure that you're constantly looking at the model, making sure that it is going to perform and give you the value that you expect it to. And then once it's, you know, deprecating not performing as well, kicking off this training pipeline all over again. So that's pretty much machine-learning life cycle in a nutshell.

00:01:45

Now, what is, I'm a lop? So that's basically just bringing dev ops principles to your machine learning workflow. So it's integrating a continuous integration kind of flow into the data science workflow. So it's automating the building and testing of your code. So creating these repeatable training pipelines and then providing this continuous deployment workflow, which is automating the validation of the package model and then deploying it out into your target, you know, device server, wherever you're deploying your model to, and then monitoring not only your pipelines or infrastructure, um, the model performance, as well as the new data that's actually coming in and creating this data feedback flow ticket off this pipeline all over again. So how is this different from dev ops?

00:02:30

So how many of you have heard of dev ops

00:02:33

Is making sure we're on the right conference. Um, so as you can see, you bring in data and models into your system and it pretty much just implodes it. So you have this small integration, Thai Eunice test, you're kind of not managing a lot of pieces in a traditional software development life cycle. And then when you move into an ML based life cycle, you have data test model tests, you know, your system level monitoring your model, monitoring the training pipeline monitoring. So it gets a little complex. Yeah. Yeah. So what ends up happening is that your ML code is actually such a small piece of your infrastructure. You have so many more assets that you have to manage, um, you know, in a scaled out machine learning workflow.

00:03:16

So essentially at what makes it really different is you're putting together three different workflows here. You have your data engineer that is cleaning up this data, creating these data pipelines, your data science workflow, which is experimenting, creating model. And then your software engineer who's operationalizing the model. So what makes us so different is that the versioning of your assets ends up being very different. You need to be able to version the datasets, the schema, how it's changing. Um, along with that, you also need to have that lineage of where your model is coming from, where it's being deployed, what data sets are being used in your model. So you end up having a billion, more requirements on, uh, additional artifacts that you're creating. And then same piece here is model reuse. It's very different than reusing software. So you need to create a training pipeline where you're cashing a bunch of these steps that you're creating. So then you can maybe transfer, learn fine tune the model that you've created. So it stays relevant in that context. So inevitably models tend to decay over time. Everything in the world is changing around us. Data is changing. So you have to go in and update these models over time.

00:04:22

Now, what does the life cycle look like? So initially what ends up happening is you start by doing kind of this exploratory phase of first getting in that data, know, creating, um, creating model, finding an algorithm that works, and then you have this really useful model. So your data engineers like hoping, okay, maybe the state acquisition, hopefully this works. And I actually scale it out. The DeSantis is create the model. And then you have your software engineer. You're a dev ops engineer. You're saying, how the heck do I deploy this out? Where do I apply it now? How do I package it?

00:04:53

It tends to be a lot of friction between the personas here. Yeah. Okay.

00:04:58

And then the next phase here is actually reproducing that, um, process of actually creating your model. So this is where the continuous integration piece comes in, um, turning your training process and your training pipeline into this frozen pipeline that you can reproduce over and over again from wherever.

00:05:15

So, um, for those here who aren't familiar with data science and data scientists, how do they normally work today?

00:05:21

So right now, what happens is your data science, you're working on your own laptop. So you're training in your own environment, you're tracking your own things. So there's no standardized approach to creating your model. You're doing it in your own context, which is really hard to share across a huge team. Yeah.

00:05:37

A lot of them are researchers or have PhDs. They're not classically trained software engineers and thus they don't understand things like version control. So taking any of the work they do and bringing it to production is tricky.

00:05:48

So you need to be able to capture all these are pieces that we talked about, datasets, the environment that you're actually working in, um, the code that you're creating and then all the metrics come along with creating that model. Now we have, all right. And the next phase is let's just get this model running somewhere. So you're actually getting value from there. So once you push this model into this centralized store, um, you know, forcing the data scientists to kind of standardize a little bit of their approach and push what they've created into a central store, you kick off this training process of actually packaging up the model, um, in the context of its deployment environment. So your training environment, your deployment environment may be different. So making sure you have that environment and then testing it and making, uh, throwing it with some sample input data, making sure it's behaving the way that you expect it, um, in the package format and then releasing the model. So this is, you know, triggering it and pushing it out into a device server, wherever your deployment target is.

00:06:49

Well, that's where your engineer comes in. Now, the last phase here is actually automating it and reaching that happy state of MLR. So this is basically we start with the data engineer that has a centralized data store. They create these pipelines. That's constantly pushing in, um, cleaned up data into a centralized data catalog. And that's where your data scientists can pick up the new data, trigger this pipeline that essentially will go through these standardized steps of creating a model and then compare it to the last known model that was out there, the last known good model. Um, and then pushing this new model that's been certified by the data scientists into the centralized model or registry. And that's where your, you know, engineer software engineer comes in, picks it up, triggers a deployment pipeline. That'll go through kind of the same steps as it did before package it certify it.

00:07:39

So you have a standardized way of testing the model. So you're deploying your model out with confidence, um, and then maybe doing some AB testing, which is testing different versions of the model, um, and then monitoring that feedback, uh, and the outputs that are coming out of the different versions of the model. So how do I do AB testing with machine learning models? So basically with AB testing is that you'll have a kind of scoring endpoint or basically the way that you call the model and you'll have a bunch of different versions of the model behind that same scoring end point. So when you call it, you'll, you know, configure the traffic of data going to each of those versions. So then you can ramp up and down, um, according to the performance of those models. So the model behaves like any other microservice. Okay.

00:08:22

Exactly. And then you want to bring in that data that you're getting, pushing it back into your central data store. So then you have fresh new, current data that you can train your models on. Can you, uh, describe what the data analysis services? Yes. So basically what happens is models that are out in production. Uh, the data that's coming into the model can be changing over time. So that'll pretty much change the performance of the model because it's getting these unexpected input. So being able to monitor those pieces and understand the performance degrading immediately, and being able to kick off, um, these retraining pipelines, as you see the model performance decaying over time with the new data.

00:09:03

So, yeah, so say you've got a model running on an offshore oil rig detecting if there's gonna, you know, a valve for air compressor is about to blow a temperature, goes from spring to summer, it's now much hotter. The sensor values are higher. Uh, you're gonna get a lot of false positives on there's an incident. So, uh, those types of signals are what looking at drift on the model is all about.

00:09:23

You want to make sure it's again, staying relevant in that context as time and the world changes around us. Yeah.

00:09:30

Oh, cool. Uh, so I'm just going to talk about sort of how Microsoft at a macro level tries to address the issues in the MLM space and then run through a few different customer examples to get, to give you a flavor for what ML ops looks like in the wild. Uh, so Shavani touched on, you know, sort of the key problems of ML ops. How do you, uh, reproduce models when data scientists are largely researchers coding on their laptops? How do you reproduce predictions? Uh, so say a bank is running a model, which is, uh, determining if they're going to approve or deny a person alone. Uh, they need to be able to demonstrate exactly why the model gave that prediction for approved or rejected. Uh, so you need to be able to trace back exactly which data which features were used to train this model in the first place and prove that things like, uh, you know, fairness testing, uh, bias testing were done, especially when you're dealing with these highly regulated industries.

00:10:25

Um, then when it comes to operationalization and automation of the ML life cycle, again, that's about more than just the data scientist or the software engineer or the data engineer in isolation. It's about how to all three of them work together in a collaborative fashion. Um, you know, oftentimes the data you have when the data scientists, trains, the model isn't available, when you're trying to make predictions, how do you ensure you can have that same data, quality of data availability of the data, and also that the model is performance. So you're trying to have, you know, a real-time system detecting, if there is a giant spill on a factory floor, that model can't take like 10, 15 minutes to run, uh, in a real-time context. So how do you operationalize, automate, uh, three different personas, oftentimes different tools and technologies together. And the answer is hopefully with dev ops, uh, then there's also, how do you do collaboration, uh, within and across teams on your ML workflow?

00:11:21

So as Giovanni mentioned earlier, you can't share models around, like you can share normal software packages. You need to actually share a pipeline along with a model that can be used to, uh, reproduce it and tune it based on, uh, the data specific to that scenario. Uh, so, you know, one example is say, I have, uh, a model that's detecting anomalies in video feeds like the workplace safety. When I mentioned earlier, um, that model might be trained for a factory and, uh, in Beijing. But, uh, if you want to use that same model for a factory in say the U S you need to customize and tune that model based on footage of what, uh, the video feeds in that factory in the U S are gonna look like, uh, then there's, you know, how do you collaborate and share? Uh, another thing we've seen is really common with large enterprises in particular is each of their organizations have independent data science teams, running projects and experiments right now.

00:12:13

And lots of them are working on very similar workflows, like so 80, 90% the same. So how do you collaborate and share on, Hey, I'm using these datasets to train these types of models and solve these types of problems. Uh, then when it comes to enterprise readiness, uh, I mentioned briefly, highly regulated industries, uh, areas around how do you do governance when dealing with data and machine learning, especially when you're talking about, uh, deep learning and models that are really hard to understand, because they're so convoluted and deep that even people with PhDs don't know what they're doing. Uh, and then how do you do compliance and infrastructure is code, uh, when you start to deal with, uh, specialized types of hardware, uh, the cost management comes into play. If you're using, uh, you know, uh, compute with large amounts of, of GPU, large amounts of memory jobs that take a really long time to run.

00:13:04

Uh, it's not like a normal software build where it takes a few minutes. Some of these jobs can take days or even weeks to run, potentially to get a good model. So how do you trust that? So, uh, from a Microsoft and Azure point of view, uh, we sort of have this recommended flow, which involves using these three different technologies together, uh, data factory, uh, machine learning and dev ops services. Each have pipelines that are optimized for the different personas that are involved in this flow. Uh, so you have, uh, you have your data factory for your data engineers, Azure machine learning for your data scientists and dev ops for your dev ops professionals. And those are just sort of the three personas, hopefully working happily together, uh, to get the end to end flow going. So as far as, uh, you know, what does sure an Azure platform, but any platform need to be able to effectively manage ML ops flows, there's, uh, keeping track of your infrastructure, your code, your data sets, uh, data versioning is super important.

00:14:02

Being able to, uh, you can't dump all your data and get repository. It turns out and you can't really diff large binaries and yet, so you need ways to, um, track metadata profile hash and compare changes in your data over time to determine, uh, you know, has my data change enough to warrant training? A new model has my data change too much to the point where now my training pipeline is going to be useless because half the colonize features are dropped out of the tabular data, uh, tracking your environments. So say I train a model on my laptop. I want to know the exact state of the world when I trained it there. What were all the Python packages? Was I using Docker? If so, what Docker image was I running? And how do you easily shift that from the training to deployment side of the house, uh, tracking all my runs and experiments.

00:14:48

Uh, another common thing we'll see from enterprises is, Hey, this data scientist left my team. I have no idea what they did. All the work has gone, uh, because you know, it was just sitting on a Jupiter notebook on the laptop. So, uh, having an easy way to track your runs in the cloud should not impede the agility of your data scientists, but should give you a way to centrally track and understand the work that's going on. And the types of, uh, experiments that are being run, the types of models that are being produced. And then of course, the models themselves, it's important to be able to share and reuse those and to integrate those models and events around those models, into your end-to-end, uh, uh, MLN fused application life cycle. Uh, then when it comes to making models, less of a black box, you need ways to, um, explain how models are behaving.

00:15:33

Uh, so we have a few different approaches that we try inside of Microsoft, mostly using these, uh, open source explainers, um, called like shop and lime. And what they will do is actually analyze for a given machine learning model, all the features and relative importance of those features into your model. That's making a prediction. Uh, it's super useful. Again, if you're talking about any highly regulated industry, whether it's financial services or healthcare, uh, profiling the model, determining how long the model takes to actually go and run, and then deploying it to a variety of contexts, whether that's as a real-time API, uh, to an edge device, or as part of a larger data pipeline. Uh, most customers we've seen right now are mostly working on getting their models, integrated into data pipelines in batch and running them in more of an offline fashion to help build trust around the model with an eventual goal of getting those models running in real time, uh, on the edge, closer to the devices and making predictions faster and adding more business value. And then for enterprise life cycle management, there's data pipelines, you're training pipelines, uh, release pipelines, and then event into connect everything together and to end.

00:16:39

So as far as, uh, we have the documented best practices around how to do this, and, uh, there's actually a repository on GitHub. You can go and look at, that's got this whole flow set up, uh, including arm templates to go and deploy everything on Azure. But basically you have your data engineers working in tools they're comfortable with, uh, dropping data into whether it's, you know, a blob blob storage or a SQL database. You have your data scientists who are working, um, checking code into get a built-in experiment tracking, uh, on top of that, when they're ready to say, Hey, I think this model is good. Uh, all they do is submit a pull request to the master branch of the get repository, uh, that will actually run code quality checks, data checks, unit tests, and build and publish, uh, the reproducible training pipeline. That's then shared along with the model.

00:17:26

And then from the model registry, you tie into your dev ops, uh, release pipeline, whether that's Azure dev ops or anything else, again, it doesn't really matter. The whole goal is to show how this flow can work end to end and can go from, I have a data scientist doing something in a notebook locally to now the code's checked in. There's a reasonable training pipeline published. There's an official model, and I have the full end to end lineage to figure out where everything came from. And that flow again, that GitHub repo at the bottom, uh, shows how you can actually go and do it. Uh, now, uh, again, just another pivot on the same thing, which is, uh, to bring ML workflows to production. You need scalable compute and storage to, uh, to train your models, the ability to manage all the assets, collaborate and share on them package and validate your workflows before bringing them out, uh, deploy and serve them at scale.

00:18:15

Our general rec from a Microsoft point of view is to use containers whenever you can, easy to encapsulate all your dependencies, is it, uh, you know, build it once, run it anywhere, whether it's in the cloud on a Kubernetes cluster or, uh, running on that offshore oil rig, I mentioned earlier, and then monitor the models when they're deployed and know, um, as Giovanni mentioned, like when you should go and retrain them, uh, now, uh, just a few customer stories about how are our customers actually doing am ops. Uh, so this is a transportation company in Canada, and they're using 16,000 models at scale one model for each of the bus stations to, uh, determine bus departure times more accurately, they're called, uh, translate and go look them up. But basically, uh, they use dev ops and Azure machine learning together, uh, to submit a request, to pull all the data and train the model and do the scoring and batch.

00:19:06

And based on the results of that, they'll tune their calculations around, you know, estimated delays and how long it's going to take for the buses to arrive or depart. Uh, just an example of, for more of your classical government industries, how you can just take a pipeline, train it for one bus station and then scale it out to again at 16,000. So it's, how do you take that flow and, and bring it up and out and really help to do production ML? Uh, we've seen the same thing with customers who will, you know, train a model for one retail store, then take that same training pipeline, apply it against all the data for all their stores in the country or around the world. Uh, again, this is how you get that value. Uh, so, uh, another customer here in the retail sector using ML ops to ship recommender systems.

00:19:52

So in this case, uh, again, they use Azure machine learning and dev ops together to customize train and publish models. And in this case, they actually deploy the model as a real-time API on Kubernetes. And they also use the model inside of a spark pipeline to generate these, uh, sort of static recommendations on products, which they push into a cosmos DB. And then those are both served from whether it's the website, the mobile app, whatever. Uh, but the whole point is it's the combination of the machine learning services and dev ops services that allow them to actually do this at production scale. And of course it's mostly because I rented a space, but they, the data engineer is an important part of that, making sure the spark pipelines are working properly, uh, making sure that all that data in the bottom inside of the data lake is available and clean.

00:20:38

So all, all super important roles. Uh, and then another example here is for ML ops, we're running predictive maintenance at the edge. So again, take machine learning dev ops together. In this case, they have the maintenance models that are deployed and running on IOT, edge devices, taking all the sensor data info as it comes in, determining if they want to send an operator to the offshore oil rig to go and take a look and see what's going on. And that same feedback collection process goes in. Like the operator can say like, oh, that was a good prediction or a bad prediction. Those signals are fed back in, uh, to help improve the model over time. And, uh, I'll let Shavani recap and close it out

00:21:17

A quick recap. So, I mean, the key takeaway here is that it's not just using a bunch of technologies together. It's bringing this workflow of three different personas to collaborate together, um, and scale out. So essentially it's a lifestyle that we want to introduce with that a lot. So, um, it's not impeding on certain workflows, but bringing all these workflows together to collaborate and, you know, push their data and artifacts and what they're creating in two centralized stores that other personas can go in and pick up.

00:21:51

Yeah. And again, if you're looking into a, for your organizations, how to do digital transformation, how to bring it into the fold for data science, the recommendation is to not try to force a ton of existing software engineering practices onto them and said, give them tools to make it easier for them to get compute in the cloud, make it seamless for them to track their work and make it more apparent to them. The fact that they're training these models, someone's actually going to go and use them because, uh, one of the most important pieces that we, uh, bring up when talking to customers is, uh, 88% of models that are trained in the enterprise, never actually make it to production. So unless you put these ML ops processes into place, uh, your organizations are just never going to get there. Really cool. And yeah, that's all our content. So thank you.