Las Vegas 2018

Lightning Talk: Code + ML - Will Automation Take Our Jobs

Lightning Talk

DS

Dr. Stephen Magill

CEO, MuseDev

Transcript

00:00:02

So I'm Steven McGill. Um, I'm gonna be talking about the combination of code and machine learning. So a bit more technical, but I'm gonna keep it high level. I'm gonna go fast and stick to five minutes. So the question is how is this combination of code and ML gonna change development in the future and will robots be taking over our development jobs? So the background for this is a bunch of work that's happened over the last five or so years in the academic research community, looking at how you can take machine learning concepts and use those in a development or code analysis context that's led to a bunch of startups focused exclusively on how you can apply learning to code, um, as well as other companies like ours. Looking at ways to combine learning with other techniques to enable more robust, uh, useful code analysis.

00:00:47

So the question here is, are all these technologies and tools just gonna form this, this tidal wave of change that eliminates developers and you know, we have to all go find something else to do? Or are they gonna enable developers in a new way and allow, uh, developers to be more effective, more productive and have more fun? Look how much fun that guy's having, right? So the idea is if these tools can take care of the tedious aspects of development, then we get to focus on the fun stuff and that's great. Um, and so I think it's much more the latter than the former. And I'm gonna go into why in a little bit. But first I wanna show an example of just some of the things that are possible using machine learning with code. So I'm gonna use this example from natural language processing 'cause that's a great place to look for techniques because after all we write code in programming languages, right?

00:01:36

So a really cool result from natural language learning is you can do this thing where you take a neural network, you train it by you input words, English language, words on the left, and then you ask it to predict the context in which those words occurred. What other words were nearby them In this, you give it a huge corpus of just English language, text, articles, books, et cetera. And what happens is if you then look what's happening inside the network, it learns this cool representation of the meaning of these words. So it's just a bunch of numbers floating around in here. But if you look at those numbers and you plot them here, I've plotted them in two dimensions. Uh, the spatial relationships tell you really cool things. So this is a collection of country names and capital city names. And if you draw a line between here, uh, you can see that uh, the items below the line, uh, those countries are entirely in Europe.

00:02:28

Those above the lines are at least partly in Asia. So already there's some spatial clustering that tells you something about what those words represent in the countries that they represent. And then there's other cool effects that fall out of this. So you have this effect where distance captures similarity of concepts. So Russia is closer to China than to Italy, and that's true geographically. It's true geopolitically and it's true geometrically in this representation of the data that's learned by the network. Even cooler, you can use math to create analogies. So if you take Russia, you take the point representing Russia, you subtract Moscow, its capital, and you add Paris the capital of France, you get a point that is very close to where France is in this display. So you can use these vector operations to discover relationships in the data. So how does this apply to code?

00:03:18

Well, you can take these same techniques and apply them to code substitute English language, uh, words with method names, right? And you can use similar approaches to discover that the method count does something very similar to the method get count or that, uh, if you take equals and add two lower, then you get something that's like equals ignore case. Uh, and so this sort of technology has a lot of applications, uh, program de obfuscation adding code comments, code completion, code similarity. And then if you go through and you look at other machine learning techniques, a lot of them have very cool applications to code. So the typical classification task take an image and say, is this a cute cat or an ugly cat? And I know the second one's an ugly cat 'cause I asked Google for a picture of an ugly cat and it gave me that <laugh>.

00:04:03

So <laugh>. So that corresponds to like a code smell detection or a a vulnerability detection task. Automated translation, the kind that you would do to convert English phrases to German phrases corresponds to automated porting of languages among programming languages. And then there's this image completion task where you take, you take a picture and remove part of it. Maybe a telephone pole was in the way and you ask the neural network to fill in the details. Well that corresponds to a smarter more context to where code completion. And so you can get some really cool code completion tools out of this. And then that's just scratching the surface. There's a bunch of other tasks people have looked at focusing attention during code review, automatically generating glue code, checking API usage, predicting performance problems, and even taking actual English language descriptions. You know, search for this string in this buffer and generating code from those.

00:04:51

But what you'll find in common among all of those is they're foc the tools are focusing on the formulaic parts of development, these local sort of repetitive tasks. And that's actually great 'cause then developers get to focus on the fun, creative parts, the architecture, the business logic, the security story, and so forth. And so we can reach this point hopefully in the future, where we can develop enterprise grade scalable applications without a lot of the minor annoyances and, and roadblocks that go along with that. So these, a bunch of these techniques have open source implementations you can go play with. If you look up that last slide, there are pointers and, uh, if you're interested in this, come find me. I'd love to talk about it. Thank you.