Secrets of Developer Productivity at the Tech Giants

Google, Facebook, and Amazon have invested heavily in developer productivity, with a focus on using automation to flag performance, reliability, and security issues early in development.


They have each published extensively on their productivity experiments, describing lessons learned and best practices for incorporating static analysis and testing into DevOps workflows in a developer-centric manner.


Google has developed a static analysis platform that integrates so seamlessly with the development process that many devs aren’t aware of the platform’s existence until it points out a critical vulnerability.


Facebook has integrated fuzz testing and machine-learning backed fix suggestions into their workflow in a similarly low-friction manner.


And Amazon has integrated advanced formal reasoning tools into CI processes to streamline certification reviews and catch security regressions.


What all of these efforts share is that they allow developers to recognize and fix issues themselves, without the involvement of security or QA teams, enhancing productivity of both the security / QA and development teams.


This talk will describe these automation efforts, with a focus on open source tools and integration steps that any enterprise can adopt to boost their own developer productivity and enhance software security.

DS

Dr. Stephen Magill

CEO, MuseDev

Transcript

00:00:08

I want to start by saying how excited I am to be here and how thrilled I am for the first ever all virtual DevOps enterprise summit. I miss seeing some of you in the hallways, but I'm very glad to be interacting with everyone via the Q and A's the virtual happy hours and all the other great networking opportunities. The organizers have planned. I'm monitoring the Q and a right now. So please, as I dive into this, send your questions over so we can keep this interactive. So first a bit about myself. I've been doing academic research and software analysis, security and programming languages for more than 15 years. First, as part of my PhD work at Carnegie Mellon. And then at other universities and research labs, since then, uh, to give you an idea of the timeline. I still remember the first talks on the research that led to Coverity, which was founded almost 18 years ago.

00:00:54

So I've seen a lot of change since then. Um, I've been involved in this area for a long time and lately there've been particularly exciting changes happening at the largest tech companies and how they use program analysis tools, not just to find security issues, but to enhance developer productivity. And that's what I'm going to talk about now is these approaches to enhancing productivity using advanced tooling. And this approach influences a lot of what we do at muse dev, where we work to make this technology more accessible, but it's broader than any one company. And today I want to talk about this general trend and the other places I see it cropping up, including at various companies here in the dev ops enterprise community.

00:01:33

So what is this trend? It starts with a word we all love. Um, I'm imagining the rush of dopamine, the warm smiles as the slide sinks in, right, continuous, because this is an ethos that we all believe in, right? This idea of continuous, but continuous what well, continuous everything and what exactly is so magical about the ideal of continuous well striving to do things continually has this effect of, uh, both improving outcomes and lessening pain. It's this at first counter intuitive concept that if there's something you're doing, that's valuable but painful and you should probably be doing it more frequently and this frequency will cause it to become less painful because you'll find ways to automate it. You find ways to decrease friction. And so by aggressively automating a task and making its execution, the concern of infrastructure, you free up humans to then focus on other things.

00:02:26

And so we've seen this happen with continuous integration where this painful process of combining components, building code, performing integration, testing, this all becomes automated and you reach the state where everything is just always working together. And it happens with continuous deployment where standing up production infrastructure, rolling out code updates, switching over production traffic all becomes automated and the associated pain and risk in that process decreases significantly. And there's an emerging trends to apply the same sort of spirit of up automation to assurance tasks. So I see it happening at the largest tech companies. I see it being explored in the government world, and I see it at various enterprises right here in the dev ops enterprise community. Um, and this is what I'm going to be talking a lot about in this talk, but first, what do I mean by assurance? Well, broadly speaking, I mean, quality, security and compliance. Uh, and one interesting thing about this movement is that organizations that have started down this path, it's often a one of these three that's the driver, but it quickly becomes about all three. And that's because at the end of the day, these goals are really very tightly coupled.

00:03:33

So it's this continuous insurance process I'm going to be focusing on as a driver of new productivity gains. And I'm going to explore it by telling three stories first, uh, two from some large tech companies, Google and Facebook, and then one in the government world. Um, and that one, it's a government driven effort, but also there's community elements. And then I'm going to talk about, uh, what's happening in the broader community in particular, uh, the dev ops enterprise community right here, but let's start with Google, uh, and the work that they've done to decrease their pain and the pain we're talking about in this case is static analysis and what could be more painful than static analysis, right. Um, you know, I'm sure we've all had, uh, situations where, uh, where we fought with static analysis tools and, and it's true. They can be painful, uh, when they're deployed badly.

00:04:20

But what does it mean to deploy a tool badly or deploy it? Well, well, uh, there's a lot to learn from how Google has explored deploying these tools. So I want to start by telling a story of an experiment they did. So they had a new Java static analysis tool that they were bringing in. Um, it had shown a lot of promise, found valuable errors in the test code that they'd run it on. Um, and so they took this tool and tried to apply it to the code base as a whole. They actually organized a, uh, sort of a hackathon a week long process where they had hundreds of Google engineers come together and try to address all of the findings that this tool surface. Um, but it didn't go well. Right? So even with hundreds of Google engineers working for a week, um, they got through less than half of the issues that were reported and of those that they managed to go through and triage less than a sixth of those are actually fixed.

00:05:09

Um, and it's because when you do it this way, when you're going through issues in code that you wrote a long time ago, code that engineers aren't familiar with, the bar is so high for how bad a bug has to be, uh, to have it be worth digging through that old code and remembering enough about what's going on to fix the error without breaking anything else. Um, it's just very few issues that reach that point. And even if they had managed to go through all of the code and finish this process, Google has over 2 billion, billion lines of code and 16,000 changes per day. And those are actually old numbers. So I'm sure it's way more now. Um, so there's no way that a human driven bug triaged process, uh, can keep up with that level of change, right? And this should sound familiar, right?

00:05:50

There's also no way that a batch mode human driven process can keep up with the scale of modern systems development and deployment. And so we don't do this, right. We, we don't wait and integrate our systems just once a week or save building the full system for shortly before release. Right. We use continuous integration, we use continuous deployment. Um, so why should we be doing that batch mode approach to static analysis? Uh, but it's often what companies do, right? Uh, this situation on this slide probably looks familiar to a lot of people here where, um, you set up a process that has developers on one side, sort of throwing code periodically over, over a fence to the, uh, security team on the other side, um, that team reviews, the code does testing. You know, they apply SAS tools, desk tools, um, send the results back over to developers.

00:06:39

Uh, developers never asked for these results in the first place, right? They just want to move and get their code out. They want to implement new features and, and, you know, keep pushing the product forward. Uh, and so not surprisingly, this doesn't go well, right. Developers start to see security as, um, this dating process, this blocker in them getting their releases out. Um, and security is frustrated too, right? Their goal is to get bugs, fixed, to attend to the security goals of the company and the product. Um, but they have to fight with developers to get things fixed, right. So it's a costly and inefficient process, and it's also just the look personal conflict, right. It's not an enjoyable process to be a part of. Um, and so it, it doesn't work well in general, but, um, it really seriously breaks down at Google scale.

00:07:23

Right. So what did Google do to move beyond this process? Well, just like, uh, developers, uh, when they went to roll out continuous integration, uh, realized you need a platform, right, for predictable automated builds, uh, you need something like Jenkins to drive all these different tools that are involved in setting up a continuous integration process. Um, and when you go to do a continuous deployment, right, you need a platform for those automated, uh, repeatable deployments. And, you know, you need things like infrastructure as code configuration as code. And so, um, similarly when you go to move your assurance tools into this, into this more continuous process, um, you need a platform, um, that can orchestrate these tools, run these tools automatically and take the security engineer out of that loop, right? So this is what Google did. They moved, uh, to this process of continuous assurance, where you have a static analysis platform, uh, that's tightly coupled and integrated into all the key pieces of the software development process.

00:08:21

So it integrates into build, it integrates into code review, uh, and it integrates into the code repository service. And, um, and I have the Google names for these things, uh, up on this slide. Uh, they of course have their own infrastructure for all of this that they've developed in house, uh, you know, at, at, on the outside at a typical company, right? This might be something like Maven for bills, full requests as the code review infrastructure, uh, and something like get hub, get lab or bit bucket or the repository service. So, so Google built this platform deployed, it plugged a variety of different tools into, and, uh, as they were evaluating its effectiveness, they found really two factors to be particularly important in how they went about this. Uh, the first was to integrate tools into code review. So, um, this is the best place in the process to raise, uh, results that two fine, because it's the stage in the software development process where developers have just written the code.

00:09:17

So it's fresh in their mind. Uh, they're expecting to discuss it with the team and make suggested changes. And so they're already in this mindset of, um, you know, being open to feedback on their code, uh, and because the code is fresh in their mind, it's easy to make those changes. So if a tool suggests a change at this point, uh, it's really easy to act on. In fact, it's often easier to just fix the issue, even if it's a minor issue, uh, than to argue for why you shouldn't take a step to improve the code. The other important thing about Google's approach is it, they didn't just integrate one tool. They implemented this platform for hosting multiple tools. In fact, the last time they reported on the strike quarter system, they were up to 146 different tools on the platform. Um, and this is critical because different tools tend to catch different types of errors, right?

00:10:01

And if you think about those 146 tools, a lot of those are very application or domain specific, right? They're looking for particular coding patterns, usage of particular frameworks for a particular type of task. Um, and so by providing the platform, they let those teams bringing in those analyzers that are very important for their code base. Maybe not generally applicable, but run them alongside the general analysis tools, uh, and have it all be part of the same process, the same assurance process. So if you want this sort of broad coverage, if you want teams to be able to customize things for them, um, you really do need a platform that lets you run multiple analysis tools. So Google invested a lot of engineering effort into, into creating this standing up and experimenting with it. Um, what did they find? Well, uh, they, they got it running across all, you know, the entire Google code base.

00:10:54

Um, they analyze 50,000 changes per day. And across those changes for errors were reported. Uh, they have a 95% fixed rate. So compare this with the results from the previous experiment, right, where they were doing this batch mode process, uh, at about 16% of issues got fixed, uh, and less than half of the errors were even examined. Right? So, uh, from 16% to 95%, right. That's the, that's the difference that integrating into the right step in the process makes right. So integration is key. Um, and in particular, um, this mechanism of integration where you're hooking hooking into the pull request process, um, it lets you do essentially small batch triage, right? So just as, um, with dev ops and, uh, you know, in pushing deployments more frequently, it lets you implement a process of small batch code changes, right? So that you keep changes small, uh, you can iterate more quickly code review integration for analysis tools also lets you deal with issues in the moment and in a distributed manner manner, right?

00:11:51

So individual developers triage their own issues as they come up, which really unlocks a new level of scale and agility. And so Google saw productivity gains, uh, you know, uh, primarily from this process, this more optimized workflow, getting humans out of this process and being able to run it at higher scale. Um, but there was another interesting source of productivity gains, right? So, uh, because they allowed the platform to be extensible and developers to plug new tools in and they also include tools in this platform that can be customized. So developers can easily write sort of new, new patterns, new bug patterns that they want to search for. Um, they found that developers frequently used those mechanisms to add checks that were focused on productivity. Right? So, and we, we see this happening a lot like, uh, you know, a company might bring in tools or start to set up this process primarily to attend to security outcomes.

00:12:41

Um, but then developers quickly see the, the productivity advantages that you can have here, the sorts of bugs that you can find early on so that you don't have to go through a more expensive debugging process. Um, and so, uh, it, it enhances developer productivity directly as well as just optimizing this workflow. So I should say now that, um, just like CGI doesn't mean, you know, moving to a continuous integration process, it doesn't mean that you need to get rid of everyone who understands the build, like right, you still need build experts and people who can improve that and tweak that. Um, and, uh, it's the same way with continuous assurance, right? So you can't get rid of your security team. You don't wanna get rid of your security team. Um, but what this lets you do is have them focus on more valuable activities, right?

00:13:23

So things like pen testing, architecture review, uh, application risk assessments, right? These are things that you need human experts for. Um, and so it's a waste of time to have security professionals focused on, uh, these code level issues that, that the tools could easily find. So the next story I want to tell is about Facebook, um, where, and the Facebook, their motto, original motto famously was move fast and break things, right. You've probably all heard that they've changed that recently to move fast with stable infrastructure, which doesn't roll off the tongue quite as well. Um, but it is a great acknowledgement, um, that, uh, you know, there's, uh, it's important to focus, not just on speed, but also on stability and quality. Um, and you know, I think it, it shows that they believe that you can really have both, right. You can have this, this makes sense as a goal, these two things are not in conflict, right?

00:14:13

You can move fast with stable infrastructure and stable code. Um, so, so Facebook, you know, they want it to reach this new, um, this new level of stability, um, and in particular they were facing. And for this story that I'm going to tell here, um, they were looking at what was happening in their mobile, uh, platforms and, uh, seeing that they had a lot of app crashes and performance issues, uh, that were leading to customer disengagement. And the thing about mobile apps is you can't just push updates, right? It's not like your cloud infrastructure where you can deploy multiple times per day and just know that it's always up to date, right? Customers have to opt and they have to download these updates and you never know when that will happen. So, uh, it's extra important to get it right the first time. Um, at the same time, Facebook has a culture of not putting any roadblocks in the way of change and release processes, right?

00:15:05

The original motto was move fast and break things, right. They've always been focused on allowing developers to quickly make changes and push new code to production. So what did they do to preserve this speed, but still have high assurance? Well, they went through a similar progression to what we saw with Google, right? They brought in new tools that could detect the sorts of bugs that lead to those mobile app crashes. Things like no pointer, exceptions, resource leaks, and threads, safety issues. Um, and first they tried deploying them in a batch mode, right? Like we saw in the Google example, uh, and it led to very similar results. So very few of the issues that the tools found were acted on. Um, and it's just too, it's just too hard to go back and fix old code and run this batch mode process. Um, so the analysis tool authors in the paper that they wrote on, on this experience described this pitfall as the raffle assumption for report only failure list.

00:15:58

So this is the, uh, this is the assumption that a lot of tool authors have and even a lot of people that, that bring in these analysis tools just sort of assume that all a tool has to do to provide value is return a list of failures. And as long as those failures are mostly legitimate, so it has a low, false positive rate, right. Then that tool will bring value, but it doesn't work that way. Right. So, uh, they essentially, they reached the same conclusion that Google did that. It really matters how you integrate into the process. And if you have tools that are reporting separately to above dashboard, you know, some separate workflow, um, that doesn't work, you really need to be interfacing directly with the developer and when the code is fresh in their mind. So they did that. They moved to this other integration process, working directly with developers as part of code review.

00:16:44

Um, they did this with two different tools. Um, so the first is that one focused on reliability and app crashes. Um, it's since been applied to the rest of the code base. Um, and they are the fixed rate went from almost nothing up to 70% following that integration. And so that's the, again, another example of the difference in effectiveness and value, um, that having the right integration makes and that's that's for the same tool, nothing changed except how it was integrated. Um, and they've since, uh, deployed another tool focused more on the security space called and, um, and that's contributed a lot of those, a lot of the security findings from that tool also go directly to developers. Um, so I think that's a great Testament to the importance of connecting with developers in the right way and the effect that can have on these tools value.

00:17:30

So those are both stories from large forward-leaning tech companies. And I want to tell a story now from a very different organization and targeting a very different problem space. So a NIST, the national Institute for standards and technology as a government organization, that's responsible for publishing certification requirements for a variety of different things. Um, but in particular, for implementations of cryptography, that'll be used by the government and they've been quietly and effectively transforming how this certification process is done. And I think it provides a really nice model for how other certification heavy processes could be transformed by automation and dev ops. And, you know, I know we certainly have people in this community, um, you know, who, who are in highly regulated industries who have their own internal certification and audit processes, um, that are really ripe for automation and streamlining. So what NIST did, first of all, um, the particular certification that we're talking about here is called FIPs one 40 dash two is a lengthy external review process, um, that all crypto that's used in government services has to go through.

00:18:35

And, uh, it's, it's managed by a third-party testing lab. You shipped your code off to this lab. They do a bunch of tests. There's a back and forth and lots of reports written. And it's a very slow process that really gets in the way of development velocity. And it can even get in the way of pushing out important security updates. So there's stories of tools, uh, of, uh, applications and libraries where security vulnerabilities are found. But, you know, it took time to get the patch out because it had to be re-certified right. So in this realized, this is a problem. And, um, and this slow process was increasingly out of step with modern development practices. And so they put together an industry working group, uh, to really transform how they do certification of crypto algorithms. Uh, the new process, uh, focuses on integrating an automated testing process into the dev ops pipeline.

00:19:22

So they're starting from the assumption that you have a pipeline, uh, you have this, this process that's using tools, uh, testing tools, analysis tools, and so forth as part of this pipeline, um, to, to certify that you're reaching your assurance goals. Um, and what they do is then just collect evidence off to the side from that process. So, um, they have, uh, a mechanism where you can send results from these tests, uh, directly to NIST, and then, uh, they check those results and monitor them. And that serves as evidence, um, that the process was run the process, the testing process that they expected was executed on this code. And it enables a sort of self-certification workflow where, uh, teams that, uh, keep, uh, can have continually certified versions of their software. So it really shifts from certifying individual releases to a certification of the process level, right? So it's certifying the process that produces the new versions of this code, and then anything produced by that process that passes the automated checks and comes with the required evidence inherits that certification.

00:20:24

So it was great to hear about this shift because it's actually something that I, and a number of other members of this community recommended last year in a white paper published by it rep. So we had this paper called, uh, automated governance. Uh, it was a reference architecture for how you might set up an automated governance process. And, uh, it's really all about automating your software governance structure, right? So automating compliance, achieving a continuous assurance posture. And, uh, and so we showed how you might hook into various checks that you're already running in your pipeline to collect the evidence that you need, uh, to certify that, that, um, the really key checks that, that are important to you into your auditors are running. And since then, uh, several of the authors have been pursuing this philosophy at their own organizations. And in particular, I know John Rez at PNC bank has been working on integrating evidence collection into their pipeline.

00:21:17

Uh, and the advantage of this is it lets you have a flexible pipeline process, which is important, right? Teams sometimes need to customize their pipelines to bring in certain frameworks or work with certain languages. Um, and so you want to provide that flexibility, but you want to be sure that the checks that are important to you are still running, right? And so you can do, you can achieve both of those goals by having this sort of parallel certification process that sits alongside the pipeline. Um, and so at PNC bank, they're actually working on, uh, leveraging this evidence collection to switch from a process of sending individual changes through the change advisory board to get individual releases certified and move into a process where the change advisory board looks at the process as a whole and the evidence collection that they have in place. And then, uh, sets up the certification workflow that says, okay, if you have a certified process, then any changes that go through that process inherit that, that approval.

00:22:14

So that sounds a lot like what happened with this approval process, right? And so it's really cool to see this approach, um, working out and providing value, uh, in different contexts. So, um, it's great to see sort of the general benefits of this continuous assurance process. You know, I think we've, we've seen it work, uh, at the largest tech companies. We've, we've seen examples from government and from, uh, from industry in, in the dev ops community as well. Um, and I think, you know, like, like I said, different organizations tend to come at it from, from different entry points, but really it's the same underlying philosophy of, uh, moving into a continuous posture, uh, these, these checks and these certifications and these, uh, these pipeline elements that are focused on quality and security. So I think we're going to see more and more of this.

00:23:01

Um, I want to close with an observation from Nicole Forsgren and her team's excellent work on the state of dev ops report this year. So, uh, new this year was a section where they dug into what automation practices various organizations have in place. Uh, and they found that, you know, as you, as you might expect and, and very encouragingly there's great progress when it comes to automation of things like bills, deployments, uh, production monitoring, uh, in general testing, but there's a clear area for growth when you dig into these results. In, in, in particular, in the row that I highlighted here, when it comes to automation of security testing. So I'm hoping the next year, uh, we'll see these numbers increase because this is really, this is what I've been talking about here, right? Is approaches that different organizations have taken to automating and making more continuous, um, their security and other, uh, certification and assurance processes, right?

00:23:53

So I'm hoping next year, we'll see these numbers increase. And, uh, also just hear more and more right about continuous assurance and automation of these compliance workflows, because it really is a key opportunity for improvements in productivity. Uh, the largest tech companies, Google and Facebook have shown that it can work at scale. Uh, and now we just need to go through the cultural transformations necessary, uh, to embrace it more broadly, right? Because just like successful CII CD and dev ops transformations in general, um, you know, there's, there's a technology component, but they're really rooted in transformations of culture. It's going to be the same way with continuous assurance, right? It's about shifting the processes that are involved, uh, shifting various humans roles in that process and so forth. So see this as a call to action and dialogue. Um, I want to hear your stories. I want to hear about what you're doing, and I want to leverage this community. This CVS is gene. So rightly puts it to help everyone make progress in this area. So thank you. I'll be around monitoring slack. Please send me your questions, your stories, or anecdotes. I want to hear about, um, what everyone's doing in this space and what everyone thinks would be valuable.

00:25:07

Um, so please, in Julia answers, if you have any questions or thoughts, please tell them before I leave by.