Las Vegas 2019

DevOps Patterns & Antipatterns for Continuous Software Updates

So, you want to update the software for your user, be it the nodes in your K8s cluster, a browser on user's desktop, an app in user's smartphone or even a user's car. What can possibly go wrong?

In this talk, we'll analyze real-world software update fails and how multiple DevOps patterns, that fit a variety of scenarios, could have saved the developers. Manually making sure that everything works before sending update and expecting the user to do acceptance tests before they update is most definitely not on the list of such patterns.

SC

Stephen Chin

Senior Director, JFrog

Transcript

00:00:02

So, um, um, someone in the audience called this out, I have a small type on my slide. It should say, um, Steve on Java rather than the missing J so apologies on the, on the typo on the first slide, it gets better from here. Um, so I run the developer relations team at J frog. Um, so the plan for today is we're going to chat a little bit about, um, the reasons why you want to do continuous updates and update your software. Um, some anti-patterns for things which can go wrong. I think, you know, anti-patterns are always the fun part. You look at other people's mistakes and you're like, I'm, I'm not that guy I can do better, but then some practical advice on what you guys can do with your process to, to avoid some of these mistakes and common practices. Um, that'll save you from, um, hitting issues with your updates and affecting your customer base.

00:00:53

So, um, first question we all want to ask ourselves is why, why do we care about software updates? Um, so I hear a lot of good people in the, in the audience, shouting out questions. So, you know, security, what features. Okay. So that, that's my number one reason, because we have these annoying folks users and of course, what do they want? Right. Features. And, um, when, when did, when do you think they want the features? Yeah. Now, now see, you guys all have customers. I can see the audience. This is not a, a new problem for you guys. And this, it wasn't always like this. Right? So, so what no customers for you?

00:01:40

Yes. Yes. You're very right. So this, it wasn't always like this, right? So it used to be like, you got a feature phone and when you want to update your device, you brought it back to the store. Cause you, maybe you wanted like, like snake on your phone, which is clearly why Nokia won the smartphone war or the feature phone war. And, um, you took it back to the store and you got a new phone. Um, maybe eventually you could update your software by cable. Although initially if you remember on the early, um, feature phones, all the cable was good for was transferring contacts. You couldn't even update anything or connected to the internet. Um, then this iPhone came around, right? This was the big game changer, but you had to actually, you know, get prompted for updates. So you had to make a choice about updates and hopefully we're in a, in a world now where your phone is constantly being updated.

00:02:31

You don't even think about it, the, the apps on your phone, um, what you're using for software, all of this stuff is continually getting pushed down as updates. And this is what users expect. Now, this is the expectation on any software application is that you're going to continually get updates. Okay. So now the problem here is all of these updates. Um, this is the new, this is the new, um, the new oil spill because security vulnerabilities, which can destroy your customers and destroy your business. And this is actually something, someone else. This is kind of the second big reason why you want to do updates is because you have to patch security vulnerabilities, otherwise you're impacting your users. Um, okay. So we have features, we have updates. We all know that we want to do updates, but the question is, how long does it take us to actually get updates and a good model for looking at this is, um, you guys all drive cars.

00:03:26

All right. So this is the breaking distance that you need, um, to, to stop your car in the case of an accident or, um, to avoid a collision, right? So the there's kind of two different things going on here. One is you're thinking time that there's a, an obstacle I need to stop the car. The second one is the breaking distance. Now the, um, I've pressed the brake pedal. Now the car needs some time to actually mechanically stop. And these are two separate actions. One of them is a, a thought process. So it happens pretty much linearly as you, um, as you have to stop at shorter distances. Um, the other one is a mechanical process. So the faster you're going, the harder it is to stop a big hunk of iron. And, um, when you add these two up, this is the actual stopping distance that you need to stop your car.

00:04:16

And you can think of, um, fixing software is a very similar process. So first you identify that you have a, um, a reason why you need to update, then you fix the issue. And finally you deploy that. So your end users can actually take advantage of this, this new fix. And there's a, there's a bunch of examples, um, of this, where this has not been a very quick process. So we're getting into the anti-pattern process of this. Um, so one example is, um, back in 2017, there was actually a shutdown of a UK hospital due to ransomware attacks. So this is, this is horrible stuff, right? So they have x-ray machines, they have dialysis machines. They have patients who depends upon healthcare and, um, their, um, their machines were hacked. Um, they had to shut down the hospital, they had to patch this and it took them a very long time to, um, obviously they identified it pretty quickly, but then actually doing the OSTP upgrade and then deploying this took a very long time where the hospital was not operational. And, um, the problem with this case was that they were running an outdated version of windows. They were running windows XP. So clearly, clearly they weren't doing a good job staying up to date and their updates. Um, that probably was the first set of issues. Obviously the fix is easy upgrade to a modern version of windows, which has a bunch of security patches, but it's not that easy when you haven't done updates for decades.

00:05:51

Okay. So, so another example, um, probably all of us remember Equifax and probably were impacted in some way, right? When they lost all of this personal data, um, it was worse than just a credit card leak because attackers knew your social security number. Um, they could open up new accounts in your name. And actually there was so much data put out in the market on personal information that social security numbers went down to being worth only $10 a piece. So I'm gonna actually push down the price of illegal information because so much it was available. So the case here was it took them a while to even identify that they had a vulnerability. So hackers had access to the security vulnerability for a couple of months. Um, it turns out that they were running an outdated version of struts. So they had to do an upgrade of this.

00:06:37

Fortunately, it was already fixed in the current version. So it was just an upgrade to an existing library version, but they weren't using a continual update or continuous deployment process. So it took them two months to just get the update, um, out to the market. So again, you know, huge, um, security vulnerability, which was open for several months to identify and then to fix it. And then, so another example, Spectre and meltdown. So this hit in January, 2018. So meltdown is the easier of the two because it simply requires a few JavaScript lines of code to hack, but at the same time, it can be fixed systematically. Um, spectra is a bigger problem because it relies upon, um, branch execution strategies inside the processor. So basically what you're doing is you're trying to get programs using predictive ranch execution to do things which then give you information about what's happening in memory and other process.

00:07:31

Not only it can be another process running in the same machine, you can also target a virtual machine running inside of it. So it doesn't isolate you between containers. Um, so this, this is really, really bad stuff. And in a lot of cases, since there is not simply a fix, the hardware manufacturers can put out for it because it's inherent in the design, you need to do this in software. So you need to patch your software to account for specter. A bunch of libraries are updated specifically for this. There is constantly new attacks and new ways of exploiting, which needs to be addressed. And then you get in the situation where you need to update your software very, very fast. If there's a new specter exploit or a new way of attacking your software, which isn't accounted for. So you need to identify this as fast as possible, fix it as fast as possible, and then deploy it as fast as possible. Right? So this, this is, this is the reason why you need continual updates.

00:08:25

Um, w we'll stand by for just a sec. So, um, now I, hopefully I've convinced you guys, you want to update your software faster and, um, you know, as you guys know, I'm a Java hacker and I'm the Java guys actually, um, recently changed the release model, right? So, so mark Reinhold announced back in 2017 that they were going to move to a release model where Java, rather than being shipped every five years or more practically, every seven or eight years, um, they were going to release it every six months. So on, this is, this is great stuff, right? So, so the, the core platform that a lot of our enterprise software is built in is actually going to be updated on a more regular cadence. They're going to get more features out more quickly. They're going to, um, help our developers with patching security vulnerabilities more reliably.

00:09:17

And, um, so there was a, there's a study which actually, um, looks at the usage of different Java libraries and how well they're adopted. Um, so is the state of the developer ecosystem report in 2019 7,000 developers. So very credible Paul and, um, yeah, we're not, we're not doing so good on, on adoption past Java eight. So Java eight was the version before the announcement was made on the release cadence, Java 9, 10, 11, 12, we're up to 13 now are the, um, the more recent releases. And so we, we have a problem here, right? What, what happens?

00:09:57

Okay. So to, to understand kind of where we are, um, this, this is a graph which shows you the thought process. People go through when they want to do updates and send updates available. Do we want it? So if we don't, if we don't want the update, if it doesn't have features, we want, if there's no security issues, maybe we don't even care about the update, but hopefully we want it. And then we ask ourselves, are there any high risks with this? Um, if there is no risk at all, we're probably willing to update it because it's easy, but if there are risks, then you ask yourself, do you trust the update? Because potentially this could cause issues for me. It could cause downtime, it might need to be retested. And, um, why, so when you ask yourself, why do I want to update? Do we trust the updates? Really the, the answer to this, um, is best shown in a comic.

00:10:52

So really the problem is we don't, we don't trust the process, right? So if we trusted the process, we might be willing to go along with updates, but typically we don't trust large companies to QA the software. Um, we know there's going to be issues. We have a track record in our industry of, um, releasing software, which has bugs in it. And this is a complexity problem. If you look at the complexity of software, the complexity of software keeps increasing over time. So, um, you know, we start out with agile processes. So you're releasing software, you know, faster, um, continuous integration. Now we have builds running. We're continuously, um, updating code we're continuously delivering code. Next level is you have infrastructure as code. So now everything inside of your organization is treated as code to deploy to servers. Then you have microservices and serverless and smaller, smaller bits of code, um, containers, runtimes like Docker and Kubernetes.

00:11:53

Then finally in the IOT world, everything, everything needs to be updated. So as the number, as the complexity of the system goes up, it's harder and harder to, um, to actually determine that you're not going to hit any software bugs when you update. And the other aspect of this is data. So the amount of data in the world is increasing exponentially. This is some data from Seagate. So arguably they, they, they self storage. They, um, they predict high, but this isn't too far off. Right. So we already, we already know in 2017 that we have, um, over 20 zettabytes, has anyone noticed that a byte is a lot? Yeah. Okay. So the answer is a lot of zeros. If you know what a petabyte is, a thousand petabytes is an exabyte. A thousand exabytes is a, um, a zettabyte. So it's a lot, it's a lot of storage.

00:12:43

And, um, the prediction is by 2025, we're going to get to 175 zettabytes. So that's, that's a lot of data. And if you do an update, you need to make sure it works with a lot of data. The question is, how do you test with that much data? Well, the answer is probably you don't because it's, it's simply impossible to exactly mirror what's happening in production with large data sets inside of a QA region inside of a test region. So, um, one example of this is, um, some people get these letters, um, these unsolicited letters from China, and then inside of the letter, you get like a red sock or a little bit of, a little bit of red tape or a black cloth or a little ring. Um, and they're just sent out to random people. There's actually a threat on Facebook about empty envelope from China.

00:13:31

So why, why are people getting these like random envelopes? Um, is it, is it some sort of like, like government plot, maybe like China's trying to spam the U S or destroy the, um, the federal, um, um, um, shipping system. So the, the answer is this, this is quality control. So in a large, in a large system, in a large shipping system where you have to actually verify end to end, if you want to do an end-to-end test, the end to end test is shipping something to somebody. So in China, um, like Ali Baba and these large companies, they actually test their Anton verification by shipping out random packages sometimes to make sure that your package is going to get to the final destination. And so this, this is the challenge in extremely large systems. It's hard to check that you actually are not introducing any problems.

00:14:21

Okay. So getting back to how do we update? So, um, we go back to Dewey, trust the update if we trust it. Yes. We'll update. If we don't trust it, we have another option. Can we verify the update? So can we verify that this is something which won't break production? And if the answer is yes, then we might trust it, but probably increasingly the amount of time needed to verify the update is going to be very long. So it's very long and labor intensive to verify the update and make sure it doesn't introduce any problems. And of course, if we can't verify it, then we're, we're back to no, which is not great. So this, this is a problem, right? So we, we have a lot of food in our table. We have a lot of features as a user. Do we need more features at the risk of the update?

00:15:07

And the balance here is, is the feature more valuable for doing the update or is the cost higher for doing the update? So this, this is what we're doing as, as either individual users or as consumers of, um, open source libraries or other packages, which we're importing into our own projects is we make this trade-off like, how long is it going to test me, versus is it a feature I actually need? Um, and this is a problem, right? This means that you're often gonna choose not to update. You might have security vulnerabilities. You might not be exposing features. You might not be taking advantage of the latest libraries, and we need to find a solution to this. So one way of doing this is to look at what the industry does as an example. So, um, we're gonna, we're going to look at some folks in the industry who, who cheat the system and see what they're doing.

00:15:55

How are, how are they getting updates out to their users? Or how are they getting folks to make updates without doing the time intensive verification? Okay. So there's actually some, some examples of this. You guys probably are very familiar with just from using your computers. So your, your browser is one example, um, who knows which version of the browser you're running on your desktop. Okay. So it's a couple of a couple of folks do. You're probably developers for the rest of us. They like, like Firefox started incrementing the version number so quickly on your browser. It's actually hard to keep track. Every time you open the browser, it's like you have a new update. Chrome does the same thing. Um, safari does the same thing. So basically you're, you're probably not really keeping track of the version of your browser, unless you're doing software development and testing specific browser issues.

00:16:46

Um, second one is Twitter, Twitter in your browser. Do you guys even know what version of Twitter? Well, this one's an easy one. You can't, because it's a, it's a software as a service. And, um, you probably don't care as long as it works, as long as it's continually being updated and it works. It's fine. Twitter on your smartphones, a similar story. It has a version. Um, I don't even know how you get access to it because they, they push app updates to you. Um, but you probably don't care about that either. Um, what about your smartphone? I'll ask, who knows what version of their smartphone? iOS. Okay. So a lot more hands. And there's a reason for this updating your smartphone. iOS is risky when you go for major iPhone update to the next version of the update, we all know that the first version of the software is buggy has issues.

00:17:33

Um, my wife was actually complaining to me on a, on a trip recently because the latest iOS update messed up Purdue to do list. And I checked online and there was a known bug with apple where they put a new to-do app out the migration of to-do items. Didn't work in the first version, getting fixed in a subsequent update, right? So this, this is the problem when you make very, very large, um, not granular changes, then the chance of having a high risk updates much, much higher. Okay. So what we'd like to happen is we'd like to have small updates that continually get pushed, but then the question is what can possibly go wrong with this model? And there there's a bunch of things which can go wrong actually. So on hub, um, which is now owned by Google is a wifi router. That's self updating.

00:18:22

So this, this is awesome stuff, right? So you have your, just like you have your thermostat, you have your wifi hub, it automatically downloads new software. It updates online. It's a self-improving Y hype hub. You get new features pushed to it constantly. What could possibly go wrong? Well, of course, a lot because, um, the wifi hub is how you get access to the internet. Google push a, um, an update where they actually broke and reset the settings on the routers because the router is your only access to the internet. They couldn't then push an update to fix it. So this is a problem. Um, there's an E there, there is a fix to this, but it is slightly complicated. If you're doing something like this on edge devices, you want to have a local rollback strategy. Um, so basically in this case for, for a wifi router, what you'd want to happen is it would have a local copy of the last version of the OS. It would do a self-help check to see if it can connect to the internet. If after a certain time, the updates not working, it automatically rolls back and calls home, right? So that's ideally what you'd like to have in a situation like this. Um, the caveat with the local rollbacks is if you don't need it, often the implementation complexity of local rollbacks outweighs the benefit.

00:19:34

Um, so then getting to the internet of things, pretty much, everything's in the internet things now, and there's a whole bunch of different devices, possibly even smart cars, which get updated constantly, hopefully not while we're driving. And there's actually an example of, um, why it's important to be doing updates to your car continuously, um, which is Jaguar had an issue with their cars, where they had to do a massive recall, and it was a problem with the braking system. So, um, you know, obviously if you're driving a car, you want to make sure that this, this is a safety feature. Brakes should obviously be working. Fortunately, in this case, the core breaking function was fine. The car stopped. It was the regenerative breaking, which was broken. Um, so you could take it back for the recall they would fix it. But the problem with this is it's extremely expensive to then take cars physically back and to do manual recalls of them.

00:20:29

So, um, the answer here is over the air updates. Um, Tesla does this, a bunch of other car companies do this. Jaguar is doing this as well now. Um, and this helps you to avoid the problem of users, not doing updates and also pushing updates when something critical comes out. Um, so continuous updates are even better than over the air updates. And even though Tesla's doing over the air updates, they're not doing continuous updates. And one of the problems it introduces is stuff like this. So there was an issue with Tesla with Phantom braking. Um, the way Phantom breaking works is you're cruising down the freeway. It has all those automatic collision detection systems working. And it, it thinks there's an obstacle when there's actually not one and the car suddenly stops. So this is incredibly dangerous. It's dangerous for a different reason, right? Not that you can't stop the car, but that somebody might crash into you because the car thinks that it needs to stop for an obstacle. Um, this was a trending thread on the Tesla forums. It was a big issue. The identified that it was a software issue, they fixed it in the patch. The patch for the Phantom breaking was in red there. This release contains minor improvements and bug fixes. It took a couple of weeks to come out Tesla updates of every two weeks, because it was waiting for a very important feature, which is chess.

00:21:49

So you don't really want major critical. You don't want critical features waiting on, um, large features. So the answer here is do granular updates, do continuous updates, um, do batch updates in small sizes. And then this way your end users are getting important fixes and they're not waiting for large, um, features to come out. Okay. So, um, another example of this ends on the mobile space. Um, so most of you auto update applications, right? And, um, there's a game called newbs adventure, which is done by a developer where he actually documented the process of, um, building the game as part of the game itself. So it's kind of a cute, like, like developer story built into a game. Um, and one of the challenges he ran into on the game was, um, a new feature update, which broke a certain percentage of users. Basically the problem was, um, with the feature update, um, some of the apple servers would return prices without dollar signs.

00:22:52

Some would return prices with dollar signs, dollar signs was the template and character used in some of the scripts in the game. Therefore, even though it was tested and then it shouldn't have broken on a certain number of end users, it would break randomly. And so it took a while to even identify that this was an issue. It took time to fix the issue. And one of the challenges with the app store is there's no way to do a roll back. So, um, you can't say I have a bug, I need to roll back to the previous version. You need to push a new version of your software and then apple will, um, then go through the whole validation process and let you push the new version of the software. So the update pattern for this as do Canary updates. Um, so when you push a new feature that might affect users, you want to push it to a few users at first, let them test it and the problem.

00:23:46

Um, and then if there's an issue, then you convert back before you affect your entire customer base. Um, another pattern here is observability. So some part, some problems are really hard to trace. If you build observability and monitoring, it's your application that makes it easier to identify when there's an issue. And then, uh, another thing is rollbacks. Um, so in the case of the app store, you can't do rollbacks within the app store, but what you can do is you can do feature flags. So feature flags allow you to release a version, have some features turned on or turned off via configuration. And then if you had an issue, you could roll back to the previous implementation inside your own code base, rather than relying on the app store to do the roll back for you, which in case in this case, apple doesn't support. And then once you get a few versions down and you realize that code is not needed, you can take out the code, which is surrounded by a feature flag.

00:24:41

Okay. So another example of this is, um, an entirely different space. We've been talking a lot about, um, you know, um, um, IOT devices and mobile devices, but the same thing applies to server side software, which a lot of us build and, um, Knight capital. This is, this is a historic example of a huge fail in our industry. John Willis loves to use this example as well. Um, and basically what happened is a company disappeared overnight by a big, um, it failure in how they do their dev ops tool chain. And what happened in this case is they had a bug which was introduced, um, because one out of eight of their servers wasn't updated. So there's a manual process for doing the server updates. They've made changes in the API between the client and the server. If you hit one of the servers, which wasn't updated, it failed when they tried to, um, debug this problem, they rolled back the servers.

00:25:43

So now all eight of the Bates servers were running old code. The API is the clients are running. The new code. All of the API requests are failing. Now, if you're in the trading industry where you have millions of dollars every minute exchanging, you can imagine debugging a situation like this, where you're losing money, your servers are down incredibly stressful. They finally fit, identified the issue and figured it out. But by that time it was too late, they lost $400 million and went out of business the next day. So there's a classic example of, of failure. And in this case, automated deployments would have helped them out, right? We're really, really bad at repetitive tasks. If you can automate tasks, which humans do, you're less likely to have problems with this. Um, you're, you're, it's going to be easier to debug and troubleshoot. And another one is to do frequent updates. If you only update infrequently, when you actually go to do the updates, you don't have the muscle memory needed to actually effectively do updates in a repeatable way. That's reliable.

00:26:48

And then finally, um, state awareness. So something to keep in mind and it affected this case is that when you're deploying code, you have to be very careful about the state of the system, the API APIs, which you're using, because that can also affect how the updates happen. Um, and rolling back might not fix things if you have stayed involved in the system. Okay. And then one final example. Um, so Verizon had an outage, which was based, which was caused by some of their upstream providers or rather a CloudFare had an issue and they blamed Verizon and auction. So he was the, um, um, CEO of, of CloudFare blaming some of the folks downstream from them. And so this, this is, this is kind of how we, we work in the business, right? So we, when we have an issue and we see our competitors fall down, we like to point a finger and say, I'll look at those guys. They screwed up. Um, a couple of weeks later, CloudFare went down by themselves and the internet was not kind to them. So if you bash other people, it comes back to you in spades. So, um, this was an example where there's, first of all, there's a real life pattern behind.

00:28:06

And then there's also a, um, a technical problem here. So basically the whole cloud went dark as a result of CloudFare having issues. And, um, the, the reason why this occurred is because they had a bunch of regular expressions, which were used for filtering one regular one bad regular expression caused the entire system to spike on load. And that took down the entire system. So it's very simple as you guys know, if you've done any regular expressions, they're very easy to write impossible, to read impossible, to validate the correctness of, and you can destroy your entire deployment with a single misconfigured regular expression. And in this case affect the entire earth.

00:28:48

So again, the pattern here is Canary releases. Don't, don't ship your code, don't ship your regular expressions on all the servers tested on a few. Um, Netflix, Facebook, a whole bunch of companies are very good at doing this, where they test all their new releases on us, the unwitting end-user, um, but just a few of us and they roll back if there's an issue. Um, and to summarize, these are all of the, um, different patterns we've talked about. So doing frequent updates, doing automatic updates, making sure that it's all tested, doing Canary releases, being aware of state aware effects in your system, having observability. And in some cases doing local rollbacks, it doesn't always make sense, but if you have edge devices where you think there might be problems, updating them, if they have an issue, having a local rollback strategy will help you on the edge.

00:29:38

Okay. So getting back to our diagram, so we want it. Sure. Yes. And even if not, we're going to auto update. You are there high risks if there aren't great, but if yes, hopefully we now trust the update and we can actually get folks to effectively update our software and take incremental, um, updates to what you're working on. So a quick quote, this is from the liquid software book. Our goal is to transition from bulk and rare software updates to extremely tiny and extremely frequent software updates. So tiny and so frequent. They PR they provide an illusion of software flowing from developments to the update target. Um, this book is co-authored by our founders. You have landmen, Fred Simon and, um, Bruce , who is, um, one of our developer advocates. And, um, you can come by our booth and find out more, but this is kind of the overall picture we're trying to paint.

00:30:39

So all of that stuff in the top, right? Where you need updates, or you need automation, that's liquid software, there are some cases where you might need to do manual updates, or you might need to avoid updates and critical situations. Like does anyone want their plane updated while they're flying on it? Okay. So perhaps not, but, but what about if there's a hacker on your plane, would you want a security vulnerability? Patched? Okay. So maybe a few folks want the update in that case. So hopefully we're all moving towards a world of continuous updates. Thank you guys very much for coming to see the presentation and enjoy the rest of the conference.