You Suck at Cloud and It's [Not] All Your Fault

Corey is the Chief Cloud Economist at The Duckbill Group, where he specializes in helping companies improve their AWS bills by making them smaller and less horrifying. He also hosts the "Screaming in the Cloud" and "AWS Morning Brief" podcasts; and curates "Last Week in AWS," a weekly newsletter summarizing the latest in AWS news, blogs, and tools, sprinkled with snark and thoughtful analysis in roughly equal measure.

CQ

Corey Quinn

The Duckbill Group, Chief Cloud Economist

Transcript

00:00:13

Thank you, Nora. Our next speaker is Corey Quinn, who you may know from the snarky delivers in his newsletter podcast. And of course, Twitter, I found him to be an incredible source of insights and I'm personally grateful for the help and the critiques. He gave me on an early draft of the unicorn project. He has a keen eye for the absurd, and there's plenty of that in how organizations are using and misusing the public cloud. He'll be presenting some startling observations on what we tend to do wrong, and he provides some equally startling advice on what we can do about it. Here's Corey

00:00:57

Hi, I'm Corey Quinn and I'm the duck-billed groups, chief cloud economist. You might have little to no idea what that means, but that's okay because I have absolutely no idea what it means. Basically. I took an engineering background and applied it to an expensive business problem that, and this was key. Nobody would wake me up about at 3:00 AM. The horrifying AWS bill, four years later, we're about 10 employees. Our customers spend billions a year on cloud services and we have angry opinions backed by data. How do we get here? Well, I got yelled at a lot. When I used to run ops teams about the AWS bill. I wanted someone that I could just give money to and they would make that problem go away. So I couldn't find that person. So I started consulting. So I wound up creating my newsletter last week in AWS, that gathered all the information from Amazon's ecosystem that had an economic impact, which let's face.

00:02:02

It was pretty much everything. And then I shared my idea with Mike Julian and we partnered to create the duck bill group. So we now do AWS cost optimization because it's a growing problem experienced by approximately everybody. And we host three podcasts and write two newsletters, covering all things AWS, because focusing on something is not really something we're great at. Oh, and our mascot is a venomous Platypus named Billy, unless he's doing consulting work, then he wears a tie and goes by William that's where you folks commit. I gave a version of this talk a year ago or so at the DevOps enterprise summit event. Back when the pandemic was just getting started and gene Kim apparently lost a bet or something because he wanted me to bring the better version of that talk here as a plenary. So this version of the talk is very reasonably called.

00:02:59

You suck at cloud and it's all your fault just to make sure that we set the proper tone and context for the nonsense. I'm about to hit you with, because everyone feels on some level like everyone else has figured this stuff out. And somehow we're the ones who are missing the bigger picture that those other companies have managed to get, right? I'm sorry. Everyone is secretly ashamed of how they're working in the cloud. You're not alone and it's not really your fault. So get comfortable. Let's chat. I'm here to talk to you about the plethora of mistakes we see in the world of managing cloud costs. Now, when I say managing costs, you're probably going to expect a talk that covers points that are a lot like the ones right here, albeit with slightly less inferred violence against your account managers, furry friends, because this is the usual stuff you'll see in every talk about cloud bills.

00:04:01

And they usually end with a rousing call to action, either go forth and tag everything better or to buy some company's product or service that won't solve your actual problems. If this is the kind of thing you're into great. Allow me to suggest every single talk about cloud cost optimization that isn't this one, it's always the same tire device. And it doesn't matter if someone's giving it to you in 2021 or 2012, because it doesn't really change. I consider this proof that all of the advice on this slide doesn't actually work for crap when it comes to achieving outcomes that even slightly resemble lasting change. So if the usual suspects aren't, when you shouldn't be focusing on what are the worst mistakes I see companies making around cost let's begin, of course, with running Kubernetes, you might chuckle at this and think I'm being intentionally antagonistic or setting up to make some clever point rest assured I'm not.

00:05:03

We have a large enterprise client who had their cloud bill divided into Kubernetes and everything else. Kubernetes was a giant expensive question. Mark it's maybe where real work happened. Maybe it was all waste, but nobody had a clue. This is because running Kubernetes is and remains a giant mistake from the bill side of the world. Now, from the perspective of a cloud provider, you can spin up a whole bunch of instances. And then on top of that, run all of your workloads inside of Kubernetes and get yourself into billing hell and that's because to that provider, you're running a single workload called Kubernetes. There's no visibility at all into what workloads are going on inside of that environment, scaling your clusters up and down as a ridiculous fantasy that everyone talks about. But effectively nobody actually does because is often not worth the expense of implementing it properly and requires you to accurately predict the future.

00:06:04

So in the real world of enterprise, Kubernetes looks like a lot of big instances sitting around costing you the same every hour of every day. Then those instances talk to each other in weird ways in AWS transferring data from one availability zone to another costs, the same as it does to transfer it from one region to another 2 cents. In most cases, Kubernetes has no sense of zone affinity. So that weird workload that the cloud providers see, it spends an inordinate amount of time talking to itself and not in the fun way that I do. And it racks up the bills as you go. Think about that for a second. Something inside of Kubernetes wants to talk to something else and it'll frequently ignore the thing that's right next door that it can talk to for free and opt instead to shove a few petabytes a month at something that's charging 2 cents per gigabyte, worst of all, you can't attribute those costs.

00:07:00

Be they for data transfer, compute, or Ram to workloads within those Kubernetes clusters. Other than what amounts to basically dead reckoning, you squint, you figure out that 70% of that cluster is the workload AE. The rest is for workload B and that's how you allocate it. Namespaces kind of work to solve this and do an old Kate ish job. But now already put the AWS primitives that Kubernetes needs, regardless of workload or capacity, the control plane itself, the snapshot storage, the backups, the stuff that doesn't cleanly allocate to one particular workload. How did those get attributed? They often don't or can't be because there's no tagging mechanism that actually freaking works for this. And the sad fact of the matter is as an enterprise, you invariably care a lot more about allocating where your spend is going than shaving dollars and cents off of the bill.

00:07:54

So this client of ours, like so many of them do went with Kubernetes because they were doing a hybrid environment that is a data center and a cloud environment. They wanted to move workloads between data centers and cloud seamlessly. And that middle layer was of course called Kubernetes. What they were doing in fact was improving their data center at the expense of their cloud environment. Look, before I get angry letters, I'm not denying that Kubernetes offers advantages. If you want to learn more about that, go ahead and talk to anyone who has the word Kubernetes in their talk title in any event ever. And they'll be thrilled to evangelize a technology that nobody fully understands for any problem you have. And several UDaB, I'm pretty sure that's a job requirement at this point. For some folks, I talked to a company recently who was receiving vast quantities of data from their customers that doesn't really narrow it down in 2021 at all, because we live in a data economy and we want more data faster across the board image and video files have gotten larger.

00:09:00

And as we've increased bandwidth available on the network, companies have rushed to fill it with a whole bunch of telemetry. Now, no judgment here. That's a different talk where I get into about privacy. So I digress that client was receiving scads of data from their customer in the multi petabyte range. Then they were transferring that data internally between availability zones over and over as they sliced diced and restructured that data, they ran some queries, ran other queries on the results of those queries. And it turned out that for every gigabyte of data that they received from customers, they were transferring 50 gigabytes internally. That's not an exaggeration that I made up to prove a point it's true. Now, this isn't exactly the best approach to take. So we dug into it a bit further. Now they had valid reasons for doing it. They were slicing the data apart, taking the results, transforming it further.

00:09:58

They weren't just being silly with it, but it had a very real cost in a data center doing this just means that you're switching fabric is going to be pretty congested and you might have to spend a bit more money on your network in AWS. It manifests differently. Now I mentioned previously that it costs 2 cents to move data between most regions and availability zones. Try thinking about it this way. Instead of every time you move data between availability zones or regions, it costs the same as storing that data in S3 for three weeks. You're a slight simple misconfiguration away from that number exploding to just shy of four months. Do you want to achieve that high score? Your search term is managed Nat gateway data processing fees. I know it sounds like I'm getting into the weeds of how data transfer is build an AWS, right? Let's see what it looks like.

00:10:56

I'm not joking. This is how the billing works. Now don't worry. I'm going to give you a link to the high res version of his image at the end of the talk, because it's probably going to be useful for you. I find myself consulting this constantly, and the only winning move with data transfer charges is simply not to play the lesson here could be distilled down into a basic truism. Once you get data into your environment, absolutely do not pass it back and forth between ECE two instances in different availability zones. Please, if you're going to be doing that much data processing store, multiple copies of the data, and we'll all be happier to that end data should live on the cheapest storage possible. A lot of companies fall into this trap constantly, and it stems from not fully understanding how the data life cycle works.

00:11:53

It's well beyond SSD versus spinning disc in this era. What do you need from a durability perspective versus a latency perspective? One of our reference customers, and it's rare that we get to name names when talking about our clients, but we can hear their honeycomb. They had a Kafka cluster running with local EBS volumes as a backing store. They were able to save money by changing to instances with NBME volumes and save on cost increase throughput. And the durability concerns of having the volumes tied to specific instances was offset by calf because built-in replication factor. A different company has a whole bunch of data that lives in Splunk because data always finds its way to Splunk. Imagine that an expensive story in which Splunk is the main character they're currently in the process of moving that data to S three Y S three. It turns out that the cloud long ago took a collective vote about what the data storage model of the future was going to look like.

00:12:56

And object stores won hands down and let's do a little math here. Typical GP, three SSD storage in AWS costs 8 cents per gigabyte per month. You need multiple copies of that for redundancy, let's say three availability zones. So that's 24 cents per month per gigabyte. You're also not a dangerous lunatic. So you're not going to be completely filling your just volumes. Let's assume your disarray, 80% full, which is aggressive but doable. So now each gigabyte of data that you're storing, not including replication costs is costing you 30 cents a month per gigabyte, or you could store it in S3 for 2.30 cents per gigabyte per month. But wait, there's more as three offers a sarcastic number of nines in its durability, 11 of them, which is win the lottery while simultaneously getting struck by a meteorite level of likely. And when you access that data from different availability zones in the same region, there are no data transfer charges.

00:13:59

There are request charges, ballpark that a thousand requests costs you a penny. It adds up. So you sensible, but the economic winds here are absolutely massive. But if, and only if your applications know how to speak to object storage, instead of disk volumes, please don't try to treat S3. Like it's a file system. Absolutely. Nobody likes what happens when you do, and it ends in tears before bedtime. So this is where we start to get to the idea of modernizing your applications to speak to object stores. It's not easy, please don't think I'm saying it is. And for some workloads, it may as well be impossible. I get it. It's hard, but there are serious durability and scaling wins. If you can pull it off. Ah, but what about not wanting go locked into one provider. You want to have discs dis surveillance everywhere. Let's talk about multi-cloud and ignore for a minute.

00:14:59

The fact that everyone has an object store over the past year, ramping about multi-cloud would become one of my signature talking points because sometimes you set up a straw man, and then it comes to life almost like frosty. The snowman. My position on this is often misconstrued to be understood is never use multi-cloud for anything. Some folks also like to use it as proof positive that I'm a shill for AWS or that I hate them or frequently both at the same time. However, the hell that's supposed to work. I want to be clear here. I have no partnerships with any vendor in this space. So let me first state the point that I'm trying to make. And then I'll shoot down the various ways it'll be misinterpreted. This is very important. Don't email me until you've listened to this part. Now that everyone knows that lock-in is to be avoided.

00:15:54

So the right play is obviously to build everything in a cloud agnostic way. So you can deploy your entire stack to different cloud providers on a moment's notice quote, my friend, Ben keyhole of I robot, think of multi-cloud as being a lot like couch shipping, which is a miss. How do we know that couch hipping is a myth easy. There are no videos of anyone successfully tipping a cow on YouTube. That's right. If it's not on YouTube, it doesn't exist. Similarly, we know that building everything in a cloud agnostic way is also a myth because if someone had actually built their entire application stack in a way that wasn't completely horrifying, we would never hear the end of it from a whole bunch of different vendors. Keynote stage. Instead we're treated to things like 20 nineteens VMworld keynote, where their vision of the future was so horribly complicated.

00:16:52

They could never find a real customer who would do this. So they had to invent a fake t-shirt company called Tansu teas to theoretically use all of the ridiculous nonsense they were talking about. That was a part of their Tansu platform. It doesn't exist. I like VMware. Please don't think I don't, but I'm serious. After all that nonsense, they didn't even have the decency to give me a way to go to that website and buy an actual t-shirt. Please stop trying to chase the impossible dream. You're turning your back on a whole bunch of differentiated higher level services that cloud providers offer that can massively improve your business. And you're getting basically nothing of value in return. You're paying for an optionality. You're not cashing in and the coin you're using to buy it as your own feature velocity. So as a rule of thumb, pick a provider per workload and go all in.

00:17:45

I am not an AWS partner. I don't care if you use AWS or Azure or GCP or Oracle cloud, hell. If I really dislike you, I'll suggest you use cloud, but pick a provider and go all in until you're forced not to on a per workload basis. And that last point is key because I personally at the duck-billed group use AWS for my infrastructure, get hub, or just have as it is properly pronounced for my cold storage because AWS code commit is really a sad joke and G suite for my email, because I don't want to run mail servers anymore this decade, but each workload lives in distinct provider. There's no workload that has to seamlessly flow between different providers because that's not sensible for most use cases. You'll spend more time getting that workload to speak different cloud providers, various dialects. Then you will actually running the blasted thing.

00:18:46

And if you're not running it in multiple providers, it's like a Dr. Plan. You update it and get a binder and everything set. And then three months goes by and it's time to test your Dr plan again. If you're one of those forward-looking shops that believes in testing your Dr plan and it breaks, then you keep iterating forward and you finally get it to work. And the next commit breaks the whole thing. Again, if you're not active, active, you're not really multi-cloud. Now there are exceptions for workloads where multi-cloud makes sense, but they're usually stainless workloads that often fit inside of containers. We see them, but it's infrequent and it's certainly not nearly happening often enough to suggest this is somehow some kind of best practice. It's an edge case, but one that's talked up to be way more common than it is by vendors who will have absolutely nothing left to sell you.

00:19:37

If you go all in on your cloud provider and buy crappy cloud providers who know that if you're going all in on a cloud, it will certainly not be there or anything their dirty little hands have ever touched. So for that end, we don't have a whole lot of clients that are actually doing multi-cloud for this definition of the term. Instead, we see a number of clients who don't want to commit to a single vendor out of a misplaced fear of lock-in. So they build everything to be provider agnostic. One company we've spoken to is spending over a hundred million dollars a year, almost entirely on . They run a whole bunch of databases on top of to their single concession to the cloud is an object store, but they're careful to only use S3 functionality. That's widely replicated elsewhere. They want to spend three months of an engineer's time trying to get the VPC peering over IP sec, working between Google cloud and AWS in Terraform you'll note that I send, they spent three months trying not that they succeeded the idea doesn't pan out in the real world, avoiding lock-in, please you're already locked in by virtue of how identity gets managed, how networking fails to interact and by, you know, the people you've hired that are more expensive than your cloud mill.

00:20:59

They're good at your cloud provider of choice. Presumably tell them to learn another cloud. And an awful lot of them are going to opt to just move down the street instead to continue working with the thing they're good at. So pick a provider per workload and go all in. If your company is all in on cloud provider, a and you acquire a company on cloud provider B, cool, leave them alone. There's not a lot of business value in upsetting that apple cart in most scenarios, there really isn't. So let's say you're in a data center environment. Your core site, bill shows up your CFO season has a heart attack and your month goes on man, more or less presuming they recover. It's a data center. What are you realistically going to knew about it? Arson and claim insurance money not recommended. Now imagine a cloud environment.

00:21:48

The AWS bill shows up well that can be broken down at different business units. It's not. I spent $2 million on ECE, two it's. I spent $1.4 million on search $300,000 on displaying impressive Norton antivirus badges to our website, visitors, et cetera, that latter version can be used to communicate with a business in a way that carries weight and context beyond it turns out computers are super expensive, but to get there, you have to first slice the bills into resources that align with the business. Finance doesn't really care how much you spend on AWS. They don't care. Particularly if it's $1 million or $5 million. If you come to them with a well cloud is just expensive deal with it that doesn't help them. They're not there to hold the purse strings. They're there to help the business grow, giving them information that informs their forecasts. And then they in turn can help the business make better decisions.

00:22:40

If, if one month it's a million dollars and the next it's $2 million. And the month after that, it's $3 million. The CFO is going to be pretty pissed off. What's not well understood is that if this month it's $2 million next month, it's $1 million. And the third month it's 300 K the CFO is going to be just as upset. It's not about the cost. It's about blindsiding the business with what it costs to provide your goods or services to your customers. Presuming you have some pro tip from someone who's been there all too frequently. You almost certainly don't want to surprise your business leadership because they will surprise you in return. And those scars last, we spoke with a company who was a bit under sold on the value of being able to allocate their spend, which in this example was a bit on the smaller side.

00:23:25

They knew that it was important to have a vague idea of where the money was going, but they weren't so sold on the idea of why it really mattered. Then their Amazon contract came up for renewal. A fun story. If you Google for AWS contract negotiation, I'm results. One in two, I'd kind of do this a lot. Now. Details are of buried under deep secret levels of NDA, but here's an open secret in the industry. If you commit to longer term spend at certain dollar figures, you get discounts off of retail pricing. There are a lot of complexities to this on a service by service basis and spend causes different pricing structures to unlock. But that's the baseline situation. Now the trick is what's the right number that you should commit to too low, and you leave money on the table too high. You don't hit your commit and have a shortfall and complicating.

00:24:15

This whole thing is that Amazon has their own math on how to calculate what they think your commitment should be. It's okay. Ish, it's a tool. It has no context. It doesn't recognize that the holiday season just ended and you sell Christmas decorations. But what if you knew more about your growth and they did and had the data to back that up? What if you were able to predict your spend better than Amazon can. This company was in exactly that position. It let them take the discount offer they received and in turn, realize another half million dollars of additional savings on a $5.7 million commit solely because they had the better predictive model and the data to back it up. When you're able to predict your spend, you're able to negotiate better deals. The same was true back in the data center days. If you knew how many racks you're going to need, by the end of the year five, you could have negotiated all upfront.

00:25:03

That's the futures market nutshell that it's come to cloud whether we want it to or not. All of this is building up to a mistake. That's the source behind a lot of the things that we encounter in the wild. They collectively all speak to the single biggest strategic blunder that we see. And that's thinking of the cloud. Like it's just another data center. You can do that, but it's expensive, fragile and basically tire historical business model. They're the payday lender of technical debt. Here's the easiest way to figure out if you've fallen into this trap, look at your architecture. Now look at this image. Now back to your architecture. Now back to me, are you basically only using ECE two instances and if so, I've got some bad news for you. You're pretty much a data center over on the left side. Another way to tell whether you're stuck in data center thinking is to take a look at the hour by hour cost metrics in your environment.

00:25:58

Do they decrease significantly in the off hours for your business? Does suspend ramp up as your user traffic increases or are you seeing what a lot of folks did over the past year, 80% of your user traffic evaporates, but your cloud bill doesn't budge, the entire value proposition of the cloud is the ability to scale up and down to meet your workloads needs, which is what everyone tells himself. So they can feel better about not actually doing it. The entire premise of the cloud is that you can dynamically scale up to meet demand then, and this is the part everyone forgets. You can scale the hell back down again, further you can use higher level managed services that remove the operational coil from your environment. I know I talk a lot about AWS in this talk because that's what my business does. But every mistake I've described here to any cloud provider, there's nothing here that doesn't apply to Azure.

00:26:47

If you add in a pile of licensing issues or GCP, and you sprinkle in a bit of healthy fear of them turning your production environment off because they got distracted by something shiny. So you're more efficient as you become cloudier, but there's a dark counterpoint here. It's getting more efficient. The cloudier gets, but it's also getting way, way, way less predictable. The unfortunate reality here is that the cloudier you become, if you start embracing the idea of paying per consumption on a per request basis, it's way harder to predict your spend in terms of dollars and cents. The best way you can get to a positive outcome here is to identify your baseline costs and things that are going to charge you, regardless of what else you do. And then take a look at your workloads for the super cloudy environments, figure out what it costs to serve a monthly active customer or a thousand of them, whatever metrics make sense for your business.

00:27:36

Then you can turn it back around on the business planning folks in your organization. They're having to predict the future too, and you can sort of draft along behind them as they knew it. The business is used to variability in their user metrics or whatever KPIs they're using as a lens through which they view their business. They're not used to it spending treated the same way. If you can tie the, it cost directly to the metrics used to ascertain the health of a business and the growth of its offerings. Suddenly people will accept that very differently than they will. You just shrugging when you ask them what the cloud bill is going to be next month. Congratulations. You've sat through the slings and arrows. I've heard it a bunch of cloud decisions and come out the other side, still intact. Good for you. Maybe it's not all your fault after all.

00:28:20

Now go to last week in aws.com/dos, where I've put together resources for you first, a PDF on how to cut your AWS bill. If you care, it's gorgeous, it has a Platypus on it. It's something you can use to immediately start the cloud cost conversation internally. It has actionable things you can do yourself today. Secondly, a high resolution download of the AWS data transfer cost diagram that I talked about earlier. Oh, and before I forget, I also write the last week at AWS newsletter, which gathers the news from Amazon's ecosystem gently and lovingly makes fun of it. And then it goes out every Monday and Wednesday to over 26,000 people. I also host a pair of podcasts, screaming in the cloud, which is a serious interview show about the business of cloud and the AWS morning brief, which allows me to indulge my ongoing love affair with the sound of my own voice. Again, I'm Corey Quinn, chief cloud economist at the duck-billed group. We help companies fix their horrifying AWS bills. Thank you very much and enjoy the rest of the conference.