Top Ten Mistakes in Managing Cloud Costs

"(Number four will blow your feet off!)"

CQ

Corey Quinn

Cloud Economist, The Duckbill Group

Transcript

00:00:17

Hi there I'm Corey Quinn and I'm a cloud economist. You may have little to no idea of what that means, but that's okay because I have no idea what it means. Fundamentally. I took an engineering background and applied it to an expensive business problem that nobody would wake me up about at three in the morning, namely, the horrifying AWS bill, four years later, or seven employees, our customers spend over a billion dollars a year on cloud services. And we have angry opinions that we back with data. I also write the last week in AWS newsletter, which gathers the news from AWS as cloud ecosystem gently and lovingly makes fun of it. And then it goes out every Monday and Wednesday to over 20,000 readers. I also host a pair of podcasts, screaming in the cloud, which is a serious interview show about the business of cloud and the AWS morning brief, which allows me to indulge my ongoing love affair with the sound of my own voice and what I imagined to be humor.

00:01:13

Last year on Analytica ran an analysis that determined I was the greatest cloud influencer in the world. I've been completely insufferable ever since now. The reason I bring all of that up is to validate that while you are going to see some of what passes for humor in this stock, I do know what I'm talking about. And today that's what I'm here to do. Talk to you about the top 10 mistakes we see in the world of managing cloud costs. We start with the usual stuff that you'll see in every talk about cloud economics. If this is the kind of thing you're into great, allow me to refer you to every single talk about cloud cost optimization. Other than this one, it's always the same tire device. And it doesn't matter if someone's giving you this talk in 2020 or in 2012, because it doesn't really change.

00:02:00

This is proof that the advice given here doesn't actually work for crap. Instead, I'm going to talk to you about the top 10 terrible mistakes that I see companies making around cost let's get started. We begin, of course, with running Kubernetes, you might travel with this and think that I'm being either intentionally antagonistic or setting up to make a clever point, but I'm not doing either of those things. We had a large enterprise client who had their cloud billing divided into Kubernetes and everything else. Kubernetes was a giant expensive question mark, from the perspective of a cloud provider, you can spin up a whole bunch of instances and run all of your workloads inside of Kubernetes and then get yourself into billing hell that's because to that provider, you're really just running a single workload. Kubernetes. There's no visibility at all into what workloads are going on inside of that environment, scaling your clusters up or down is a ridiculous fantasy that everyone talks about, but effectively nobody actually does.

00:03:02

So in practice, it's a bunch of big instances sitting around cost you the same every hour of every day. Those instances talk to each other in weird ways in AWS transferring data from one availability zone to another in the same region costs are the same as it does to transfer it from one region to another 2 cents. In most cases, although there are exceptions that are egregiously high Kubernetes also has no sense of zone affinity. So that weird workload, the cloud providers are seeing. It spends an inordinate amount of time, not only talking to itself, but racking up the bill as it does. So worst of all, you can't really attribute those costs to workloads within those Kubernetes clusters. Other than by what basically amounts to dead reckoning, you squint, you figure that 70% of that cluster is for workload a, the rest is for workload B and that's how you allocate it.

00:03:52

Now namespaces do kind of work to solve this in somewhat passable way. But now where do you wind up putting the AWS primitives that Qube needs regardless of workload or capacity? How do they get attributed? They often don't or can't be because there's no tagging mechanism for any of these things. It actually freaking works. When the world isn't melting down into a recession, fueled by a pandemic. You generally care a lot more about allocating where your spend is going then shaving dollars and cents off of the bill. This client went with Kubernetes originally because they were hybrid. They had workloads and data centers and they had workloads in cloud. They wanted to move workloads between those two environments seamlessly. And the middle layer that did that was of course called Kubernetes. What they were doing was improving their data center at the expense of their cloud environment.

00:04:41

Every year until this one, at least the CEO of AWS gets on stage at reinvent in Las Vegas and unleashes a torrent of product and service announcements. They're bizarre. Two years ago, they announced something called ground station service. That's used to talk to satellites in orbit around earth. This is a legitimate service that exists, but at least a third of you watching this are reasonably sure I'm making it up for the sake of a joke. I'm not AWS has over 200 services and yet over 80% of spend on AWS comes down to just five, easy to RDS, S3 EBS and data transfer, which all sounds like a bunch of letters that I'm throwing at you. But roll with me here. The rest of the spend is either long-term strategic bats, interesting technologies that customers ask for or something else. Remember that every AWS services for someone, but no AWS service is for everyone just because your cloud provider has built a thing does not mean that you should use it more frankly, and that you can, if you're using something that your cloud provider has built, as soon as it launches, you're going to run into tricky edge cases, a company who is very excited to use Amazon's MSK, their managed Kafka service instead of running their own Kafka, jumped aboard as soon as came out.

00:05:57

Now, every time I talked to that company about a new release that comes to Amazon's offering of it, their responses, well that sure would have been nice to have released time instead of these ugly hacky work-around that we spend a month building. Since that thing is a core feature of Kafka that we still can't believe that Amazon forgot there've been no fewer than six of these. After the fact releases that would have made their jobs easier. Sometimes Amazon releases features that you'd swear. I was making up to insult Amazon on fairly one. That was my favorite was Amazon Neptune now supports TLS. Now, as I said, in my sarcastic newsletter, the far bigger story was that it somehow launched without supporting TLS in the past few years. We've long since passed the point where I can talk incredibly convincingly about AWS services that don't really exist and not get called out by AWS employees.

00:06:48

There are over 200 of them who's to know which ones are real or not, just because a visionary from your cloud provider shows up to tell you what the future is going to look like on stage. Doesn't mean that you need to be the first person or company to embrace that future in your production environment. You presumably you have a cloud strategy. Don't let the flashy announcements distract you from that. Similarly, AWS announces all kinds of different services. You have a limited set of things that you're going to be able to innovate on. Choose wisely. The third mistake that we see companies making, I talked to a company recently who was receiving vast quantities of data from their customers. Then they were transferring that data internally between availability zones over and over and over. As they sliced diced restructured the data ran and ran different queries.

00:07:37

It turned out that this is not an exaggeration for every gigabyte of data that they received from customers. They were transferring over 50 gigabytes of data internally. This isn't exactly the best architectural approach you can take. So we dug into it a little bit further. They did have valid reasons for doing it. They were slicing data apart. They were doing useful transformations. They weren't just doing this out of ignorance or to be silly, but it had a very real cost now in a data center. That means your switching fabric is going to be fairly congested and you might have to spend a bit more on your network equipment in AWS. It manifests very differently. Every time you move data between availability zones or regions, it costs the same as storing that data inside of S3 for three weeks, you are a slight simple, easy to make misconfiguration away from that number exploding to just shy of four months.

00:08:29

Once you get data into your environment, absolutely do not pass it back and forth between ECE two instances in different availability zones. Please, if you're going to be doing that much data processing store, multiple copies of the data instead, which brings us to our next point. A lot of companies fall into the trap of not storing data in the right location constantly. And that stems from not fully understanding how the data life cycle works in cloud. It goes well beyond. Do I put that on SSD or spinning disc? Well, what do you need from a durability perspective versus a latency perspective? There are a lot of options here. One of our reference customers, and it's rare that we get to name names when we're talking about our clients, but we can hear their honeycomb. They had a CAFCA cluster running with local EBS volumes as a backing store.

00:09:16

They were able to save money by changing G instances with NBME volumes and save a lot on cost, increase throughput and address the durability concern of having those volumes tied to specific instances by offsetting that with Kafka's built-in replication factor, if they lost an instance or two, there are multiple copies of that data, but it's replicated intelligently. A different company has a bunch of data that lives in Splunk by which I mean pretty much every company. Imagine that an expensive story that features Splunk, who would have thought they're currently in the process of moving all that Splunk data to S3. Why S3? Well, the clouds, all of them long ago, took a collective vote about what the data storage model of the future was going to look like. And object stores. One hands down, typical GP, two volumes, which is SSD and AWS costs you 10 cents per gigabyte per month.

00:10:06

And you'll need multiple copies of that for redundancy. Let's say three availability zones. So that's 30 cents per month per gigabyte. You're also not a dangerous lunatic, so you're not going to be completely filling up your disbelief rooms. So let's assume they're all 75% full, aggressive, but doable. So now each gigabyte of data you're storing is costing you 40 cents per month. Or you could store that same data in S3 for 2.90 cents a month instead. But wait, there's more S3 offers a sarcastic durability guarantee. They say their design goal is 11 nines of durability, which is in the realm of win the lottery while getting struck by a meteorite at the same time level of likely when you access that data from different availability zones from the same region, there are no data transfer charges. There requests charges, which ballpark to a thousand requests to S3, costing you a penny.

00:10:56

And that adds up. So while you want to be sensible about it, the economic winds here are so incredibly massive that it's a slam dunk, but if, and only if your applications can speak to object stores, instead of disk volumes, if you try to treat S3 like a file system, you'll make it worse. No, one's going to be happy with that. If you do it, another common mistake is chasing the multi-cloud dragon. Everyone knows that lock-in is to be avoided. So the right answer is obviously to build everything you can in a completely cloud agnostic way. So you can deploy your entire stack to different cloud providers on a moment's notice to quote my friend Ben keyhole, think of multi-cloud like cow tipping count tipping is an urban myth. And how do we know it's a myth simple, there are no videos of anyone ever having successfully tipped a cow on YouTube.

00:11:44

Similarly, we know that building in a cloud agnostic way for multicloud is also a myth because if anyone had actually done such a thing in a way that wasn't completely horrifying, we would never hear the end of it from every multi-cloud vendors, keynote stage. Instead we're treated things like last year's VMworld, keynote, where their vision was so horribly complicated that to use it all. They had to invent a fake t-shirt company called Tansu teas to theoretically use all of the ridiculous nonsense they were talking about. That was a part of their platform. Multi-cloud like that doesn't exist. Stop trying to chase the impossible dream by doing so you're really turning your back on a bunch of higher level differentiated services that cloud providers offer to massively improve your business. And in return getting basically nothing of value you're paying for an optionality that you're not cashing in.

00:12:33

And the coin you're using to buy that is your own feature velocity. As a rule of thumb, pick a provider per workload and go all in. I am not an AWS partner. I don't care if you use AWS or Azure. If I really dislike you, I'll suggest you use Oracle cloud, but pick a provider and go all in until you're forced not to, to that end, we don't have clients that are actually doing multi-cloud for this definition of mult. For these, for this version of multi-cloud. Instead we see a number of clients refusing to commit to a single vendor out of a misplaced fear of lock-in. So they build everything to be provider agnostic. One company we've spoken to is spending over a hundred million dollars a year, almost entirely on just easy two instances. They run a whole bunch of databases on top of those instances, but nothing managed their single concession to anything cloudy is the object store, but they're careful to only use S3 functionality.

00:13:30

That's widely replicated in other providers as a result, they once spent three months of an engineer's time trying to get VPC peering over IP SAC, working between Google cloud and AWS using Terraform. Now you'll note that I said they spent three months trying not that they succeeded the idea doesn't pan out in the real world. And for all of the effort they've put in to trying to maintain this, they are still a single cloud shop. Let's talk a bit more about analysis paralysis and what it means to finance. Let's hypothetically say that you've bought in heavily on the idea of data centers. You didn't build your companies on top of public cloud because you're responsible grownup companies instead of Twitter for pets. And back when you were building out your technology capability, the cloud didn't really exist the way that it does today. Now you're hearing lots of good things about the cloud.

00:14:18

You're also hearing a lot of bad things about the cloud that usually masquerade is digital transformation, but in reality are the sound of a vacuum cleaner that's run by a giant consulting company being fired up and aimed at your wallet. So here's the big question. Will you save money by migrating to the cloud? You can spend the next 18 months doing a TCO analysis to answer that question. And at the end of it, the results will be wishy washy or incredibly one-sided depending upon internal corporate politics. Let me save you some time on this. If you're going to save money by doing a cloud migration, it's going to happen on a five-year time horizon. Let me be even more clear to you. You can assume that you will not save money by moving to the cloud. What you're going to gain is capability and the level of rigor that your data centers will not be able to match.

00:15:05

If you leave an application running unmaintained in a cloud provider for years on end, it gets better. The instances it runs on grow more durable. The network gets faster. If you try that in a data center, you're going to discover much to your chagrin. That raccoons have carried your servers off somewhere around the two year mark. At some point you have to stop measuring and make a decision. There's always risk, but we've reached a point where the cloud has been effectively de-risked for most workloads, it's time to grow up and get off the fence. We had a conversation with a company that needed to get approval, to begin planning, moving out of the data centers onto AWS. They got approval on that plan last month. And they started that plan in November. That's not a migration plan. That was a, can we even do it at all?

00:15:49

Is it feasible for our business? They have not built a migration plan yet. And there's seven months in for anyone who's watching this video later in time, trying to do calendar math seven months to do an analysis as to whether moving to the cloud was feasible. Now that they've decided it is they get to build their migration plan, which as anyone who's ever done a data center to a cloud migration, the plan is going to be another six months. If you rush it, their AWS bill, if they migrate everything, which isn't guaranteed will be roughly $120 million before discounting, whether cloud is worth it or not for you is a question you have to answer your path to get. There is not going to be via financial analysis. Instead it has to be a capability story. Similar to analysis paralysis is letting the accountants believe that the world hasn't changed.

00:16:34

If you tell me to build a small WordPress website in AWS, I could not tell you what it's going to cost to run for the first month, within any closer than 20%, we need to first figure out how much traffic does it get? What does that look like? And then we can adjust our financial forecasting to match reality. This is not how it used to work. If you go back to data centers, you pretty much knew what the servers you bought in that data center were going to cost you for the next three years to a pretty accurate degree. If you build a pure serverless environment on the other end of the extreme, you're going to spend way, way, way less money than building a data center. But the answer to what it's going to cost is going to be a lot more variable. In fact, it's going to be a function of how many users you have on the system.

00:17:16

Over a given time span, you're going to have to bring finance people who understand the concept of unit economics into these conversations sooner than later, but stop them before they lead you down a different path to utter madness. This isn't cost accounting. It's the cloud. It's elastic you. Aren't going to get accurate models to the penny. Getting to within 10% will optimistically take you months. And most of the time, it's not directly worth it. If you over-index on this too soon, you're going to become like a different company. We ran into. We'd spoken with someone who requested our help in building out this model for them. They're prepared to devote six engineers to building that model over eight months, which is more than their entire cloud bill costs. This isn't an early stage startup trying to dial in unit economics. They're already in steady state.

00:18:01

They're trying to allocate almost down to the penny and they're spending far more than they're ever going to recapture and value to do it. So let's pretend for a second that you, in a data center environment, your core site, bill shows up your CFO sees it as a mid-sized heart attack and your month goes on. That's the end of the story. More or less, it's a data center. I mean, what are you realistically going to do about it? Breach contract and leave. Turn the servers off. Now the AWS bill shows up, well, that's a different story that can be broken down into different business units, but that's not how the bills align. It's not. I spent $2 million on ECE, two it's. I spent $1.4 million on our search service. $300,000 on displaying, very impressive Norton, antivirus badges to website visitors, et cetera. You start allocating it to business functions.

00:18:45

The latter version of that story. It can be used to communicate then with the business that carries weight and context beyond it turns out that upon careful analysis computers are super expensive in order to get there. You first have to slice the bill into resources that align with that business. Finance doesn't really care how much you spend on AWS. Believe it or not. They don't particularly care if it's $1 million or $5 million. If you come to them with a well cloud is just expensive. It does that deal with it. That doesn't help them any. They're not there to hold the purse strings. They're there to help the business grow, give them information that helps inform their forecast and they in turn can help the business make better decisions. This month, your cloud bill is a million dollars and next month it's $2 million. And the month after that is $3 million, the CFO is probably going to be pretty upset.

00:19:32

What's not as well understood. Is it next month? It's $2 million the month after that, it's $1 million. And the third month out, it's $300,000. The CFO is going to be just as upset. It's not about the cost. It's about blindsiding the business with what it costs to provide your goods or services to your customers. This ties into making the mistake of underestimating, the power of prediction. We spoke with a company who was a bit under sold on the value of being able to allocate their spend. They knew that it was important to have a vague idea of where the money was going, but they weren't so sold on the idea of why that actually mattered. Then their Amazon contract came up for renewal. So open secret in the industry, if you commit to longer term spend at certain dollar figures, you'll get discounts off of retail pricing.

00:20:18

There are a lot of complexities to this, but that is the baseline universal truth of the situation. The trick is what is the right number that you should commit to? If you commit to spend too little, you're leaving money on the table. If you commit to spend too much and you haven't hit your commitment and you'll have a shortfall complicated in this whole thing is that Amazon has their own math for how to calculate out what they think your commitment should be. And it's okay. Ish. It has a tool. It has no context. It doesn't recognize that the holiday season just ended and your company sells Christmas decorations. But what if you knew more about your growth needed and have the data to back it up? What if you were able to predict your spend better than Amazon was Ms. Company happened to be in precisely that position and it let them take the discount offer that they received.

00:21:02

And in turn, realize half a million dollars in additional savings on a $5.7 million commitment, just because they had the better predictive model and the data to back it up. That's the value here. When you're able to predict your spend, you're able to negotiate better deals. The same was true back in the data center days. If you knew how many racks you were going to need by the end of year five, you could have negotiated for all of that upfront. We're really in technology. Re-inventing the futures market. In a nutshell, the last thing that I want to talk about is the biggest strategic blunder that I see in. Everything's been tying into this and that is thinking of the cloud. Like it's just another data center. You can do that, but it's expensive, fragile. And basically VMware's entire business model. They are the payday lender of technical debt.

00:21:48

Here's the easiest way for you to figure out if you've fallen into this trap, look at your architecture in the cloud. Are you basically only using easy two instances and disco? Liam's if so, I've got some bad news for you. Another way to tell whether you're stuck in data center thinking is a bit more nuanced. Take a look at the hour by hour costs in your environment. Do they decrease significantly in your businesses off hours? Does the spend ramp up as your user traffic increases or are you seeing what a lot of folks did over the past few months as the pandemic hit 80% of your user traffic evaporates, but your cloud bill doesn't really move at all. The entire value proposition of the cloud is the ability to scale up and down to meet your workloads needs, which is what everyone tells themselves. So they can feel better about not actually doing it.

00:22:30

If the premise of cloud is that you can identically scale to meet demand. Then this is the part everyone forgets. You can also scale the hell back down again, further, you can use a higher level managed services that remove the operational toil from your environment. I know I've talked a lot about AWS in this talk, but every mistake I've mentioned applies to every cloud provider. There's nothing here that doesn't apply to Azure. If you add in a whole bunch of licensing issues or GCP, if you sprinkle in a bit of healthy fear of them turning your production Ironman off because they got bored and distracted by something shiny. So you've sat through the slings and arrows. I've hurled at a bunch of cloud decisions and come out the other side, still intact. Good for you. Now go to last week and aws.com/dose D O E S.

00:23:16

I've put together a few things for you. First, a PDF on how you can cut your AWS bill right now. It's gorgeous. It has a Platypus on it. It's something you can use immediately to start the cloud cost conversation internally, it's full of actionable things. You can do yourself today. There's a newsletter signup box on that page as well. I gather all of Amazon's cloud ecosystem news every week, strip out the things I don't care about. And then I make fun of them because I have serious problems with my personality. You're going to want to sign up for that too. Lastly, I'll be hosting a free form, Q and a as a workshop here on cloud cost management, ask questions. I'll make jokes at the ones you've just sat through while I answered them. It will be a grand old time as we're trapped inside during a global pandemic. Again, I'm Corey Quinn cloud economist at the duck-billed group. We help companies fix their horrifying AWS bills. Thanks very much.