Las Vegas 2019

POST No AWS Bills: Cloud Cost Optimization Without APIs

POST No AWS Bills: Cloud Cost Optimization Without APIs

CQ

Corey Quinn

Cloud Economist, The Duckbill Group

Transcript

00:00:02

We begin with a simple noncontroversial statement that no one will take personally, specifically, that all of the tools that help you optimize your cloud bills kind of suck. Let me back up for a minute. Who here has no idea who I am until today. Oh wow. You really should have gone to a talk. That was actually good, but we're going to see whether this happens. I'm a cloud economist, which is like a regular economist, except I dress far better. I also rant in coherently about clouds. And my first language is as you're probably picking up by now from the accent sarcasm and I wind up doing an awful lot and mostly are crapping all over the conference. Hashtag on Twitter. I'm hilarious in my own mind if nowhere else it's fine. But what I do for a day job is I go into large companies, definitionally and I fix the horrifying AWS bill, a small problem experienced by pretty much everyone.

00:00:58

And this has led to some interesting insights as I've gone down this path. And I wanted to share some of them with you today. So we go back to my original non-controversial statement and I could name companies here, but I won't, because first some would be very annoyed that I named them. And then others would be annoyed that I didn't name them. And that's how this whole thing works. So I just, I'm just going to have sharp elbows regardless. It's part of my meal. You it's fine. The problem is that these all, all these tools are more or less equivalent. Uh, despite people's urge to tell you otherwise, for example, they'll always tell you to buy some reserved instances, prepay for some capacity on a thing that you're about to turn off, just set more money on fire. Why not? Or they'll tell you to turn off those idle instances, labeled the ER site, um, fun story, by the way, like slight detour.

00:01:55

For those of you with engineering minds, that you want to have a hot Dr. Site up and running because if there's an availability zone or a regional outage, and you're going to go fail over to another region, I hate to burst your bubble. You're not the only person with that plan. And the control plane often becomes saturated. So you're waiting an hour or two for things to spin up. If it needs to be there quickly, it needs to exist before the disaster happens. Don't turn it off to save money. You're it's, it's picture it like an insurance policy. Hopefully you'll never need it here. In reality, we play with matches an awful lot. And of course, they're not going to tell you other things as you go down this path of, yeah, there are things we could tell you programmatically, but for some godforsaken reason, we're not going to because, well, it's not a big problem in our environment.

00:02:41

It, the quality is uneven across all of these things and it's not their fault. This is a dev ops summit. And one of the things we do in dev ops is, and summits and dev ops summit CS is we do blameless postmortems because it's not about blame. It's about fixing systemic problems. So we conducted a blameless post-mortem and found it was all these CS fault. I didn't say I was good at it. I just said, I knew how it worked and delving into this a little bit. It turns out that a lot of this comes down to venture capitalists, having incentives that may or may not align with companies that they're working with, or importantly their customers, for example, uh, if you're a VC and you're investing in something, everything has to be massively scalable. Uh, if you don't think, if you think that perhaps consulting can be massively scalable, I urge you to talk to someone who started a consulting firm.

00:03:43

Yeah. It's not. And fundamentally, since everything they want to fund has a low touch element to it. It's fun. They're going for a plan of everything should be able to be self service done with software as a service, which is totally going to solve things. And because they're VCs of course, bonus points, if they could work Uber for something in it, in this case, cloud billing, because Uber was the last really disruptive, innovative thing. And everything's waiting for the next thing, uh, before this, believe it or not, it was the next Yahoo. Sorry. If anyone works at what used to be Yahoo, I know it's, it's a sad time for all of us. I'm going to digress briefly into the problem with the entire approach for VC funding in this space. Um, at the beginning, you may have been told that there would be no math.

00:04:29

That was a lie. Let's do some math. Um, let's assume for the sake of argument based on last week's earnings call that AWS makes about $35 billion a year. That may not be exactly correct, but it's at least directionally, correct. Again, we're not checking your math work on this. Assume that a company goes out and charges a percentage of bill as they all seem to want to do. Okay, great. Doing a little math there. We figure out that the total addressable market is a bit over a billion dollars that assumes that one company could capture all of that. They won't, they have competitors. It further assumes that everyone is going to go out and jump on that. They're not a lot of companies want or need to build their own. And it winds up in turn as a direct result, looking an awful lot, like a pretty shitty unicorn. Incidentally, the shitty unicorn project would be a great sequel or parody to Jean Kim's new book. I bet we can strike a deal if we position it right then he's not in here. Is he? Yeah.

00:05:32

And there are some counter arguments to this, like cloud revenues are growing true. They are, they that's the fun thing about bills. You ever noticed that they don't tend to get smaller. Uh, and there is more to cloud than AWS. I should call out that. I biased towards talking in an AWS context because when I started this company a few years ago, it's where the expensive problems were. And by and large, that continues to be true. I'm not saying it's a one horse race, but it's, I found that specializing worked out well, this is not a comment on Azure GCP or IBM, it is on Oracle cloud, but I digress we'll get there.

00:06:08

So let's think some of these things through here first, the position that they're taking of that no one really that arrow, we're going to have a single pane of glass. We can look at all of your different cloud estates. Yeah. Multicloud done that way is terrible. And no one is really asking for it. I'm not saying you don't have different workloads in different, uh, in different clouds. And I'm not saying that having different lines of business with different providers is a bad thing. I am saying that absence of compelling argument to the contrary, having a workload, you can magically deploy anywhere you want it to be on to any cloud provider forces you down to lowest common denominator, API APIs, and you spend an awful lot of time solving global problems locally. No one wants to do it. That's a separate talk where I spit a lot more when I talk. And of course it also presupposes as this continues to grow that the cloud providers themselves are not in turn, going to make the tools to slice and dice the bills more accessible and better featured. Because you know, if there's one thing we know about cloud products is that they get worse with time.

00:07:12

And wait a minute, why am I asking VCs for business advice in the first place? I seem to recall you having a bit of a swing and a miss recently, it's it feels on some level, not all, but some that being a venture capitalist is mostly about having won a lottery once. And now you're going to teach other people that do the same thing. Forgive me if that's not the most compelling sales pitch I can imagine. And, oh, this is of course rather beside the point, because none of the tools in question will actually fix the problems that you have. What problems do you have? Let me tell you, uh, to be clear, I'm going to tell you what your problems are strictly in the context of cloud billing, because it's very hard to start a conversation with, you know, what your problem is and remain at a conference. Ask me how I know you can probably guess.

00:08:06

So invariably, you need to explain what's going on in a cloud environment, in a financial sense to someone who is not directly involved with it. It turns out that they generally don't spend a lot of time in finance, for example, logging in to various cloud consoles and poking around in various services. It also points out that the bill itself is vast and deep. And of course it is not structured in a way that is going to answer business questions because it answers things like how much did you spend on storage and how much did you spend on data transfer, but not things like how much did we spend in development versus production or how much did that sub service cost once it broke, these are the things that companies want to know and nothing out there from a tooling perspective aligns with answering them. It also runs into other problems specifically that not everyone who touches a cloud environment on an engineering sense is going to be a responsible steward.

00:09:04

And I mean that in both directions, sure. There's someone who spins up the biggest instances all the time, because bigger obviously means bass. That never turns anything off, but we also have people who love to spend weeks golfing a couple hundred bucks off of their developer spend as if they have no idea what they actually cost. It's at some point that adds no business value. There's a bigger problem to focus on context matters. So what does finance tend to care about in the context of a cloud bill? Mostly it's about allocation and prediction, but after you play the corporate game of telephone, that's not what anyone hears. I've traced this back countless times. It's true. For example, do you know the AWS bill does in fact have tax consequences in the United States, at least research and development tax credits are available to companies that includes as engineers, think about it, pre production environments. If you're in a position to be able to divide out what is pre-production and what is not, that has meaningful impact. Of course, if you just guess and give wrong numbers that you can't back up, it turns out some auditors would like a word, and it's not the fun, happy conversation earlier this morning where they said we have the auditors are here and everyone clapped that's never happened before. No one has ever been thrilled. That auditors show up. I promise.

00:10:25

And more importantly, as they start doing predictions, based upon where money is going, how do we wind up calculating out the cost of the goods or services we're providing? And when that winds up being a function of something else, that's a difficult conversation, but they wind up talking to engineering about these problems. And what is heard is you're spending too much on the cloud, spend less. And sometimes that's even what they think they mean, but it isn't. The problem is, is that visibility and being able to predict what your spend is going to look like in the future matter to businesses, they inform strategic decision-making. Presumably I would call some of these decisions less than strategic, but will be charitable because that is again a separate talk.

00:11:13

And of course, finance doesn't have the context engineering does. They see a big Amazon bill. They think a whole lot of books. And I didn't see that many boxes being dropped off at the office this past week. And wait, how do they have time to read all of it? Anyway? Don't they have full-time jobs. It's not that they're dumb. They're not, they just have a completely different skillset. And historically we haven't done a great job of bridging the gap between finance and the engineering world, similar to a decade or so ago between development and operations. Please don't call it fin ops. That doesn't work. It just doesn't work. Meanwhile, on the other side of the fence, what does engineering care about

00:11:53

Very often, whether than Damien or not comes down to feature delivery and how quickly they can get something out and companies wholeheartedly encourage this. And that's fair. It's reasonable, but I've been in conversations with engineers who were dinged on their annual review because they spent a couple of weeks unprompted to optimize the cloud bill. And in one case, they optimized for a couple of hundred bucks and I get it. Yeah, that was probably not the best move. And in the other case, they knocked $10 million off the company's cloud spend, but that's not the feature they were supposed to be working on. Context matters, communication matters. And of course the other half of what engineering does that we don't generally talk about at a high level is problems with the computers, that break things. And if this doesn't resonate, I urge you to go troubleshoot an app for about eight hours where it turns out the problem is either a comma or a white space character that isn't UTF included and go to quite the way that the thing that's reading it expects it to be.

00:12:50

And then tell me, you're not table flipping the challenge here is that as you start building out governance and control and yes, saving cost optimization is a part of governance. But as soon as you use that word, half the audience tunes out, I don't blame you. I'm in that half of the audience as you build these guardrails, it fundamentally has to be easier than not doing things the right way. Now, some of you are fortunate enough that you work in regulated industries where it turns out that just not disregarding the compliance requirements, doesn't just mean you're fired. It means you're going to prison. Most companies for better or worse, don't have quite that strong of a, of a guardrail around this. And increasingly companies do governance wrong. It comes out as trying to be this impenetrable gate rather than a constructive filter. You have to make it easier to do things the right way or things break down simple example.

00:13:44

If it takes six weeks to provision a physical server and we're in the cloud now, so we're saving time. It only takes four weeks to provision an instance. You're going to have someone at your desk, every 20 minutes asking for you to spin a thing up. They're never going to turn it off because it takes four weeks to spin up a new one. So they're just going to leave it around forever. And people don't remember to turn things off. It's never as exciting to clean up after yourself as it is to build anything, a source, any child with a pilot Legos.

00:14:15

In fact, some went on Twitter set. The bill is not even about what you use. It's your build for Winstead. But you forget to turn off who here has not left something running in a cloud environment and this guy and get found out because of the bill. Yeah. Few people. Yeah. Most of us have been in that painful place. And the reason that you can't do any of this with tooling is that there's no API for business insight. The thing that might be the right move for one company could be completely disastrous for another because there's no good way to get information from people, programmatically while within the bounds of the law, installing an API without consent is a problem. When it comes to people.

00:15:00

Lastly, of course you can't be a real economist unless you slap your name on a theory or a law or something else equally self-aggrandizing. And I don't have the attention span most days to write a tweet, let alone a book. So instead I've noticed something that I wanted to bring up for your consideration. Based upon conversations I've had with fascinating people who are in turn doing fascinating things. This is a data center for those who don't know what one looks like, Ooh, and odd it's majesty, come on. There we go. And if you're building out a data center for your application, and once the data center is finished construction, which is first way more money than you thought way longer than you thought. But once it's up and running, you can tell to a very high degree of fidelity what it's going to cost to run this for the next three years, almost to the penny, which is awesome.

00:16:00

That's one end of the spectrum. Let's go to the other end. This is the actual architecture of a serverless application. And yes, I get paid a dollar every time I use the word serverless on a conference stage that builds my ridiculous email newsletter and sends it out every Monday. And there's a whole bunch of different things going on in here as it transforms my ridiculous nonsense into something that is still ridiculous, but now it's prettier and goes out to a whole bunch of people. But, and I can trace as Simon, Wardley says the flow of capital throughout this application. And let's say that, and now it's immaterial because we're talking a couple of cents a month. It does not matter to the business, but if I were sending out 10,000 email newsletters a day, which no one wants, I assure you. But if I were, I could start tracking, where is the expensive part?

00:16:48

Where is it bottlenecking and start focusing on that. And if someone in finance has a question, I can get into a very detailed, granular discussion about where that bill is. This really winds up on a spectrum that I like to call cloudiness, and we'll get to that in a minute. On the one side, you have the story of data centers. Then we move into instances or virtual machines. And then we move into auto scaling. The idea of scaling up and down. Yes, that is the AWS diagram, uh, icon for an auto scaling group. Their art is about as good as some of their service names, as far as being clearly understood, but roll with it, take it on faith. That's what it's for. And then Docker or Kubernetes, a word I hate using onstage because I lose $5 every time I say it again, I digress.

00:17:36

And then into the serverless world at the end of it, um, to be clear, these are indicative. These are not prescriptive because it is possible to do it exactly wrong across the board. If you take what's running in your data center and shove it into a bunch of VMs and just run them in stat, in a cloud provider, first, it runs on money. Secondly, you haven't really solved anything unless the problem you're trying to solve is, you know what? This company sucks at running data centers and yeah, Hey, I suck at running data centers. There's no shame in that, but I've never yet seen that as the stated rationale for doing a cloud migration. Well, we suck at running data centers. Cool. Some folks will say, now that you've done that you're finished, VM-ware sorry. Something caught in my nose there, but it's a transitional step.

00:18:22

It doesn't get you far enough down the line for that to work. And this isn't necessarily just me saying this. It turns out that if you say something dubious, you can cite. It works out super well. Uh, as we learned from the state of dev ops report this year, Dr. Nicole forest Grande, who I believe is giving a talk next. So she could not be here for me to call out in person highlighted something, transitional, something foundational that NIST has come out with among if for nothing else. It was because it was the first time NIST had done something that wasn't 400 pages long. So people could actually internalize it and do something useful with it. And there's a high degree of correlation between how well people align with these cloud characteristics in their environment and how high performing the team is. This there's data that backs this up.

00:19:07

So it's not just about doing things because the thought leader on stage said, you should, there are measurable impacts that come out of this and that's valuable and that's important. And that is something that opens up doors, but it also unlocks something else. Um, forgive my amazing skills of an artist, as you might tell, this is not my first skillset. As something gets cloudier, it becomes inherently more cost efficient. And this shouldn't be a tremendous surprise to anyone because, huh? Things like auto-scaling, if you turn off things you're not using you don't pay for them. It turns out it's super hard to go and sell a server on eBay. And then a day later, when you need capacity, buy that server back on eBay and do that in an effective way. But with cloud providers and having only paying for what you care about, you can become far more cost efficient all the way up to the end of serverless, where it is pay for invocation, pay for consumption-based billing.

00:20:09

So there is no idle that you're paying for that becomes fantastic, but it's certainly not easy to get there. And I'm not agitating that anyone should attempt to in one fell swoop. That's again, a different talk by other people who are way more idealistic than I am, but as things get more cost effective, the ability to predict what that's going to cost in absolute dollars and cents becomes almost nil and finance. Doesn't like hearing the answer to a question of what is this going to cost us to run during the holiday rush being they want you to say at a third daily manner and, and it still doesn't help because they, this is over slack and how can you even express home that way? But they're asking for an, a question that can be answered in dollars and cents. And instead, they're getting an answer that while accurate, isn't helpful to them.

00:21:04

And one of the ways to start expressing this in a more helpful way is to view it as a function of a KPI that the business cares about when you tie it back to business metrics, a popular one in the SAS world is monthly active user Mau. If we, if every thousand monthly active users winds up costing X dollars to service, plus a fixed fee of all the things that don't spin up or spin down like the Jenkins box or a whole bunch of supporting infrastructure, that's something that they can at least work with. And then it goes on to the BI folks to try and figure out exactly what the real numbers are going to look like. It's like reading tea leaves. The, the challenge though, too, is that people often take the wrong message from that, which is why having the right conversation with people is valuable.

00:21:52

Uh, I am asked consistently by people for industry benchmarks around monthly active users, what should it cost? And it's a fair question. It's not a bad one at all, but there's no reasonable way to answer it because in my spare time, I run Twitter for pets.com. It's like regular Twitter, but 80 times less racist. And each user is just more or less tweeting. Other companies have a monthly active user recommending other references banks. They're going to be doing giant machine learning projects on it. Those costs a little bit differently, but I'm tired of answering the question. And now that I'm on stage and I have a microphone and no one else does, I'm going to say for the record that the monthly benchmark for monthly active users is 32 cents. Each it's found bounded and absolutely nothing. And ops teams will absolutely hate me for what I just said.

00:22:45

But 32 cents seems about right. And because the answer is anywhere from a penny or less to multiple millions of dollars, and it's hard to get an industry benchmark even within a sector because everyone builds things fundamentally differently. All of which is to say that my working thesis and I would love to hear people's thoughts on this directed loudly at me, tied to a brick through my window is that the more, I guess, cost optimized a cloud environment becomes there's an inverse correlation to how easy it is to predict the cost. And this isn't inherently a bad thing. As long as the costs are tied to something the business understands, but having those conversations requires a bit of empathy. It requires getting to a point where finance it engineering can have a conversation together. And everyone walks out of the room content with the conversation and not thinking that they're being condescended to, or being dragged into a hole into an engineering exercise.

00:23:50

That makes no sense for them. And that's been my working theory. And when I tell people this in various contexts, I'm I have roughly half of them say, well, yeah, that's obvious. What about it? And the other half just sort of stare at me with an, oh my God. I hadn't considered that look. And that tells me I'm either onto something or I've got the basis for a terrific scan and I'm not entirely sure which way it goes. So if I have an add one, ask for folks here it's to tell me what I missed. What am I not thinking about when it comes to being able to talk about optimized environments, being harder to predict in absolute terms, because I work with a certain subset of the industry. Obviously I don't work with every company yet. So there are things that I'm missing. There are clearly missing pieces of context for me. And I would be very interested to hear from you folks, what it is you think those are. My name is Corey Quinn. I am a cloud economist, whatever the hell that means. Thanks for listening to me. You can follow my exploits that last week in aws.com and I'll be hunting around the conference until they heard me out of here tomorrow afternoon. Thanks.