Fabulous Fortunes, Fewer Failures, and Faster Fixes from Functional Fundamentals - Scott Havens
Learn how real-world enterprises that adopted functional programming principles and adapted them to their line-of-business systems achieved greater resiliency, faster time-to-delivery, and lower total cost of ownership.
Scott is the former Director of Software Engineering at Jet.com and Walmart Labs for supply chain technology. He specializes in data-intensive systems.
Senior Director, Head of Supply Chain Technology, Moda Operandi
Alright, uh, to motivate the next talk. I want to tell you just a little bit about something that has influenced me a lot. About three years ago, I wrote, I learned a language called closure and it changed my life. It was probably one of most difficult things I've learned professionally, but has also been one of the most rewarding. It brought the joy programming back into my life for the first time in my career. As I'm nearing 50 years old, I'm finally able to write programs that do what I want them to do, and I'm able to build upon them for years, without them falling over like a house of cards. As in, by me has been my experience for nearly 30 years. Other famous French philosopher, cloud levy Strauss would save certain tools. Is it good to think with, and for reasons that I will try to explain the next five minutes, I believe functional programming and things like immutability are truly better tools to think with and has really taught me how to vent myself.
I'm constantly sabotaging my code, which I've been doing for decades. Uh, I'm going to make the astonishing claim that these things have eliminated 90% of the errors I used to make. So, uh, I'm going to try to motivate why. So about a year ago I found this amazing graphic on Twitter that describes the difference between passing variables by value versus passing variables by reference. So when I was in graduate school in 1993, uh, most mainstream languages, uh, supported, only passing things by value. Um, so which meant that if you pass the variable to a function and you changed it within the function, um, you know, you would only change your local copy. So often this means that you'd have to return the new state. And if it was just, if this was a structure or a large object means you would have to do a lot of copying and pasting, this is tedious error prone and very time-consuming.
I often find myself complaining about this, uh, wishing there were a better way, and it turns out to, you could do this by, uh, eliminate this by using pointers, but actually pointers are now considered so dangerous few languages besides C, C plus plus, and assembly, even let you do it because it is that dangerous. In 1995, I got introduced to a huge innovation programming languages that was called passing values by reference to showed up in Cecil plus Java modular three, uh, which allowed you to change the value that was passed you as a parameter. Uh, and it would change, uh, the, the reference that you passed it in from the color. And this seemed really great. I loved it because it's such a time-saver because it lets you write less code. Um, but three years ago I changed my mind. So closure is one of the categories of languages called functional programming, Haskell F sharp.
Uh, they're all part of these, uh, have the same sensibility. They don't let you change. Variables functions need to be pure. Um, the functions always return the same output, given the same inputs, and there are never any side effects. You're not allowed to change the world around you. Now you're not allowed to read or write from disk. Certainly not. Uh, even reading from disk is not allowed because, uh, it's not always the same. And so this is one of the biggest aha moments of a program for me because it taught me how terrifying program passing variables reference should be. Because when you see this, what you should really should be seeing is this it's like, why is my coffee cup changing? Who is messing with my coffee cup? And how do I make them stop? The point here is that it's very difficult to understand your code and to reason about what is happening.
When anyone can change your internal state, you may have heard of Heisenberg's where even the mere act of observation changes the result. And these are the hallmarks of multithreading errors, which is considered to be one of most difficult problems in distributed systems. I'm fixing my coffee cup and I can't get it to, I can't figure out how to get it to fill up again. Right? So, uh, and I, because I need to replicate the problem. So in the real world, uncontrolled mutation makes things extraordinarily difficult to reason about because other people can put anything they want in your coffee cup, John Carmack, uh, he wrote castles and Stein, 3d tomb, uh, quake. He wrote, he gave this amazing, uh, keynote at the quake con conference in 2013 saying a large fraction of the flaws in software development are due to programmers, not fully understanding all the possible states.
Their code may execute in, in a multithreaded environment. The lack of understanding and the resulting problems are greatly amplified to the point of panic if you're actually paying attention. Um, and so the point here is that in the real world, it's not just your coffee cup, you're operating in a universe of coffee cups. And if you zoom out, there are many, many more coffee cups around that. And if anyone can change your state because they have a reference to it, it becomes almost impossible to reason about under these conditions. It's almost impossible to understand what is actually happening and how to make things truly deterministic. And this is one of the beliefs that functional programming, uh, truly taught me if they have a belief that uncontrolled state mutation is that the very limits of what humans can reasonably understand and to be able to test and run in production.
And so, uh, programming languages, I pioneered functional programming techniques. This has Haskell OCaml closure, Scala, Earline, elemental, elixir of reason. And Mel, uh, is becoming increasingly popular. And what I find so exciting is that these concepts are now showing up in infrastructure as well. Docker is immutable, right? If you can't change containers, uh, you know, if you really want to make a change at persists, you have to make a new container. Uh, Kubernetes uses this concept, not in the small, but in the large four systems of systems. Uh, if you see Apache Kafka, chances are they're using it for an immutable data model, uh, that says you're not allowed to rewrite the past. It turns out version control is immutable, right? You get yelled at if you actually rewrite history. Um, and so just, uh, talking with, so I'm going to introduce the next speaker, which is Scott havens.
Um, and we were, as we were talking for the slide, he said, everyone knows now, as a doctor, doctor said, his go-to statements are considered harmful to program flow. And it is, uh, he said that it is without a doubt that uncontrolled state mutation will surely within our generation be considered the next go-to. Uh, so one is for code one is for data. So the next speaker is Scott havens until very recently he was director of software email@example.com and Walmart labs, his remit was to rebuild entire inventory management systems at Walmart, the world's largest company. He earned this right by the amazing work he did, building the incredible systems that power jet.com, a company that Walmart then acquired it powered the inventory management systems, order management, transportation, available to promise available to ship and tons of other critical processes that must all operate correctly to compete effectively as an online retailer. He is now senior director, head of supply chain technologies at Moda operandi and upscale fashion retailer. And I hope what he presents will blow your mind as it blew my mind showing that functional programming principles apply not just in a small, in a program, but can be applied at the most vast scales, such as Walmart enterprise with that Scott havens,
Good morning, DevOps enterprise summit. I'm really excited to be here today and talk about something that's really near and dear to my heart. My name is Scott havens. I'm a senior director and head of supply chain tech at Moda operandi. It's a fashion e-commerce company that was founded in 2010. Our mission is to make it easy for fashion designers to grow their business and for consumers to recognize their personal style. I joined Moda because fashion supply chains are notoriously challenging, and I'm excited about how we can use technology to improve time to market lower costs and even help designers predict next seasons fashion trends before the season starts. However, I just joined two weeks ago. So I'm going to supplement a lot of my discussion today with, uh, my experiences prior, excuse me, prior to Moda, before Moda operandi, I was an architect at Walmart, the largest company in the world by revenue over half a trillion dollars a year.
And by number of employees, 2.3 million to be precise, I was responsible for designing and building supply chain systems like inventory management for Walmarts, including the 4,500 stores in the U S e-commerce like walmart.com and owned brands like jet.com and international markets. I joined Walmarts via the acquisition of jet.com three years ago, for $3.3 billion at the time, the largest e-com acquisition today. One of the reasons that Walmart had bought jet is because the jet tech stack looked transformative. It was cloud native microservice-based event sourced, and fundamentally it was based on functional programming principles. It looked cool, but not everyone is convinced by just cool. They didn't know if jets techniques were just the latest buzzwords or if they provided real world benefits. Well, it wasn't long before we were fortunate enough to get the chance to demonstrate these benefits. And when I say fortunate, what I really mean is that disaster struck about three years ago in the middle of the night, I got paged for a system alert.
I woke up hopped on the phone bridge and our PagerDuty slack channel and started looking into it almost immediately. I was joined by coworkers from several other teams. It turned out our production Kafka cluster was down. If you're not familiar with Kafka, it's a very scalable pub sub messaging system. We use it as the primary method of communication among all of our backend services before too long, we realized that the cluster wasn't just down, but it was dead. It was an ex Kafka cluster. Every single message in flight was gone. Customer orders, replenishment requests, catalog changes, inventory updates, and warehouse, replenishment notifications, pricing updates, every single one just gone. We were going to have to rebuild the cluster from the ground up. Now this could have been catastrophic. This could have been the end of the grant jet experiment enough to convince our new Walmart compatriots that Jet's technical tenets sound good on paper, but don't work in a real enterprise compared to tried and true systems.
So what happened? Well, first we rebuilt the cluster. New brokers deployed in minutes via Ansible scripts. While this was happening, we coordinated with all the teams who manage the edge systems, the systems that are exposed to the outside world like merchant, API inputs and customer order inputs, these edge systems, like all the others are event sourced. Each of these teams reset the checkpoints in their event streams to a point in time, just prior to the outage, all of the events after that point were readmitted to all the downstream consumers. And one of these checkpoints were set back. There was some overlap on messages that had already been sent and processed downstream, but the downstream systems were all designed and had been fully tested to handle duplicates and act idle potently, even though they were out of order, these downstream systems were hit by a flood of messages, but we were able to just scale them out in seconds, some automatically some manually to handle the throughput and stay entirely within our SLS.
In the end, this potentially catastrophic event was a little more than a minor annoyance. No data was lost and not a single customer order was delayed. Walmart was happy that their $3 billion wasn't wasted on worthless tech. And it afforded us coming from the jet side, the opportunity to examine the rest of the Walmart technology ecosystem and see where we could provide value. Now, Jen, as a startup had the advantage of being completely Greenfield and focused in their business. Walmart on the other hand had built their incredibly successful and wide ranging business. Over many decades requiring a number of different stacks and technologies. What we found was an organization and an architecture of enormous complexity and cost. Now I'm not going to attempt to capture the entire mammoth business that is Walmart or even any other e-commerce company here. Instead, I'm going to dig into just one common small piece of e-commerce website functionality.
Our customer Jane wants to buy a cocktail dress for an upcoming party. She wants to know if it's available in her size. It doesn't have to be a store or a warehouse nearby. It can be anywhere as long as it can be shipped to her. This item availability is served via an API. When she checks her favorite e-commerce site, it can't be down or take too long to load because competitor's websites are only a click away. So our item availability API has an SLA of, let's say 99.9, 8% uptime. That's just shy of two hours a year of permissible downtime at say 300 milliseconds latency.
What factors will go into this item availability? The first ones that may come to mind are the inventory in the warehouse and any reservations that may exist from existing orders, but there's a lot more to it than that. In addition to the warehouse inventory, you might have the store inventory on the floor or the inventory in the back room of the stores. Now, if you are a marketplace, you might have a lot of third parties could be thousands of different third-parties inventory. You have to look at the item and see if you're even eligible to sell it on this site. Just because it's sitting in a warehouse, doesn't mean that you're permitted to sell it. And there as warehouse eligibility, perhaps a certain warehouse that has your item, isn't permitted to ship to a different area, to a certain area, or is not allowed to sell on a particular website.
There are sales caps where you might have limits for a particular timeframe of how many are just permitted to sell, like maybe a cap of a thousand during some kind of discounted special. And they're back orders from all the orders that already exist, that weren't able to be filled originally of the customer still wants them. And for every single one of these factors at a large enough organization, they're going to be legacy systems that have duplicates of all this information that you need to consider as well. So how do we add all of these things together to give Jane her answer? A common model is via service oriented architecture or SOA, which we decompose each of these factors into a service you call each of these services on demand in real time to get the information you need. What does that look like here? Well, now I have the pleasure of showing you one of the ugliest diagrams I have ever made, and don't worry.
I don't expect you to memorize this or even be able to read it. The complexity is the point. You can still see at the top, the website calling the item, availability API, each of the item availability factors that I listed is represented somewhere on here by a service, which may depend on other services to give you a sense of scope. Each one of these boxes is a whole system or multiple systems each maintained by one or more whole teams. So let's walk through what happens when Jane looks for her dress at the top, highlighted in red, the customer facing website calls the item, availability, API, that general API calls, the global item availability API, which checks it's cache. Doesn't find it and falls back to other services, which call other services. And more until we can finally compute the answer for Jane. So let me save you some time on the math to get the dress availability in under 300 milliseconds, 99.9, 8% of the time requires 23 service calls.
Each of which has five nines of uptime and a 50 millisecond marginal service level objective without every single one of these services working correctly. It is impossible to know if an item is available, you're better off, not even guessing with partial information, it's better than risking telling the customer the wrong answer to be blunt, an outage in any one of these services takes down the entire availability API because each of these systems has business logic that is so tightly coupled to so many other systems it's extremely difficult to properly test them. Unit testing covers a tiny fraction of the space of potential errors and relying on integration tests to fully vet something. This complex is absurdly costly and absurdly ineffective further. Each of these systems was fundamentally designed internally in a traditional manner as changes happen. The current state usually stored in a relational database is mutated in place.
And there is an expectation correct or not that servers are reliable and will only be shut down and restarted with permission. And we all know how well that works. So how do we go about tackling these problems? Can we take what we learned at jet and extract lessons further? These problems probably aren't unique. Can we ensure that these lessons are broadly useful to anyone or any company that might suffer similar problems? But the jet.com way of approaching these problems was to look at them through the lens of functional programming. So let's walk through these principles and learn what the implications are for system design. There are many principles, but I'm going to focus on just a handful today. Let's start with immutability. The idea that the inputs don't change and functions that take these inputs produce outputs that are also immutable. State is not directly mutated. We embrace purity. We avoid writing functions that produce side effects, no writing to disk or network until the last possible moment. And we strictly control those side effects. When we do this, makes it easier to reason about the code and test the code. The external world, outside the function can't affect the results and the function won't affect the external world. This makes the function very predictable and repeatable. Given, given input, the output will be the same every time.
And that repeatability unlocks a principle called the duality of code and data. It's a fancy way of saying that the code and the data are interchangeable, a function that accepts parameter a and computes output B could be replaced with a lookup table with the key of a and a value of B conversely, a really big look-up table takes up gigabytes of space that maps a to B. It could be compressed into a function that computes B for me, you can go back and forth between the two. Jean did a great job introducing some of these principles and showing how they work in the small. When you're writing code, we took these same principles and applied them in the large changing how we design systems and systems of systems. Let's walk through some of these results, starting with immutability, you get message-based log driven communication. The first part of this, the message-based part is pretty ubiquitous systems communicate with each other via messages, synchronously over HTTP or asynchronously over some kind of cue log based pub sub systems like Kafka, AWS Kinesis, and Azure event hubs.
Take this a step further. Not only do the messages themselves, not change, but they're ordered and retained for an extended period of time. Even after you've consumed the message, the consuming services keep track of their own progress via checkpoints into the log. So what does this mean? Imagine you suffer an outage that causes you to lose the last day's worth of transactions or even worse. You've introduced a bug in your code that corrupts data. This will force your consumer, excuse me, you can deploy your fix and reset the checkpoint to the point in time before the bug was introduced, this will force your consumer to replay all of the subsequent messages, reconsidering them with corrected code and fixing your corrupt data. This approach drastically improves your meantime to recovery on an entire category of production errors. At Walmart, we produce replaced HTP calls, Hughes and even enterprise service buses with Kafka.
Now Modo we're using AWS Kinesis for the same end with immutability. You also get drive event. Sourcing events are facts about something that happened in the world. Once an event occurs, it always have occurred. It doesn't change because by definition it's already happened. An event source system events are first-class citizens. The canonical data store is consists of ordered streams of events. The current state is secondary. A consequence of the events you use the stream of events to build the current state by aggregating over all of them. Bank accounts are an obvious example of this approach, your account balance. Your current state is the result of summering over every deposit withdrawal that had ever happened. Event sourcing, storing the events of this way is extremely powerful. It effectively gives you a time machine. You can see the state for any point in time, and you can walk step-by-step through everything that's ever happened.
This is fantastic for troubleshooting. You can validate behaviors that people are observing. People may report that they saw a problem at a specific time. It could be days, weeks, even months after the fact. And we can go back in time, re observe it and perform a root cause analysis further event, sourcing, unlocks entire new areas of analytics, uh, where we've found that our marketing teams love having this kind of data of everything that's happened over time and our operations and audit people love knowing exactly everything that's ever happened with purity. Our goal of purity means that we isolate computations from the real world. We write all business logic as stateless functions with zero external dependencies. And that means zero IO. Instead collect all state you need upfront and pass it into your business logic as parameters that's statelessness that isolation of the computation gives it predictability and it gives it atomicity.
There are no random outcomes. So-called highs and bugs, and there are no partial results. Real-world failures. And in the cloud, you are constantly dealing with real-world failures may keep your code from running, but it will never affect correctness or consistency now because the business logic doesn't have side effects because they're pure. It means that 100% of the domain logic is a unit testable. You can provably identify every single path through the business code and write unit tests for it, not just write unit tests, but create executable specifications. You can define in variants from your specifications explicitly as properties. For example, we say that inventory counts should never be negative. That is an invariant. These properties can be checked automatically with large numbers of randomized inputs, extremely quickly, spec based. And property-based testing frameworks that do this are available for most languages, but to work well.
They depend on your code being stateless. And if you do this well, integration tests are only needed for establishing basic connectivity between services. You can test much more thoroughly in less time and for less cost. Now you can't remain pure forever. Once your business logic is complete and you have a result, you have to do something with it, but don't do any more changes than you absolutely have to in this process, right? At two, one, and only one place you may be tempted. And this happens all the time in the same process to write to a database and then notify a downstream consumer about that change, maybe via Q don't. This is called a dual right in a distributed environment like the cloud failures can and will happen at any point, as soon as one of those rights succeeds and the other fails your system is now in an inconsistent state dual rights take all the hard work you did to get guaranteed outcomes and tosses it out the window.
Instead, the safe way to accomplish this is via change data capture. The result is that an event is published downstream. If, and only if it's been committed to the database, this ensures eventual consistency and failure scenarios. You may fall behind publishing, but you'll never lose events. You'll never lose events in your data store. You'll never lose telling your downstream consumer different databases support this in different ways. Walmart now uses the Azure cosmos DB change feed for this. And at Moda we use Kenisa streams. So my applying these principles, we've established a pattern for designing systems that looks like this. We receive immutable messages over Kafka that are consumed by a microservice running stateless domain logic that emits these immutable events into data streams. The events are then published downstream to any consumers over Kafka again, but we're not done yet. When we're employing immutability and purity, we can take advantage of the third principle and replace real-time compute with data lookup, wherever feasible when you know the set of possible inputs in advance, or you've seen specific inputs before you can replace the often an expensive runtime computation with a pre-computed cache of the result.
For instance, in event sourcing, if you try something over the first a thousand events in a stream, more than once, you'll get the same result every time, particularly for long running streams that are millions of events long, it makes sense to save a snapshot and use that as your starting point next time, instead of retrieving and summing the entire stream, this costs you a very small amount of storage for this snapshot. And congratulations. You've just exchanged a computation for data. And that gives us a final pattern for system design that looks like this. What has changed from the previous diagram is that we've added a service that consumes the events from the Kafka feed builds, upgraded streams, updated stream snapshots, and then updates the cash further. We're publishing all of those snapshots via change feed to Kafka as well. Downstream consumers will have a choice.
They can consume all of the events as they happen, or if they only care about the latest state, they can consume that feed. Instead, one of the first teams to use this pattern at Walmart was called Panther. Panther is an inventory tracking and reservation management system. On the supply side, it aggregates them tracks. All sources of inventory includes the Walmart and jet owned warehouses and all partner merchants and their warehouses. And on the demand side, it acts as the source of truth for reservations against the available inventory at those warehouses. When a customer is checking out, the contents of their cart are reserved to make sure that no one else will order them. If there's only one left, that's pretty important. If the inventory is not available at that point, the reservation fails and the items must either be resourced from a different location or different items must be selected.
The primary goals of Panther were to maximize on-site availability while minimizing order reject rates. Due to lack of inventory. There were a lot of secondary goals as well to improve the customer experience by reserving inventory early in the order pipeline, we wanted to enhance insights for the marketing and operations teams by providing more historical data and better analytics. And we wanted to unify inventory management responsibilities typically spread across multiple systems. Of course, along with these business goals, our solution had a lot of non-functional goals like high availability, geo redundancy, and fast performance backed up by SLS. We found a lot of success with this architecture. The entire team started by a single engineer in July, 2016 had only three team members when Panther went into production by black Friday. That same year, that's only five months later. After one year, the team still only needed five engineers.
Once in production, we found it very easy to add features with inventory tracking, staleness of data is an issue. Simply put if a merchant last told us their inventory months ago, there is no way we would trust that. So we wanted to implement a feature that expires the merchant updates after a certain amount of time, just zero it out. The results were immediate. We dropped our third-party reject rate in half from 0.8 to 0.4%. And what's great about this is that this was done by a single engineer who is new to the company with lights F sharp training and no cloud microservice background went from design to production in three weeks. So with the success Panther, we started rebuilding a number of our supply chain systems following the same principles and patterns, but we didn't stop there. I want to revisit my ugly diagram from earlier.
This is the one with all the nested synchronous API calls. We looked at this mess of dependent services. There may be dozens of teams and a lot of deployments on heterogeneous stacks on thousands of servers. But as far as the front end shopping site is concerned, looking up the availability of Jane's dress may as well be a function that calls other functions that call other functions and eventually returns a single end result. This is a call graph it's code. It's distributed unreliable, stupidly expensive code, but it's code. And if we remember the duality of code and data, there's a way to exchange that code for data. Maybe that data will be more reliable and less expensive than this monstrosity. It turns out it is the systems modeled after the Panther architecture, uh, all stream events and state changes as messages over Kafka. We can use these message streams to invert the dependencies instead of the dependent service polling it's needed inputs.
In real time, the source system can push the data changes. The dependent service consumes these changes as it happens, updates its own state accordingly and pushes its own changes downstream. We can convert it from a primarily synchronous service oriented architecture to a primarily event driven architecture. All of the same item, availability factors are represented in this diagram, but now almost all of them are hooked up. Asynchronously messages are flowing in this diagram from left to the right we're trading the real-time computations, the real-time calls for re pre-computed data throughout the supply chain systems. How does that affect the hot path? The moment that Jane looks for her dress? Well, a moment ago, like I showed him the SOA model. I highlighted that hot path in red. Let's look carefully to see what it looks like here.
That's it to achieve the same SLA. We need only two service calls, not 23, both of which have only four nines of uptime, not five and only 150 millisecond SLO not 50. All of the event driven systems still need uptime and processing time SLS, but they're no longer in any customer hot paths. They are completely asynchronous, three nines, uptime and end to end processing in seconds or even minutes is sufficient. So how does all of this affect cost? This is going to vary among organizations, but we can ballpark it. First and event driven system three nines, uptime, mid and latency is about as cheap to operate as any system. We're likely to see if we increase our uptime by 10 X to four nines and drop our latency by 400 X, 200 feet to 150 milliseconds for a lot of orgs. You're looking at an order of magnitude higher cost to push your uptime to five nines while tightening the latency even more for most organizations that is an obscene amount of money. How did the total operational cost compare once you've replaced all of these things well with the functional event driven approach versus yeah. Now I'm not allowed to give you precise numbers, but I can tell you that for walmart.com, this difference is millions of dollars per year and, uh, notes. Uh, just went off the screen. Can we get those back on, please?
Uh, and we're running a little low on time. So I'm gonna just walk through this really quick. He found that it was, uh, it turned out to be a lot more practical than he expected. And, uh, the last one is that it'll cost too much. It'll take too long, too dangerous. And it's just too much. I have this enormous creaky, bailing wire and duct tape spaghetti code monstrosity. It grew uncontrolled over the years and not decades has dozen hundred thousands of people trying to keep it working. Well, I recognize this is a pretty big shift in mindset, but there's an old joke about this. How do you eat an elephant? The answer is one bite at a time. There are small steps you can do right now to take these principles and apply them regardless of what your systems look like. Now, you can identify just one duel, right?
Somewhere in all of your systems and figure out a way to eliminate it, consider using change data capture to do so. You can encourage property-based testing in just one system. Most of your devs won't find it that different from regular unit testing, and you can switch one web service to also publish events. You don't have to fully commit to event sourcing, just publish your changes as they happen. Then switch one consumer to read the events rather than make HTP calls at runtime. This is a very easy way to bite off a small piece and, uh, ensure that safety of the system while you do it. So if you have an architecture that looks like this, and you don't have someone in architect who is talking about how to move to something like this, you're doing your organization, a grave injustice. My mission in life is to reduce the amount of entropy in the universe, or at least our little corner of it. So if you want to help me in this journey, if you want to replicate what we've done, or you have new ideas, here's how to reach me. I'm Scott dot firstname.lastname@example.org or Scott havens on Twitter. And I'd be remiss if I didn't say that we are hiring. So thank you very much. Have a great day.