Las Vegas 2019

Fabulous Fortunes, Fewer Failures, and Faster Fixes from Functional Fundamentals - Scott Havens

Learn how real-world enterprises that adopted functional programming principles and adapted them to their line-of-business systems achieved greater resiliency, faster time-to-delivery, and lower total cost of ownership.


Scott is the former Director of Software Engineering at Jet.com and Walmart Labs for supply chain technology. He specializes in data-intensive systems.

SH

Scott Havens

Senior Director, Head of Supply Chain Technology, Moda Operandi

Transcript

00:00:02

Uh, to motivate the next talk. I wanna tell you just a little bit about something that, uh, has influenced me a lot. About three years ago I wrote, I learned a language called closure, and it changed my life. It was probably one of the most difficult things I've learned professionally, but it's also been one of the most rewarding. It brought the joy programming back into my life. For the first time I my career, as I'm nearing 50 years old, I'm finally able to write programs that do what I want them to do, and I'm able to build upon them for years without them falling over like a house of cards. As in by my, has been my experience for nearly 30 years, uh, the famous French philosopher Cloud Levy STRs would save certain tools. Is it good to think with? And for reasons that I will try to explain in the next five minutes, I, uh, I believe functional programming and things like immutability are truly better tools to think with and has really taught me how to vent myself from constantly sabotaging my code, which I've been doing for decades.

00:00:55

Uh, I'm going to make the astonishing claim that these things have eliminated 90% of the errors I used to make. So, uh, I'm going to try to motivate why. So about a year ago, I found this amazing graphic on Twitter that describes the difference between passing variables by value versus passing variables by reference. So when I was in graduate school in 1993, uh, most mainstream languages, uh, supported only passing things by value. Um, so which meant that if you passed the variable to a function and you changed it within the function, um, you know, you would only change your local copy. So often this means that you would have to return the new state, and if, if this, if this was a structure or a large object, it means you would have to do a lot of copying and pasting. This is tedious, air prone and very time consuming.

00:01:42

Uh, I often find myself complaining about this, uh, wishing there were a better way. And it turns out, uh, you could do this by, uh, eliminate this by using pointers, but actually pointers are now considered so dangerous. Few languages besides CC plus plus and assembly even let you do it because it is that dangerous. In 1995, I got introduced to a huge innovation programming, uh, languages that was called passing Values by reference that showed up in c plus plus Java Modular three, uh, which allowed you to change, uh, the value that was passed to you as a parameter, uh, and it would change, uh, the, the reference that you passed it in from the caller. And this seemed really great. I loved it because it was such a time saver, because it lets you write less code. Um, but three years ago I changed my mind.

00:02:25

So closure is one of a category of languages called functional programming. Haskell F Sharp, uh, they're all part of these, uh, have the same sensibility. They don't let you change variables. Functions need to be pure. Um, the functions always return the same output given the same inputs, and there are never any side effects. You're not allowed to change the world around you. Now, you're not allowed to read or write from disc. Certainly not, uh, even reading from disc is not allowed because, uh, it's not always the same. And so this is one of the biggest aha moments of, uh, program for me because it taught me how terrifying program passing variables reference should be. Because when you see this, what you really should be seeing is this. It's like, why is my coffee cup changing? Who is messing with my coffee cup and how do I make them stop?

00:03:11

The point here is that it's very difficult to understand your code and to reason about what is happening when anyone can change your internal state. You may have heard of heisen bugs, where even the mere act of observation changes the result. And these are the hallmarks of multi-threading errors, which is considered to be one of the most difficult problems in distributed systems. I'm fixing my coffee cup and I can't get it. I can't figure out how to get it to fill up again, right? So, uh, and I, because I need to replicate the problem. So in the real world, uncontrolled mutation makes things extraordinarily difficult to reason about, uh, because other people can put anything they want. In your coffee cup, John Carmack, uh, he wrote Cass Wilson Stein 3D Doom, uh, quake. He, uh, wrote, he gave this amazing, uh, keynote at the Quake Con Conference in 2013 saying a large fraction of the flaws in software development are due to programmers not fully understanding all the possible states their code may execute in, in a multi-threaded environment.

00:04:01

The lack of understanding and the resulting problems are greatly amplified to the point of panic if you're actually paying attention. Um, and so the point here is that in the real world, you, it's not just your coffee cup. You are operating in a universe of coffee cups. And if you zoom out, there are many, many more coffee cups around that. And if anyone can change your state because they have a reference to it, it becomes almost impossible to reason about under these conditions. It's almost impossible to understand what is actually happening and how to make things, uh, truly deterministic. And this is one of the beliefs that functional programming, uh, truly taught me. They have a belief that uncontrolled state mutation is at the very limits of what humans can reasonably understand and to be able to test and run in production. And so, uh, programming languages that pioneered functional programming techniques like this is Haskell OCaml closure, Scala Lan Elzer reason <inaudible>, uh, is becoming increasingly popular.

00:04:56

And what I find so exciting is that these concepts are now showing up in infrastructure as well. Docker is immutable, right? If you, you can't change containers, uh, you know, if you really wanna make a change that persists, you have to make a new container. Uh, Kubernetes uses this concept not in the small, but in the large for systems of systems. Uh, if you see Apache Kafka, chances are they're using it for an immutable data model, uh, that says you're not allowed to rewrite the past. It turns out version control is immutable, right? You get yelled at if you actually rewrite history. Um, and so this I was talking with, so I'm going to introduce the next speaker, which is Scott Havens. Um, and we were, as we were talking for the slide, he said, everyone knows now as, uh, Dr. Deter said is go-to statements are considered harmful to a program flow.

00:05:42

And it is. He said that it is without a doubt that uncontrolled state mutation will surely within our generation be considered the next go-to be. Uh, so one is for code, one is for data. So the next speaker is Scott Havens. Until very recently, he was director of software engineering@jet.com and Walmart Labs. His remit was to rebuild this entire inventory management systems at Walmart, the world's largest company. He earned this right by the amazing work he did, building the incredible systems that powered jet.com, a company that Walmart then acquired. It powered the inventory management systems, order management, transportation, available to promise available to ship, and tons of other critical processes that must all operate correctly to compete effectively as an online retailer. He's now senior director head of supply chain technologies at Moda Operandi and upscale fashion retailer. And I hope what he presents will blow your mind as it blew my mind showing that functional programming principles apply not just in the small inner program, but can be applied at the most vast scales, uh, such as a Walmart enterprise. With that, Scott Havens,

00:06:54

Good morning DevOps Enterprise Summit. I'm really excited to be here today and talk about something that's really near and dear to my heart. My name is Scott Havens. I'm a senior director and head of supply Chain Tech at Moda Operandi. It's a fashion e-commerce company that was founded in 2010. Our mission is to make it easy for fashion designers to grow their business and for consumers to recognize their personal style. I joined Moda because fashion supply chains are notoriously challenging, and I'm excited about how we can use technology to improve time to market lower costs, and even help designers predict next season's fashion trends before the season starts. However, I just joined two weeks ago, so I'm going to supplement a lot of my discussion today with, uh, my experiences prior, excuse me. Prior to Moda, before Moda Operandi, I was an architect at Walmart, the largest company in the world by revenue, over half a trillion dollars a year, and by number of employees, 2.3 million to be precise.

00:07:59

I was responsible for designing and building supply chain systems like inventory management for Walmart, including the 4,500 stores in the us, e-commerce like walmart.com, and owned brands like jet.com and international markets. I joined Walmart via the acquisition of jet.com three years ago for $3.3 billion at the time, the largest e-com acquisition to date. One of the reasons that Walmart had bought Jet is because the jet tech stack looked transformative. It was cloud native, microservice based event sourced, and fundamentally it was based on functional programming principles. It looked cool, but not everyone is convinced by just cool. They didn't know if Jet's techniques were just the latest buzzwords or if they provided real world benefits. Well, it wasn't long before we were fortunate enough to get, get the chance to demonstrate these benefits. And when I say fortunate, what I really mean is that disaster struck. About three years ago, in the middle of the night, I got paged for a system alert.

00:09:07

I woke up, hopped on the phone bridge and our PagerDuty Slack channel and started looking into it. Almost immediately. I was joined by coworkers from several other teams. It turned out our production Kafka cluster was down. And if you're not familiar with Kafka, it's a very scalable pub sub messaging system. We use it as the primary method of communication among all of our backend services. Before too long, we realized that the cluster wasn't just down, but it was dead. It was an X Kafka cluster. Every single message in flight was gone. Customer orders, replenishment requests, catalog changes, inventory updates, warehouse replenishment, notifications, pricing updates, every single one just gone. We are going to have to rebuild the cluster from the ground up. Now this could have been catastrophic, this could have been the end of the Grand Jet experiment enough to convince our new Walmart compatriots that Jets technical tenants sound good on paper, but don't work in a real enterprise compared to tried and true systems.

00:10:11

So what happened? Well, first we rebuilt the cluster. New brokers deployed in minutes via Ansible scripts. While this was happening, we coordinated with all the teams who managed the edge systems, the systems that are exposed to the outside world, like Merchant API, inputs and customer order inputs. These edge systems, like all the others, are event sourced. Each of these teams reset the checkpoints in their event streams to a point in time from just prior to the outage. All of the events after that point were remitted to all the downstream consumers. And when these checkpoints were set back, there was some overlap on messages that had already been sent and processed downstream. But the downstream systems were all designed and had been fully tested to handle duplicates and act item potently. Even though they were out of order, these downstream systems were hit by a flood of messages, but we were able to just scale them out in seconds, some automatically, some manually to handle the throughput and stay entirely within our SLAs.

00:11:13

In the end, this potentially catastrophic event was a little more than a minor annoyance. No data was lost and not a single customer order was delayed. Walmart was happy that their $3 billion wasn't wasted on worthless tech and it afforded us coming from the jet side, the opportunity to examine the rest of the Walmart technology ecosystem and see where we could provide value. Now, jet as a startup had the advantage of being completely greenfield and focused in their business. Walmart, on the other hand, had built their incredibly successful and wide ranging business over many decades requiring a number of different stacks and technologies. What we found was an organization and an architecture of enormous complexity and cost. Now I'm not going to attempt to capture the entire mammoth business that is Walmart or even any other e-commerce company here. Instead, I'm gonna dig into just one common small piece of e-commerce website functionality.

00:12:18

Our customer, Jane wants to buy a cocktail dress for an upcoming party. She wants to know if it's available in her size. It doesn't have to be a store or a warehouse nearby. It can be anywhere as long as it can be shipped to her. This item availability is served via an API. When she checks her favorite e-commerce site, it can't be down or take too long to load 'cause competitors' websites are only a click away. So our item availability API has an SLA of let's say 99.98% uptime, that's just shy of two hours a year of permissible downtime at say 300 milliseconds latency. Then what factors will go into this item availability. The first ones that may come to mind are the inventory in the warehouse and any reservations that may exist from existing orders, but there's a lot more to it than that.

00:13:13

In addition to the warehouse inventory, you might have the store inventory on the floor or the inventory in the back room of the stores. Uh, if you are marketplace, you might have a lot of third parties, could be thousands of different third parties inventory. You have to, uh, look at the item and see if you're even eligible to sell it on this site. Just because it's sitting in a warehouse doesn't mean that you're permitted to sell it. And there is warehouse eligibility, perhaps a certain warehouse that has your item isn't permitted to ship to a different, to a certain area, or, uh, is not allowed to sell on a particular website. There are sales caps where you might have limits for a particular timeframe of how many are just permitted to sell, like maybe a cap of a thousand during some kind of discounted special.

00:13:57

And there are back orders from all the orders that already exist that weren't able to be filled originally, but the customer still wants them. And for every single one of these factors at a large enough organization, there are going to be legacy systems that have, uh, duplicates of all this information that you need to consider as well. So how do we add all of these things together to give Jane her answer? A common model is via service oriented architecture or a OA in which we decompose each of these factors into a service. You call each of these services on demand in real time to get the information you need. What does that look like here? Well now I have the pleasure of showing you one of the ugliest diagrams I have ever made, and don't worry, I don't expect you to memorize this or even be able to read it.

00:14:45

The complexity is the point. You can still see at the top the website calling the item availability API. Each of the item availability factors that I listed is represented somewhere on here by a service, which may depend on other services. To give you a sense of scope, each one of these boxes is a whole system or multiple systems each maintained by one or more whole teams. So let's walk through what happens when Jane looks for her dress at the top highlighted in red. The customer facing website calls the item availability API, that general API calls the global item availability API, which checks its cash, doesn't find it, and falls back to other services, which call other services and more until we can finally compute the answer for Jane. So let me save you some time on the math to get the dress availability in under 300 milliseconds, 99.98% of the time requires 23 service calls, each of which has five nines of uptime and a 50 millisecond marginal service level objective. Without every single one of these services working correctly, it is impossible to know if an item is available. You're better off not even guessing. With partial information, it's better than risking telling the customer the wrong answer. To be blunt in outage in any one of these services takes down the entire availability API.

00:16:22

Because each of these systems has business logic that is so tightly coupled to so many other systems, it's extremely difficult to properly test them. Unit testing covers a tiny fraction of the space of potential errors and relying on integration tests to fully vet something this complex is absurdly costly and absurdly ineffective. Further, each of these systems was fundamentally designed internally in a traditional manner. As changes happen, the current state usually stored in a relational database is mutated in place and there's an expectation, correct or not, that servers are reliable and will only be shut down or restarted with permission. And we all know how well that works. So how do we go about tackling these problems? Can we take what we learned at Jet and extract lessons further? These problems probably aren't unique. Can we ensure that these lessons are broadly useful to anyone or any company that might suffer similar problems?

00:17:21

The jet.com way of approaching these problems was to look at them through the lens of functional programming. So let's walk through these principles and learn what the implications are for system design. There are many principles, but I'm gonna focus on just a handful today I'm gonna start with immutability. The idea that the inputs don't change and functions that taking these inputs produce outputs that are also immutable. Uh, state is not directly mutated. We embrace purity. We avoid writing functions that produce side effects, no writing to disc or network until the last possible moment. And we strictly control those side effects when we do this makes it easier to reason about the code and test the code. The external world outside the function can't affect the results and the function won't affect the external world. This makes the function very predictable and repeatable. Given, given input, the output will be the same every time and that repeatability unlocks a principle called the duality of code and data.

00:18:29

It's a fancy way of saying that the code and the data are interchangeable. A function that accepts parameter A and computes output B could be replaced with a lookup table with a key of A and a value of B. Conversely, a really big lookup table takes up gigabytes of space that maps A to B could be compressed into a function that computes B From A, you can go back and forth between the two. Gene did a great job introducing some of these principles and showing how they work in this small. When you're writing code, we took these same principles and apply them in the large, changing how we design systems and systems of systems. Let's walk through some of these results.

00:19:13

Starting with immutability, you get message based log driven communication. The first part of this, the message based part is pretty ubiquitous. Systems communicate with each other via messages synchronously over HTTP or asynchronously over some kind of queue log-based pub subsystems like Kafka, AWS, Kinesis and Azure event hubs. Take this a step further. Not only do the messages themselves not change, but they're ordered and retained for an extended period of time. Even after you've consumed a message, the consuming services keep track of their own progress via checkpoints into the log. So what does this mean? Imagine you suffer an outage that causes you to lose the last day's worth of transactions. Or even worse, you've introduced a bug in your code that corrupts data. This will force your consumer, excuse me, you can deploy your fix and reset the checkpoint to the point in time before the bug was introduced.

00:20:10

This will force your consumer to replay all of the subsequent messages, re consuming them with corrected code and fixing your corrupt data. This approach drastically improves your meantime to recovery on an entire category of production errors. At Walmart, we produce replaced HTP calls, queues, and even enterprise service buses with Kafka. At Modo, we're using AWS Kinesis for the same end. With immutability, you also, uh, get drive. Event sourcing events are facts about something that happened in the world. Once an event occurs, it always will have occurred. It doesn't change because by definition it's already happened. In an event source system events are first class citizens. The canonical data store is consists of ordered streams of events. The current state is secondary a consequence of the events. You use the stream of events to build the current state by aggregating over all of them. Bank accounts are an obvious example of this approach.

00:21:15

Your account balance, your current state is the result of summering over every deposit and withdrawal that had ever happened. Event sourcing, storing the events this way is extremely powerful. It effectively gives you a time machine. You can see the state for any point in time and you can walk step by step through everything that's ever happened. This is fantastic for troubleshooting. You can validate behaviors that people are observing. People may report that they saw a problem at a specific time. It could be days, weeks, even months after the fact. And we can go back in time re-serve it and perform a root cause analysis. Further event sourcing unlocks entire new areas of analytics, uh, where we've found that our marketing teams, uh, love having this kind of data of everything that's happened over time. And our operations and audit people love knowing exactly everything that's ever happened with purity.

00:22:13

Our goal of purity means that we isolate computations from the real world. We write all business logic as stateless functions with zero external dependencies. And that means zero io instead collect all state you need upfront and pass it into your business logic as parameters that statelessness that isolation of the computation gives it predictability and it gives it ity. There are no random outcomes. So-called heis and bugs and there are no partial results. Real world failures. And in the cloud you are constantly dealing with real world failures may keep your code from running, but it will never affect correctness or consistency. Now because the business logic doesn't have side effects because they're pure, it means that 100% of the domain logic is a unit testable. You can provably identify every single path through the business code and write unit tests for it. Not just write unit tests, but create executable specifications. You can define in variance from your specification explicitly as properties. For example, we say that inventory counts should never be negative. That is an invariant, and these properties can be checked automatically with large numbers of randomized inputs extremely quickly. Spec based and property based testing frameworks that do this are available for most languages, but to work well, they depend on your code being stateless. And if you do this well, integration tests are only needed for establishing basic connectivity between services. You can test much more thoroughly in less time and for less cost.

00:23:56

Now you can't remain pure forever. Once your business logic is complete and you have a result, you have to do something with it. But don't do any more changes than you absolutely have to in this process. Write it to one and only one place. You may be tempted, and this happens all the time in the same process to write to a database and then notify a downstream consumer about that change. Maybe via Q don't. This is called a dual right? In a distributed environment like the cloud failures can and will happen at any point. As soon as one of those rights succeeds and the other fails, your system is now an inconsistent state. Dual rights take all the hard work you did to get guaranteed outcomes and tosses it out the window instead. The safe way to accomplish this is via change data capture. The re result is that an event is published downstream if and only if it's been committed to the database.

00:24:58

This ensures eventual consistency in failure scenarios. You may fall behind publishing, but you'll never lose events. You'll never lose events in your data store and you'll never lose telling your downstream consumer. Different databases will support this in different ways. Walmart now uses the Azure Cosmos DB change feed for this, and at Moda we use Kinesis streams. So by applying these principles, we've established a pattern for designing systems that looks like this. We receive immutable messages over Kafka that are consumed by a microservice running stateless domain logic that emits these immutable events into data streams. The events are then published downstream to any consumers over Kafka again, but we're not done yet. When we're employing immutability and purity, we can take advantage of the third principle and replace real-time compute with data lookup wherever feasible. When you know the set of possible inputs in advance or you've seen specific inputs before, you can uh, replace the often in expensive runtime computation with a pre-computed cache of the result.

00:26:13

For instance, in event sourcing, if you try summing over the first a thousand events in a stream more than once, you'll get the same result every time. Particularly for long running streams that are millions of events long, it makes sense to save a snapshot and use that as your starting point next time. Instead of retrieving and summing the entire stream, this costs you a very small amount of storage for the snapshot. And congratulations, you've just exchanged a computation for data and that gives us a final pattern for system design that looks like this. What has changed from the previous diagram is that we've added a service that consumes the events from the Kafka feed, builds up upgraded stream, updated stream snapshots, and then updates the cache further. We're publishing all of those snapshots via change feed to Kafka as well. Downstream consumers will have a choice.

00:27:06

They can consume all of the events as they happen, or if they only care about the latest state, they can consume that feed instead. One of the first teams to use this pattern at Walmart was called Panther. Panther is an inventory tracking and reservation management system. On the supply side, it aggregates and tracks all sources of inventory, includes the Walmart and jet owned warehouses and all partner merchants and their warehouses. And on the demand side, it acts as the source of truth for reservations against the available inventory at those warehouses. When a customer is checking out the contents of their cart are reserved to make sure that no one else, uh, will order them. If there's only one left, that's pretty important. If the inventory is not available at that point, the reservation fails and the items must either be resourced from a different location or different items must be selected.

00:28:01

The primary goals of Panther were to maximize onsite availability while minimizing order reject rates due to lack of inventory. There were a lot of secondary goals as well to improve the customer experience by reserving inventory early in the order pipeline. We wanted to enhance insights for the marketing and operations teams by providing more historical data and better analytics. And we wanted to unify inventory management responsibilities, typically spread across multiple systems. Of course, along with these business goals, our solution had a lot of non-functional goals like high availability, geo redundancy, and fast performance backed up by SLAs. We found a lot of success with this architecture. The entire team started by a single engineer in July, 2016, had only three team members. When Panther went into production by Black Friday, that same year, that's only five months later after one year, the team still only needed five engineers once in production.

00:28:58

We found it very easy to add features with inventory. Tracking staleness of data is an issue. Simply put, if a merchant last told us their inventory months ago, there is no way we would trust that. So we wanted to implement a feature that expires the merchant updates after a certain amount of time, just zero it out. The results were immediate. We dropped our third party reject rate in half from 0.8 to 0.4%. And what's great about this is that this was done by a single engineer who is new to the company with light F sharp training and no cloud microservice background went from design to production in three weeks. So with the success of Panther, we started rebuilding a number of our supply chain systems following the same principles and patterns. But we didn't stop there. I want to revisit my ugly diagram from earlier.

00:29:52

This is the one with all the nested synchronous API calls. We looked at this mess of dependent services. There may be dozens of teams and a lot of deployments on heterogeneous stacks, on thousands of servers. But as far as the front end shopping site is concerned, looking up the availability of Jane's dress may as well be a function. It calls other functions that call other functions and eventually returns a single end result. This is a call graph, it's code, it's distributed, unreliable, stupidly expensive code, but it's code. And if we remember the duality of code and data, there's a way to exchange that code for data and maybe that data will be more reliable and less expensive than this monstrosity. It turns out it is. The systems modeled after the Panther architecture, uh, all stream events and state changes as messages over Kafka. We can use these message streams to invert the dependencies instead of the dependent service pulling its needed inputs in real time, the source system can push the data changes.

00:31:01

The dependent service consumes these changes as it happens, updates it its own state accordingly, and pushes its own changes downstream. We can convert from a primarily synchronous service-oriented architecture to a primarily event-driven architecture. All of the same item availability factors are represented in this diagram, but now almost all of them are hooked up. Asynchronously messages are flowing in this diagram. From left to the right, we are trading the realtime computations. The realtime calls for com. Pre-computed data throughout the supply chain systems. How does that affect the hot path, the moment that Jane looks for her dress? Well, a moment ago, like I showed in the SOA model, I highlighted that hot path in red. So let's look carefully to see what it looks like here. That's it.

00:32:02

To achieve the same SLA, we need only two service calls, not 23, both of which have only four nines of uptime, not five and only 150 millisecond. SLO not 50. All of the event driven systems still need uptime and processing time SLOs, but they're no longer in any customer hot paths. They are completely asynchronous. Three nines uptime and end-to-end processing in seconds or even minutes is sufficient. So how does all of this affect cost? This is going to vary among organizations, but we can ballpark it. First, an event driven system, three nines, uptime mid latency is about as cheap to operate as any system we're likely to see if we increase our uptime by 10 x to four nines and drop our latency by 400 x to a hundred to 150 milliseconds. For a lot of orgs, you're looking at an order of magnitude, higher cost to push your uptime to five nines while tightening the latency even more. For most organizations, that is an obscene amount of money. How do the total operational cost compare once you've replaced all of these things? Well, with the functional event driven approach versus, yeah. Now I'm not allowed to give you precise numbers, but I can tell you that for walmart.com, this difference is millions of dollars per year.

00:33:29

And, uh, notes, uh, just went off the screen. Can we get those back on please? You may have a lot of objections to this. Uh, you may be thinking, wow, this sounds really great, but there's no way we can do that. Well, let's talk through some of the more common reasons my dev team isn't skilled enough. Well, I've trained up not just senior devs, but junior, um, mid and senior level engineers from all kinds of backgrounds, Java, C Sharp, JavaScript, Ruby, Python, and they've all succeeded at this. We don't have the technology to do this. What you can follow these principles in any language, and if you're talking about the infrastructure that you need, uh, every cloud provider has some kind here, uh, some kind of, uh, results available or some kind of infrastructure available for messaging, it'll make your app too complex. Uh, well that might be true for some systems.

00:34:27

Uh, if you're only, if you're talking about the most basic ones, gene talked about a system where he, uh, that he built that, uh, w was really just a, a toy system, but wanted to see what he could do to simplify it. Uh, and we're running a little low on time, so I'm gonna, uh, just walk through this really quick. He found that it was, uh, it turned out to be a lot more practical than he expected and it'll, uh, the last ones are it'll cost too much, it'll take too long. It's too dangerous and it's just too much. I have this enormous creaky bailing wire and duct tape, spaghetti code monstrosity. It grew uncontrolled over the years, if not decades, has dozen hundred thousands of people trying to keep it working well. I recognize this is a pretty big shift in mindset, but there's an old joke about this.

00:35:16

How do you eat an elephant? The answer is one bite at a time. There are small steps you can do right now to take these principles and apply them regardless of what your systems look like. Now you can identify just one dual right? Somewhere in all of your systems and figure out a way to eliminate it. Consider using change data capture. To do so, you can encourage property-based testing in just one system. Most of your devs won't find it that different from regular unit testing. And you can switch one web service to also publish events. You don't have to fully commit to event sourcing. Just publish your changes as they happen. Then switch one consumer to read the events rather than make HTP calls at runtime. This is a very easy way to bite off a small piece and, uh, ensure the safety of the system while you do it.

00:36:06

So if you have an architecture that looks like this and you don't have someone, an architect who is talking about how to move to something like this, you're doing your organization a grave. I justice, my mission in life is to reduce the amount of entropy in the universe or at least our little corner of it. So if you want to help me, uh, in this journey, if you want to replicate what we've done or you have new ideas, here's how to reach me. I'm Scott dot havens@modaupopera.com or Scott Havens on Twitter, and I'd be remiss if I didn't say that we are hiring. So thank you very much. Have a great day.