Monoliths vs Microservices is Missing the Point (Las Vegas 2019)

The debate on monoliths vs microservices as architectural patterns for modern software systems usually focuses on technological aspects, missing crucial details around organizational strategy and team dynamics. Should we start with a monolith and extract microservices or start with microservices? How many microservices is the right number? These kinds of questions indicate a confusion that is made worse by the perceived need to adopt lots of new technology in order to make microservices work. The false dichotomy between monoliths and microservices helps no-one. Instead, switched-on organizations start with the team cognitive load required to build and run a part of the software system. If a team is not able to understand fully the details of a service or subsystem, there is little chance of the team being able to own and support it. The resulting team-sized services are by definition suitable in size and complexity for a single team to own, develop, and run. No longer do we care how many lines of code there are in a single service or whether it is a "monolith": what we care about is that a team can own and run the software effectively. Using team cognitive load as the guiding principle - assessed by the team via measures such as supportability, deployability, testability, operability, prioritization difficulties and domain complexity - organizations can optimize for sustainable ownership and evolution of software systems. This talk draws on research and case studies from the book Team Topologies by Matthew Skelton and Manuel Pais (IT Revolution Press, 2019) together with first-hand consulting experience from the authors with organizations around the world.

breakoutvegas2019
MP

Manuel Pais

Co-Author, Team Topologies

MS

Matthew Skelton

Co-Author, Team Topologies

TRANSCRIPT

00:00:02

My name is Matthew Skelton. And we're here today to share with you some thoughts about kind of software architecture and how that relates to teams. So the talk today will look a little bit like this four sections. First we'll look at, uh, monoliths and microservices. We'll then look at something we've called team cognitive load and how that relates to building software systems manual we'll then take us through some case studies that apply some of these ideas. And then finally, we'll have a little look at how to get started with some of these ideas in your organization.

00:00:50

There we go. So we're all we're authors of this book, team topologies published by it, revolution press. There are copies available in the bookstand and this evening we have a book signing. So seven 15, I think it is in the, uh, Chelsea theater there along with all the other authors, uh, from it revolution who were assigning today. So if you're interested in what you hear today, come and get your free copy signed by us and take back with you. So here are quite a lot about monoliths and microservices at the moment in terms of kind of software architectures for cloud native or for, uh, to enable teams to deliver very rapidly. Um, but we think this is a bit of a false distinction. Let me try and explain why it's sometimes seems, feels like a bit like, um, kind of street fighter or mortal combat or something. So over here, we've got someone like Thomas Salai saying here, start with monolith and extract microservices. Then like on the other side, we've got Stefan till cough saying like, don't start with a monolith when your goal is to a microservices architecture, and then we've got someone in the middle, who's like a guru Simon brown saying, well, if you can't build a model with what makes you think microservices are the answer,

00:02:10

Something is kind of missing here, right? Th there's there's a, there's an angle on this problem, which is, um, which we're missing. So where should we focus? What should we focus on in order to kind of, to, to make this, make this stuff effective? And I think Daniel Terhorst north, um, puts it very well when he says, we should think about software that fits in your head. Can we understand the software that we're building ourselves? If it doesn't, if you like fit in our head, if it's too big for us, we've got a problem in the context of the, the, the talk and the context of the, the book that we've written team topologies. We, we kind of like to extend out what Daniel has said and say, software that fits in our heads. When we're working as a team,

00:03:13

Why is this important? Who has a, who, who has a copy of accelerator, or we'll have a copy by this evening? Every hand in this room should be up, okay, you need to get yourself a copy of accelerate book. There are four key metrics in the x-ray accelerate book that are indicators for a high-performing organizations, lead time, deployment frequency, and meantime to restore and change, fail percentage. I'm not going to go into these now, that's for Nicole, um, and a few to chat to her this evening. However, the problem we have is if software does not fit in our heads, there's a real danger that each one of these four key indicators is going to get worse. So if the software is too big for our heads, then the lead time, which is depending on how you measure it at the time from kind of starting to work on, on a, on a new feature to it being in production, there's a danger that that will take longer.

00:04:10

That will start to extend. There's a danger that the, the deployment frequency will decrease rather than increase. We will deploy less frequently if the software is too big for our heads, because we won't have the confidence to deploy more frequently. There's a danger that if the software's too big for our heads, then we will not be able to restore service in production as quickly, because it's too complicated. It's too involved. And likewise, there's a danger that if the software is too big for our heads, that the percentage of deployments that result in failure will increase. We're trying to drive that down. So that's why we think that this, this, um, the framing around model less than microservices is, is sort of the wrong, wrong way to look at it. And a useful way to look at it is, um, this phrase that the D that Daniel tells north comes up with, which is software that should fit in our heads

00:05:04

Software that is too big for our heads works against organizational agility. This is the key. If you want to go to sleep for the rest of the talk, just take a picture of this slide. That's the only thing you really need to worry about. And that's a really, that's a really key thing, right? So this is the thing I want you to take away from, from the session today is we do need to think about the size of software that teams and I'm talking about teams, not individuals that teams work with because there's a direct impact on organization agility

00:05:45

And so how do we approach this in the book? We talk about team cognitive load. Let me just talk you through a little bit of background. First cognitive load is a, is a concept that was defined by, uh, John Sweller in 1988. And he defined it as the total amount of mental effort being used in the working memory. So when we're building software systems, working with software systems, we've got a lot of stuff in our working memory, as we're juggling kind of concepts and trying to put those into, into, into code or into, or working out how to shape a dataset or whatever. Um, cognitive load comes into play a huge amount as we're working with software systems, and there are three kinds of cognitive load intrinsic, which is something kind of fundamental to the way we're working or the kind of problem domain, um, extraneous, which is stuff which gets in the way effectively, which, which prevents us from really thinking too much about, uh, about the, the problems at hand and Germain, which is, uh, useful stuff about the problem domain that actually helps us to, to, uh, solve a particular problem.

00:06:52

Now, in a software development context, this would be something like this in teams. It could be remembering how classes are defined in Java. Extraneous would be for example, oh, how the hell do I deploy this application? Again, it's really complicated. We shouldn't have to think about it. Jermaine. If we're working in a financial services application, it might be, well, how do bank transfers work? We need to keep that kind of, um, we need to have that conative load on, uh, people who are working because they need to be thinking about the details of in this case, how bank transfers work in order to be able to write code effectively,

00:07:31

You could sort of see like this in, in a kind of software delivery context in terms of, here's kind of the fundamental skills that we we bring as engineers extraneous could be something like the, the, the kind of mechanism which we shouldn't really have to think about. Um, and Jermaine is the important stuff about the domain that we're working with. The business domain, we're working with a bit of a simpler simplification, but you can think of it like that for now. What we're trying to do is, um, we have to work with the intrinsic cognitive load. That's just, that's just the nature of the beast. We have to do it. We're trying to squeeze down as much as possible, the extraneous cognitive load, which doesn't add value, which gets in the way. And we're trying to give us as much, give ourselves as much space as possible for the germane cognitive load, the stuff that is really kind of business differentiate, differentiating, try to represent that in this slide here.

00:08:31

If you want to know more about this, by the way, have a search for, um, talks and slides called hacking your head by, uh, Joe PIs. You'll find some interesting, um, talks and slides and, uh, and blog posts and things around that.

00:08:51

What all this means is if we're, if we want to enable organizational agility, we need to explicitly limit the size of software services and products to the cognitive load that the team can handle. Because as soon as we exceed the cognitive load of the team, there's a danger that those four metrics, if you remember, from, from accelerate, the danger that we're going to be driving bad decisions, we're going to be increasing. Bugs are going to be making it more difficult to diagnose and redeploy and so on. That's not where we want to be. So this is a very different starting point for us kind of software architecture, and for ways of thinking about team responsibility, boundaries, and so on, we've started to think now about what's the what's, what's an effective, uh, size of software. Well, the, the size of software should be no more than the owning team can handle based on the cognitive low.

00:09:52

It's certainly not something that many organizations have been explicitly doing. Many organizations have really been, have been thinking about this. Um, but perhaps not exactly in these terms until more recently. So again, this is this kind of software that fits in our heads concept. If it's in our heads, we're more able to own it as we kind of build, run and, uh, build and run it in production. So we're starting with a team let's just go very much a team focused way of thinking about software responsibilities, boundaries, architecture, and so on. Let me say team, we mean a long lived group of people with a, with a shared purpose and backlog, probably fewer than nine in some organizations with very high trust, you might be able to get away with a team being more like 15 people, but what, but certainly in our book in team typologies book team means something very, very specific, which is this long lift collection of individuals who work together over a long period of time. Um, you know, multiple months, years possibly, um, with a common purpose and work together as a team, rather than just a collection of individuals with the same manager. The reason for that is because, um, a high-performing team is far more effective than just a collection of individuals. So if we want to be high performing organization, we use teams to, um, to do the work.

00:11:25

So there's a really important point here that each service or, or application each part of the software estate must be fully owned by a team with sufficient cognitive capacity to be able to build an operated there's no there's no applications or services, which are kind of shared, or which don't have an owner, or which only have like a BAU team kind of keeping it ticking over every, every application or service has got, has got full ownership. One team that builds and runs it. And it has sufficient cognitive capacity. We haven't exceeded the cognitive load of that team. So we're not just piling more and more services onto, onto the same team. At some point, that team would have reached its limit of cognitive load.

00:12:12

And there are some techniques we can use these days, which we know work to help us do all these things. So whole team techniques like mobbing, where the whole team comes around. A single keyboard, brings multiple, um, viewpoints to solving a problem. We solve that problem with very high quality. We've reduced the likelihood of downstream, um, problems and bugs and so on. And then we move on to the next feature. That's a very whole team approach to getting work done. We can use techniques like domain-driven design DDD to help us establish effective boundaries between different parts of the business domain, and therefore assign the responsibilities to teams to match those domain boundaries.

00:12:55

We can emphasize developer experience developer experience sometimes called dev X, where we've got strong employment. We've got strong emphasis on the experience that developers and other engineers have of using other parts of the software estate platform tools, this kind of thing. So that, so that we're making sure there's as little friction as possible in using, uh, various tools and, um, apart from the platform and so on, we also need to focus on operator experience. So whoever is running the systems in production, whether it's the same team or whether it's a separate team, maybe it's SRS or ops people, whoever whoever's running it, we need to need to understand what their experience should be, because if their experience is terrible, when there's an outage, we're going to be hurting that meantime to recovery, right? From there, from the accelerate metrics, we need to be building in operability as a first-class thing for our software. So the operator experience is excellent.

00:14:01

In the book, we talk about something called a thinnest viable platform. So this is a concept where we need, we are going to need some sort of platform underneath what we're building. Uh, we might use to ignore it, but that will be there. And we're not looking to build a platform, which is absolutely huge and all singing, all dancing. We're looking at just the smallest amount of platform to, to accelerate, uh, teams who are building, uh, kind of application software and services, um, and make it safe to do the right thing, safe and rapid to do the right thing. We'll come back to this, this one a little bit later on in the book, we talk about four fundamental topologies. These are kind of 14 types, which as far as we can see are the only four types of team that we need in a modern organization building and running software systems.

00:14:56

We we've tried hard to find more types that are necessary, but we've not yet found them. So if you, if you're sure you've got another team type, please come and tell us, we'd like to hear about it, but based on what we've based on our experience and so on, this is what we've come up with. And the most important one is the stream aligned team, because we're trying to optimize for a fast flow of change. We want to make sure we've got a team that is aligned to the stream of change from coming from the business. And we've used things like DDD to help us get boundaries between these kind of different, different teams, different streams. So that that team is able to take an idea or a change from concept all the way through to production and running it. So the streamlined teams build and run, uh, applications and services. And the other three types of team are there to effectively to reduce the cognitive load on the streamline team.

00:15:51

So, um, if the streamline team needs to, uh, understand a new way of a new kind of technology, let's say, and you kind of database type, we might have an enabling team shown in green. And the second one, the enabling team will come on, perhaps their database experts, they would work with the streamline team to help them get to grips, to help them understand this new kind of database technology for a period of time. Perhaps it's two months, perhaps it's just two weeks. At some point they will, they enable the team will move to a different team and them start to help them with this new technology. They're not there permanently. They're not there as like a support permanently. The complicated subsistent team is optional, but if there is a part of the system, which is really awkward and requires really highly specialist knowledge, then we might give that particular, um, chunk of, of, of work to, to a team, um, with, with the extra expertise. And then at the bottom underneath, we've got a platform we've heard, there's always a platform, but we need to define it very well and make sure that, um, the, the way in which we build this platform is focused on enabling the stream aligned teams to deliver rapidly and safely. So the platform, people in the platform treat the streamline teams as their customers.

00:17:19

So in some organizations, they even use things like net promoter score so that the streamline teams can rank, can rate aspects of the platform as if, as if this were a kind of public, uh, um, kind of SAS service. So if we've got, let's say we've got in an organization, we've got three streamline teams. They're running on a platform. Two of the teams are using a component, which is kind of quite complicated. So there's a specialist team. Looking after that, that's on the left in red and the top two teams are having some help from an enabling team to get to grips with some new technology. Perhaps this database has practiced machine learning, something else. So it can immediately see that the kinds of interactions between different teams are different, depending on what they're doing. We don't have exactly the same kind of interactions, uh, and needs.

00:18:10

And it's kind of dependencies between different teams. It varies depending on what teams are doing in the organization and the way in which those teams might interact is it can also be, uh, different needs to be different. The top two teams there that have this enabling team working with them, that enabling team is going to be facilitating those two teams the way, the way in which those interactions work, that that will feel very different from, um, the, the way in which the, uh, the component is being used by the, by these bottom two teams, for example, which the bottom two, you just want to consume this component kind of as a service, if you like, they've got a nice, nice clean interface, nice, nice, um, easy way to install it or easy way to, to test it and access it. There's very little kind of additional interaction that's really needed there.

00:19:01

And likewise, all in all these three teams here, streamline teams, they can just consume stuff from the platform in a very straightforward way. There's nice. API is nice documentation. It's nice and straightforward. The team at the bottom, the streamlined team at the bottom, however, is collaborating with the platform on something new, perhaps they're moving cloud provider, or perhaps they changing the way they do infrastructure automation or something. They need to interact with the platform team in a different way. So we've got different kinds of team interactions at different parts of the organization at the same time, depending on what's happening. This is just a snapshot. In six months time, the interactions will look different because the team are doing something different. So this is kind of an important point, um, that, uh, the purpose of the, the, the purpose of the platform, the enabling team, the complicated subsistence team are there to reduce the cognitive load on the streamline teams to enable them to own their parts of the system effectively. We're expecting to interact differently with, with, uh, with other teams in the organization. And this starts to help us to move towards the concept of kind of environmental scanning in this case, it's our internal environment within the organization. So Dr. Naomi Stanford, who's one of the world's foremost experts on organization design, uh, talks about an environmental scanning is a really crucial aspect of how, um, organizations should expect to set themselves up for success. And this, the patterns we're talking about today, start to touch on, on that. So let's have a look at some case studies now,

00:20:43

Thank you, Matthew. So I'm going to talk about two case studies. The first one is from a large worldwide or Taylor, uh, they're still growing into new markets. And so they realize we're kind of traditional enterprise. Our delivery cycles are very slow. Uh, so we want to do something different. So they had a specific market that they wanted to enter, and they said, we need a new mobile experience. So we're going to create a cross-functional team and give them the autonomy to decide whatever architecture you think is, uh, the best to do this. So this team had, um, all this good practices around dev ops, continuous delivery, using public cloud, et cetera. And they had this iterative approach. So they very quickly were able to deliver something, working, and then iterate and improve over time. So as a very concrete success story for this organization, you know, kind of success stories you'd put in this, uh, a presentation like this.

00:21:40

Um, so what happened next is that because they were successful, they were asked to do another, uh, mobile experience for another market. So this, you can see, they starting to have a bit more complexity in terms of backend, and they needed a CMS to control different types of changes to different markets. And this went on for quite a while. So about a year and a year and a half later, um, you can see the team has grown considerably and the system around them, uh, as well, or the system they're responsible for. Um, so you started having more backend services, product catalog, uh, framework with shared services between different, um, mobile app applications, et cetera. The interesting thing here is that a couple of people in this team kind of the more senior architects, uh, we're realizing that actually our delivery cadence is, is slowing down.

00:22:34

We're actually starting to have more dependencies within the team. And what's happening here is, as you can see, this is becoming a little bit of sort of a monolith, um, that the, the team is working on. And you start having people who are specializing in certain parts of the system. So those people become bottlenecks. You know, uh, if we need it to change this part of system, only one or two people know how to do it effectively. You start having different work streams within this larger, um, system. And some of them are blocking each other.

00:23:08

So what they decided to do, and at first, they had a lot of, um, pushback, uh, in, in, against this decision to split the team into two smaller teams. Uh, you can see that on your right side, one of the teams is more focused on the front end experience and the product catalog. And on the left side, the team is more focused on kind of the backend services. So, but because the team was working quite well before, they were not really, um, very happy with the split, but they did it. And it turned out quite well because actually most of the time they could work independently on their part of the backlog on their features. Uh, but obviously there are some that were cross cutting across the two teams. And for those, um, they're represented, you know, between the two teams, you can see those two blue bars.

00:23:57

That means, you know, they have a very considerable amount of, uh, communication between the two teams. You could almost see it as a paired, a pair of teams that come together for specific needs. So there will be some features, there'll be some changes where they need to synchronize and actually work together for a period of time. But this is intentional is, is explicitly, um, designed like that. And the rest of the time they can work more independently. So this worked out quite well for them. They even went on to for the split. I believe now they have a kind of front-end teams almost aligned to a single market, so they can go as fast as possible to meet the needs of that specific market. Um, and on the backend, they also split and they aligned to what almost one service per team. So what was happening here is that as the team grow and the system grow, it was, you know, becoming more monolithic and having flow of work, being blocked, um, within the team.

00:24:58

So, but they were able to listen to some of these triggers that, okay, we need to evolve. What was working before and the structure we had before, what is not working anymore. Um, so software growing to large over specialization. So people like, you know, Brent in the Phoenix project who are the only ones who know how to change part of the systems are supported, um, and just overall increased need for coordination, spending more time, um, coordinating different changes, et cetera, even within the team. So the other case study I want to talk about is from our systems, they are a low code platform vendor, and they also grown, um, considerably in the last years in particular, they had one team which was called engineering productivity team. So they were helping the product teams, um, get better in terms of these domains of continuous delivery, desktop automation build and continuous integration, as well as infrastructure automation.

00:25:56

But this was over time, they were acquiring more responsibilities in these different domains. But what happened was that again, they had people had to specialize in one or at most two domains because it was very difficult, although they wanted everyone to be able to work on everything in reality, people had to specialize because it was too much cognitive load. And so they realized we're actually getting people demotivated and not engaged with the work because there's so much happening. And there was just trying to kind of stay alive and, and respond to the product teams. That alone was very hard. So again, then they also decided to split into smaller teams. Uh, each of these smaller teams is aligned to a single domain and they don't have a team lead anymore. So it's a flat structure within the team. And this quickly proved to be very useful, very successful for them, because if you think about the intrinsic motivators for individuals.

00:26:56

So if anyone has read the book drive by Daniel pink, he talks about three intrinsic motivators, autonomy, mastery, and purpose. So each of these teams were much better in a better place to have those motivators because they had a shared purpose, a single domain of focus that they were engaged with, interested in. They had more autonomy to decide, okay, what are the priorities for this domain? Where do we want to go? What are we missing as an organization and mastery in the sense of, okay, let's we have the autonomy to allocate effort to improve our knowledge, to learn new techniques, maybe go to conferences, uh, try out new tools, et cetera. So this worked out quite well for them. And you can see, again, there are cross cut, cross cutting concerns. Maybe some requests we'll need people from different teams to come together because they cross different domains, but that's kind of the exceptional.

00:27:52

And what they do in that case is they create a kind of micro team for a period of time when we're going to work specifically on this request or in this feature that is cross domain. But most of the time they're able to work independently as they are aligned to a single domain. Ironically, this team engineering productivity was created to reduce the cognitive load on the product teams, but they themselves fell victim of too much cognitive load, too many responsibilities. So it's not always just about software size. Think about some, some teams are more, um, support teams or productivity teams. So they have domains of responsibility that you need to be careful that they're not, um, uh, overbearing for the team. So if you aim for teams with this kind of high cohesion internally, this shared purpose, autonomy and mastery that we talked about, that can be quite, um, powerful.

00:28:46

And between teams, there will always be a need of coordinator of coordination communication, but that you can try to make that kind of the low bandwidth, um, minimum or minimal communication that you need. And for most of the time they are independent, they can work on their own, uh, backlogs. So again, they were listening to triggers for evolution, awkward interactions within the team, or, um, people not invested. Some people were at the point of almost burnout and leaving the organization. Cause they didn't like how, uh, how the work was being done and frequent context switching, every time we switched contexts, we need to kind of, um, upload to our working memory that the skills and the domain knowledge that is necessary for that problem or a feature we're working on and give it back to Matthew.

00:29:36

So technically we're out of time, if you're happy to leave, thank you for coming. Um, otherwise I'll take about two minutes to run through some, a few extra things. Here's some ideas for getting started, go and ask your team how confident they are or other, how anxious they are, how much anxiety they have about the software that they're working on. Try and get a sense for that. Try and try and get to the point where they feel comfortable giving you an honest answer, because the anxiety about the software that we're working on is a leading indicator for potential problems in production. And we want to use leading indicators rather than lagging indicators, right? So if we can actually assess the, the, the, the sense of, of how confident the team is that they understand everything about the software they're working on, that can be a powerful indicator for whether we're likely to get problems later on how we exceeded the team's cognitive load. Do we therefore need to pull some things into a platform? Maybe, maybe not. It depends.

00:30:42

Are there skills or capabilities missing within the team? So these, these things are signals. If, if we, if we, if we've gone beyond the team cognitive load, that that might indicate that there are other things we need to change around that team, these slides obviously going to be available, um, online, think about what is your platform, how is that defined how the teams understand what they are consuming from that platform? How good is the documentation? How good is the developer experience for using the stuff that's in that platform? Because if any of that stuff is not first class, then you're increasing the cognitive load on the stream aligned teams that are supposed to be developing software. And why would you do that? We need to minimize that kind of, um, extraneous cognitive load that has evolved around how do I deploy this component? How do I, how do I update the package, whatever it is. Um, we want to minimize that kind of extraneous load. So I work out how easy it is for teams to use that platform, to understand how to, how to, uh, how to use a platform and so on. So that's the kind of developer experience.

00:31:51

So here's the book with book signing at, I think it's seven, 15 this evening in Chelsea. If you're interested in training, get in touch, we have some options available. We are also looking for kind of industry case studies, where we're talking to three organizations at the moment who have started to use the patterns and ideas from the teams, apologies book. We're talking to a global manufacturing company, a large government agency, two large government agencies and a company involved in kind of global financial services. But if you're working in a situation where you think you've got some interesting dynamics in your software delivery, uh, challenges, then just do get in touch. If you find the material useful, um, we've got a newsletter sign up if you like, thank you for coming.