Monoliths vs Microservices is Missing the Point—Start with Team Cognitive Load

The “monoliths vs microservices” debate often focuses on technological aspects, ignoring strategy and team dynamics. Instead of technology, smart-thinking organizations are beginning with team cognitive load as the guiding principle for modern software. In this talk, we explain how and why, illustrated by real case studies.


Matthew Skelton has been building, deploying, and operating commercial software systems since 1998. Head of Consulting at Conflux (http://confluxdigital.net/), he specialises in Continuous Delivery and operability for software in manufacturing, ecommerce, and online services, including cloud, IoT, and embedded software.

Matthew curates the well-known DevOps team topologies patterns at devopstopologies.com and is co-author of the books Continuous Delivery with Windows and .NET (O’Reilly, 2016) and Team Guide to Software Operability (Skelton Thatcher Publications, 2016). He is also co-founder at Skelton Thatcher Publications (http://skeltonthatcher.com/), a specialist publisher of techniques for software teams.

Matthew founded and leads the 2300-member London Continuous Delivery meetup group (http://londoncd.org.uk/), and instigated the first conference in Europe dedicated to Continuous Delivery, PIPELINE Conference (http://pipelineconf.info/). He also leads the CodeMill digital skills initiative in the North of England (http://codemill.tech/), and is a Chartered Engineer (CEng).


Manuel Pais is an independent IT consultant and trainer, focused on team interactions, delivery practices, and accelerating flow. Manuel is co-author of the book ""Team Topologies: Organizing Business and Technology Teams for Fast Flow"" (IT Revolution Press, 2019). He helps organizations rethink their approach to software delivery, operations and support via strategic assessments, practical workshops, and coaching.

MS

Matthew Skelton

Author, Team Topologies

MP

Manuel Pais

Author, Team Topologies

Transcript

00:00:02

Hi. Good afternoon, everyone. Good to see you here. My name's Matthew Skelton And together we are the co-authors of a new book called team topologies. We're here today to share with you some insights, advice experience on how to size software services and the focus of that is team cognitive load. So today's talk will look something like this. We'll have a section where we're looking at monoliths and microservices, different kinds of sizes of software. We'll then look at what we mean by team cognitive load. We've actually had this to mention a couple of times already today in some of the earlier talks manual will then take us through some case studies Organizations that have used him cognitive load as a way of helping them to evolve their software systems. And then right at the end, we'll look at a few tips for getting started with this approach. So this is a book published by IC revolution press. It looks like this. The publication date is September, 2019. These are the early advanced copies, and we're doing book signings tonight, half past seven at the soda type stand. So if you're interested by what you hear today, come along and you get a signed copy.

00:01:18

So in the past few years, many organizations have started to adopt microservices as a way of being able to deploy the software systems more rapidly with greater focus on, on specific areas of the system. But there's often lots of debate around what size microservice, UBE should be 10 lines of code. Should it be a hundred lines of code? And it starts to look a little bit like this isn't kind of mortal Kombat thing. So, you know, in, in the, in the blue we've got, um, We've got Thomas Salah who says, start with monolith and extract microservices. And then like over the other side of the arena, we've got Steph until COVID says, don't start with a monolith when your goal is microservices. And then the wise words of Simon brown, who says, if you can't build a monolith, what makes you think microservices are the answer? So what's going on here? Like where should we actually focus? And I think that Daniel turtles north has, uh, has got it right when he's in his phrase, he talks about software that fits in your head. And there's an awful lot of experience and awareness behind that recommendation or that, that, that phrase in the context of teams, if we're thinking about building software, within the context of team teams, owning and running software, we might rephrase this to be software that fits in our heads as a team.

00:02:54

But the intent is the same Who, who has yet to buy or read a copy of accelerate, put your hand up and be shamed,

00:03:04

Right?

00:03:06

Fine. So you need to get yourself a copy from the stand. Very, very straightforward. These are the four key metrics from the accelerate book based on five years worth of state of DevOps reports and assessment from many thousands of companies around the world. These are the four key metrics that are strongly indicative of high organizational performance, leads, time deployment frequency, meantime to restore and change, fail percentage. The problem is if the software that we're working with does not fit in our heads, these things are going to be very difficult to improve upon. If the lead time is the time from say version control to production, if the software's too big, we're likely to distrust the kind of tests we're likely to want to take more time to find out what's going on. The lead time is going to extend same with the deployment frequency. If we're, if we don't, if we don't understand the software well enough,

00:03:59

Are we going to have the confidence to deploy more and more frequently? Probably not. We're probably going to want to restrict how many times we deploy and so on. If, if the, if the software we're working with is too complex and too complicated and fails in really awkward ways in production, it's going to be difficult for us to restore that server as quickly. So again, our MTTR will will extend. So if we want to start to move towards these kinds of improving these kind of four key metrics as recommended by, by the accelerate book, then we need to start thinking about the size of software that we're expecting teams to work with software that is too big for our heads Works against organizational agility. And this is a kind of different starting point compared to how many organizations, many people have started in terms of thinking about software and architecture result. Because often in the past, we've thought that we started with bits of technology. We started with the database, we started with a message bus. We started with something else. If we start with the team and the cognitive load for that team, we get some different results.

00:05:12

So let's have a little look at what we mean by team cognitive load. It was defined in 1998 by psychologist, John Sweller. And this is the total amount of mental effort being used in the working memory. And there are three kinds of Contiv load that John swell identified intrinsic, extraneous and Jermaine. And in the context of software development, we can think of them in these three terms in these three ways. So we can think of intrinsic as something like how a class is defined in Java. It's something that we kind of, it's just a fundamental of how we're kind of working with in this case software systems.

00:05:53

We don't have that front front and foremost all the time. Once, once we've kind of, once we've spent, you know, six months or a year doing job development, then that sort of becomes naturally a big, it becomes a, an intrinsic part of how we work. Extraneous is something that works against what we're doing. Something that is kind of like a distraction. So how do I deploy this app again? I can't remember. It's really all kind of got to set this concrete property, blah, blah, blah. This is extraneous cognitive load, and it's effectively valueless. We don't want to have this kind of cognitive load on our teams. Germane cognitive load is load that we have to deal with because this is the book part of the business problem we're trying to solve. So if we are building an app for online banking, Then part of the domain cognitive load of a software developer or tester, whoever who is building the application at that point might be how do bank transfers work? Because you need to have that kind of load in your head out as you're building the software. So it can sort of see these in a software context, as intrinsic is kind of like that the skills that we bring to the table extraneous is stuff to do with the mechanisms of how we do things in a software software world. And Jermaine is sort of like the domain focus. It's a bit more evolved than that, but that's how you could think of it.

00:07:17

And what we're really trying to do is maximize give the most space to the Jermaine details as humane cognitive load, the intrinsic we have to deal with, we can't get rid of it. It's just, we're working with software, working with computers, the stuff that we just have to know, we're trying to minimize and squeeze the extraneous cognitive load to get rid of that as much as possible, if possible, just get rid of it entirely so that we've got the most space available for the domain cognitive load, the business focus of the problem, which has to deal with, If you want to know more, by the way about this, in some detail, that's a great presentations by Joe Pierce. If you searched for hacking your head, then you'll find lots of slides, lots of videos, and so on. There's some really good material there By the way, these slides from today will be online later on today. So we'll tweet out the link and able to find your out to download this slide. So this is the implication of what we've just been talking about. We should be thinking about limiting the size of software services and products to the cognitive load that the team can handle

00:08:31

So that we're taking a, starting to take a socio-technical approach to building our software systems here. We don't just pretend that we can throw any kind of software architecture or design or technology at a team, and they'll just have to deal with it. We're actually, we're actually using the kind of, if you like constraints or properties of the human systems that we have in our organizations and working with them to produce more effective software delivery and more effective software systems. So this again is software that fits in our heads. So this is very, quite a different approach to thinking about kind of software boundaries. The field miss this'll feel very unfamiliar to many people, not to everyone. There are organizations already doing this as we'll see very shortly, but it does feel a bit unusual When we talk about teams, we're talking about a group of people that's probably less than about nine people in size. There are, there are evolutionary reasons for this. Some organizations have

00:09:37

Found patterns, whether you're able to kind of bring two of these kinds of teams together in close harmony. If you think about a rugby team, you've effectively got to two closely operating teams together. You've got the forwards and the people at the back and either play rugby, but I spoke to people who do, and they do say it feels a little bit like there's two separate teams, but working really closely together. So in some, some organizations have found ways in which they can do that. But generally speaking, we're talking about a cohesive long-lived group of people that work together on the same set of business problems for an extended period. And that that group of people is less than about nine Referred from many of the talks this morning about ownership of software services and how important that is. We need to move to the point where every service must be fully owned by a team with sufficient cognitive capacity to build an operated. In the words of Andy Bergen from sky betting gaming early on, it was you build it, you run it, you fix it, you support it, you diagnose it. And so that's what we're talking about here. There's no services, there's no products which do not have an owner.

00:10:49

We know that there are techniques to help us do that. This kind of stuff. We've got techniques like mobbing, which applied to the whole team, which will help the team to own that service. We've got techniques like domain-driven design DDD to help us choose domain boundaries in an effective way that that really works for the business context.

00:11:13

We've heard many people talk about the importance of developer experience, particularly when building a platform, making sure that platform is very compelling and very easy and natural for product teams, development teams to use. So we're making sure we're explicitly addressing developer experience when we're particularly when we're building a platform. But to be honest, when we're doing anything inside our organization where other people need to use our software, but we also need to think about the operator experience. What about the people who actually need to run this stuff? People who are on call, how easy is it to diagnose these systems and so on. If it's, if, if we've built a system that's fine for our team, but we've handed over to another team and it's terrible. It's really difficult to operate that stuff. If the cognitive load is way too high, we're in a bad place, we need to focus on operability to make this stuff work. And another technique is what we've in the book called thinnest viable platform, which is an approach where we explicitly, explicitly define what the platform looks like. So again, from Andy, Bergen's talk this morning. There's a really nice slide where he showed the very, very beginning of their kind of platform evolution. They had a page Wiki page, which defined exactly what that platform was aiming to do and listed the services it provided. So being super explicit about what our platform is, is important. It's also important to make sure that it's not bigger than

00:12:41

That. Unnecessary, hence thinnest viable. If you're a startup

00:12:49

And you're quite small, it's only maybe 10, 15 people in your organization. The underlying platform is going to be something like AWS or Azure or Google cloud or whatever, but you might decide to build an extra layer platform layer on top of that. But your platform might simply be a Wiki page listing, the five services that you are going to use from AWS. And if you don't need to build anything more, don't build anything more. That's enough. That is your thinnest viable platform, just a Wiki page with the list of five sentences. We're not trying to build a huge, great thing. We need to make sure that whatever we build is compelling to use how strong developer experience we're treating the product teams or what we call streamlined teams. As we're treating them as customers, we're treating them as people who, whom we need to speak to, to understand what they need, and we need to be set up to meet their needs.

00:13:46

So I've talked about a few different times in the book. We've identified four different kinds of team, which as far as we can see are really the only types of team that are really needed in this kind of context and building modern software systems. And the first team type is the most fundamental. And this is the streamlined team, the team that is aligned to part of the value stream for the business. And they have end to end responsibility for building deploying, running, supporting, and eventually retiring that slice of the business domain or that, that slice of service. And really the other kind of teams listed below are effectively there to reduce the cognitive load of the streamline team. That's how we can see If we've chosen our domain boundaries. Well, the streamline team should have everything they need to deploy changes for that business, that part of the business system, but they can't do everything. They need some supporting services from a platform. For example, we heard a great talk from Tom this morning about the platform at ICV. We need some support from platform. So we don't have to think about how do we spin up a Kubernetes cluster, because that will be too much increased cognitive load compared to deploying something more business focused,

00:15:03

Likewise, for a complicated subsistent team. If there's part of the system where let's say, let's say in the case of media streaming, we need to write a specialized transmit video transcoding component. We probably hire some PhDs, people with PhDs in maths or something like this and get them to work on a complicated subsystem. We're taking the cognitive load off the streamlined teams who can focus on more kind of customer end to end experience, enabling teams kind of help to up-skill the streamline teams, perhaps on a temporary basis, typically on a temporary basis, and also detect if there's any gaps in the platform or gaps in the, in what the streamlined teams are expected to do. So this is maybe an organization here where we've got three streamline teams. We've got a platform underneath, we've got a complicated subsystem on the left in red and towards the right hand side, we've got one of those enabling teams kind of facilitating two of the extreme 90th. Perhaps they're moving from one container platform to another or something like that. So they're just trying to get up to speed. Another key idea in the book that we, that we've identified is the need to be much more explicit about the ways in which teams interact,

00:16:16

Because what we can see from our experience and what we hear from other people talking about their experience is that in many organizations, teams don't understand why or how they should interact with other teams. So what we've defined is just three interaction modes and part of the purpose of these three interaction modes to help reduce confusion and effectively reduce the, the kind of irrelevant cognitive load. So that it's easier for teams to understand how they should be operating effectively. So if, if the complicated Sussex subsystem are transcoding component, let's say if, if that team busy, busy building net, if we set up the expectation that they're simply providing that component, if you like as a service to these other teams, these two teams at the bottom, then those, those, all those three teams involved in that interaction

00:17:18

Have a clear understanding about how they're supposed to interact, how they're supposed to, how they're supposed to provide something or consume something. So we've, we've minimized the kind of cognitive load around how we should operate as a team. Um, similarly, the streamline team at the bottom here is currently collaborating with the platform to discover something about, let's say logging or a better way of doing Kubernetes, something like this. They know that for a period of time, they cognitive load is going to be higher because they're working together closely with another team. But perhaps after say three months, we finished that discovery and we go back to consuming the container platform as a service. So there are mechanisms here that if we're, if we're, if we define much more clearly ways of working with other teams, we're actually able to address cognitive load to minimize that in different parts of the organization. So now we're going to look at some case studies from, from organizations,

00:18:24

I'm going to talk about two case studies from the book. The first one is a large worldwide retailer, and they're still growing into new markets. And back in 2016, they decided they want a new mobile site for one of these new markets. They put a team together from scratch cross-functional team with business people directly involved in the team that had all the technical skills to have this kind of end-to-end ownership that Matthew was talking about. They had good DevOps practices, everything was in the cloud, kind of the typical success story that you would include in a presentation like this. And so given that success, you know, they were able to quickly release working version of the mobile website and then iterate frequently. So after a while, they were asked to do the same for a new market new mobile site, or they wanted this to be rather independent, that it could evolve to different sites for different markets, more or less independently, but in the backend, they started to have need for some more, little bit more complexity.

00:19:26

They needed a content management system, so they could upload content to different sites, but overall, this was working quite well still. And of course over time, they were asked to do even more markets, more sites, and the backend start to get a little bit more complicated. They needed subsystem to handle product management, product catalog. So different markets are going to have different sets of products and versions available and pricing, et cetera. So they need to manage that. They also started this framework, which is essentially a collection of common services to all the sites. Things like searching for a product or uploading static files to a CDN, things like this, that all the sites would need, but you wouldn't want to repeat it for every code base. So I think you can tell probably what's happening here. Like the system is growing and the team is growing along with it.

00:20:23

So by now they had far more people than in the beginning. And so it's becoming a little bit of a monolith, right? And so some of the people in the teams start to realize, actually now we also have different work streams going through the team. So you have maybe feature requests for one of the markets sites, other feature requests for other markets. You also might have changes that need to be done in the CMS for the content editors and so on. And the fact that the system was a little bit monolithic by now meant that this work streams were kind of impeding each other. There were dependencies and they were actually slowing down the pace of delivery. The thing that had made them so successful in beginning was now harder to achieve so particular to people in this team who had a kind of senior architect role started to realize this.

00:21:12

And even though the team worked quite well together, they were a high-performing team. You feel like they, they noticed this dependencies and also people had to start specializing in certain parts of the system while before it was pretty fluid that you would get a change request or a feature, and it would go, you would know exactly which parts of the system to change and get it out. Now, people were starting to specialize in specific parts. So these two people, those two senior architects proposed to split the team in two. And they got a lot of pushback because the team members felt that they were working well together, but eventually they did that. So they got into this pattern. Matthew mentioned kind of a paired team. So obviously there was a lot of communication going on on a regular basis, but after doing some refactoring of the system and rearchitecting a bit, they were able to kind of split into two teams, essentially one team, more focused on the customer facing applications and markets and the other team focusing more on the CMS and this framework. So this worked quite well for them. And now these two teams were able to deliver more independently. There was still obviously some correlation between the roadmaps for these two teams and they had this communication going on on a regular basis, but they were much more independent than at this point. So they realized that at this point there was too much cognitive load. The system was too large to handle as efficiently as before.

00:22:42

And from what we've heard, they've went on to actually further breakdown this team. So I believe now they have smaller teams aligned to markets on the customer facing side, and they have split the CMS and the framework, which is a kind of platform team as Matthew was mentioning with common services. So this worked quite well for them. So the key point here was the, as they grew and they were successful, the system became larger and the team became larger and things were starting not to work as well. So there were the flow of work was getting blocked or at least significantly delayed.

00:23:21

The critical thing here is that some people in the team were listening to the signals that something is not as efficient as it was before. So the software was getting too large in this kind of monolithic architecture. Some people were over specialized. So if you've read the Phoenix project is kind of the brand syndrome where only this person or this couple of people know how to change that part of the system. So you're introducing this dependency, even inside one team, this dependency that only when those people are available, we'll be able to get this out the door and overall just increasing the need to coordinate releases and make sure when is that part done? So I can do this other part and introducing delays in delivery,

00:24:06

But it's not always just about the size of the software. The team's responsible. There are other types of responsibilities. So in the case of our systems who are one of the leading local platform vendors in the world a few years ago, they started the engineering productivity team. So this team in the beginning, they were responsible to their work as an enabling team around build and continuous integration and also test automation. So the two domains at the bottom, that's what they started with. So their goal was to actually reduced cognitive load for the other engineering teams who are in fact, their customers, if you like. So they were helping them adopt good practices around this areas, set up tooling in a good way, and just overall help the engineering teams increase their maturity on these areas. So again, they were quite successful. What happened was that they took on more domains in particular infrastructure, automation and city, continuous delivery enablement, and the team grew to cope with that.

00:25:07

And the interesting fact here also that was happening was the other engineering teams were getting really more mature, more advanced in the way they use test automation, CI CD, et cetera. And so they were coming back to them with requests for help that were much more specific, much more domain specific for those teams. So what is productivity team was facing now was a large number of requests across different domains and coming from different teams with specific needs. So they were barely able to keep a float let's say, and respond on a timely enough basis to this requests and inside the team. What happened was that it became very difficult for any one team member to work, to understand all these different domains. So people were in practice working on the, on one or perhaps two domains and motivation went down significantly. So some of the people felt like they didn't have enough effort available to actually master the domains that they're supposed to support and understand in detail.

00:26:09

And at the same time, they were spending a lot of time in the planning meetings and stand up meetings where most of the things being discussed were not directly related to the work that they were doing. So at this point, and this is quite recent. So late 2018, they took, they made a bold decision to split into smaller teams, almost micro teams where any one team was only responsible for one of these domains. And the early results were quite positive. So motivation went up, people felt like they have more autonomy to actually decide what are the priorities for their domain of responsibility. Also interact much more closely with the other engineering teams, their clients, if you like, and really understand what are the problems we have, what are the solutions, the best solutions I can find for you and have a little bit of breathing space to actually master this domain, understand good practices, perhaps come to conferences like this and get to know what other people are doing.

00:27:08

And so the motivation really went up and there was a feeling of shared purpose inside each of these teams. And obviously there were still issues and maybe requests that were cross cutting across some of these domains as they are closely related, but turns out those are kind of the exception. So when that happens, that people from different teams will come together if needed, they will create a temporary team to work on that specific problem or need, and then go back to the original teams. So in fact, before they were optimizing for this situation, which is the exception that their actual needs and requests across multiple domains. So this has worked quite well for them for now. And you can see also here, there's still communication going on between different teams, but the bandwidth there is required is much lower. So the key is that it's not always just about softer sides, but actually aligning the number and complexity of the domains that the team is responsible to their cognitive capacity. And if you aim for this kind of pattern with smaller teams, with high cohesion, internally high communication internally and shared purpose, and then you need some synchronization with other teams, but that can be much lower bandwidth. So you don't need to be communicating across all teams all the time

00:28:29

That can work quite well. And then finally, they, again, they were listening to these signals that what worked for us in the past in the beginning is now becoming a problem. So awkward interactions, some people were not really invested. Some people may be almost burnout because they were trying to really keep up with all these different domains. So we'd have to put in a lot of extra time to actually understand all of this and definitely frequent context. Switching inside the team.

00:28:58

The last example, it's not from the book is from actually a recent talk again from sky betting and gaming. And besides getting a slide of a cat in the presentation, it's also just to show, is this always the good pattern to split into smaller teams while not necessarily in this case, they decided to keep a kind of a large team of 12 people because they had different applications. So some older applications that was making money today and new applications, more experimentation trying new markets and did what happened was that within the same business domain, the demand for working on one part, all applications are newer would change over time. So in one quarter, maybe we need to increase the resilience of the other systems. Most of the time spend most of time on that next port, or maybe we'll want to push out new applications and try new things. So it made sense to keep the same team, but within the team, there were clear work streams and people know now we're focusing on this part, this older systems or the newer systems

00:30:07

So how, how do we get started with this kind of approach? So a few ideas here, Simply speaking, just ask team members, just do a survey of members in a given team, how well they understand the software they're working on. Good, give it a score of one to five or something like this, and just get a, uh, a very rough idea of which teams currently are really struggling with, uh, with the cognitive load of the systems that have been asked to, to, to own and develop. Could that be some things that are candidates for pushing into a platform don't rush ahead and do it, but like come up with a candidate list to start with and have some conversations we're looking for missing skills or capabilities that could be, it could be that within the team, there are actually missing skills. It could be the, actually the organization as a whole, as missing skills. If we adopted these three, um, team interaction patterns that we saw earlier on, so that kind of close collaboration, so we know where our cognitive load is gonna be higher or X as a service where we know we're just supposed to come seem something, or if we're facilitating. So kind of helping or being helped, what would happen if we adopted these patterns?

00:31:18

Like how would teams actually react and behave in this context? Because you need to sense your organizational situation, how your kind of maturity or the dynamics within the organization as to, as to where to start to apply some of these sort of practices, don't just rush in and do it Is your platform. Well-defined if not go ahead and define it. And really quite carefully, you'll probably be surprised that there's far more services that are actually being run by a small group of nearly burned out platform engineers. And so it's time to do something about that. What is the thinnest platform that could work in your context? It doesn't have to be thin, but the thinnest and no more. Um, so as like, as I mentioned, here's the book we've got siting at hopper seven this evening. Uh, it goes on sale in September. Um, you can pre-order now, if you go to Tim typologies.com/book and, uh, so bookstores all over the world are currently stocking it, which is great. Um, we've got some training. If you're interested, give us a shout, uh, and you can sign up for some DS and tips if you go to team typologies.com. So thank you very much, everyone for attending today. Hopefully it's useful.