Our First Inverse Conway's Maneuver Was Not Enough

I guess you already know that there is a Conways Law.


During this session, I will not tell you too much about what it is but why we thought we should make use of it. I will share what and how we did our inverse Conways Maneuver.


Finally I will close with some insights where we still feel a lot of pain and why.

RL

Rene Lippert

DevOps Evangelist, Lufthansa Systems GmbH & Co

Transcript

00:00:08

Hello, my name is Renee Lippert and I welcome you to the presentation, which I've prepared for this year's DevOps enterprise summit. It is about our first inverse conveys maneuver and why it was not enough. My talk will split into four sections. First is about the little business domain within Lufthansa systems and what the little flight product is about. Second, where did we originate from and why did we take the journey? Third, our trip into the agility describes the new structure and where we are. And in the fourth part I will end with where are our pain points today? So I will start now with the Lido domain.

00:00:57

We offer products for commercial airlines of all kinds to ease their operations. In Europe. We are a market leader and we have customers all around the world. We also offer data services was means that we do digitalize information, which is only available in paper. Yes, that's still happens nowadays. Um, and we have around 850 employees to offer the software product and the data services, uh, just to give you a feeling, uh, the product we have has more than 2,700 configuration parameters to configure and tailor our software to the customer's needs because they are quite special with all the law they have to fulfill and all the different policies they have to fulfill. Of course, that makes our testing a nightmare because every individual installation looks like in software in itself.

00:02:00

He, you see a short look on our customers. In the previous slide, you saw that there are 120 airlines around the world. Our customers, at least at most the figures before the crisis, the port, before the Corona pandemic, um, as we are a hundred percent owned Lufthansa subsidiary, we understand and we feel what it means at the moment to be an airline. Nevertheless, we are looking optimistic into the future and we call the new normal being our new chance. And we know that our software will be needed again. When the flights are coming back and the airlines are back into the sky

00:02:49

Here, we see a map covering the different business scenarios where our products are in use. It starts at the bottom where we plan the trip, check to read off the flight, uh, providing the flight crew with the required information about their flight event, support to use airport maps for texting on the airport itself, take off parameter calculation, and in flight details, our product allows aircraft, ground communications. So even providing updates during the flight and ground stuff can monitor the flight event on ground while it's in the air. And finally we support during the approach and lending and taxi back to ramp. The last step is then the collection of the data for post-flight analysis. In today's talk, I will focus about one of these products used in the cycle. It's called the lead or flight software oversimplify. We can say it's the Tom Tom for the flight crew, knowing not only the optimal way, the pilot has to go from a to B, but also a lot more like what's the weather on the road. What's the wind for example, and all the other information relevant for the flight. So now to the second part, which explains the way to organization look before our HR transformation and also some insights about the architectural characteristics. During this time,

00:04:22

We have many specialized persons on the development side, as well as on the business side, persons with this pitch, know how exactly knows what the customer are doing as well as persons was deep know-how in it. And writing coat. You can see them here as green shirts, the resource management, so-called RPMs plant our green shirts first, and a tool called planter. And later in Emma's project, you see them here on the next slide in light blue shirts, the team leader was responsible for business and disciplinary decisions. So we made the most skilled person from business side to become a team leader as this post, the career paths for the people. You can see them with the black cylinder and the release manager was in charge to coordinate and track everything that was done according to the release process and being responsible to deliver the release in scope and in time, seeing here with a gray cylinder.

00:05:32

So here you see the schema, how the development organization looked. You can see the leader flight product is developed mainly at two locations in Dansk and Frankfurt. As I said, the team lead was responsible for the decisions on what has to be done and what can be delivered with which release. What's the capacity planning for this. The RPMs managed the work for the people long in advance, and we even shared the people between different teams and that was planned there as well. In short, we had a very classical traditional waterfall product way of development. Two version releases a year seven service releases a year, and if needed special pitchers, it worked, but we fought with all the problems such a traditional way of working has. Now let's look on the software side, we had a wide set of languages in use. It was C C plus plus embedded C shell scripts per person for Fortran Lex, yuck, square, tickle TK, and lots more all together. We had 3.6 million of roll lines of code stripped down. It was more than 2.7 millions or components had more than 100 dependencies. It was overall 26, and you can see a picture of it. On the left-hand side.

00:07:00

I'd like to speak about our system as a post, a solar system with a database core in the middle and over 1,500 planets binary's shell scripts around it. Not only the database was used as a persistence layer. We also used the shared file system and the complex shared memory construct was built around the product to provide data consistency and performance in this galaxy where the stable build and release process for which was already a big achievement, but it was a huge effort and hard to maintain and even harder to enhance. So it could not stay like this. And it was clear. We need to transition in the modern world. We wanted to do this was a step by step and not a big bang approach because that big bang approaches in the past, we tried and had failed with them. We started a journey into a new, bright and shiny world of working the age away. Before I start explaining what we did during our redesign of the organization. I want to spend a minute on Conway's law and what the inverse conveys maneuver is. Let me quote one of the conclusions from Melvin. He conveys paper, how to committees invent

00:08:29

The basic thesis of this article is that organizations which design systems are constrained to produce designs, which are copies of the communication structure of these organizations. We have seen that this effect has important implications for the management of system design. Primarily we found a criteria for the structuring of a design organizations. The design efforts should be organized according to the need for communication. Let me give you an example so that you better understand what we intended to do. So if you have two software components, a and B, which are closely related, and do you want to merge into a single component, you better merge the teams first. Then there is a good chance that you have created a homomorphic force, which reflects in your architecture so that the software component really becomes one. As long as you have two teams, your software will always have a kind of fraction where the two teams work meets.

00:09:40

We designed the new architecture to lounge, inverse conveys maneuver. Our lead architects built the new domain model was a clear responsibility. The aim was a clearly cut decoupled services from each other, and all communication happened through a well-defined technical layer, kind of interface past the services are well structured into components. We call them building blocks. We decided to follow the self-contained system per digham. So the architectural pattern to use have been defined in macro and micro architecture. After we had this, we set up the teams according to this model and hoped that by this, the architecture will simply follow Conway's law and all will become a good because of the homomorphic force the teams will use and make all right in shiny. Unfortunately we didn't understood or underestimated how much the communication, especially the uncontrolled communication and how much the pull of the old architecture against this homomorphic force from Conway's law is going to harm us

00:11:02

As the architectural work was done. We could start as our atrial transformation with analyzing what different roles we need and how a team should be structured. We also thought about which roads we need besides the delivery teams here, you see one of the white boards from this time, we decided that beside the delivery teams we had identified from the architecture, we need one platform teams like infrastructure monitoring and blocking to across the teams. We need chapters to work on definition of Thunderbirds, spread into all these teams and reach . And this an alignment across the teams, three F for very special know how traveling experts and enabling the teams. For example, a security expert, going to teams and join them for a sprint or two and stick with them to enable them understanding securities for have coaches who coach the team or individuals into their new roles on the job. For example, an HR coach here, we did a couple of mistakes first and biggest. From my point of view, we did not took care about our ops people. We understood how Adrial software development is working, but not how to do a trial software operation. Second is that we did not manage to bring the concept of traveling experts and coaches into an efficient life.

00:12:48

He has an overview about the roles, which we have introduced in our organizational redesign. We still have many team members, of course. So for the majority, nothing has changed in what they do on their daily basis. What has changed is that they now clearly assigned to one team and that they have to work in an HR way. They are still the green shirts to support the HR way of working. We've introduced this crumb masters here is the light blue Nina's and the product owners here to be seen with the black cylinder and the red shirt for us. The product owner is very close to the team, or some of them even understands themselves to be part of the team. And we have introduced one role called architecture owner, who piers the responsibility of the architecture makes together with the team, the technical decisions, hence the ports, the PO in all the technical aspects, you can say the PO decides what has to be done. And when, and the AOA decides how it is to be done and documented here, we had a steep learning curve, one hour assessment for the right people in the right shirt, but this is a talk in its own.

00:14:25

Today. We have more than 30 of such teams with 15 POS and service owners, 15 ScrumMasters and 17 architecture owners. How does that match with more than 800 people you might ask? It is, as I said before, only the part of legal flight, which I'm here describing

00:14:48

During the transition we manage the change. So I might call it a reorganization by the use of quality gates. We had three of them. The first was the preparation phase. We called it ready for decentralization. A clear mapping of the people into the teams happened. The tooling was in place and ready to use. And the team was aware that they are no longer centrally planned by resource managers. Second was ready for assignments of the new roles happened. So the POS the AAOS, the scrum masters, they all have been found and placed into the teams. The HR way of working was understood, and the team decided which methods to use. And the last one third was done with little four D that was about upskilling the people in the team so that the new technology could be used and start working towards the new architecture to decouple from the centralized core, starting forming the teams own galaxy.

00:16:01

If you want to see it in the old analogy with the solar system here, you see the principles and values we introduced. When we switched to the new HR way of working first, the team must have the feeling to own their coat. So give them the autonomy to decide and the responsibility to bear with the consequences. Second, we aim for simplicity and fight complexity, especially the existential introduced one third. We want to have a collaborative culture at the customer side. The product is seen as a single piece. So we need to collaborate to make this happen in front of the customer for rules and guidelines from the lead architects, how the architecture owner to design their micro architecture in their service. Five teams have also the budget responsibility so that they can become cost efficient six. You build it, you run it. That was the way we intended to think with the individual teams, seven infrastructure as code and automation should be in the DNA of each team. Eight increased the bus hit factor to something higher than one so far that sounded like we have a perfect plan. And it was very simple to execute and reach what we've achieved to today. I can tell you that was not the case. We had plenty of discussions, hordes of skeptical employees. And quite often we had to argue and fight for the greater good. We started beginning of 2017 and now more than three years later, we still face a lot of pain.

00:18:11

Here are some examples before we started our chain, the value stream was driven along the release process. So there was not much communication required for new releases. And also the technology change with rather low. When we allowed the teams to work in an HR way and reach out for the new architecture, we started to introduce a lot of new technology. The amount of rest services increased heavily automation tools like Jenkins popped in also for operational tasks. Containerization in the form of Taka was introduced. We also allowed the teams to bypass the release process to move faster for new services, which are in Docker, but the old release process still dictates the value stream. And by this also the overall speed shifting left, the installation process increase the cognitive load on installing and T H T team because they have now to work out how to run all this new technology, learn all the new automation and the tooling around the containerization.

00:19:38

Instead of let delivery teams do all this, including in their releasing of the service. We asked the delivery teams to automate the installation only in the interns systems and they did, but we still try to protect our customers from the delivery teams. So we just shifted parts of the release process left, but the pain is still handled by the same ops teams, only with a much higher effort we need for communication and a much higher cognitive load by all that new technology coming in, compared with our values and principles, we have not achieved the, you build it, you run it video based on the book team topologies, there are four types of teams. Whole here mentioned only two of them. There is the so-called stream aligned team. They should be loosely coupled with a clear defined version API and being responsible for the value stream of its product component or microservice you name it.

00:20:52

They have the end to end responsibility on their part, our teams, compared to this focus at the moment, a lot on developing new features or shifting to the new technology, but with the old logic and with the old coupling. So even if we have renewed the software to the new technology, we do have a tightly coupled monolithic architecture in the book team topologies. What we now have is called a monolithic release. This is the reason why we still need the release heartbeat for almost all our teams. As told before we still have run, the ops runs it model. The way we do this at the moment is a nightmare in operations. As a single team, running all the components, being first line of defense, they have to collect and provide all the feedback to the delivery teams, which even results in more communication on top. There is no good tooling around to operate the new services monitoring traceability and logging is not enough developed and not tailored for the operational leads as this is mainly done by the delivery teams themselves who barely get in contact with the customer systems where the incidents happened. And by this, it should take place.

00:22:34

As I told you before, we did not run through the HR transformation with the ops teams. So we have no clear idea or kind of team type you have desk, for example, should be, should they be understood as a streamlined team, delivering incident resolution and software updates and more as a service to the customer, or should we design them in a form of X as a service team, offering the services to the teams and cover incident resolution and a 24 by seven, but was a clear API to be used by all the different delivery teams or maybe a completely different form and type of team. We also still fighting with the three ways because of these topics, we could do a lot better on the management of the value stream and focus more on flow. We should do better in the shift left to avoid too much indirect feedback and foster more direct feedback. And last but not least, we have to become better in the way we do postmortems for customer incidents to be better with our continuous improvement and learning.

00:24:03

Let's now have another look on our teams more specifically on the team size Dunbar's number simplified says you can only build deep trust with a small group, five individuals, maybe two or three more. You can only keep regular contact with a few name at 15 individuals. You can only stay in touch with some 115 individuals. Recently, that number was increased to something like 180, more than half of our teams are quite big. Our old philosophy on that was we do lack skill in the team. So at a person with this skill to the team, the team feels overloaded. Ah, add more people to the team. On top of this, we do not manage the communication between the teams. We just let it happen. That leads to a very high cognitive load in many of these teams, because a lot of people have to communicate. And a lot of our people complained that by this, we have far too many meetings, far, too many communication, far, too many information needed.

00:25:28

We have easily forgotten one simple mathematical law. If the team sizes and you calculate the number of communication channels by N times N minus one divided by two. So if you have, for example, six people take the six dots in the little icon I've showing here. You have six times five makes 30 divided by two makes 15 communication channels. If all these people need to communicate simply within a team. And if you have six teams, if the certain communication channel between the teams, our batch sizes is driven by the release cycle. As we still have the waterfall paste release cycles of two version releases and eight service releases, our batches are big, big batches, they're high risk. And by this, we still suffer on the quality of our software. So we have to reduce these batch sizes and release more frequent so that we can manage the risk a lot better.

00:26:50

We still have many hand offs in our end to end value stream. We should reduce that and allow the delivery teams really to deliver and being responsible for their product at the customers end, not saying that there should be no ops ops still need it. I'm pretty sure. Or very short reaction times or 15 minutes cannot be handled by an on-call of the delivery teams here. We still have the need for a clear operations, 24 by seven unit. And it's more a question of what is the right team type of the 24 by seven unit last but not least. Our ratio of unplanned work is not tracked, especially in our ops teams. And it's for sure, far too high it's even understood. We are here to work on the incidents. So there is no plan work for us at all. We have only unplanned work.

00:27:57

That is the plant work, which fights against the unplanned work. You need to know what kind of unplanned work you have. So we need to start track and understand the roots of these unplanned work. And then we can plan to fight against it and a white it from happen. So planning is your weapon against the unplanned work. And we need for sure, a shift in mindset in our operations departments. Before I close, I would like to say, thank you to Matthew skeleton and mental piece for this book, the team to Pollock to spoke helped me a lot to understand what kind of teams we have, what kind of problems we are facing in our organization and organizational design. And it helped me to raise the why questions and answering them. If you have to design an organization, I highly recommend you to read this book first, let me close with a very short feedback.

00:29:07

I think we did the right things, but we didn't do them. Consequently enough. What makes me look optimistic into the future is the effect that the collaboration and the way the teams working together has changed a lot. Over the last three years, I do have more often the feeling that we becoming a trial and not only doing a trial, I'm sure we are going to address our pain points. And if you are interested in how successful we will be, then let me know this in the feedback to the talk. If you're interested, I should come back next year and tell you the outcome in another talk. Thank you very much for joining me on this session. I'm Ronnie Lippert DevOps evangelist of leader.