Las Vegas 2020

The Observable Universe...Creating Observability at Scale

As a Fortune 50 company in the Aerospace & Defense industry with 60K engineers across x countries developing products in the digital age, you can imagine the complexity.


Effectively transitioning to a DevOps mindset required exploring the problem space by asking:


-How do we create value that supports the cyber to physical spectrum at scale?

-How can we create observability of the work system?

-Is it possible to give teams end-to-end visibility of flow?


To answer these questions, we developed an approach that integrated lean, agile, and DevOps with sense-making and the science of socio-technical systems for teams be able to peer into the unobservable.

TJ

Tiani Jones

Sociotechnologist, The Ready

Transcript

00:00:13

Hi, everyone. I'm really excited to be here today to share some work that I was able to do with a couple of colleagues that really expanded our ideas of transformation. So I'm just going to get this slide started and we'll get going in with the discussion.

00:00:33

So the corporation that I was working in had questions about how to face the challenges of serving their customers and how to position themselves to take advantage of new markets and how to innovate. And as we know, the aerospace industry has seen disruption already. And so how could they get ahead of those potential threats? Um, so then transformation became the topic of the day and this problem started with a question first of all, and the question was, can we build and design engines faster than five years? So this is, um, a complicated product that is developed in a complex environment. Multi-year development projects, programs with sometimes a thousand people can be working on one of these systems. We observe, you know, a few things that were very typical in most large conglomerates. So if you're familiar with Conway's law or the mirroring hypothesis, then you know that any organization that designs a system will produce a design whose copy is a structure of the organization's communication structure. So naturally we saw silos, um, organization around disciplines and departments rather than value streams. Um, there were boundaries of, for information sharing that were based on countries that which was exacerbated by heightened or tightened ITC constraints to share and protect technical data. So data wouldn't be leaked where it shouldn't be. Um, there was also expecting quality in, uh, lengthy reviews and multiple reviews where others were inspecting quality rather than it gained, built into the ways of working. And let's goes on.

00:02:14

Additionally, they had standard work and some lean Elaine operating system, which that sort of worked for operations, but they were trying to push that into engineering and it was causing friction in engineering because they didn't have the flexibility that they needed to do the type of work that is done in engineering and design. Um, they couldn't unlock velocity or, or flow things seem to be getting slower. So, um, there was a lack of understanding of the interconnectedness of the system as well, how and where to try things, whether that's a new tool new as a working or changes to the org structure,

00:02:54

There was one team that we coached. We helped them experiment with a new material for engine fan design, and there w we help them change from a typical V model and these lengthy design and kind of, sort of kicking things over the wall and kicking them back between disciplines. We tend to, uh, co sort of somewhat co located and collaborative team multidisciplinary team that designed experiments. So we ask them, well, how would you run this experiment? What's the first question that you need to answer if that goes well, what's the second question that you need to answer. If that goes well, what's the third question that you would need to answer. And they put that together in a plan three hypothesis, three, three tests based on their hypothesis about this material, what they would need, what would the dependencies, what were the actual physical things that they needed to build in order to test this new material?

00:03:53

And one of the scientists on the team, he had 30 years of experience in this domain. He, I remember him saying to me this quote, I remember how we used to work. We just start building until we got it right. But somewhere along the way, the red tape came in, standard work came in and we started to slow down. And I'm just not sure how that happened. So the question was like, could we duplicate this? Could we take a team and like, think of an outcome, one want to achieve and sort of changed the way we work? How could that happen? I mean, that was part of the essence of the transformation. Um, additionally, um, there was this, uh, hypothesis that if we incorporate model based engineering, which is sort of like dev ops for, of cyber physical products, it'll make things better. Um, that should make things go faster.

00:04:44

And then, you know, what else we'll get a roll out agile. So we heard people say roll out, agile, implement agile, deploy agile, um, outside of where engineering was happening, though, there was not much in the way of new ways of working. Um, and there were systemic issues that kind of came outside the boundary of the primary work system that teams couldn't solve yet. All of these address, everything just close to the primary work system and not really thinking through the interconnectedness, through other parts of the organization, additionally, safe was kind of what they call the apex predator in terms of agile at scale. Um, and then there was the question. If we have our lean operating system, we have standard work, is that isn't that going to conflict? Who's responsible for helping us do things differently and then became like this, like, you know, Wayne to scratch the agile edge.

00:05:34

Cause that was the buzz word again and not really sure where to start. And so my team was a very small team, but we started out, um, making sense of everything that we were seeing and we thought, what could we propose? How should we be thinking about transformation? How should we be talking about transformation to these other groups? Um, what could have success given that we were only a team of two and ultimately we came to one clear conclusion. We understood that agile would not be enough. This was a systems problem. It required a systems approach. And so that's kind of where we started another, um, part of our reasoning process was how could we talk differently and bring to bear some theories and research around organizations, um, that might resonate in the system and, and start to, uh, change the conversation from, can we just roll out agile and safe?

00:06:37

And it can be more fit with our lean operating system to a little bit more, um, thinking through and perceiving what the system that exists have an appreciation for it. And some of the theory behind it, one of them being the fundamental attribution error. So there's a white paper, as we were doing this research about different ways of thinking about it and talking to different people, nobody ever gets credit for fixing problems that never happened. So what we did observe is that, you know, something would happen, a problem would happen. Everyone would rush to, uh, address work or people who are close to that problem. Assuming that that was the cause. And yet true cause may be distant in time and space from the defect that it creates. Um, there's so many different examples you can think of about like, if you're have a maintenance procedure where there's a high level of defects in, and the, um, manager presumes that the operator is at fault because they're close to this, um, this procedure, the true cause could be, um, inadequate maintenance procedure.

00:07:52

It could be poor quality of the training program. It could be, um, uh, any other series of causes. Uh, and so this is the fundamental attribution error. That was one thing we observed. And it was another way of talking about looking at the system and, and change. Another thing we discovered was the working harder versus working smarter than the capability truck, which is from that same white paper. So suppose that managers conclude that people not the process or the ways of working or the system interconnectedness and the causes being distant in space and time are the source of low performance. So once they make that attribution, um, to increase the production pressure, then you have two effects that happen. One of them is worker effort immediately raises, which closes the performance gap, but then workers are now less able to achieve objectives by increasing the time they spend working.

00:08:53

And so they continue to hit ever increasing targets to continue to hit ever cuisine targets. They eventually resort to shortcuts because the pressure is still increasing to produce. And what that leads is to cutting the time they spent on improvement. And so if they start the shortcuts loop, which is shortcuts, because they're cutting the time spent on improvement in the short run, they might have the desire to effect, but in the long run, there's a side effect that happens where you have less effort dedicated to improvement capability to begin to see decline performance falls, which offsets the initial gains that you had. And so by increasing throughput objectives and pursued a better performance managers who mistakenly attribute low performance to the attitudes and dispositions of the workforce have inadvertently forced the system into this capability trap. So this was another way to talk about and reason about this with people in different levels of the organization.

00:09:55

Another thing that we surface was, and that we thought through is how could we distinguish ourselves from the fields of agile coaches that we're talking about, safe story points, velocity, and some on journeys of five years, and then hitting this wall of systemic blockages. So from early on, we were asked to bring agile to the teams, but we knew that as we said before, as I said before, that we had to go beyond agile coaching and the lean operating system. So we leaned into how can we reason on and talk about the system with the people that want to hear about agile. So I found this white paper, agile based patterns in the agile cannon, and it gave us some food for thought about how to step back from specific methods and techniques and think through how to apply patterns. And so the first one is, uh, a balanced measure, economic progress.

00:10:51

But the interesting thing about this is the author Daniel greening puts forth that you should have a balanced suite of well thought out metrics. And in this case, the company really only had bowler charts, choppy bits of data aggregate that provided no insight. There was no data thread, the second was to proactively experiment. So that sounds great, but how can you know where to start experimenting? How do you know where to put the feedback loops? In addition, you have the sunk cost fallacy. So after we invest time or money in a project, we have to continue investing. And additionally, I'm just too busy building the thing I'm too busy producing I'm in that capability trap, collective responsibility, uh, limiting work in process. So how do we illustrate and how do we explain, um, how much work do we have in process? Do we know, do we know what a work item is when we're talking about engine design? Is it a model assimilation? Is it code? Is it a document? We found all of those things as work items. So what do you, how do you represent the value that's flowing through the work system, collective responsibility, who was, how do we share responsibility based on the organizational structure and the there's other factors at play? How do we, um, how we work together, what does it really mean? And solving systemic problems. So bringing that appreciation of the system, surfacing what a system is and how it works was really important here.

00:12:32

And so based on all of that, we started to gather up all these morsels of information, all these morsels of reasoning and ideas that helped us develop some shared language with people already familiar with and exploring agile. But we were leaning into the complex systems thinking.

00:12:52

So we started to articulate the values of the system. Um, and okay, so you want to build faster. So maybe if you're trying to optimize the system, you're you want to optimize it or help it to be disposed, to produce more quickly. And leadership was talking about speed, a lot building products faster, but then my teammates and I, we added learning. So is our work system disposed to, or optimized for learning your ability to learn it is related to the ability to go fast. And so in product development, the goal is to produce the product that meets the outcomes of the customer market and the needs and the customer's needs and desires. But the system may not be operating with that same goal. It may not have the same goal. And when that happens, you get that strategic gap that Stephen Bungay talks about. And you can put in some countermeasures to reduce that gap. However, the system may not exhibit the desired behavior of the system, which is learning quickly. So then why aren't we learning quickly, the system isn't designed for it? What are we actually observing in the system? Ultimately it's behaviors and the complex adaptive system. Can we modify the system to enable the behaviors of learning, maybe lean, maybe theory of constraints or agile will help. These are the questions we surface and how we were able to kind of bubble up this topic.

00:14:21

Additionally, we were working with J bloom at one point. Um, and so this is a quote that he, um, he had the observability is about seeing what's happening, not about an answer. So what we landed on was given we have a system system for developing these complicated products, observability was needed to enable transformation. So, uh, Jay actually worked with us to start shaping how we would propose a way forward, um, to products faster is the system enabling learning quickly? Can we observe the systems disk positionality to learning? So we know about flow metrics. So flow metrics give you insight into the primary work system. Flow metrics are about the flow of value through the system. You have work in process throughput, variability of throughput, and other things you can measure that are related to flow cues, uh, lead times. And this is the space of figuring out some of the answers to that questions that we had about the work items and handoffs boundaries ownership for a large program of up to maybe a thousand people. And it's interesting because the, the lack of the data thread is where this surfaced. And can we do anything about that so that we can even represent flow, but we know about this, and there's a lot of research and experimentation and, um, tools that we can use to instrument this if we need to, or if we're able to.

00:15:57

But then the thing that was interesting, and if we start talking about, is the system dispose to learning? How is the flow of knowledge related to the flow of value? It's really about measuring the impact of what creates flow, the correlated impact of measuring what creates flow. So it's even feels like it's more upstream from the flow metrics in a sense. And that's what we were really interested in exploring. So means says to eliminate waste everywhere always is equal painted with an equal buck. Brush theory of constraints is to find where there's the hunting is good. So what we understood is that if you have a large complex adopted system, you're just kind of hunting everywhere, equal hunting across all parts of the system would be wasteful, but hunting is like going to the gemba. It's like going into where the value is created. So what if going to the gemba and observing the system is looking for where there are things that are out of balance. So we create observability by means of flow metrics. We're trying to understand if we can do so regarding the flow of knowledge. And we're what if we could look and see where things are out of balance, not looking for active problems because by the time the problem happens happens, you're too late, but where are things off? Where are their trends?

00:17:29

And what if we could use metrics to hunt the system for four behaviors related to the flow of knowledge specifically, like I said, we already know about flow metrics. We know there's this idea of going to the gemba and looking and seeing where value is created. The gemba is the place where value is created, if connected to knowledge connected, to learning. Can we use metrics to do that? So the key concept was that if observability is not about the answer, observability is about finding where to hunt.

00:18:07

What is the hunting ground then the socio-technical system. So another white paper that we discovered, um, the evolution of socio-technical systems by tryst describe the socio-technical system as the whole organization, the primary work system and the macro social phenomena. So the hunting ground then begins to show begins here. We know what the primary work system is, or at least we think we do where the complicated products are developed. And now we can reason on, I see there are micro social phenomenon and elements in the whole organization that now touch on and affect there's interconnectedness there. So talking about socio-technical system and distinguishing that when we're talking about the work system became language that we could use and reason about as we explored this topic of the flow of knowledge.

00:19:08

And so I thought about what are the skills of a hunter? And I found this cool website that showed these, these maps, these really neat maps for all different kinds of animals. If you're a hunter, an Armadillo in this case and the black bear, it shows you the clusters of where you will likely find those animals. Um, and then in this website, they also talked about the top skills of a hunter. So marksmanship mental toughness, physical fitness, I bolded the ones that I thought were kind of interesting to draw this technology, navigation, bushcraft, observation, and patients. So navigation you'll need to know the hunting ground you'll need your map of your compass that helps you to develop bushcraft. So what is bushcraft? It's all about venturing into the wilderness and the here's, how they described it, venturing into the wilderness, feeling comfortable and confident. This comes with time spent on the hill, observing others and learning, setting up a correctly cited camp while you'll be out of the wind and protected from the worst of the weather whilst avoiding any pooling rainwater, identifying native flora and fauna, reading signs left by animals and surviving any of the weather changes that mother nature has to throw at you.

00:20:25

And that all comes from experience. Experience leads to confidence. Confidence leads you to explore, and the more you explore, the more comfortable you feel, and don't be afraid to ask questions. So if you think about it, the hunter within the socio-technical system, they need to develop a type of bushcraft where they feel comfortable rather than having a typical management response, which is to just jump in and assume that any problems or potential problems and their solutions or their cause, or their causes are very close together. But you begin to be a hunter using tools for hunting metrics in this case. Um, as one example that allow you to peruse the hunting ground and get familiar with it, asking probing questions, finding where the hunting is good finding, where there are things out of balance, which can lead you to potential problems or opportunities, um, before they become problems to get things back in balance and, and manage your system.

00:21:34

So then what if we could hunt? So now that we've established that this is a hunting ground and the hunting is the skill, rather than finding an answer. What if we could hunt for a Brent everywhere that he might be? We know the most common bottlenecks are policies and people, people comprise the socio aspect of the work system. They prevent us from learning quickly. These bottlenecks, we noticed that we had the international, the ITC constraints, the matrix, ORIC silos, all these things indicate there's PO and the firsthand experience of that propeller team that found, uh, you know, one person in the lab that had all the skills to do the lab setup. Nobody else knew how to do it. And they had to send someone in there to document it, save it. And other people were then able to take advantage of that knowledge. They took that knowledge, that tacit knowledge out of that one person and were able to spread it around and, and unblock themselves.

00:22:34

So it's that deep renting Brent's have the tacit knowledge and what brings this and what brings forth, what comes forth from this is this idea of inter predictability rather than controlling the, um, controlling, uh, the response and making sure all responses to the same. When if X happens, one person thinks that the response is Y if X happens and other things to see what, if they work together, they share the information, they, the increased sense-making. The thing is you don't know what you really know until you teach it to somewhere. So the idea of the transfer of knowledge begins to emerge here. And so that's where we started our metrics approach. So we came up with this idea of balanced metrics. Um, there's lots of reading and research and ideas out there on ballots metrics already. We leaned into this idea of pairs. So if we're talking about flow metrics, if you only focus on throughput, what about variability of throughput as a pair? So now that shows you if you're predictable or stable in your production. So the balance metrics system for observability that would highlight where gaming was happening. Cause if you overdrive on one, you'll see a negative effect in the other one. And that gives you an idea that things are out of balance.

00:23:57

We use the Odom approach. So the Odom approach was really interesting because it focused on new behaviors and conversations that we wanted to foster. And so what's the outcome we want to achieve. What is the decision that we need to make? What's the insight you hope that this metric gives, how are you measuring it? And then in the measurement, um, uh, category, the sampling frequency, how to collect it, how to calculate it, what are the data sources? How could it be displayed? Um, and so we, we finally, we did land on a few metrics that might be usable. I'm just going to give you a taste of sampling of this. So the hunter metrics that we, that we came up with that we thought would create observability of the work system would be flow metrics, um, whip throughput, variability of throughput, amongst some others, but flow metrics of the general category, which is the flow of value through the work system balance with metrics centered on knowledge, how do we know we were learning? Where are the Brents? What is the optimal skills mix? How do we transfer knowledge, skills, liquidity in finance liquidity is how you can move your investments around in this case. Liquidity is the knowledge.

00:25:20

So it's, what is our active strategy for moving knowledge around? We, we want to see, see how knowledge flows and to what extent that we can employ, or that we can introduce the idea of intro predictability.

00:25:38

Um, another thing that was really important in this is measurements and behavior. So in the Odom approach, when we talked through the, um, outcomes and the decisions and the insights and the metrics, a key concept that underlies that was the behaviors that we want it to foster. Um, so behavior changes depends on what's being measured management behaviors change when they have the information they didn't have before in their, you know, the hunting, their hunting ground, their hunting maps, and the desired behavior is more investment in learning more investment in double loop, epistemic action actions that create new knowledge. Um, once we, I really liked was from Troy McGinnis, talked about three types of dashboards and why you need them strategic, analytical, and operational, and how the, and the, some of the time concepts that were there. So we played with all these ideas to see how we could do this, um, suite of metrics based on flow of value, flow of knowledge, how they would be visualized, how they might be visualized and how they might be collected, who would use them and who would benefit from them.

00:26:45

Who's using them to hunt who is using them to make their way through the technical socio-technical system. And finally, um, I know this was just a sample and I went through the metrics part quickly, and hopefully in a future talk, we will have an opportunity to explore that in more detail. Um, this is work that we just, we got started on and it had some interruption and it's, uh, um, we could deep dive on those metrics, but we're looking to continue this work on ultimately what I was hoping to say and test this in practice. Um, and discover if in fact these are universal patterns that expand the idea of transformation. So thank you very much and catch me on Twitter. If this work interests you or reach out in the slack channel, if you'd like to explore this more and any questions that you might have, thank you so much.