Chasing the Unicorns at T-Mobile

Twelve hour outage bridges, worn out headphones, 90% unplanned work, and 25TB of randomly corrupted file systems were normal business for T-Mobile developer platforms.


When the foundation of where software delivery happens is the bottleneck, throughput remains buried under a large pile of debt. Ripe for improvement, T-Mobile has begun to embrace DevOps principles including transparency, telemetry, post-mortems, and continuous experimentation to spark a turnaround of historic proportions.


Listen as Chris Hill, Senior Manager of Developer Platforms, walks through a journey capitalizing on T-Mobile culture and desire to create experiences customers love. The culture, otherwise know as "Team Magenta" lead to an appetite to change and now has teams achieving up to 30x throughput gains and decreased deployment pain.

CH

Chris Hill

Developer Platforms, T-Mobile

Transcript

00:00:07

Hello and good afternoon London. My name is Chris hill and I am the senior manager of developer platforms. And I'm very excited to speak to you all today about T-Mobile's pursuit of unicorn status. Now, this is within respect to developer platforms, which is my favorite subject, and I'm highly passionate about this. And I feel like, and I'm really excited because all of us can relate to this. We either lead developers or we've been a developer, or we understand, and we empathize with the developer experience. So I hope that it resonates. I'm going to start with the developer experience and walking us through what a developer experience, uh, maybe is like in your enterprise, uh, and how we can improve it. I'm going to go into transformation. I know that word gets overused a lot, but what does it mean to transform and change the way that you work and a dissect that a little bit.

00:01:07

I'll also mention the unicorn playbook and what unicorns basically how they operate and maybe what we need to do to catch up. And some of the lessons learned, uh, within T-Mobile at least on our journey so far. So I'm going to kick us off in an area that a lot of people don't really like to talk about, and that is the onboarding process. This is where I feel developers first lose their motivation. And the best analogy is, you know, when you get the, your, your Ikea furniture, you've got this instruction, booklet, all the steps are laid out in front of you, and you have little cartoon versions of yourself. Um, that basically just show you like either really confused or don't do that. Um, this is the designers from Ikea that are empathizing with the experience you're going through, uh, or that you are going to go through.

00:02:02

As you build this table, there's also a little nice like tools, charts, and inventory chart that literally lays it out all in front of you. These is, these are all of the pieces that you'll need, and this is all of the, these are all the tools that you need to put these pieces together. I'm still waiting for the day when I inherit a software project or joining a software project, that it comes with a list of reasonable instructions like this. Like I'm putting together an Ikea table, but every, every software project I've joined as a developer feels like I just stumbled on step number 650 of the Ikea table build cycle. But the instructions are actually missing. No. Where did those go on? I don't know. And every screw is the wrong size and every screw is completely stripped. If there's no way I can be productive at what I just landed myself in. If I could just get somebody to walk me through how the software has been built before, or someone showed me like the design documents and why you made those decisions. But instead I'm stuck raising a flurry of access requests. Um, I'm stuck looking at incomplete architecture drawings. If I can find something and thousands of people that well, everyone has their food ready, and they're watching me try and build a table and saying, how come your table's not done yet.

00:03:35

Maybe after scouring the earth, I'm able to obtain a hard copy that tells me how or where I need to go to ask for access or code in environments. Um, this would be step-by-step instructions on how to run. Essentially. You get into service tool service now, remedy ticket, raising types of tools. And if I'm lucky, when I do make those requests that people in the approval chain are in the office this week, and I can finally start to actually look at code. I'm really excited in a larger enterprise that SLA for the approval workflow of what you just asked for. Oh yeah. That's a five day SLA. Okay. So I'm sitting on my hands until then, but then maybe when I actually do get access to the system, there's so fragmented that it's a context switch. Every time I go from one tool to the next tool.

00:04:29

And basically in order for me to fully understand and fully utilize the tool chain that I have at my disposal to do my job, I need to understand the narratives of every company who builds each one of these tools. I got to jump into why company decided to put that part of the workflow there. And then I got a log into another tool and understand where that company was coming from. Just to do my job. Does the seven familiar to anybody? I mean, what a way to welcome a new developer into a company or even into a new project, like congratulations, here's the most de motivational and disenfranchising way that you can join our project. I honestly don't know why more developers don't run for the Hills. Like patience is not a quality that I usually see a lot of in developers.

00:05:27

Let's say your motivation started at 10 out of 10 on, on, on day one new project, new company. I'm really excited. Now I'm going to kick off my career. You're at 10 out of 10, right? After going through a painful onboarding experience. And before you're ever even able to look at, at, at code, you're down to like a two, you better hope that the code is perfectly written and there's no debts. And it's tightly organized because you don't have much motivation left to work with now at T-Mobile customer experience is in our blood. We love the customer experience and just like a better subscriber experience leads to lasting and growing revenue. A better developer experience leads to higher software throughput and higher retention, higher levels of innovation. And the more investment you make to keep the motivation up and the friction down from the beginning, the happier your deaths will be and the more productive.

00:06:34

So why does this make sense? How do I rationalize developer experience to, uh, results? How do I reconcile this? And the less cognitive load for the context switches will decrease your overall cycle time. You will beat faster delivering products. The less you have to worry about the fragmentation on the tool, chains that deliver your products. There's also less wait time with your, within your developer value stream. If you can say that it takes you 30 seconds, just to log in to every tool that's associated with each one of the tools that exist in your software value stream.

00:07:16

You've already lost hundreds of thousands of seconds based off of how big your enterprise, sorry. We've even calculated that if we save right now, one second in every single CIC D job, we get feedback back one second faster. It's just like we hired a full-time person. That's pretty amazing. And it makes somebody think twice when they go, oh yeah, that'll add five seconds. Wow. It doesn't really bother me. It's five seconds. I'm getting out of it. What I want. Yeah. But at the whole enterprise scale, as a huge tax, we'd like to empower rather than we'd like to impede. We don't necessarily always know that we're going in the direction of impeding, but we do know when we have a loss in creativity and when people are unhappy, it's all about empowering. If a decision is made and you're lowering the empowerment, it will sacrifice from a throughput perspective and a results perspective.

00:08:21

Now, the assumption is, is if you're trying to instill change, or you're trying to make a difference that you have confidence in the business and this isn't just confidence in the leadership, this is confidence in your strategy, how your product works. If you're not, you're not going to get priority cycles to even be effective. Now I want to dive in a little bit on change and change for me. And transformation is really about fear of loss. And I really think a fear of loss because I hear phrases like we've been told, this is the last time we're making this change or making a change in this area. Or the last time we're moving. What I already have already works for us. Why are we mixing things up, right? If you're not armed with a Y or if you're not investing in earning confidence in the value of your shared service, if you're training for economies of scale, are you really doing the best thing for your company?

00:09:25

I've been with T-Mobile on this initiative for at least two years. And I've always asked myself where we just late in determining how valuable developer experiences, if, if T-Mobile had started five years ago, would we be further along in our pursuit of unicorn status? Would we be a unicorn right now? If we had invested earlier, would it have even made a difference? I've reconciled. The answer is that it's way more complicated than that. You can't take a unicorn's playbook and just magically become the unicorn overnight. There's a lot of work to make transformation and a transition successful.

00:10:11

You've got people to convince funding, to earn legacy systems, to keep running anti-patterns of behaviors, to break feelings, to probably hurt architectures, to rip apart network firewall, rule changes like policies, the challenge across a culture to evolve and unplanned work to compete with if you're changed or your platform or your product or your service that you'd like to be able to transform is that 90% unplanned work, the entire team has no room to even have a thought on how this could be better back. About two years ago, when I first joined, we definitely weren't in this state that we're in right now.

00:11:07

We were on our feet most of the time. And I went through multiple pairs of headphones due to 10 plus hour bridges where the padded black part in my headphone was starting to wear off. Like I was getting a haircut every day. We used to be in constant crisis. That's where your hair gets mad at down. Cause you've been on a bridge for so long. We definitely knew we needed a new way. One of our guidance points was John, all spawns phrase, incidents are unplanned investments. And then this was our biggest headline and was our fuel to how we were going to change this experience. Now, the interesting part about the experience for our users Is that we had to acknowledge that it was a current poor experience. That was a lot of half the battle we had to basically say, yes, we know this isn't ideal, but this is what we're doing today to make it better in the future. I feel like transformation is intended to be fruitful for all, but it's painful for some and uncomfortable for most.

00:12:30

Now, if we acknowledged the experience as an idea, ideal, or leaving ourselves room for opportunity for improvement, the thing is, is what should this actually be? It, it changes every day in a crisis at our number 10, what this should be is really just man. I wish I never had to get on this sort of, uh, dissect this technology again, but ultimately your postmortems should actually dictate your priority. What you've established is not sustainable. How do we get everyone together to reflect on how to make it more sustainable? This is my fourth industry going through a digital transformation. I started in semiconductors, then retail, then automotive and now telecom. And I keep thinking one day, this is absolutely going to get easier, but it never does get easier. I mean, who am I really kidding? Telecom has the same crippling legacy debt smothering our ability to ever improve just like the other industries. It's like, you're always in a hole trying to claw your way out, but the bottom is your legacy quick Saint there's hope to get out of that hole. Here are some things that worked for us. The overall objective was for us to turn unplanned, work into plans, work

00:14:07

The book, making work visible by Dominica to grandis talks about not only how do, how do we understand where we're comes from? How do we accept work? Do we pull work? Do we push work? How do we ensure that our capacity limits and our, um, whip gates or individuals aren't being exceeded? How do we know what value is in progress and what value isn't right. That transparency is a pillar of a transformation. In my mind, the blameless postmortems I talked about before, turn them into investments, take them as an opportunity where you can take an incident and really find comfort and understanding your assumptions were wrong. Everyone else's assumptions were wrong. How do we ground ourselves on an assumption that we know is actually right? Or how do we prepare ourselves or our systems in case our assumptions are wrong, architecture safety.

00:15:12

We took an interesting stance with our customers right away. When we were in crisis mode, we basically raised our hand and said, Hey, you know what? We screwed up. Things are really, really bad. What we're going to need. We struggled to even come up with 80% of time, let alone, we're not even having the nines discussion at that point, maybe unless it was 89. Um, we told our customers, we needed six hours every single week. And yes, it was going to be early morning business hours. We need six hours every week to make sure that the experience improves over time to harden our old systems. That is, that was really hard for customers to swallow. And I, and I don't blame them again. We had no, I'm just not ideal, but this is what we're going to do today to make it better for the future.

00:16:10

We were disciplined in our operations. This seems obvious for a unicorn, but may not seem obvious for horses. Take a buddy. Pre-check all formal runbooks peer check, all formal. Runbooks make sure you have the estimated time allotted. Is there things that probably seem natural to unicorn and might not necessarily seem as natural to horses and don't be afraid to back out. This one is probably one of my favorites methods to turn unplanned in the plan. And I'm going to go over a little example on, uh, how I explain this particular point. One of the downtimes. These were the two hour early morning downtimes on these Tuesday, Wednesday, Thursdays. I had woken up and we had about five minutes left in the downtime I had known from the previous day. Cause I looked at the run book that the downtime was only supposed to take 30 minutes and we had five minutes left and we actually allocated the full two hours.

00:17:13

We had five minutes left in the two hours and I hadn't seen any sort of Salvatore messages. And I hadn't seen any sort of indicators or signals that things are going right. So I joined, I joined the bridge and I think we've all been here before. And when you joined the bridge, you hear phrases like, well, we'll see how long that takes or I'm going to try this. And man, what the why is that working like that? Right? You're healing all of these very negative phrases. And we had five minutes left in our downtime. And I asked the question. I said, when do we to be able to hand our systems back over to our customers in a known state?

00:17:57

And the answer was well based off of the napkin math, I think we'll be done by noon. That was approximately four or five hours. And my next question was is how long would it take for us to get back into a known state if we backed out, oh, well we put that into our run book and that will only take 20 seconds. And so I said, just back up, why are we waiting on a judgment call to do something that we know is unknown? And we keep failing forward, patching trying to figure things out patching and see if it works like your trial and error periods. Aren't meant for your downtimes, your trial and error periods are meant for you before the 10 times. So to get into a known state, don't feel bad about a backout. I actually celebrate back outs just as much as I celebrate non-bank outs because the whole thing was planned and you've planned a backout and we've made a commitment and you fulfilled your commitment within those two hours. So we backed the whole thing out. I don't ever be afraid to back out. We also started to reward flawless execution. Now I really like ensuring that when things go as planned, let's celebrate, let's embrace it. And that is a huge success for us.

00:19:26

Now, I feel when you've turned to unplanned into plan, you can start asking the right questions. Do you know what good looks like? What are our measurements to see whether or not we are successful as a platform, as a service, where are our bottlenecks, what standards should be enforced and what standards should be flexible or distributed? Are we actively impeding nor are we empowering? Do we have a community of support? Do we have a community of experts? Does, do we even know what an expert is in this area? And do our customers believe in us? This goes back to the confidence question.

00:20:08

If you have the right set of questions that, you know, you need to ask. Now, now you can transition into finding the right solutions. This is when you can define the best practices, the ones that you can control, and then simultaneously challenged the best practices that you can't control for more refinement. Think unicorn think ideal, but also thinking iterations. I don't like the phrase. Yeah, I agree. But that's not really possible here. Fair enough. Okay. What is the path that we can make it not impossible? What is the iteration determine if you have any large scale, uh, directional movements in people, process and tools to make during this transitionary period, this is your opportunity to change how things work after you've pulled yourself out of crisis, treat any sort of feedback that you get as gold. I have a set of pre transformation metrics. I have a set up post transformation metrics is the transformation and adoption going as we would expect as it been successful.

00:21:27

And then what is the quality of adoption? And then what is the quality in your post transformation without people? How happy your users are? We measure net promoter score for anyone that you can turn into a net promoter, you've basically just started multiplying your staff and you started to build your knowledge economy of this capability, I think is extremely important. Also, as you start to move to low context, switch tools, low cognitive load, low amount of friction. So that way they're throughput focused. One of the changes we made was we took all source control and CIC D those were two different things and two different narratives. Well, we moved to something like get lab and get my FCI, which they have the same narrative. And they were built in with a minimal context switch from the beginning. Things like Conway's law, come into play here. When you have the same company with the same narrative that is building and architecting for flow and throughput, you are able to be more successful from a flow perspective. You're able to eliminate and create a better experience. You're able to eliminate waste and create a better experience.

00:22:52

The last one is core versus context. And the core versus context is this idea where, and Jean mentions this in his book. It's this idea where focus on what makes you special, let someone else focus on what makes them special. And there's a good example of us doing this actually. Um, we chose not to run, get lab on on-premise. And I always ask myself, this question, will T-Mobile ever be able to run, get lab better than get lab can run, get lab? And that answer was always no. So by taking the context and handing it over to somebody else, who's really going better qualified and better at it. In general, we can now focus on what's core to T-Mobile. And this is T-Mobile's implementation of how we use get lab. Everyone who would have been involved in babysitting a on-prem get live instance. And I use babysitting and sound that pen, but everyone would have been involved in doing that for on-prem can now focus on what is core to T-Mobile automation, patterns, automation, instruction, reusability internal runners are, uh, pipeline integrity. Our Sox integrity, any of our integrations with T-Mobile's, uh, internal ecosystem. And now our implementation becomes something where we can have a direct influence on what the experience is. And we can leverage the fact that another company can ensure that they hold up their end of the bargain and deliver what they're good at.

00:24:42

There are a couple of lessons we learned along the way failed transformations can lead to an extremely productive second attempt to many, a time too many times. The fear of loss, uh, actually equates to it's not a loss. It helped inform the decision in the future. In the evolution. We also learned that there is a great question to ask anytime when you're, when you're in the middle of standards. There's a great question to ask, is this the best thing for your team, or is this the best thing for your enterprise? And I love the comment. This is a conversation starter, and this will essentially go what is being requested as a deviation. Okay, fair enough. Is the deviation something where we should now make the new standard and the rest of T-Mobile should now adopt. And we should have the economies of scale where we can lift and shift very quickly, or is the deviation not something that T-Mobile should adopt as a standard. And therefore the teams should adopt this thinking the existing standard themselves. This gets people's brain working in a different way and understanding that we're all a part of the same ecosystem. And we all have a responsibility to be a steward of that ecosystem. And it's a shared responsibility.

00:26:05

Transformation. Fatigue is absolutely a thing. If you have every single day, multiple transformations in flight, you will be in constant chaos. This means be selective on which transformations you want to do simultaneously. Or if you want to do any transformation simultaneously, there's a balance here. How can you be productive and also transform at the same time? I think it's important to focus on the constraints that can't move and move the ones that you can move. When I hear a no three or four different times, when I say, well, how about we do it this way? Or how about we no longer do it this way? And I continue to get nos. I realized that constraint probably won't budge, and I need to come up with a better solution. I also need to know when a constraint is completely unreasonable and maybe there's iterative steps to eventually get out of the businesses, maintaining that constraint. We also feel that obtaining a adoption inertia is part of this unlocking and passion, Almost like the rebellion within the unicorn project. There are people who are excited and they have passion. They may not have been able to actually unlock that passion until you partner with them and understand that this transformation is about unlocking that passion. And we want to create and foster that creativity and innovation.

00:27:50

We just merged with sprint. We have a ton of work to do in terms of taking these behemoth of telecom companies and merging them together. And the technology associated with that and the change required and know people, process tools that are required in that. We need a lot of help. So please, if you're interested in joining us, please go to that link. You'll see there. I do want to thank everyone for listening. I know that it's been a challenge right now that we're doing things virtually and I would love to be in London in person and due to the circumstance. I think, uh, I just want to thank the it revolution staff. We've done a phenomenal job with putting on a conference and taking a constraint that they've never had before, which is making things virtually and making the best of it and making it a successful conference. So with that, I appreciate your time and have a good rest of your conference. Let's see. I go.