Operability and You Build It You Run It at John Lewis & Partners

How do you transform speed from 10 to 5,000 deployments per year, while increasing overall website reliability for 30 teams working on a £3.5 billion retail website? How do you embed an operability mindset into product teams working at one of Britain’s oldest, largest, and most popular retailers?


John Lewis & Partners has provided its customers with retail merchandise since 1864. The company is co-owned by its 78,000 employees, and it operates 42 stores across the UK as well as johnlewis.com. Over the past few years, John Lewis & Partners has been on a digital transformation journey, replacing its ecommerce monolith with tens of microservices and teams, and a bespoke, award-winning digital platform.


We have a big emphasis on operability. We’ve built a Paved Road for telemetry, including availability targets and service level objective alerts. We’ve implemented You Build It You Run It at scale. We’ve adopted Chaos Days, post-incident reviews, and per-team incident management.


We’d like to share with you the successes we’ve had, and the lessons we’ve learned while adopting operability at scale. We’re hoping to encourage and inspire other folks working in large enterprise organisations! Your takeaways will be: how a digital platform can remove telemetry friction, how you can track leading indicators of operability, and how you can measure the cost effectiveness of You Build It You Run It.

SS

Simon Skelton

Platform & Operations Manager, John Lewis Partnership

SS

Steve Smith

Principal Consultant, Equal Experts

Transcript

00:00:00

Um,

00:00:13

Hi, I'm Simon skeleton, the platform and operations manager at John Lewis and partners. This means I have overall accountability for the smooth running of John lewis.com. And whilst I consider myself a new volume with a mere 20 years of the partnership throughout my career, I've been an on-call program, a developer led ops teams and implemented ITIL across the partnership, but I'm definitely an advocate for DevOps

00:00:43

And I'm Steve Smith on from equal experts. I've worked there for seven years and I basically spent two and a half years working at John Lewis and partners with Simon and a whole bunch of other great people.

00:00:53

And we're here to talk about operability and you build it. You run it at John Lewis and partners and how we've gone from 10 releases a year to find 5,000 deployments a year whilst also improving website's stability. Well, a little bit background on John Lewis. Well, this is definitely not John Lewis. This is the classic. Are you being served sitcom from the seventies and eighties, but our history actually goes back much further than that. That's back to 1864 when John Lewis Sr opened the first store in Oxford street, but he was actually some John speeder Lewis who believed in fairness and humanity. And he experimented with a new business model. As he thought it was unfair that the three owners and more than all the 300 employees in total and in 1920, he shared the first bonus of seven weeks, pay with them. All. This now means the 78,000 employees, our partners, as we call ourselves are all corners of the business.

00:01:54

And as you'll see in the middle, John Lewis partnership is the overall brand we talking about here, John loosen partners with the department store, and we've also got Waitrose and partners, the grocery chain as well, but the strong foundations are still allowed us to adapt and innovate with Edgar. The dragon being our first combined, John Lewis and Waitrose Christmas advert, which is often called the official start of Christmas. And that was trending number one on UK Twitter, within two minutes of launch and our stores continue to be updated as well to be ever changing customer needs with much more focus on experiences. And indeed they've had to change. Looking back 2019 was a very challenging year for retail with the likes of Mothercare closing time, unbelievably Brexit, no longer filling the headlines when coronavirus hit in early 20, 20 and other well-known brands, such as Devin and house, Afrezza have to close their shops too.

00:02:58

As you'll see on right, almost 10,000 stores of 12 and 2020, it's been tough for John Lewis. So what let's look at some positives is a great example of how we quickly adapted with our virtual services launched just two weeks after the first lockdown forced our shops to close virtual nursery, personal styling, home design, and beauty classes, all proving really popular and something that will likely last post Corona virus, Corona viruses also accelerated what we believe will be a permanent shift. More online. John lewis.com was already very successful at 40% of total sales, but this is likely now to remain at 60 to 70%. We believe, and black Friday is normally our biggest down line, but of course the shops were closed once again last year and our estimates prove very accurate and we saw an additional 50% increase in sales on the previous year, but I'm very pleased to say that the last three years worth of investment paid off the platform scale perfectly throughout black Friday and the whole Christmas period, and we traded without any issues.

00:04:13

But now let's step back in time to 2017 and look at some of the challenges we were facing. Then we felt I was speed to market for new features was too slow and the technology was seen as constraining the business. I'm not enabling it. It was also difficult to manually scale up. They own premise servers for the likes of black Friday, as well as difficult to add more teams to work simple tenuously, delivering new fixtures. Also, this was a key decision point. Do we invest the majority of our budget and resources in the next 18 months to upgrade our commercial off the shelf e-commerce platform, but that would only enable us to stay in support without adding any new features. So no surprises, but guessing what we did too. So back then we had six teams working on multiple e-commerce monoliths. There were a mix of third party commerce packages and bespoke from ends as well. We had a central operations team called application operations support, which is mostly comprised of third party Miami service with some partners as well. We only managed one overnight deploy a month with summer clearance and Christmas trading period change freezes meant we only did 10 deploys a year. And these big releases called plenty of calls, plenty of major incidents. And we had quality issues as well. We were losing millions of pounds a year in opportunity costs. We couldn't release new features fast enough.

00:05:52

So now let's have a look now at what we did to tackle those challenges. So this is a timeline and a brief narrative of a huge amount of work by a lot of people. We can't cover everything here today, but in 2017, we made a commitment to replace our e-commerce model less with digital services while still delivering new features to customers. Those digital services were run on what we call the John Lewis digital platform. It provides a paved road, admits business book, platform capabilities are built upon the top of the Google cloud platform. This allows us to scale up product teams without compromising on throughput quality or reliability in 2018. Our cloud search team were successful in taking 1% of the live traffic aware from the old search engine. This validated not only the technology, but the ways of working as well. By 2019, we had nine times as many teams and we had those product teams on-call for their own services.

00:07:00

And we had new customer propositions emerging by 2020. We continue to grow and accelerate, moving significant traffic away from our model list to the new services. And as you saw from the black Friday traffic, that's been very successful. So back in 2017, we believe that product teams and you build it, you run, it were prerequisites for deli deployments and higher reliability, but in the two thousands, we actually used to have combined delivery and ops teams, but they were eventually split as delivery deadlines were frequently missed and operational issues became overwhelming, but what were those issues and what can we learn from them? Well, back then we have project-based delivery within frequent business owner input. We now have agile project teams with frequent prioritization, from a product owner. We had manual testing, which didn't catch enough defects. We now have automatic testing with continuous integration release.

00:08:06

These were infrequent large and manual. We now have continuous delivery with small frequent deployments and the on-premise test in live environments with two different and slow to provision. We now have the John Lewis digital platform with cloud-based self-service infrastructure, but when it came to operability, keeping availability high and operational issues low, the question I kept asking myself and Steve was how do you embed operability into digital teams at scale in an organization that is 150 years old? Well, we brought operability down into these four areas, growing awareness by making product teams responsible for supporting live digital services, identifying concerns by standardizing, and then visualizing leading and trailing indicators, testing proficiency by running chaos days and live load tests and embedding principles by creating new learning pathways and opportunities for partners. And now I'll hand over to Steve to give you some more details on these.

00:09:18

Thanks Simon. So an operating model is insurance for your business outcomes and wherever you build it, you run it. It's a policy that can achieve high standards of deployment throughput and service reliability together in a way that's cost effective. This table shows how you build it. You run, it works at John Lewis and partners. There's a table of availability levels matched to revenue and out of our support. So a product manager has an idea for new digital service. They then go to the digital platform on boarding guide, which has a copy of this table. And they have to estimate the maximum amount of revenue that can flow through that digital service in a period of time, then they match it to one of these levels and their own tolerance for risk, and that gives them an availability target, and it gives them, um, out of our support as well.

00:10:05

So for example, if I'm a product owner and I have an idea for a cloud search service and maybe in 45 minutes or so, it will have 570,000 pounds flowing through it, then that will match to the 99.9% target. And I have to have a team rotor for on-call. Alternatively, maybe I have an idea for a merchandising service. And, um, I think it might take 50,000 pounds within more than seven hours. In that case, I'd have a 99.0% target. And that would mean no one. Cool. We'll come to what that means. In a moment. The important part here is it's a product manager that makes the decision. It's not a platform leads such as myself, no one operations manager or delivery manager, such assignment. The product manager is the budget holder. They make the prioritization decisions. It's up to them. This is a good framework for revenue versus availability versus on-call, but it's not a recipe for an organizations, the maximum revenue that you tied to an availability level, the available level that you choose, that's going to really vary based on your own business.

00:11:06

Um, we took our initial revenue numbers from an instant management policy across the partnership and iterated on it. And you should take a similar approach. So this diagram, it shows the workflow of instant notifications for than lifts on premise and digital services, hosted longevity, DP. So we have a model. If there's an alert that comes out of new Relic, that goes to the ops bridge team, they scramble around and a bunch of spreadsheets and they hunt down the right member of the application operations team to call. They found them, they invite them into a Google chat room and they put an image instant manager for instant response. Opsbridge also manually create a service now record with a digital service. It's very different and the alert could come from from atheists, or it could come from new Relic, both fire into PagerDuty, which has teams, services, escalation policies, and routers all automatically provisioned as part of jail, DPS paved road offering.

00:11:57

All you have to do in your service is a bit of conflict, just type in your service name, your team name, and your availability targets, and JLD provisions, PagerDuty, and service now for you. So PagerDuty gets an alert. It matches it to a service. It matches it to a team matches to an on-call engineer and phones them straight away. It also immediately creates a record and service now and there's bi-directional sync. So any changes in service now itself are reflected back to PagerDuty as well. An incident channel is created in slack, and then, uh, the product engineer starts to do instant response. Other people can view the response because the channel is public and searchable. And engineer also has a shiny button in PagerDuty called declare a major instance that lets them pull in a major incident manager, do the exact same major incident process on reflection, adding PagerDuty into the alerting tool chain was a really key part of the operability journey at John Lewis and partners.

00:12:53

It meant that the time to acknowledge an incident could come down from five to 20 minutes to 60 seconds consistently because the answer when workflow was fully automated, it also had the painful friction points in PagerDuty setup. And especially in service now set up, could all be eliminated. It also meant a commitment to working with all aspects of it. Operations could be demonstrated because there was no attempt to create a digital incident management process. I remember insisting myself that we use the process as is, and we work with the incident managers to help them get the most out of their role.

00:13:32

It also means that, um, the use of public searchable channels means that, uh, instant response, instant reviews can all happen in one place. All right, this is a diagram that shows, um, out of our support for digital services in early 2020, based on that on-call level. So on the Y axis, we have availability levels from low to high and on the x-axis we have product demand from low to high. So, uh, we mentioned earlier the different services have different levels of Encore. So at the lowest availability levels, there is no on-call out of hours for service. And that includes no fallback, one operations team. That's an intentional, carefully designed approach. That's appropriate for the lowest level of revenue risk. We do this because having absolutely no operations, fallback generate stronger operability incentives for the delivery teams, because now they're thinking, uh, if there's an incident out of hours, no one else is going to fix it.

00:14:35

I've got to fix it when I get into the morning. So that would encourage people to think more about operational features upfront. If a service has a middling availability target, then what will happen is, um, the product team engineer will be on call for their digital service or an engineer and a sibling team in the same product domain. A product domain is a logical grouping of services and the same business domain. There's a focus on customer outcomes or minimal cognitive load for engineers. And the way that it works is that, um, well, for example, here, the antique basket service, the electrical service, the fashion service, they all operate in the same commercial journeys, product, domain rotor. So tonight, uh, one person from those three teams will be on call for those three services. This is the secret to growing. You bought it, you run it at scale in a way that doesn't, um, go up in a linear fashion.

00:15:28

As you increase number of teams and services, you don't want to have 20 people on COVID 20 services, nor do you want to have one person on call for the world. This is a way of striking an effective balance. And if a service has the highest availability target and the highest amount of customer demand, then the team operates their own on-call rotor and they have maximum operability incentives to there, uh, that isn't forever. If product demand slows down and the product manager announces that demand has been filled, at least for now, then that, um, digital service gracefully transitions into the appropriate product domain register and minority of digital services should be an a team road. If too many services are in that rotor, then it's considered to be an overestimation of revenue, impact risks, or an underestimation of, um, uh, mitigating downstream dependencies. And in each case a team wrote or a product domain, Reggie needs to have a minimum of three to four product engineers, both John Lewis partners and equal experts, engineers all go on, call together. Um, no one's made to do it. It's just like, uh, we have an operations team. It's all about personal choice and trying to get one core Virta that works for the team.

00:16:40

All right. Let's move on to identifying concerns with, um, leading indicators. So I vividly remember sign and send to me at some point that trailing indicators of operability weren't enough. We needed to understand the presence of adaptive capacity, not just, um, see after it had been used. So this is a screenshot of the jail DP service catalog. It's a service that runs itself on JL DP. And it's kind of like a developer portal, I guess, is the trendy name now. But what this shows is a bunch of different services by their service level, their availability rate at present. And then there's an assessments called I'm Annie telemetry column. And these are showing leading indicators of operability that are, um, relevant to the John Lewis and partners context. So with telemetry, there are automated checks. They look for bespoke telemetry. Now J LDP gives every digital service logging, monitoring, and alerting out of the box, but it's been observed that teams who built around bespoke telemetry on top of that are more likely to handle live traffic incidents as they occur in a timely fashion.

00:17:47

So JL DP scans for bespoke, telemetry and flags up, if there's nothing there at all. So green would mean that there are no outstanding tests to complete red as an in this screenshot implies that teams have some work to do that with assessments, this refers to a set of exploratory questions where teams self assess themselves for their own services every quarter, uh, it's called a service operability assessment and they are, how are questions there's no, yes, no questions. It's all about how and diving down into how teams actually operate their services. So for example, one of the questions that has to be completed says, how do you handle latency problems of a downstream dependency? And what might happen is you might look at that and think we need to put in a circuit breaker, you'd record that in your response, or you'd might down a JIRA ID for that task.

00:18:37

That's all machine readable. It's scanned by JL DP and it's visualized in the catalog as something that's been, uh, needs to be handled. So, as we can see in this screenshot green for the first service means that there's been a recent assessment and there are no outstanding tasks to complete gray means that there's been no assessment for awhile. And red means that there's been an assessment and there are outstanding tasks to complete. So all of this is about identifying operational problems later volts. If you like before, we actually have a major incident, all right, this is all about trailing indicators as well. So we use, um, service availability and deployments throughput, whereas automated checks to show adaptive capacity as it has been used. So this screenshot shows a delivery indicator. It's a visualization of deployment throughput. It shows the amount of days in between production deployments and the amount of time it takes to do a deployment.

00:19:33

So with this service, we can see that through 2019 into 2020, this service went from fortnightly deploys, the weekly deploys, which was really good. And there's some wobbliness with how long it takes to deploy. So there's a couple of conversations that Simon or someone else can choose to have with that team about how they've improved on deployments, how they've made them smaller, more frequent for themselves in a better position to diagnose problems and low back quickly. And yet there's still a bit of wobbliness about how long it takes to get something out of the door. All of this data is constitutive. It's shallow. There we'll place orders for conversations, and there's no one going to have a clipboard saying, you must do better. That's definitely not the John. There were some partners way.

00:20:15

So one way we test operability, um, proficiency is by running chaos days. So we want to identify digital services that may fairly production under certain conditions before a major incident actually occurred. So this is a photo of a chaos day review in our head office. And, uh, standing up there presenting is, uh, Rob Holmby, our, uh, product owner for the platform and this particular chaos there was targeted at the John Lewis digital platform itself in a test environment with some of the platform team members acting as agents of chaos, product teams were asked to monitor their own services in that test environment and contact the platform team in that dedicated from those slack channel. If any issues we're seeing now, we run chaos days on a quarterly basis in a test environment, and we intentionally set select the most experienced team members to be those agents of chaos, to ensure they can act as human room box during the incident response.

00:21:19

So we uncovered plenty of latent faults in the past, such as a product team who didn't notice that database advantaged, um, the landing learnings from a chaos day and follow up tasks, uh, captured and we've observed teams in fixed latent faults. Soon after the cacao stays are less likely to Punjab painful incidents later on, we also regularly validated that our ability to handle white Friday, low levels of traffic, we have a similar approach to that. About our chaos days, we visualize key components of the website and use our knowledge and experience to determine what load scenarios to try. Although product teams do their own load testing per digital service. We still find it ex extreme simulations of customer browsing, just surface issues from interactions between the different John lewis.com website components. A live load test happens overnight to reduce customer impact with both real profiles or customer behavior are compressed and skewed to fit the black Friday traffic profile.

00:22:29

And they're injected into the live website product teams use the analysis from those live load tests to improve their own digital services and protect, protect our black Friday capacity. We also take professional development of our partners very seriously after all they're co-owners in our business from the very outset about digital journey, we've ensured that partners have opportunities to learn new skills and move into new roles. Partner engineers come kind of a buck on a number of different learning pathways. We've designed one specifically for operability that covers topics such as agile operations, security testing, performance learning from incidents and more. And we've mentioned before that the app ops support team was mostly staffed by third-party money surveys with some partners and those partners have invaluable skills and experience. And as we wind down that so small team and we reduce the money service, those partners are gradually moving into product teams and into the platform team itself to share that operational wisdom and learn new skills as well. So let's come more up to date. Now I'm looking at the outcome.

00:23:50

Thanks Simon. So in terms of deployment throughput, the graph on the left shows deploys from 2018 to 2021. And you can see that it's rocketed up from 10 to 5,000 a year. You'll see a drop around black Friday 2019 that's because digital services were still in a change freeze process. Then there's no such dip for 2020, because by that point, stakeholder confidence has increased and digital services have been lifted out of that process, which was great. The graph on the right is JLD P service catalog again, and that's showing the time to first customer, the time to provision a new digital service has come down from six months to one day. The average timescale to the first live customer is now 90 days. And it's coming down all the time and teams are reporting additional millions per year in incremental revenue. As a result. This is about service reliability, the graph on the left chest incident rate. And you'll see there's been no significant increase in major instance for the past two years during the introduction of digital services, the graph on the right shows, time to restore for those exact same incidents. And you'll see that for modernist and digital services, there is a trend downwards, which is really encouraging. You'll also see that digital services have a much faster time to restore than the monoliths.

00:25:06

And this is my favorite slide. This is the magic table. This is all about service reliability. This is all about showing how the hybrid operating model works at John Lewis and partners. This was an analysis between April 20, 19 and April, 2020, all of the different components of the live website at the time. So if you build it, you run it at the time. There were six digital services operated by four routers. So one service not on call, perhaps some services in a product domain we're going to gain. Remember six services doesn't mean six people on call. The deployment frequency was daily. That's seven times faster than the third-party managed service operating the free modernist under one rotor digital services had, um, only six major instance compared to 13 for the modern lifts, the hand-off rate, the amount of incidents that required a second better place to respond to and incurred a time penalty.

00:25:58

That was one and a half times lower, whether you bought it, you run it. The time to restore was three times faster. You might recall that the target for a 99.9% was 43 minutes. Well, you bought it. You run, it's awfully close to it on average, which is pretty good. And revenue protection effectiveness was three times higher. This is a measure that looks at the percentage of estimated revenue loss per incident. That's actually protected because the actual revenue loss is less than the estimate because of a fast time to restore. So because you build it, you run, it has a faster time, just all the third party managed service could cope with the Mon lifts. As a result, more revenue could be protected. More money could be saved, which is a really good thing.

00:26:41

So what does this kind of speed and agility allow us to deliver for our customers? I've picked up one example here, which was pre COVID, but this was our first beach of trial on John lewis.com. While we wanted to improve the experience for choosing the right surfer for an online retailer, it's not easy to gather feedback from our customers directly, but we have the advantage of being able to tap into the past experience of our shop selling partners. So after putting the first iteration life on the website, some of the team visited one of the stores. Now the shop floor partners are most used to it being multi-year projects to run, roll out the likes of a new positive system. So they were absolutely amazed when they could see their feedback being implemented on the live website within the same day, which was excellent. So let's move forward to what are current challenges and, uh, you know, where you may be able to help us, uh, learn from your experiences.

00:27:39

We're still working out how we achieve the best value support model, such as influencing teams to adopt the demand model. How do we safely reduce and remove the reliance on the 24 by seven eyes on support model, that's still work in progress. And of course the ongoing challenge of evolving service management to become more agile. So what are our takeaways? Well, how do you embed operability into a digital teams at scale for an 150 year old enterprise? Well, we think test learn on continually evolve your model. Think about operability as early as possible to ensure sustainability, maintain visibility of operability with both leading and trailing indicators, encourage little and often deployments, wherever possible to increase the agility and reduce the blast radius of deployment issues or defects and adopt you build it, you run it for all product teams to maximize operability incentives and create a cost effective insurance for the business outcomes. So it just remains for Steve and I to say, thank you for listening. We've put a few references up here, uh, one to our partner recruitment website. So please check that out. There's also, uh, some talks here from some of our colleagues, um, articles on medium.com and some of the playbooks. So thank you for listening. Thanks

00:29:13

Very much.