DevOps Journey at adidas III: Exploring Data in the Cloud

Team adidas comes back this year to describe the consolidation or DevOps and SRE practices across the whole IT Department, being now renamed as Technology.


Fernando Cornago, who has taken more responsibility and whose scope is extending towards Cloud and Connectivity, brings this year Daniel Eichten. Heading up Enterprise Architecture, Daniel will talk about adidas’ Cloud and Data Strategy.


Last but least, gamification is always a topic at adidas, we’ll see how this gets implemented all over business and tech.

FC

Fernando Cornago

VP, Platform Engineering, adidas

DE

Daniel Eichten

VP, Enterprise Architecture, adidas

Transcript

00:00:10

And now it's time for our first talk three years ago, the team from Adidas attended this conference and left inspired by the Jason Cox from Disney presentation. The next year, Fernando Carnegie presented on the plenary stage with his VP Marcus Robert last year, he presented with Benjamin Grimm who drives the product vision for the entire multi-billion Euro e-commerce channel. And I'm so delighted that Fernando is pure presenting again because the last year has been very exciting for him. He was promoted to VP of platform engineering, and he moved his family from Spain to the Adidas headquarters in Germany. This year, he's presenting with Daniel VP of enterprise architecture. They will be describing the continuation of their journey, including what has been like for Fernando to be given responsibility for all of infrastructure and operations as even as a career dev leader and how the transforming the data architecture for the entire company. Please welcome Fernando and Daniel.

00:01:07

So thanks a lot, gene and hello everyone. My name is Fernando for Adidas, and it's an honor to be here for the third year in row. This time, uh, remotely today is all about the consolidation of our model. At Adidas. We started very early. I wanted there was transformation, and now it's all around spreading it globally for this hybrid with me, my colleague and brother from different balance. And to be honest, I'm not lying. If I say that is the most technical, brilliant mind that I've ever met. Daniel liked him. Thank you. Well, it's honor to be here though. It's just virtually, um, and I was joking to Fernando. Uh, if he doesn't take me the third year in the role, um, I will not be friends with him anymore and I will not happen with the move without any further due to Fernando. Okay.

00:01:57

Thank you. So let's get started been, you know, our logo, our motto, we have more than 400 million lines of code at Adidas, but you also know the fancy videos that we typically use to start our presentations this year. We don't have one black lives matter period. It's so obvious in 2020 that we don't need to say anything else. So let's get started. Okay. As a company, uh, you can see some of our figures. So we did last year, more than 23 billion revenue. We operate in every market in every region. And we do these with a total of almost 60,000 employees around the globe.

00:02:44

And the Adidas is soul about our products, everything about our product as assets, including technology. Since the end of last year, we started our global transformation for ID even renaming it as such. And we are now called tech technology. As a company. We are a complex company. We cover the full product life cycle of our physical products from EDA fashion, design planning, manufacturing, supply, selling B2B B2C channels. And this is reflected in our domain map that you see in the picture. This allowed us this transformation to push more responsibility and ownership and partnership of technology into the business. And we are convinced that this, we create empower teams accountable, focus on value while at the same time, driving simplification and innovation into the company. That's this bottom part resembled to some of the five unicorn ideas. Again, I'm sorted does, right? I'm talking this operating model, allow us to measure our products by happiness value, quality, reliability, and flow. Thanks to flow. We are convinced that we will be able to detect the bottleneck to drive our investments better as technology I want for this to thank me, Kirsten, personally, and all his team for all the conversations around implementing flow that we have had within the last year.

00:04:14

You saw the products. We also categorize this products into three different types of different levels. You see on the top, the experiences and touch points where really the key is the spirit fast reaction innovation. Then you saw on the, on the middle layer. Our corporates is basically what the outcome is that data with this product, we are structured experiences from the complexity of a company that is 70 years old and has more than 1500. It systems. The key factor by working there in this product is quality. It's the scalability. 80% of the job is happening behind the cortex experiences come and go. But the work that you do in the core on the platform stays and on the bottom, that is my field is the platform products. Yes, we really encourage all of you to really manage your foundations also as products, it, for it to take care of the customers the same way and apply all the practices that the rest of the products are applied and how the platform teams are engaging with our users.

00:05:23

This picture that you see here may sound familiar to everyone having read the fantastic team topologies book by my skeleton on a manual pace, I would say, how do you, how do you manage? It depends. So some teams, as you see in the picture, they cover critical capabilities that we really decided to decentralize for the sake of speed. These teams are engaging typically by enabling or collaborating models with the, with the users, you see the API teams, the CACD fast data, the streaming teams. They help others to build their own and must have the capability themselves or their teams still. They provide end to end services centralized for the sake of efficiencies for small size of demand, or simply because there are some big technical complexity, uh, the below the cognitive load or the amount of information that the team can handle is critical factor to consider always when you really are designing your organization.

00:06:25

And last but not least, we have a playbook. Do we really tell our teams how to function? We give a couple of a couple of suggestions to our teams and a couple of basic rules to them on the left. You can see the rule of how they Sue the spend their time. We always tell them to continuously decrease the time that they spend into manual operations throughout automation, and then the rest of the team that the rest of the time spending in value creation. We told them to split 50 50 into the next two buckets, be in the platform. So the platform doesn't get obsolete and second consulting working with the users so they can seem first person how the platform feels from the outside. And also they can keep the community alive. What will drive a standardization and a speed. And the other basic rules that we put to the platform teams you can see on the bottom is how to measure themselves.

00:07:21

So the platform value. So we are a product we need to measure our value. The platform value that we miss her is our adoption. So the amount of people, products that are using our technology and the NPS, how happy they are with the service that you are providing it for. It is hardly measured by direct business value, where you really need to mess up the business value you see in the evolutions that you do to the platform. Once you reach a certain level of users, every single, uh, uh, evolution that you do in the platform creates a huge impact into a lot of teams. And of course, on the right, you can see that we, uh, we apply the same operation on metrics and the rest of the teams is flow time. You don't need to be the bottleneck. You don't want to be the bottleneck and availability, and you will see later things about the vulnerability about the business loss that you create when w without ditches.

00:08:18

So this is the overall framework, and then Corona came to our lives. And, uh, how did our way of working help us to go through the crisis? First thing was focused cutting costs and avoiding cash held wherever possible, due to the big uncertainty that we have three months ago, we really use our product domain map to be civilized our key drivers and move our resources and investment to where we were creating business. In this case only our digital econ e-commerce ecosystem. It was almost our single open store. Like my boss always said, Marcus is like our investment model looked pretty much like a hammer. So very thick on the top. I'm very thin on the bottom, which is okay, but we all know that it's only sustainable for a small period of time, by the way, by moving all of these focus to one area, we really verified that our platform is strategy and common technology tool set, uh, was the right one because it allowed us to move teams from one area to the other, or move a scope to teams that we always preferred and moving teams scope.

00:09:28

And, um, with a really sort of ramp up times Nessian after Corona. So let's look into efficiency, livers that takes a little bit more time and more effort. Can we really do same for less? Can we be more efficient? And we look into the product domain map into a different angle with a different angle in this case for the, for the run cost. So while the build costs that you saw earlier, reassemble a hammer, the run budget really looked more like a kettlebell that we use for doing exercise really very, very heavy in the bottom. And essentially that was because of the way we have traditionally managed our infrastructure cost in the past. Where are the products that are the ones that I've been used? It, they have really low visibility on the cost that they are creating because the budget has been always managed centrally.

00:10:20

We are basically challenging this by implementing technology business management that will help us to be socialized the total cost of ownership by the products that they are creating. Really the cost boosting. We are convinced that they will boost the cost of wellness culture among the company. Imagine all the things that we can do in the it management team with value, cost, and flow really miss her at the product level, but it's a cost cost cost. And we know that a dev ops is not only about cost, right? So dev ops, really, we truly believe that comes also with value generation. And this is where really, I told you our digital channels have been our single open a store for almost three months, new to coronavirus. So the company doubled down on e-com and we even increased the talent in target of 4 billion of revenue or a year on this channel to 4.5.

00:11:15

Or now we are even talking about 4.7, what will be an increase around 30, 40% year over year? And the biggest pain for our digital ecosystem are outages. So apart from damaging the brand and our relationship with the consumer, they really create revenue loss. So we take with the, with our data and analytics team. And in the last month before this, this initiative, we were losing almost a million euros a month because of revenue loss, this revenue loss together with the percentage of defects linked to production have been our key KPIs that we decided when we launched our initiative, that we call digital experience excellence, where we put our platform engineering teams focus on engineering, enablement, and developer productivity, working together with a team for the last quarter. So we wanted to be in the elite, right? So we miss her against the best, because the elite we need to be there and our time to restore, if you can see here lately 4.5 hours was really not looking right or not looking as suspected, right?

00:12:28

So we really thought about attacking, attacking this, this metric for this, we created the framework with forest streams, uh, led by one econ person. I'm saddled by one of our platform experts. We implemented software liability engineering practices. At the scale, we revisited our QA strategy on our end to end experience testing. And let's not forget here. We are talking about a big area is more than a thousand engineers working in our ecosystem, uh, on a daily basis. Uh, last but not least, we also reviewed the release management practices followed in the, in the area where we may be too much into the DevOps extreme. We found out that 62% of the outages were really caused by by 10 years. Of course, like everything that we do, all these streams were exercising KPIs that were contributing to the two killer KPIs, net sales loss on difficult leakage to production to do all these collaborations.

00:13:34

We decided to move from our digital ecosystem to our dojo. So our space dedicated to learning and experimenting in platform engineering. You saw our collaboration models before we had the struggles in the past, really by in the finishing of a scope duration, uh, of our collaborations. So we've found sometimes our people really stuck in a project instead of really developing platforms or a spreading platform or usage microns across different areas. So now we just, we always start with the statement with the problem statement, the KPI you want to actionize on how the capabilities of both teams getting into the dojo can help each other, our learnings. This is a Pierce game. So the platform engineering team is not the smart dependent comes to tell a product that everything is wrong, right? So dojos helps to combine the strength of the team of the bathroom thing with the deepest knowledge in a technical matter together with the team that is owning a living the product in the daily basis.

00:14:39

And of course last but not least, they need to be time bounded on value-based. Please set up clear expectations on a time limit in order to achieve them the results so far are amazing. So we, as you can see in the graphs, we have decreased drastically, both meantime, to restore, to restore and maintain, to detect, and also the meantime between service incidents. And even if you see on the top, right, that the revenue loss has a smaller spike in may. This is because of two outer layers that were caused by external services in our case, by payment providers, because let's face it through nowadays with Katana, every single digital online services is struggling with increase the man hour learning there any way is that we need to protect ourselves better and react better to outages from third parties. And with that, I better stop talking and I leave Daniel and dine-in us with really the Adidas architectural vision. And none of these things I've told you will have the impossible without it.

00:15:47

Yeah. Thank you, Fernando. Um, I would now say for something completely different, but actually it's not different because we are just talking about how can we underpin everything that we heard earlier with our updated cloud strategy and when by was looking around for our cloud strategy.zero, uh, and then looking for a picture that can represent it. I found that nice picture, which I really liked, but then looking at it for some time, I said, Hmm, that's maybe a little bit too intimidating. Um, so we shouldn't really use it. And actually also our cloud strategy today, zero is not actually to the zero. It's more like cloud strategy, two.one dot four or whatever our Vicki gives us as the versioning scheme. Um, and it looks already a quite, quite nicer, like a warm summer day, right. Um, so what are the building principles follower, uh, updated cloud strategy?

00:16:40

Um, as our first cloud strategy was really more oriented around to avoid, uh, how we can really not think about lock-in, um, and, uh, and looking for a potentially clicks it or cloud exit, um, and giving our developers, our engineers, the same developer experience in all of our areas, although all of these kinds of principles were good principles and it's something that's really meaningful. We accounted that there is a problem with that one because when you look into it, um, with generated out of our great ambition to not have the vendor lock in, we created another lock in which is our tech stack choice. Lock-in right. So we locked ourselves into containers. We locked ourselves into Kubernetes. We locked ourselves into Jenkins. Uh so something that we can still perfectly live with today, but actually it made not make very much sense. I'm looking backwards to all of these metrics when we say, ah, we should focus on creating value.

00:17:43

So what is our secret key to avoid the vendor? Lock-in? Um, well, it's pretty simple us. There isn't any, um, so we, we, we deal with it differently because now we pick really, uh, logins that we love, I would call it like the love blocks you see in this picture. Um, uh, so what do I mean was lock-ins that we love, um, you might be asking yourself what can be beneficial of a lock-in? Well, there is a couple of benefits that bringing in standards, even though it might be just defacto standards, um, knowledge is usually broadly available on the market, right? So if we bring in new talent, um, there is a good chance that they already master these kinds of skill sets that we are looking for. And just thinking about defacto standards, right. Um, if I tell you now, photo editing, you might have like the one or two choices in your mind, uh, that everyone else has in mind.

00:18:32

Uh, the one is about the magician. The other one is from this big company, starting with an a, but you get the idea. So in our new cloud strategy, we simply accept the fact that it's not that super easy to work, move workloads from left to, right. But we gained a lot of the benefits of making use of higher level services. Um, so we are not wasting our times any more with, um, spinning up a basic systems. We really go dive directly deep in and create value. And, um, actually I was a little bit lying to you because there is still a secret, a secret weapon to avoid vendor lock in, uh, which looks a little bit like this. Um, so actually the only way how we do it, do it, if there is really a good reason, um, uh, to, to move things from left to right, it's, it's ex actually destroying it and redo from scratch, um, in the newest tech or the new technology, talking about technology, um, our technology vendors.

00:19:32

So obviously we also work with the big hyperscalers, um, right. So with AWS, which Azure with GCP and, um, by our history, the AWS usage is more on the consumer facing one that's very are very prominent. Azure is something that we can make use of, uh, for our employee facing applications, manage business or employee productivity. And GCP is just a new member to the group. Um, at the moment it's very, very, very thin and special purpose. So maybe we talk about this next year, um, but I'm very excited to have them onboard as well. So this is what we call, um, multi-cloud strategy. Um, and, uh, obviously we also still have our own data centers on prem as Gaia X is at the moment, nothing really more than an architectural idea or a concept. Um, there is still need for, for that one, because we still have people in the company who say like, you are not supposed to put everything onto the public cloud.

00:20:29

So how do we now decide what goes on public cloud? What goes on premise and what are the typical workloads that we have? So, um, let me start with this quadrant. Um, and it might look a little bit like some other quadrants that you are aware of, um, where the leaders are also in the upper right corner. And this is particularly also our areas. That's the big e-commerce, that's where our big data, where we really leverage, uh, benefits from, from modern cloud platform. Uh, right now that makes up 25% of all workloads. The second area is all of these new container, sorry, that cloud native workloads that we have on-prem, uh, and that we, that we keep over there, um, and make use of that where we just have some data gravity, or we are not even allowed by data privacy principles to move that to a public cloud provider.

00:21:18

That's roughly 10%. Then there's still a huge, huge, huge area of legacy workloads where we, um, stay on prem, uh, for a good reason as well. This is just like more of this infrastructure setup, uh, and that is our warehouse management solutions. And then we are also making for some legacy workloads making use of, uh, of the cloud, um, specifically in this area where we don't really have an own data center and would be more expensive to build one, rather than just picking these services. And how do we make that available to our developers? It's pretty simple. We create a nice landing zone. Um, so it starts with very limited down cloud defaults, which are secure, which are compliant, but we also make it as easy and as convenient as possible for everyone to use. Um, and for this one we actually stole, okay, let's say adopted some other people are nice ideas.

00:22:13

Um, and we're doing three everything through get, so get ops has any, how is that something that we, that we practice it before? Um, but when we set creating new cloud accounts, um, or even asking for them, everything is now also run through grit, still get, if you now want to have a new cloud linked account, you go to kit, you fork it, you send in a pull request, it's getting reviewed, it's getting merged. And if the merchant is accepted, it's also directly being, uh, being published into the cloud and you get your keys, um, for the less experienced users. And we have some, um, we just now have to build a UI, which is integrating to get rather than to, uh, using it and configuration files directly. And that's also pretty simple and easily doable. So the request form is now directly integrating with skit.

00:23:00

And then the rest of the process stays as is. Now I have to say all this is valid for set up in AWS because it's rather big for the other cloud providers like Azure and GCP. We took another strategy. We just say, leave the doors open. You can have whatever you want here is the keys. Um, but be aware there will be a watchdog, uh, looking after what we are, what you are doing. And this watchdog actually is also helping us to nailing down a compliance to a metric. And there we are again on measuring everything in a metric, right. Uh, and this is, this is bringing us to the next thing that we are doing in an architecture point of view. We are implementing architectural fitness functions, cross the different products in cross the different product domains. Um, so in general, what does it mean if you attended also last year's conference, you saw this nice presentation of our improvements that we did was the site speed.

00:24:00

And, uh, you see on the left-hand side, how it looked like afterwards and on the right or right-hand side, uh, how it looked before. So it's a good indicator already, uh, for some improvement. Um, you heard earlier from Fernando that we are measuring our meantime between failures, our meantime, to detect that our meantime to recover. I just hope this bus driver had a good disaster recovery strategy. Um, but we are also measuring where we can and we measuring financial it's. We are measuring the cost of outages. We are measuring the cost of running the service, and we are measuring where possible also the benefit of deploying a feature. We all make that available in our central dashboard, our global metrics portal, uh, that we already have for quite some years, I was trying to find a picture without a brand name, but I think I'm quite because most people would recognize, uh, what brand that is.

00:24:52

But let's also talk about, uh, one tiny failure that we did. Um, and that is, uh, our data lake, um, looks beautiful, huh? Well, I picked the picture of a golf course on purpose. Um, why? Because that not only always super beautiful, um, uh, but everything over there is quite artificial, but created to look natural. Um, and you have a couple of green keepers, usually a small group who's taking care of that, uh, of that golf course. And you also, um, let exes in very limited way. Um, and this is actually where I say we fared because we, if I now convert that picture, we had our green keepers about to make the field. That lake was the cleanest and the purest water that you can find and always ensure that it has drinking quality. Well, guess what? Um, it became a bottleneck because they were trying to fill that lake was the garden hose.

00:25:46

Um, uh, because their limited was kind of very limited. Uh, their benefits was very limited. It's just a single team, right? And the reside of that one is that other teams who really had the demand, we're creating data buckets left and right, as well as some very playful puddles of data used by the data scientists. Um, but obviously that was creating some other, um, issues again, as we had no visibility, central visibility on all of the data that is available and also no way to make that available to everyone. Right? So it could be that we have some very meaningful data already available, but some others recreated this. So, and this is how we came to making use of what we now call a data mesh, or some other people refer to it, like, uh, saying, calling it that says, which is referring back to that mountain article and say a data product has to be discoverable, addressable trust Versie self-describing, um, into operable and secure.

00:26:48

But the biggest learning that we took out of that one is that it wasn't really a failure in terms of technical setup, in terms of architecture. It was a failure in terms of organizational set up because we just gave one team that one single target. And now we are using that towards our benefit. We are reversing the Conway's law, um, and changing some organizational setups, um, with different incentives, uh, to achieve the target that we actually want because the data is available to everyone. And, uh, with that change, it's pretty clear for us as well. We know that change as a team sport. It was that I give back to Fernando,

00:27:31

Thanks a lot, Daniel, it's always amazing. And you can say change is a team sport. And for us and in our DNA, useful as year all is around gamification, right? Is someone at DNA in the DNA of our company. So let me finish up with a couple of things that we are doing right now and how we are using gamification across the whole it things that we started in engineering last year with the DevOps SCUP they are now expanding across it like Cassie's queen, right? We spoke about how to save money for the company, how to do same for less. We launch this campaign, Cassie screen. Would we really give a copy of the unicorn project to the team that is saving more money in a week? Thanks to this campaign. We have saved 10%. So we have sweets off 10% of the machines in our on-prem data center, more than 500 VM.

00:28:28

Some databases dev ops on gamification is also everywhere, right? So we have the game of technical depth one team. So one product area is around a hundred engineers. They are playing one technical depth spring working with the flow framework, every six sprints. And they realized that after the cleanup is green, the velocity completely goes up and start decreasing on the next five sprints until they clean up again, operational analytics. So our operational analytics team is following accelerate by the book step by step about decentralizing operational analytics to the market teams. And what I am more proud of this year is the network and the identity teams thins up comes from a former infrastructure way of working. And they are applying little by little more dev cycles principles. So not throwing anything over the fence to security, they are working the sprints, they visualize the backlog security can tip in, and they feel now more secure that the network team and the identity team are really following a secure way into their backlog.

00:29:39

And they can really, uh, effect this potlucks for the better of the, of the company, the biggest success of the year, no matter what, if we reach the 4.5 billion 4.7 billion is that from day one of Corona, all our 60,000 employees were able to work from home without us having had to risk their lives for this the last but not least this gamification is coming to our business. So our robotics process automation team really lounge the secrecy race licensed to automate when they are opening up our stock to business. Because basically it's not difficult to use. We have 360 cases that people apply. We are training them. I make until you're 60, 70 of them are really promising to keep really saving manual work into the company. Last year, only in the beginning of the creation of the platform, we really save 62 FTEs of manual work with a bunch of five, 10 people in the team.

00:30:50

Okay. And just like every year, I just want to close with asking for help. And it's around if in our plants, it is, we have, you've seen our product domain map. You see, in our products, Danielle, myself, we want really to define all these data flow around entities throughout the different product areas. So we'd really are strict in governing all these data interaction. And we put their data catalog in a data visualization service visualization there, and we really are able to manage and test these data because this will make the products internally faster. So if any of you have done something like this in the past, in a company, our size or our complexity, this will be really appreciated and without further ado. So thank you for the time. Uh, hope to see you then next time in person in Vegas in Vegas. Yes, baby. Yeah. Thanks.