Las Vegas 2018

Scaling Continuous Delivery to Walmart

As we continue Walmart's journey to accelerate value delivery, incentivizing the outcomes we need from development teams is critical to success. We'll discuss the tight integration of metrics, open source tooling, education, and community we use to drive DevOps at the scale of a Fortune 1 company.


Bryan Finster has been a developer since 1996. In 2001 he joined Walmart, developing warehouse management systems for their global supply chain. In 2017, he joined Walmart’s Software Delivery & Enablement organization as the product owner of Hygieia development to bring metrics visibility to the teams. Currently, he leads the CD Sherpa team who assist product teams with removing constraints to continuous delivery.


Dana Finster has been a software developer since 1998. She joined Walmart in 2015 and currently works on the InfoSec SIRT Tools Support team. In 2016, she organized our 3rd DevOps day, bringing together thought leaders from all over the country to share experiences. She is the founder of our grassroots CI/CD community of practice, Continuous Chai.

BF

Bryan Finster

Staff Software Engineer, Walmart

DF

Dana Finster

Sr. Software Engineer, Walmart

Transcript

00:00:05

That scaling continuous delivery to Walmart. I am Dana Finster. I am a CD evangelist and senior software engineer in information security.

00:00:16

Uh, I'm Brian Finster. I'm a staff software engineer and team lead for the CD Sherpa team, and we work for a small retailer in northwest Arkansas. You may have heard of it.

00:00:26

In 1950, Sam Walton opened his first small little Walton's five and dime, but today Walmart employs 2.3 million associates who support almost 12,000 stores in 28 countries worldwide with half a trillion dollars in sales annually. This is our scale. Yeah, <laugh>.

00:00:48

And on the it, Ooh, that's loud. On the IT side, uh, we've got hundreds of development teams worldwide with, uh, deploying to hundreds of thousands of nodes supporting every business. Um, we have, and uh, we have really diverse tech stacks, everything from mainframe and C to Golan.

00:01:07

And we're here to talk about scaling DevOps to this size, to Walmart size. So let's start with the first rule of DevOps. Everyone knows the first rule of DevOps, right?

00:01:18

Don't talk about DevOps. <laugh>.

00:01:21

DevOps is overloaded. The term is interpreted in many different and often confusing ways. You can't just go out and buy the DevOps. You can't hire the DevOps, but at its core, DevOps is really simple, right? The people collaborating together using lean process and heavy automation to deliver quality software rapidly. But if we don't talk about the DevOps, what is it that we talk about? And what we do is we focus in on the outcomes that we're looking for and foster the culture to attain them. We're all here looking for the same outcome, right? To deliver quality software rapidly. And the key to the culture change that's needed to attain that outcome is our people and our teams.

00:02:13

And, and we know from experience, we spoke about this last year, that we can grow really effective development teams by having that team focused on trunk based continuous integration, real continuous integration, and reducing the delivery increments and keep driving down that batch size and asking why can't we deliver today? And solving those problems. The act of that team solving the problem not only makes the team be really good problem solvers, but it generates a lot of teamwork. You get a really effective team that can deliver value very rapidly. Uh, the team that I came from, we went from zero to 12 deliveries a day to production.

00:02:50

Yeah. So we started by holding annual DevOps events to educate people about the concepts of DevOps and continuous delivery. Oops, sorry,

00:03:00

Go ahead.

00:03:00

No, go ahead. <laugh> <laugh>,

00:03:04

The way we're approaching scaling this to Walmart, you can't go team to team to change it, but we're taking a, an approach of using gamified metrics, uh, and sharing, uh, culture, uh, and sharing community, um, a unified deploy platform, which is really key and Sherpa guides to help teams with any struggles that they have.

00:03:25

And we started by educating people holding DevOps days to teach people about the concepts of DevOps and continuous delivery. These really got people excited and it started getting the word out. I went to one a couple years ago and was really excited to bring continuous delivery back to my team. I knew that it would make our lives easier. I knew that it would allow us to work more, work better with our business partners and deliver value faster. The problem I encountered was that I couldn't find a central area within the organization to learn more, to find out what initiatives were currently going on and how to actually implement continuous delivery. I looked around and I found a lot of pockets of really good progress. We had teams that were building pipelines. We had teams that were focused in on testing and continuous integration, and many, many teams were all trying to solve exactly the same problems independently. I had to figure this out. So I decided to host another DevOps day, and I brought in leaders and developers from all over the country to share the vision and highlight the progress that they were making within the organization.

00:04:50

But I knew that this event was going to garner a whole lot more excitement. People were gonna learn more, they were gonna want to accelerate faster, but these same excited people were gonna end up just like I was looking for that central area to kind of guide them in what those next steps are. So I built us a home. I I started continuous Chai, which is A-C-I-C-D user group where people can come to share and learn about continuous integration, continuous delivery, and the myriad of topics that go along with it. This community of sharing is the first of four initiatives that we wanna share with you today.

00:05:31

I believe that when we wanna change culture, it works to help use that culture to teach the culture. Sharing is a key tenant of DevOps, and it's important to share early. And often we have old habits and human nature that just hold us back, right? We, we only wanna show people beautiful, shiny, finished products after we're all done and we say, look how successful I was. I, what we've built in continuous chai is a forum where people actually have the freedom to share off the cuff ideas, to share their work in progress, and to highlight not just their successes, but a trusting environment where they can honestly speak about their failures, their struggles along the way. And by sharing early and openly, teams can learn a lot from each other and avoid wasting time with duplicate work efforts, struggling to solve the same problems alone. So an active trusting user community is key to enabling large scale change.

00:06:36

Yeah, and, and having the network in place has been a really valuable tool. As people start onboarding to new tech stacks, they start asking the same questions. We see it over and over again in ChatOps where, you know, how do I test this React app? Or how do I plug Sonar into this thing? And we say, well, have you asked in continuous try? And they go to the community and you have the community dive in and help 'em. You get solutions so much faster you do by Google or Stack Overflow. And in fact, uh, when Dana and I were working on the deck together, we went to continuous try to get feedback because we knew not only that we had a trusting environment with friends, but we knew we would get actual real feedback, not, oh, yes, it's wonderful. It was a little daunting. Daunting. Yeah, it was a little daunting seeing all the notes come by, but it absolutely helped us improve this material.

00:07:21

So we've, we've talked a bit about where we are and what an asset that a user community can be. I hope that you don't have an engaged community that you might be thinking about starting one. Um, I've learned some things along the way. First of all, a leader of a user community has to have passion. It's not something that you can just tell someone to take care of and expect it to be finished. Building a community takes ongoing work to engage associates, keep a consistent schedule, and bring interesting demos and discussion topics to the group. Secondly, it takes a lot of patience. When we first started, there were many times when I was sitting in a room by myself, <laugh>, and, or with one or two people, but even just a handful of people can brainstorm ideas and start bringing true value to the group.

00:08:17

We have iterated in different formats along the way. We've done informal coffee chats, we've done demo and discussions focused on specific topics and even offsite meetings. And over time, we've come to find that we have the most success in our environment with meetings that have specific demo topics. We just keep iterating on the format and on the timing. And we currently have over 600 members and offer weekly demo and discussion sessions. It can't be built in a day, and when the momentum does periodically slow because it will, there's one fail safe way to incentivize people to keep showing up swag free food,

00:08:58

<laugh>. Yeah, and it is true. It's amazing what we'll do for, you know, as developers for a, you know, a t-shirt. Um, but the important thing is, you know, this T-shirt is not something you get for showing up. You only get this T-shirt for contributing to the community. It's a badge of honor. And so people celebrate. Look, I have a continuous chai t-shirt. It's, it's been really important.

00:09:18

I only got one that looks like this if you're me, <laugh> <laugh>.

00:09:24

So the other thing is unified platform. We first started on this journey. We had several areas and, um, the areas that were really digging into cd, they'd go and spin up their own Jenkins instance or, or whatever tooling they were using or other areas weren't focusing on at all. They didn't have the tooling, didn't have the bandwidth to get it done, but it doesn't scale for every single area, every single team to get their own platform stood up. Teams, product teams should be focusing on delivering products. Having a cons, a consistent, unified platform they can use that's easy to use, is absolutely key. So I work in software delivery enablement. We're the area that's responsible for building the, the CD platform. And it's a set of tools we're building, um, it's delivery as a service. We want teams to be able to focus on those products and then just use the automation to deliver 'em. We don't deliver it for them. We just build the automation. We're using open source tools and scaling 'em to Walmart. And I'll tell you, we break a lot of tools, uh, and we want to make the easy thing, oh, the right thing, the easy thing. We, we want you to flow downhill to success. Uh, the initiative we work on is called Irresistible Developer Experience. Um, we want you to use the tools because they believe that they are better, that they're easier to use, and we find it's really fast on onboard people on these tools.

00:10:42

Our delivery platform is designed to be implemented by all the teams across all the tech stacks in the organization. Having this single pipeline allows for security and code standards to be consistent across all the products in the enterprise. New tools and controls can be injected and all teams can immediately benefit. Our platform uses simple configuration files that hide the complexity of the implementation from the development teams. And not all of the features are able to be configured by the developers. Things like code scanning and security controls are automatically turned on. Developers don't have to set those up, and more importantly, they can't be skipped. We actually showed this slide in a continuous chai presentation, and one of the developers there noted that he'd never really seen all the intricacies that go on within the pipeline. To him, it was kind of like magic <laugh>, he said, as a developer, it's almost transparent to me. It goes to gi, it gets built, magic happens.

00:11:54

And example is Concord. It's our workflow orchestration engine. It's a general automation tool. We use it mostly for our CD pipelines. We also use it for any just automation we want to do, including signing people up for classes. Uh, it's got plugins for all the tools we use. It's easily extensible for other plugins that we need. And more importantly, developers don't need to understand the underlying implementation. All they need to understand is how, uh, Concord, uh, re uh, uh, how to call those things from Concord. Um, and because it's been planned from day one to release this back to the community is open source, it's enabled a use case for Dana's team. Yeah,

00:12:35

My team actually supports our security infrastructure and incident response teams. And because of that, we are on a completely segregated air air gap to network because it's designed to be released to the broader community, it's designed to be very easy to install. We're able to implement our enterprise platform in our segregated network and very easily pull in the new features and still be able to take advantage of all the work going on across the enterprise. And here's an example of how simple it is to configure the tools from the developer standpoint. Um, we simply have a configuration file located right alongside the code using a simple declarative language. This allows for configuring and versioning individual repositories to be very simple. It hides the complexity from the developers. And each feature is just a simple function call. We can see right here that a single line of code calls Hy G and publishes the build metrics from this repo.

00:13:50

And metrics are also very important. If teams have pipelines, but they don't have goals to deliver to, there's, they have no idea what, what the outcomes are supposed to be. So it's really important to make those goals clear and make the metrics clear so they understand, uh, how they're trending against those goals. And to do that, uh, we use hygeia. So, uh, if you don't know about Hygeia Capital One open sourced Hygeia several years ago, uh, there's been teams around, uh, our building have been using it for years, uh, and we've now have it integrated into our pipeline. We've also, um, I'm sorry, it's wow. Um, anyway, it gives you a real time view of the, of the CD pipelines, um, and it's really important tool for the teams.

00:14:36

Oh, the product dashboard gives teams the metrics they need to understand the health of each individual repository with metrics including build stability, the frequency of commits to master static analysis and test results, as well as code coverage and the frequency of deploys per day within each environment. We've also added scoring to hy. The metrics are weighted and aggregated to give an overall health score. And this scoring gamification helps drive improvement. And it allows teams to quickly see which code bases are more hardened and which might need a little bit of attention. Teams can analyze, which, where they might need to put attention by drilling down into each individual metric widget. For example, we can see here that most of the code, repo score is determined by frequency of merge to master. But if teams are committing directly to master without using a pull request, then they're gonna take a hit on that score.

00:15:49

Yeah, and, and I was talking to Scott here from Columbia Sports last year, and he was talking about we should have taken a psychology class to get this done because to make, to get teams to change it, he's called it hacking the biggest undocumented API, you, you poke and prod and see what the outcomes are gonna be. Metrics are incredibly dangerous things if used inappropriately, right? So you really need to understand those metrics and understand how people react to those metrics. Just because you put a metric in place and expect an outcome doesn't mean you're gonna get it. Go and investigate. Uh, an example, one of the, we, we had teams coming to us when we implemented the scoring and said, okay, now I'm having to go and make changes to repositories that currently don't need changes just to keep the score up. Well, that just generates waste.

00:16:38

We don't like waste, right? We want value. And so we made some additional changes in response to that to make things better, we created a higher level view on top of hygeia that aggregates the metrics up and averages 'em across the team. We have a tool in inside a Walmart that tells us how big a development team is, how many engineers are on that team. And so we can average the score is based off of the team size, and we can get those deploys per day, per developer commits to master per day, per de per developer, and find out how teams are doing and use that to say, okay, here's our goals. Here's who we are, how do we help you achieve those goals? Um, but even then, currently all those scores are weighted equally, but commits and deploys are far more important than code coverage. And, um, we have teams that are right now trying to raise their scores by raising their code coverage, which is incredibly easy to do. So what we do is we have on our backlog shortly to drop the, uh, um, the waiting on code coverage and increase the rating, the, the waiting on commits and deploy to, to get the outcomes we want. Now, one of the thing, all of these scoring, um, in the widget changes, we have pushed those back to Capital One and you can find those on their master today. Yeah.

00:18:00

And this team view also adds some competitive fun over and beyond the, um, you know, with the teams, because there's a view above this one where we can see the scores of all the teams in the enterprise. And I know this firsthand because I have a tech lead who pulls up this dashboard every day to make sure that we're still winning, that we have the highest score of all the teams in our area. So it really goes a long way when you provide the visibility, um, especially in this case for the more competitive teammates or teams. And while Hy Gia does give us the metrics to evaluate how we're doing, and as a team where improvements might be needed, sometimes teams need a little extra help to actually determine how to implement those next improvements and get to the next level.

00:18:50

So we have Sherpa guides, you know, we're a group of developers who've done this before and we can embed with teams and help them out. So the team that I lead is the, the change is the CD Sherpa team. Um, we have been up and down the mountain. We know where the ravines are, we know where the landmarks are. We don't want you to become one of the landmarks. Uh, and so we will help with anything required to get it done. We do platform support for the tools to make sure you understand how to use the tools, but we also run tech workshops on domain driven design or, uh, strangulation. My favorite one's, agile Rehab. That was a really popular one. Uh, we do leader re uh, leader Leadership Outreach where we explain that this is a change in how teams should work. You need to understand how this impacts how you incentivize the teams to get the outcomes you want. And we do team boot camps where we will embed with the team for six weeks, run two and a half day sprints. It's very similar to the other dojos you may have heard about from other companies. Uh, and help the team with whatever their biggest constraint is, help 'em move the needles, show 'em that improvements not only possible, but can be really fun and build teamwork.

00:20:03

And we are pie-shaped developers, not t-shaped developers. We have to have breadth and depth that it's really hard to find people for this team. You can talk to anybody trying to build teams like this. It's incredibly difficult. Um, we tell the teams, we are not agile coaches. We can coach you in Agile because you have to be good on this stuff to get this, get this done. But we will also help you with, uh, planning out a legacy strangulation. If you need help with test architecture, we'll prepare a program with you to teach you how to unit test. If you need, we'll do anything required. And if our team doesn't know, we've been here for a long time, we know people that do we'll, bring them in and get that knowledge to you as fast as we can.

00:20:41

So at Walmart, we're focusing on outcomes and fostering the culture change to attain them. It's not an easy task, and it's taking work from all directions. Growing an excited base of people to advocate every day is important. We're reaching out to leadership and developing a strategy that includes a single enterprise deployment pipeline. Metrics that are focused on the right outcomes, and teams dedicated to training and enabling teams to deliver value safely and quickly. The single pipeline across all teams and tech stacks makes the right thing the easy thing to do. And standardized metrics make progress visible and understandable at, at different levels because it's standardized and the gamification makes it competitive and fun. Our people really keep the momentum going to enable large scale change.

00:21:40

Now, this works for us. We're seeing a lot of improvement using this process where we were, you know, not seeing it before, but context is really important. Um, you need to make sure that understand that no, nothing you see is a cookie cutter solution, right? You need to find out what works in your culture. You know, if people are incentivized by badges, give them badges. If they're incentivized by, you know, certifications, do that. Whatever it takes to move the needle cash, get it done. <laugh> cash money, money's money is good. Uh, but also understand, give people permission. Dana and I are not management. We are developers, okay? And in our culture, we, Walmart has a strong culture of grassroots improvement, and we didn't ask for permission when Dana needed a DevOps day to learn more. She said, um, how do I reserve the auditorium? Not, may I have a DevOps day? When she decided to start continuous chai, she just got meeting rooms together, spun up a Slack channel, started continuous C chai, and then made all of the leadership help. Uh, and there are people that are passionate about this in your organizations. Make sure they know they have permission. Don't assume that they think that they do. Find those people, elevate them and, uh, give them all the runway that you can give them to bring everybody else along.

00:23:00

Now, like everybody else here, we, we ha we came with a problem. And it's a common problem we hear, how do we get non-technical people aligned to the changes required on the technical side? How do we build that empathy? We're looking for effective ways to get that done. Um, I mean, I'm a technical person and I can do a little bit, but how do I get them really to understand the change required? If you have any ideas, if you've had any success on that, please come see me afterwards and I would love to hear what that is. Um, go ahead.

00:23:34

Okay, <laugh>, we're gonna just, we wanna just wrap up by sharing some of the outcomes that we have seen so far. Um, first of all, teams are collaborating. Lots of collaboration between teams is helping to remove duplication of efforts and really shortening the learning curve for teams. Teams who are focused on continuous delivery are delivering faster, and they're delivering with higher quality. And when they see that and start to do that, they realize that CD removes the drama from delivery. And improvement is addictive. Teams are actively trained to improve and using metrics to measure that progress

00:24:24

And teams are having more fun. Um, you know, the motto of my team is deploy more, sleep better. And when I get to sleep at night, I can find more entertaining ways to get things done at work, and I have time to just find joy. And here's an example of a team finding joy. So this is literally how this team deploys to production. That toggle switch there. If it's not blinking, uh, that means that Hygeia shows that everything is good on hygeia. He then flips the toggle switch if it's still good. Looper says, CI build is green, you're good to go. Hits the button, deploy, and Concord sends it to production. And that's exactly how he gets it done every single day. Actually, version two is all, uh, uh, uh, uh, steam punky with it's, it's really cool <laugh>. And, and if you'd like to have fun like that, we're always looking for good people, especially on my team, uh, careers, walmart.com. Feel free to come and talk to me after. Uh, and we're gonna be in the speaker's corner at three 15. Uh, you know, we're always happy to share. That's the best way we know how to learn. Thanks very much. Thank you.

00:25:39

Now we have a few minutes, minutes for questions if anybody wants. Uh, I think we have four minutes and 39 seconds, <laugh>, and there's a mic coming up right behind you.

00:25:48

Uh, quick question. Uh, great talk by the way. Thank you. That, uh, was there any governance model in place? I mean, like, did you have to go through any kind of a governance model to improve that, uh, adoption of DevOps culture?

00:26:03

Uh, you know, not so much a governances model, but there is an ask, and this is super important. It's not enough to have developers pushing for this. You've gotta have executive, uh, pushing for this as well. And we have an ask from our CTO that every team deploys at minimum to production once a day, right? And having him drive that and having him look at the metrics on hygeia to say, okay, this is where we're at today. What do we need to do to get there? Right? And you have the bottom pushing up because it's better for us as developers. You have him pushing down and then we just get it done. Uh, as far as safety goes, we bake the safety in the pipeline. We don't use process, right? We automate the safety.

00:26:49

So as you release the code, do you store those metadata somewhere, like in configuration management or somewhere? So as a management you could go back when was the patch applied or when was this de fact based, et cetera?

00:27:04

Yeah, I mean, we know what revision went to production. Okay. And, and we can trace that all the way back to GitHub to, you know, and we, I actually, I honestly would prefer more tracking, but we're building that all the time. We're, we're hardening the pipeline constantly.

00:27:19

Yeah, we have same challenge <inaudible> right now, if there any, lemme know.

00:27:28

Sure. <laugh>,

00:27:33

I'm curious, I'm curious about your sheer team. Is that a permanent role? Is that a full-time role for people on that team, or are they balancing that with their regular workload?

00:27:41

Yeah, and uh, you know, I was, uh, Ross Clanton from Verizon presented last year, uh, on the DevOps dojos at, uh, Verizon. And I gave that to my vp. I said, we really need something like this. And he said, okay. And so now it's my job, right? Um, yeah, it's our full-time job, and it's a really hard job and it's, it's hard number one to find those people and it's hard not to get them burned out. Uh, and it's the, the other challenging part is we also have to do some development work. I mean, we've gotta keep our fingers dirty because you lose that so easily. So I'm trying to make sure on a backlog that we have development work that also adds value for training, right? You know, if we want to teach people how to test, we just build an application that's tested appropriately that has all the information in it and show it to 'em, right? Um, but it's full-time job. It used to be my hobby.

00:28:33

Uh, how do you guys navigate the challenges of, uh, getting consistent, um, uh, trunk based development versus, uh, development teams that may want to say, oh, no, no. Get flow workflow's gonna work really well, <laugh>, we're gonna, we're gonna do that way. That's, it's so, you know, you know, very well gets into religious wars over that topic. So

00:28:49

Outcome metrics, right? You're not gonna be able to go to production every single day. If you're running GI flow, you're just not gonna do it. It's gonna cause too much drag. You're gonna be spending all your time on trying to deal with merge conflicts or all that nightmare, right? Also, your score is gonna suck. I mean, to, to get a good score on the GitHub, uh, on the SCM widget, you have to use trunk based development with a feature branch that is less than 24 hours old going to master, right? Uh, and it'll be locked. You know, we won't, if it, if you're not merging the master, you won't even show any score because we're only gonna measure master. And so the scoring, uh, of that widget and the outcomes you're looking for help drive that behavior. And then it's just like, well, I can't because, okay, well let me help you with why you can't

00:29:37

From a team buy-in standpoint. I also feel like being able to show it, if, if you have teams that are doing it and they've had those aha moments going, wow, yeah, we really need to work this way because look at these outcomes and look what we're doing. If you can share that with the teams that are not doing it, and try to build them into having those same kind of, I call 'em aha moments, like <laugh>. It's like when the team has that and then they, there's no turning back, continuous delivery all the way

00:30:05

<laugh>. Well, the other thing is because we, uh, my team are all, we're all developers that come from product teams and we've worked this way. We know this is a better way of working. So we can absolutely tell 'em for real, this isn't theoretical. Your life will be better if you just let us help you move this way.

00:30:20

How many people are on

00:30:21

Your team right now? Uh, right now there's four careers. Walmart, right

00:30:27

Now, <laugh>. And I think our time is up, so thank you very much.