San Francisco 2016

The Mainframe DevOps Team Saves the Day at Walmart

This session will discuss the success story from Walmart on how they built a set of services on the mainframe to provide capabilities at a large scale for their distributed teams, as well as discuss the transformation required for mainframe teams to achieve this success.

RR

Rosalind Radcliffe

Distinguished Engineer, Chief Architect for DevOps, IBM

RJ

Rich Jackson

Principal Systems Engineer, Walmart Technology

Transcript

00:00:02

We're not gonna talk a lot about, uh, DevOps in particular, or say ICD or, uh, automated testing. This is really basically a story about attitude and interactions. Um, before we get there, lemme tell you a little bit about, uh, Walmart. So we were established in 1962, so just shy of 55-year-old company. Uh, we employ 2.3 million associates around the world. We service approximately 260 million customers each week. We manage or operate over 11,500 physical retail units under 63 banners, uh, in 28 countries across the globe. We also operate e-commerce sites for 11 countries. And last year we clocked in just north of 482 billion in revenue. Um, of those two, of those 2.3 million associates, uh, probably, I think it's 67, 6800 of those are, are Walmart employed technology associates. Um, so a little background, um, on us, um, allow Rosalyn to,

00:01:22

Uh, so I'm Rosalyn Radcliffe. I'm a distinguished engineer at IBM, and I'm the chief architect for Enterprise Systems DevOps. And I spend a lot of time working with clients to help them understand their transformation, how to do this transformation, and how to include the mainframe in that transformation. I've done a number of sessions together, and sometimes I get to do 'em with Rich. Sometimes I get to do 'em with Rich and Randy. But the fun part is I'm the dev side of this. Now, I've been in ops and I've been in Dev. I've been in all sorts of different fields, but I get to play the dev in this discussion. Yeah.

00:01:56

Uh, and as Rosalyn implied, I've come from the ops side. Uh, I guess my, my path has led me to identify, firstly as an ops guy, uh, then a mainframer, uh, and a little bit of programmer. Um, I started back in the nineties, mid nineties doing freelance work, doing basic CIS admin stuff for Windows folks around town, small businesses, network setups, backup systems, that type of stuff. Went to school, got a BBA for computer information systems, and, uh, started looking for a real job. Uh, I actually, I wanted to go to Walmart. Uh, there were, there were things I found attractive about, about the company. Uh, and I was actually trying to get a job in the DBA space, but I got a call back in an offer to work on the mainframe, which I had never done before. Had no experience with the mainframe.

00:02:49

And I explained this to 'em, and they said, that's fine, we'll train you. So, uh, went to Walmart, started working on the mainframe, really dug the platform, uh, did storage admin, sis admin, moved into systems and solutions design. That's where I was actually when this story started, was in that design space. Um, I guess the point is, I'm, it's probably a similar story to most folks in this room, right? I just happened to ended up on a, on a unique platform. So I got a different tool that I use. Um, so the story involves, uh, a problem associated with one of our critical business functions for Walmart, and that's inventory management. Uh, I know there's at least one other retailer in this room that understands the importance of inventory management. It is a huge deal for a retailer. It's, uh, it's, and it's, you know, gene showed the slide the other day about the, um, what is the, the core conflict stuff.

00:03:51

Uh, and that is the essence of inventory management. Uh, you want to have always enough on hand so that a customer gets what they want when they want it. Uh, on the flip side, you don't wanna have too much where you're, you're tying up cost and, and how much inventory you're holding. Uh, to put it, uh, I guess to use an analogy that I think is probably relevant for this room is you can think of it as a small batches kind of, kind of scenario, right? You want to just a steady flow of just enough inventory moving through the system on a continuous basis that would be ideal for a retailer. Um, and the implications, uh, especially at scale are, uh, significant to put it lightly, um, to, to look at it. Um, we recently, we had an officer recently go on record talking about some of our outof stock issues at some of our stores.

00:04:48

And, uh, this was a year and a half ago, I think. And the estimate, uh, of the impact of our outta stock issues was about 3 billion in sales a year. Uh, so, and that's probably doesn't include reputational impact, right? From, from customers that may not come back because we didn't have what they wanted. So not having it there when the customers need it is a huge deal. Um, and then on the flip side of that, you can look at our balance sheets, um, and see that we manage, um, or hold for about $45 billion worth of inventory quarter by quarter. Uh, and to use a very simple, uh, theoretical example of the impact there, if I could do something to reduce that by 1%, right? That's $450 million. Um, you know, I'm sure most of you could find a way to use $450 million for your business, right?

00:05:46

Absolutely. <laugh>, you know, you could reinvest that in price. You could do technology investments, people investments. That's a lot of capital. That could be very useful if you could manage that inventory a little more tightly. So it's, it's huge. And, um, Walmart actually does a pretty good job. You know, it's not perfect. Um, but we're kind of known for doing a pretty good job at that. Um, it's, it's actually surprising if you go to a super center and you look at the back room where stock comes in, it's quite surprising how small that space is back there. I was, I was surprised by it actually. Um, basically it's a glorified hallway. It's, uh, relative to the floor space of the store itself, there's not a lot of room back there to, to hold a lot of inventory. So we do pretty good with it. And, and that's, um, that efficiency that Walmart achieved in inventory management was due to some technology investments back quite a while ago, uh, early nineties, we, Walmart did about a $4 billion investment in a new system called Retail Link.

00:06:51

Um, a short time before that, we had really, we had really bought into, um, barcode and POS systems and started collecting data, collecting all the data, right? And storing it in a, in a huge data warehouse. And what we did with Retail Link is we opened up all that data to all of our suppliers and implemented a vendor managed inventory, uh, implementation. So what that means is the, uh, our suppliers that provide us inventory and products, they have visibility into our sales and how their products competing against their other products, what the turnover rate is, velocity, all these things, then they can help us manage that inventory and also they can manage their production cycles better. So it's a win-win for everyone.

00:07:39

Um, there's a, there's a quote from Sam Walton that I love related to this and, and really speaks to, um, how big of a deal it was for Walmart and how it differentiated us at that time, and ultimately became a game changer for the industry. He said, people think we got big by putting big stores in small towns, but really we got big by replacing inventory with information. And that was the key, right? The information and the information sharing in particular, um, and, and getting those partnerships with the suppliers was really, uh, what allowed us to grow like we did, uh, during those times. Excuse me, now, you know, nothing static. Uh, it was an awesome system. Uh, it did a lot for us, and it had received, you know, enhancements over the years, but, uh, it, it was, you know, come 2010s. It was in need of some significant kind of, uh, retooling and, and reworking pretty much a, an overhaul, uh, some of the things that needed to, to happen.

00:08:43

Uh, well, I guess firstly, it was a desire to free our suppliers from IE six. Uh, that was, don't judge me <laugh>. Um, that was a pretty big deal, right? Uh, but along with that, there were other things, right? So, um, there's new types of data available that has an impact on sales and sales forecast, right? Uh, we need to do incorporate support for social streams and, and the free availability of weather data and things like that. 'cause all those things impact sales and sales velocity. So we also wanted to convert it to a much more service and API driven system than it was. So, uh, it was time to make another, uh, substantial investment in this system. And in, uh, I guess February, 2012, uh, our development team started working on, on the first phase of this, this enhancement effort. Um, and that, that design that they were working on includes, uh, a pattern that I'm sure, uh, we're all familiar with.

00:09:46

It included as part of the design, a web server layer backed by a caching layer. Um, the caching layer was used for performance and things like that, but primarily it was used for session state management. One of my, uh, distributed engineer middleware engineer counterparts actually likes to kind of put the caching in quotes. Uh, 'cause they use it outside of the norm of a simple token cache, right? They, they cache quite a lot of information about these sessions. So it's a, it's a very heavy, uh, implementation of a cache. And that'll become relevant. Actually, we will touch upon that a little bit more in a, in a bit, um, for that caching layer. They chose, um, off the shelf appliance, you know, from a enterprise perspective, uh, there's a, there's a build versus buy discussion that, that goes on all the time. Um, and there's, there's an appeal to just getting off the shelf things for things that aren't necessarily related to your core business function, right?

00:10:46

Um, the problem we have regularly in Walmart, um, is it never really fits what we need it to do. So there's typically a lack of the ability to customize these off the shelf products or really fit our needs. Uh, but more importantly, scale becomes an issue. A lot of the off the shelf products we use, uh, run into scale issues once you get it in Walmart's environment. 'cause it's so, so big and so massive. So, um, and that happened with, with this application. They didn't really, uh, discover these issues until later in the year. So they started in, in February, the first release was scheduled for February the next year. This was about November when they really recognized that this caching solution was not gonna work, uh, for the business, uh, which we had committed to. So it was becoming a hair on fire kind of situation.

00:11:41

So they, they approached our engineering, uh, de department looking for help. Uh, and a pretty sad thing. Um, they were looking at some other solutions using like some software based solutions to run on commodity servers, that type of thing. Uh, but our engineering and architecture board said, yeah, we'll, we'll look at some solutions. We'll trial 'em, we'll vet 'em and we'll get 'em qualified for production, have 'em ready for you in about eight to 10 months. Uh, which they had like three months till go live, right? So they, they didn't like carrying that. Um, and they were basically in a pickle, which is where we came in. Now, they didn't come to us looking for help. Remember, we're the old mainframe guys. They're not gonna come talk to us. Um, but we overheard, right? We worked with the other engineers, uh, we heard about their problem.

00:12:30

And coincidentally, Randy, my, my buddy that I worked with, Randy Firkin and I, we had been, uh, considering this was, you know, there was a whole lot of talk in the enterprise about cloud at this time, and what does that mean for, uh, private cloud, public cloud? What does it mean for my business? All this stuff. And we were having the same questions about, you know, some of the things we manage, what does it mean for us? Uh, particularly what does it mean for the mainframe platform? Is there a place for it to participate in this, this model? And, uh, we had actually just in October presented, uh, some ideas we had around where your mainframe footprint in an enterprise could be utilized in, in a, in a better way, in a different way that that more easily assimilated in this, in this new cloud model, which included forgetting about infrastructure, focusing more on platform, morely full, more fully realized services. Um, and of course, on-prem, right? <laugh>, uh, using it for private things and tying into backend existing assets, those types of things, and focused on services and APIs. So we caught wind of this, this business problem, application problem. Uh, and we also had these ideas that we wanted to employ and see if they were viable. Uh, so we took that as an opportunity, um, to see if we could address their needs in a new way.

00:13:54

And, uh, we took it and we didn't ask to go do anything. We just did it in our spare time. Um, did it totally skunk works. We worked on it nights and weekends, that kind of thing. And went, went about building a caching service on the mainframe. Um, we knew going in, uh, that we wanted to focus on minimizing any burden to the developer or as little as possible. We went in as far as the design of the service itself. We did some things to, to help with that. For example, uh, in the service we were designing and building, we mimic the API of the existing compliance solution. So that, uh, ideally it would be just a drop in for the developers. They just make a configuration change on their host name, that type of thing. And they could just start using ours in place of the old one.

00:14:47

Uh, there were some other things we did. We even actually violated the HDTP spec by, uh, allowing them to post on top of existing records and doing the air handling for 'em and, and things like that, right? But again, we, we just wanted to make it easy. Um, and it wasn't completely selfless, right? Like I said, they didn't come to us for help for a reason. And, uh, we suspected they might not want help from us. So there needed to be as little friction and as little barriers to entry as possible, if, if, if they were even gonna consider it, right? So, uh, we also got to look, I mean, you know, the functional requirements of caching servicer, pretty simple. Grab some ones and zeros, stick 'em somewhere. Give 'em back when you're asked for 'em. Uh, what really was the problem in their case, and, and as a whole was the non-functional requirements, right? The performance, the scalability, availability, those types of things. Ease of use. So that was really where we focused and, you know, we, we leveraged some of the strengths of the mainframe platform as well as the cloud computing model to achieve some of those things.

00:15:57

So, and

00:15:58

We ran it on this stack, which, which Rosalyn will help me, uh, describe.

00:16:03

Yeah. So Z is built for the reliability and scalability of the system. And since they already had a ciplex environment, which allowed them to scale and allowed them to be up all the time, they wanted to take advantage of that. Now, in this environment, you run Z os. Well, Z os is just another operating system. Z hardware can actually run three different operating systems. ZOS, which is the standard multipurpose operating system for high availability that well they're using and lots of others are using. Uh, but it also runs Linux. So you know, you wanna run your Linux apps there. That's just fine. And it runs TPF transaction processing facility, which is just another operating system. Highly optimized for transaction performance for people like airlines. Uh, so it's just an operating system. So that's the one they chose. On top of that, you needed an, an app server.

00:16:55

You needed the equivalent of a Tomcat. Not that I really wanna say that, but okay. Kicks, kicks is our transaction processing system, sometimes known as CICS, sometimes known as kicks. Uh, but it's the high performance transaction processing system in the environment. And it allows you to run your apps, um, and do what you need to do. It even has liberty sit in the middle of it. So you can run Java in kicks if you want to. Uh, as well as that, they needed a file access method. And in ZOS, we actually have a whole bunch of them, one of which that's been around since the beginning of time is vs a. Um, and it's just a way of accessing data. And in this case, they use the, the key store version of vs. A. Well, that's what you need for a cash, it's just a key store, right?

00:17:47

So that was available, and that's what they used. So they had the system available. And in this particular case, I think actually it's just sitting in memory, not really actually written to disc, right? And then comes the languages. So what languages did they choose? Now, ZOS can run Java, no problem. But you write in the language you're familiar with and you write based on what you wanna write in. And in this case, Randy likes assembler. So it's an assembler, but it's also that Assembler provides the high performance that they really needed in this case. And so the IO routines were written in Assembler, because that's the fastest you can be. And then the other parts were written in COBOL, because those were the parts that might need more maintenance. And anybody, any good developer can write in cobol. And so they wrote it that way. It is true, any good developer can write in cobol. I mean, it's just English. It really is just English. Uh, so that's what it was done in. So they could make it easy. They could get the speed they wanted to and make it efficient.

00:19:01

Yep. Um, and like I said, we, we did this completely on our own time nights and weekends. We spent about three weeks and we had a, uh, a pretty solid product, about 4,600 lines of code. And we were, we were comfortable enough at that point to, to approach the development team and, uh, provide this as an option. So we got a hold of the, the lead developers on the team, and, uh, they saw us coming, right? <laugh>, like, you, you are mainframe guys. Yeah. Is this a mainframe product? Like, yeah, nope, we don't want anything to do with it. Uh, they had a lot of excuses, uh, mainframe's a single point of failure. It can't scale, can't perform. And, you know, I was a little disappointed, uh, both from just the impression because they were all false. I mean, you can make, you can make a mainframe that way if you want to, but, um, how it's not the case <laugh> by poor practices.

00:20:05

Okay?

00:20:06

Um, we were also a little disappointed that they, they still shot us down 'cause we had a drop in replacement for 'em. And it, it was basically prejudice is, is why they didn't want to use it. And well, not just president, again, it was preconceived notions that were incorrect basically. Uh, but it wasn't soul crushing. You know, we went into this expecting, uh, some resistance. So, uh, we were a little disappointed, but, uh, we weren't giving up. Um, in fact, we, we knew we had an existing relationship with their management team. Um, so we went and talked to them next and explained, uh, you know, it's a, it's a hands off managed service. Go use it. We've got it for you, but your, your devs don't want to use it. And, and he took the action item to, to go talk to the team and, and kind of get things straightened out, uh, and explained to him, you know, you don't really have an option right now.

00:20:58

Uh, we're on the line to the business. This thing's failing every time we put load on it. You don't have any other options. Let's try it. If any of these criticisms you have come to light, or if you can break it with your load, okay, we'll keep looking otherwise, let's move forward and use it. Uh, so the developer came back and begrudgingly, uh, agreed to at least start testing the solution. And he asked us, first thing he asked us was whether it could handle a hundred transactions per second. Um, so we had to think about that, and then we had a real good laugh about it. <laugh>, uh, for, for a couple of reasons, right? Um, I mean, we had already run our own tests and ran it up to about 5,000 TPS, so we knew a hundred wasn't gonna be a problem. Um, but just we know the environment and we know the tools.

00:21:49

I mean, we, we process just in kicks. We process about, uh, between 550 and 600 million transactions every day. If you average that out, you're, that's, that's on the a aggregate, admittedly. But, uh, you know, we handle volume. It's, it's not something we're concerned about. Um, so they kept asking us if they could throw more load at it, and they did, went 500,000, 2000, about 4,000 TPS. They were happy enough or satisfied, I wouldn't say happy, but satisfied enough to agree to, to move their project forward using this service. And, um, you know, we saved the day, uh, ended up, um, going live on time, met the business requirements. Um, it was a, it was a huge success for, for the team, for the apps team. Um, they didn't get egg on their face. Uh, and it was a huge success for us too, right? To, to contribute to something, um, at that, of that magnitude and that level of importance for the business.

00:22:50

And, uh, funny, you know, that that team is now one of our biggest advocates, uh, from one perspective, just the service itself. They love the service and what it can do. Um, but as, uh, they're also an advocate of us as a team and of a, of an infrastructure team that they can work with, and that we'll work with them to get their problems solved. Uh, so that, that was a big win from a relationship perspective. Um, a couple little, uh, side items is that that service instance is still in production today. They don't really hit the volume that, uh, we tested for. Uh, but they've so far been processed about 21 billion requests on that initial service instance and haven't gone down a single time. Zero downtime. Let,

00:23:40

Let, let's be really clear, that's zero downtime, not planned downtime, not any downtime. It's been up through system upgrade, through software upgrade. It doesn't matter the way it's set up in the, in the cis plex, it can stay up, period.

00:23:58

Yep. Yeah. Fun fact. I go, yay, <laugh>. Um, a fun fact, uh, you've heard Rosalyn, uh, mention z you know, back in 2000, the IBM mainframe got re-architected to 64 bit and acquired new branding, which is Z Systems. The z and Z systems stands for zero downtime, if you didn't know that, but a little fun fact. Mm-Hmm, <affirmative>. And we actually achieved that here. Um, and some cool stuff started happening after that. You know, we, we had a big win. And, uh, like I said, they became an advocate and, and started talking to their other developer teams. And, uh, we started getting people coming to us. Um, now kind of reflecting back on that cloud model, we didn't do it on that initial customer, but the, the next customer, we were already starting to immediately build the self-service, um, capability so that we could even get outta the process.

00:24:53

If you wanna come use it, you've got a webpage you go to, you click a button about 10 seconds, you've got a cache service and you can start using it. So we immediately got to work on that and started bringing on additional applications using the service. Um, another cool thing I guess that helped the adoption is these folks never see a green screen. They, you know, this is an end point. That's all they ever see. They don't log onto a mainframe. They, they do it through a webpage, and then they start using it. So matter of, you know, most of our consumers are actually from the distributed space, and some of 'em don't, don't even know it runs on a mainframe. Um, so that's been cool. We've probably, um, we've got probably 30 applications on that particular service right now. Um, and uh, the really cool thing from my perspective is that's our, that's my day job.

00:25:45

Now, instead of working nights and weekends to try to develop solutions, uh, Randy and myself, we have a team that that's what we do, is try to develop services that are going to help my development community and increase their productivity. Um, we've developed a, a suite of various services ranging from product level stuff like the caching service to just automating things that created waste in their day to day activities. Uh, we had a, uh, to get a rack f service ID for some of our developers, you went through a service catalog, you put in a request, it had a five day SLA on it. It was just ridiculous, right? Um, now they just go to our portal and they have it in less than a second. Um, just little things like that. Anything we could do to help our developers become more productive and move quicker is the focus of, of the types of services we provide.

00:26:39

Um, now I, I spend, uh, quite a, a bit of time in, in the mainframe community, uh, engaging with, with other, uh, organizations, doing speaking things and, um, looking at modernization, trying to bring, uh, their practices forward, their mindset forward, probably primarily. Um, and we get a lot of interest in, in some of these services that we've described and, and shown to 'em. And, uh, a really cool thing happened last week. Uh, I got <laugh>, I got approval to share our code for some of these services, uh, with anybody that wants to use it. So, um, we are releasing, uh, the code for the caching service. We've got a persistent object store service that's got a pretty rich feature set wrapped around it. Uh, also a UID generator where we kind of, again, we kind of mimic what the UUID spec looks like, but we use a different algorithm to, uh, guarantee avoidance of collisions and clashes and that type of thing. Um, we're releasing these under the Apache license. Uh, we are, I mean, so we're gonna put 'em on GitHub, right? But, 'cause that's where you put stuff, right? However, I'm, you know, I've gotta get this done by the end of the month, and I'm trying to think through how do I put an exit encoded assembler and COBOL code base on GitHub so that people can ingest it, right? So I, I've, I've had a little struggle working through this, but, uh, nah,

00:28:14

We're solving that problem. Yeah, <laugh>, so, so I'm glad you all realize what this means, putting these services so they're available. We're trying to help make sure we have a ZOS community in this environment. And GIT is a problem because GIT doesn't support ic. So we are working with Rocket Software, and they will make a version of GI available this month, uh, that will support code page tagging. And so that when you run your, your GIT on ZOS, uh, it will do the correct encoding and extract the files in ic. But in all the rest of the world where you live in an ASK ASCI or UTFW world, it'll stay in ASCI or UTF eight. It's all under the covers. It's all built in, they've got it running, um, I'm playing with it. We're gonna make sure they get it. So when they post it to GitHub, it'll have the right attribute tagging in and it will be available.

00:29:08

So you can now use GIT for UZOS source code. Now, there, there's still some trouble. Uh, we got trouble doing builds, but they're gonna provide JCL to get all the builds done. Uh, but we from IBM want to work with customers as sponsor users to actually build a dependency based build system sitting on top of Git. So you can use Jenkins or you can use whatever build system you want to, to build your ZOS code. Let's keep the ZOS environment, ZOS development just like everyone else. There's no reason it needs to be different. There's no reason it needs to be segregated. And as you can see, you built the service in three weeks-ish. Yeah, I'd say that's pretty fast and efficient. Yeah,

00:29:57

I think so. Mm-Hmm, <affirmative>. So, thank you. Um, so with that, um, parting thoughts, um, I guess, you know, we, we, this week we've been asking what can you guys, uh, help me do, um, I guess first of all, don't be that guy. Uh, please. Um, you know, there, and, and on the flip side of that, uh, help us not be that guy as well, right? I, I fully recognize that, uh, the reputation that a lot of main framers get is, is, is earned, uh, <laugh>, I, I, I deal with that community a lot, right? Um, but there's a, there's a lot of folks also in the community that, that want to push boundaries, wanna raise practices, you know, want to get more progressive, utilize the tool where it fits and, and where it can provide value. Um, so I'd ask you, you know, to reach out to, to those guys in your shop that are, are open and, and help them kind of get engaged, you know, um, um, they wanna be there. So, uh, with that, I thank you so much for the time

00:31:10

And before you forget, if you're interested in joining the community, they actually have a community set up. So if you send one of us an email, there's a way to get attached to the Z community. They got a Slack

00:31:22

Channel. Yeah, we got a Slack team set up. It's, we're still filling all that out. We're still building it out, but we're adding people to it now. We've got a number of folks there already, so, and, uh, Jean, thank you so much for the opportunity. Yeah, thank you. Thank you, rich <laugh>.