Las Vegas 2020

How A Hotel Company Ran $30B of Revenue In Containers

App modernization with containers for the rest of us.

DH

Dwayne Holmes

Vice President of Converged Applications, Large Bank

Transcript

00:00:06

Thank you Adam and Lauren. So I have been a big fan of this next speaker For over two years, people kept coming to me and saying, you've gotta meet this guy because he's doing things with containers that will blow your mind. And it was so true. I learned that among many other things, he was containerizing all of the revenue generating systems at a top hotel company that was collectively supporting over $30 billion of annual revenue. Dwayne Holmes did this work as a senior director of DevSecOps and Enterprise Platforms for years. I wanted him to give a talk about what he was doing here at this conference, but due to a variety of reasons, we were never able to make that happen. But thanks to him moving to a new company, I'm so delighted that he's finally able to share his story, which among many other things, earned him the title of Google Cloud certified fellow, having built and managed one of the world's largest Kubernetes installations. So I'm even happier that he is now joining our longtime friend John Rodowski at PNC Bank as their VP of converged applications and cloud. Please welcome Dwayne Holmes.

00:01:25

Thanks Jean so much. Uh, I really appreciate the opportunity for you to have me here. I have definitely been following your work, like everyone, and I appreciate the awesome words. So my name is Dwayne Holmes and I'm talking about my DevOps journey. And, uh, to protect the innocent in this presentation, I will be vague. Uh, all the way up until last month, I worked for one of the largest hotel companies in the world in 2019. They have revenues of over $20 billion. They have 170,000 employees. They're over 90 years old with over 7,400 locations or hotels with 1.4 million rooms available. In 2016, they had a large merger approved. In 2018, that merger integration was completed, and in 2019, we rolled out a massive program for our customers. So a little about the things that we accomplished during that time. All the way up until last month, I ran a team that supported over 3000 developers across multiple service providers.

00:02:41

Uh, our model was that we had few FTEs, but lots of service providers when it came to development. So development was done when there was a project, uh, that was, uh, green lit by corporate. In 2016, microservices and containers were running actually in production. In 2017, over 1 billion was processed in containers. I didn't say microservices because we had microservices and also micro monoliths that were running in containers. At the time, 90% of all new applications that were coming outta development were in containers or are in containers, and Kubernetes was actually running in production in 2017. In 2018, we were one of the top five largest production Kubernetes clusters by revenue according to Red Hat and jfr. And by 2020, when I left, we did 20, uh, we did thousands of builds and deployments per day, and we ended up having two Google Cloud certified fellows. And we had experiencing experience running Kubernetes and five cloud providers. And the one that most people won't guess is Ali Web.

00:03:54

So you need to know, in order to know where you're going, you need to know where you've been. So about me, in 2012, uh, I worked for a financial company and over 95% of infrastructure was outsourced. So out of 500 employees, only five were retained. Developers thought that if they outsourced all of infrastructure to a provider, the provider would do architecture, engineering, and operations. Well, as everyone knows the same amount of work, um, still was required because we had no engineers or architects, but we had a large outsource provider contingent that was able to help us. So I learned three principles when I was at this financial company. The first one, the CIO always talked about dial tone. I'm also a closet economist, so I believe in Adam Smith about the division of labor, labor and trade specialization. And finally, I believe in automating everything. So what is the dial tone principle?

00:04:59

I always tell this story anytime I get a new team. And it's basically like no one cares about any of the, any of the technology that you use to implement a phone. It could be Cisco or Avaya or whatever. It just better work when the business picks up the phone, they expect a dial tone. Anything less makes the business upset. And this to me means that you focus on what is important. So this led me to Ruby on Rails, and most people, especially my mom, know what Etsy and Hulu and Square and Instacart and Airbnb and Twitter and Twitch do. However, they don't know the development platform that all these started on, which was Ruby on Rails. And the reason why I love Ruby on Rails is because of doctrine two, where it talks about convention and over configuration. And that basically means that we focus on things that do not accelerate business value, and we should focus on things that accelerate business value.

00:06:06

So because of that, a lot of decisions are already made before you even use the framework. The other thing because, um, I'm a closet economist, I love Adam Smith's wealth of the nations. I firmly believe in division of labor and also trade, um, specialization. And there's, uh, um, a story called the lawyer versus secretary. Suppose you have a secretary, um, or a lawyer, and this lawyer can type faster, file faster and use a computer faster than the secretary. However, would the attorney choose to be a secretary or choose to be a attorney? That's of course, um, they would choose to be the attorney. So every hour that an attorney is spent doing secretarial work is an hour that they can't become a lawyer. So because of that, you bring on a secretary in order to maximize your productivity. And that's how I feel about my DevOps or release engineering teams that developers are like the, the attorney, their job is to put out amazing code, and our job is to support them.

00:07:20

So every hour that a developer is focusing on things that don't provide business value is an hour that they're not providing that value. The other thing that I love is when all else fails, automate. So there's only two ways that you can increase productivity. The first way is automation. And the second way is increasing resources. The issue is, is in my career, resources have always been scarce. So because of that, in order to increase productivity, I've always had to fall over to automation to do things. So how does this all fit in? Well, I decided one day that, uh, or in late 2015, I was a vice president at a financial company. And if you look at the top left corner, um, that was the office I was in. And I had a corner off, uh, corner desks that overlooked the harbor and the city on top of that because they got rid of most the infrastructure.

00:08:25

I had amazing career stability and, uh, everything was great. However, one day I go to a meetup and this meetup fills my head with crazy ideas about containers. See, I went to this meetup to learn more about ruby umbrellas because I was doing some work for my mom and I was doing lots of development. However, um, when I heard about containers, it satisfied three things, dial tone, containers, abstract infrastructure. It also talked about specialization. Operations could create containers that devs could use over and over and over again. And automation, I can build containers over and over again and everything will just work, which is awesome. So I knew I needed to make a change. So I found out that this hotel company was willing to go ahead and go all in on containers. The issue is, is that, um, it was probably bad for my career at least.

00:09:28

So I thought, and everyone told me that. So I went from a VP to a contractor. I went from amazing stability to no stability at all because one, I was a contractor, and two, this was an experimental project that if it didn't work, then they could cut the project. And not only that, but instead of having amazing views on the city and the harbor, I was sitting at a table, not a desk, and I was in a room with no windows. So as a result, I didn't know if I had made the right decision. And I was, I would call, um, and talk over and over again with people whether or not this was the right thing to do. And most people said I was a, a fool. However, the thing that allowed me to stay on was an amazing team. And they formed a, a great team of, um, people who had amazing talents, cross-functional team.

00:10:26

We ended up, we had three developers and three infrastructure people. And I love giving nicknames to people. So I had our fearless leader who rallied the troops. We had the genius who was the superstar developer. We had the professor and who was the superstar. He knew, knew everything about everything. He was a developer and an infrastructure person, and he's the one who actually suggested containers and Superman who had unbelievable energy, and he was our doer. And for me, I didn't give myself a nickname. I was just glad to help. So the goal of this team was essentially to talk about evolution versus, um, revolution. So the goal was that we would take something and we would totally change the way the enterprise worked by this fun cross-functional team. And I learned lots of things. And so one of the things I learned, especially early on, is that environments, especially lower level environments should be production light.

00:11:29

And the reason why is we are a high performing DevOps team. Unfortunately, we didn't make any money, so the only performance slot that we could get was from 12 midnight to 5:00 AM because legacy teams had all the best slots and we actually got the worst slot. The other great thing about having this amazing team was that at the time we couldn't Google anything about containers. We had to actually create our own infa, uh, orchestration engine in order to, um, orchestrate containers on multiple VMs. And so we were able to proper, um, each other up and bring ourselves along. And as a result, we, um, started creating frameworks, um, that were based on containers, uh, frameworks and libraries. And we really thought, how can we, um, secure these and how can we deploy these over and over again? And so everything was kind of like a pyramid where we built on things so that things would go faster.

00:12:33

The other thing is, as I learned about the greatest microsurfaces on the planet, I'm a Linux guy. And if the way we thought about containers is that you have the command, which is a container, then your command line options, which are environment variables, and then anytime you did a pipe, think of that as a sidecar. And so as a result, Linux make the best microservices because you can take command and based on command line options, you can change how it works, and then you can do a pipe, um, command, and then you can change it even more by adding extra commands. So this is how we built a lot of our containers. And the result was that we came up with a framework, um, how we could deploy containers in multiple servers multiple ways. So a lot of people believe that you have to be in Kubernetes day one in order to use containers.

00:13:29

That's not true. The other thing is people believe that, um, containers are immutable. Well, depending on how you design them, they're not, especially if you're using them on a vm. So we really believe that containers are awesome, that if you focus on container hygiene, in other words how you build your containers, you can run a container, um, or anything in inside of a container on a vm. So in the end, especially when you start out, always focus on maybe putting containers on VMs, and then you can go to Kubernetes. And the reason why we loved these frameworks, um, especially the ones that we built, is because dial tone, we abstract where we ran containers. In other words, no one knew where we were running these containers. They just knew there was a URL that they could get a microservice. The other thing is, in terms of specialization, we realized that a small team can service a much larger team.

00:14:30

And then finally, automation, we could build hundreds of times without us getting involved. So that the ask was after we, the RAM project was successful, they asked me whether or not I wanted to come on as a full-time employee. And so I was like, sure. So I asked for six things. I asked for all containerized workloads. I asked for developer tools, pipelines, platforms, and base images. And if you think about it, three years from then, um, that is con actually considered modern applica, uh, modern operations. And so I was actually asking to run modern applications our operations, but I felt that containers and platforms and base images and pipelines and developer tools were really important for me to do my job for release engineering. So the reality was that if you look at the classic DevOps issue is that you have development throwing code over the DevOps wall and hitting operations.

00:15:36

Well, guess what? I was like code being thrown over the wall into operations. One, the infra SVP didn't believe in the team. Two, I had few allies in operations because most of my time was spent with development. Not only that, but multiple reorgs left me under a different VP who thought the DevOps team or the release engineering team was really a QA team. On top of that, the operations team had created, uh, another DevOps team where they were essential, uh, essentially implementing a service catalog with chef. And I was just thinking to myself, oh my goodness, I'm a team of one, even though I have all these developer friends, this is deja vu all over again, and I'm by myself. So the issue was is that we had two completing pa two competing platforms, and it was, what will devs choose you? A you could have a service catalog where you could have a dropdown menu and you can pick compute memory or storage, or you could have a platform where you go into get and hit commit.

00:16:51

And then afterwards Jenkins does some things, creates an artifact, which is a container, and then deploys it to compute. And the key thing is, is that, uh, the service catalog in my mind was TMI too much information where developers had to know all this compute stuff and the pipeline was abstraction, which I want. So the fortunate thing was that developers chose our pipelines a hundred percent of the time, which is awesome because if you think about it, dial tone infrastructure provided too much information that developers didn't want. Whereas we, for dial tone, they just hit commit to see their code running. Once they do something in Jenkins, we manage Jenkins containers and compute so the developers could be amazing at what they do and do what they love. And then we focused on workflow, not necessarily building, uh, servers. So as a result, if you look at everything that we built, this is kind of like our framework where we had a whole process and how we built our base images, how we pulled in libraries and frameworks, how we secured everything, how developers interacted with the pipeline as well as tickets, how we secured things, whether it was like with, um, code quality scans or static code analysis.

00:18:20

And then security was everywhere with the aqua security basically scanning, uh, containers as well as making sure that when a container was running, it was secure. So once everyone found out that a hundred percent of things were, uh, going our team's way, we got even more work. And part of this work was merger integration work, customer product work, and also the refactoring of API. And also we are going, um, all in on international expansion with a partner. And so this is the phrase which I call, go faster. So the issue is, is even though we had done all these amazing things and we wanted to go ahead and use Kubernetes, the SVP really didn't, uh, wasn't all sold on Kubernetes on top of that, because we were a small team and, uh, developers were choosing us a hundred percent of the time, we ended up having an unbelievable workload.

00:19:24

And I thought to myself, oh my goodness, did anyone get the memo? So fortunately, or unfortunately, depending on who you are, we found out one day that our multi-billion dollar website went down because of a release. Everyone's on the call, the SEP is hot, people are talking about how to roll back, um, uh, release changes. And so of course when these calls tend to happen, it starts out with infrastructure, and then afterwards you bring on more and more people, um, as time goes on. So the SVP and people on the call, which were mostly infrastructure people assume it's gonna take hours to roll back the change. Well, we meekly shared our screen. We pushed one button, which is the easy button, and we rolled back the change, and literally we blew everyone away because everyone was thinking that it would take hours to roll back the change, and instead it took minutes.

00:20:24

And this is one of my favorite, uh, um, diagrams that I used to put up. I created in 2007 after, uh, attending Google io, and I was obsessed with machine learning. And I showed this to the SVP later on. And you already know how operations and development people go, infrastructure people always trying to figure out, um, why developers are doing stuff. And developers are always like, you know, it's the network. So I go to my SEP and I say, well, guess what? I own, I asked for dev tools and I have most of them, I need some more. But in the end, um, all these tools can be used to build models to do three things, grade commits, grade developers, and also grade the team. And based on things that they do either reject or approve a release. So, and then I said, in order to do that, I need a couple tools.

00:21:30

I need Jira, I need Jenkins, I need get, I need Artifactory, I need all these various tools. And then we can go ahead and integrate them, and then they can go ahead and begin to grade these different things that developers do. Well, guess what? That blew my VP SVPs mind, and he loved it. So as a result, we got the green light to continue to consolidate. So as a result, we formed a new, uh, infrastructure, um, organization. And to me, I, I really think that, um, this is another one of my favorite slides that I go to. So in terms of value, we can talk about like customer service and cus customer service is what I call like HandsOn service. It's high touch, high cost, very low value. So a lot of times when people think about DevOps organizations, they think of sitting developers and, um, operations people together, and they might be in the same room, but, and through osmosis, all this amazing stuff happens to me.

00:22:39

I really believe that the amazing stuff happens when there's clear communications between operations and development. So for example, um, if we begin to create APIs that developers can use in order to use infrastructure, developers are able to go faster and as a result, their happiness goes up. One of the people I love listening to is Kelsey Hightower. Kelsey Hightower talks about no ops and it's not removing the, um, the operations organization, but it's forming product teams. So these products teams control the end-to-end flow on how to, uh, provide a product to operation, uh, development. And as a result, we think about how to productize things. So when I think about the value scale of, um, providing DevOps, I think of customer service, which is low value. Then you go to the platform, which is location, um, ag agnostic. Not only that, but instead of being an artisan focused team, you're now focused on process then CICD, which allows a team to be the enablers of all the, um, people who are experts in their field.

00:23:59

So we can begin to standardize tools and platforms, we can begin to, um, uh, force good practices and it allows people to standard, um, do specialization. And then finally, base images. Base images contain enterprise standards. They're opinionated, they're automated. But the great thing about this is that there are a contract between infrastructure and operations, see and not, uh, development and operations. So operations, um, sometimes gets involved when a development team has, um, a hundred steps to go and they're on step 90. And no one likes being told on step 90 that you have to go back to step one. And so our base images are a way that you can use them. And if everything works in conduction with our pipeline, you can deploy to our platform as fast as possible. So the dial tone is, is that developers should understand how to use your pap platform.

00:25:01

And in our case, we use key value pairs would get, and this controlled the pipeline as well as Kubernetes. 'cause Kubernetes is hard. We focused on Legos versus hands-on work. The specialization was that the team could focus on innovation. In other words, developers could focus on providing business value while we think about and worry about all the rest of the stuff. And then automation taking all these pieces together over and over again to build something amazing. So as a result, we went from two teams that had like separate focus to one team where we combined all these things, cloud-based image infrastructure, automation, shared service, uh, general programming platform, CICD, which was awesome. And the results were, again, we were able to support lots of developers. We were able to provide microservices and containers that were running in production. We processed a lot of money on our containers and we had Kubernetes in production where most people were just playing with it in a lab.

00:26:06

And as a result, we were able to do thousands of builds a day and we were able to go with, um, do what most, um, businesses try to do. And that is be multi-cloud. Most people can't get one cloud provider, right? We actually got five cloud providers, right? So if I were to give people advice, one, I'd say take calculated risk. In other words, I didn't know whether or not my, uh, foray into, uh, being a contractor at the time would be good. I thought it was actually the worst mistake I ever made. But the issue is, is that I went ahead and did it because I firmly believed in the technology and I thought it would be a paradigm, uh, shift. Not only that, form teams of like-minded individuals, when you feel down and you feel like you're fighting against everyone, you can actually look to the person right next to you and you can come along, um, or, uh, feel better.

00:27:10

Not only that, but digital transformation unfortunately is politics. Um, I was offered a VP role in 2017 and I didn't do it because I wanted to be hands on keyboard and I didn't like politics. The issue is, is sometimes you have to take a promotion if you need to control your own destiny. And for me, I love technology, but I really thought I did a disservice to my team finally start slow. You don't need to run Kubernetes today. Um, but most workloads can go inside of a container. So a lot of people jump to microservices and doing all this amazing stuff. It's okay to take baby steps. We took baby steps a long time ago and that was the reason why we're able to be where we are now. So then finally dial tone. That means abstract specialize. Have a team obsessed with release engineering and finally automate, automate, automate.

00:28:16

Resources are normally scarce. And so the only way to, uh, overcome that is to automate. And this is the help I'm looking for. And so everyone needs to convince Gene to allow me to do a container Kubernetes and CICD pipeline deep dive because in the end we are on our, um, when I was at the hotel company, we were on our gym four containers. It was, um, cloud portable, it was scalable. Health checks were built in. We had tests for latency for CPU and certs were no longer in the application or managed by developers. On top of that, we focused on circuit breaking and we had a PM built in and zero trust. And finally our images were very small. Again, that's all about container hygiene and our sidecar were used to enhance everything. The other thing that I'm really passionate about is pipeline security.

00:29:15

In other words, how do you secure, um, uh, end-to-end CICD pipeline, um, with security. So have plugins in the ID that give feedback all the time to developers have, uh, uh, uh, library and framework, remote repository. Most people use Artifactory and Nexus to do that, have container scanning over and over again. Aqua security is great for that. And then when you talk about even running or day two operations of containers, now you have to think about how do I secure the host? How do I whitelist commands and how do I do forensics because containers are ephemeral and they're dying all the time. The other thing I would, um, talk about is that it's very important to use, uh, environment variables because most of the time, uh, people do EMV prod or EMV QA or EV equals dev. And the issue is, is that tree folder structures are horrible to maintain. The other thing is configurations should, um, be a separate pipeline than your artifact. And finally, environment variables should be used. So if you can, uh, convince Jean to allow me to do that type of deep dive, that would be amazing. And in the end, thank you so much for your time.

00:30:39

I.