Las Vegas 2020

Cloud - Banking on the Future

KeyBank started its DevOps Journey in 2015 with a small group designing and implementing container workloads, continuous integration, and automated testing. This work resulted in being able to deliver on commitments to new user experience, being able to release more frequently, and being able to release more confidently. Over the course of the past several years, the group grew to include observability and monitoring, while also adding responsibilities for overall technology planning. Building in core observability and monitoring capabilities we were able to provide a vital feedback loop into the development community. Pulling in metrics, telemetry, and aggregating data provide our teams with the ability to make data-driven decisions. Once the group had gotten through many of the “day 2” problems and scaling hurdles we began to tackle cloud workloads.


Historically, we had been very “traditional” running our workloads within our own data center. However, we were given the opportunity to scale our practices and move some of our core processes out into the cloud. Taking what we had learned throughout the years we based on end to end approach using infrastructure as code (IaC), agile planning, CI/CD, and automation. We chose to move away from traditional planning, change management, and siloed team development. This approach allowed us, within six months and with only a team of five, to build everything we needed to have our first production workloads go live in mid-January of 2020. We have many more “go-lives” planned throughout the rest of 2020 and the principals and practices we’ve used will allow them to be “non-events”. In this session, you will learn how the KeyBank team used core DevOps practices as accelerators to drive for a successful first of many Cloud “go-lives”.

CM

Chris McFee

Director of DevOps Practices, KeyBank

MM

Mick Miller

Senior Product Manager, Cloud Native, KeyBank

Transcript

00:00:13

Hello, and welcome to our presentation banking on the future. I'm Chris McPhee, the senior vice president of our enterprise DevOps practices group. Within my organization. There are six different teams practicing and moving our DevOps practices forward. We have teams to guide and assist with continuous integration and continuous delivery pipelines, server build and server life cycle activities, service catalog, and self service development cloud, and cloud native technologies. Today I'll be presenting with one of my leaders. McMiller the senior product manager for our cloud native initiatives. First a little bit about KeyBank. Our roots trace back nearly 190 years to Albany New York today. Key bank is based in Cleveland, Ohio, and we're one of the nation's largest bank based financial service companies. And we're continuing to grow. We have retail branch footprints in 15 states, including Ohio, New York, Washington, and Oregon, a network of over 40,000. ATM's over 17,000 employees and our financial services range from personal banking, small business commercial and corporate banking.

00:01:28

Where did our dev ops journey start? It started back in November of 2015. After coming back from the DevOps enterprise summit, we were invigorated and excited to hear everybody's experience and to see what the art of the possible was. We all purchased the Phoenix project and began to plan what we were going to tackle. There was no bigger bang that we could have made than choosing our online banking experience. First focusing in on the architecture, we knew we wanted new modern tiered stack that broke up our user experience layers and our different service service layers. We also had to reinvent how we delivered our infrastructure instead of focusing in on virtual machines or bare metal servers. We were now focusing in on containers in their orchestration. We knew that the way to accelerate was to embrace open source software and technologies, frameworks like angular JS and spring boot.

00:02:25

We also need needed shared enterprise class capabilities that offerings like Kubernetes would bring to the table. Our user experience would change to be widget focused and would allow us to make small incremental changes to that user experience and get feedback directly from our clients. We introduced real-time feedback loops and incorporated them into all of our development and design efforts. Our people are the backbone of how we operate. Like many of you. We implemented new and flexible team by co-locating them with our business partners and increasing the amount of feedback that they were getting. We've been strong advocates of mentoring, pairing, and participating in rotating programs to give awareness and visibility to the ways that we work. And speaking about the ways that we work, we moved to agile based delivery practices away from waterfall. There has been an infusion in automation across our testing practices, and we've been focusing in on smaller and more frequent releases.

00:03:30

That was a super quick refresher of the last four years. So what have we been doing since in 2020, we started primed for expansion into the cloud. We had recently acquired Laurel road, a digital lending business that included platforms for student loan, refinancing mortgage applications, and payments focused in on medical professionals. These platforms were developed as cloud enabled and they would accelerate our cloud strategy and increase our cloud footprint. They also needed a quick and easy paved road to integrate with, with our applications and our systems. Additionally, in 2019, we examined internally natural language processing and chat bots, virtual assistant skills. We believe that these could be central to a new set of capabilities that would enable and make our customer experience delightful. We had implemented an event supply chain to bring in data across our applications, our systems, and our end points like ATM's and our laptops to strengthen our feedback loops.

00:04:35

We'd done a lot of work on the front door and we knew there were opportunities in our middle and back office processes and applications. In 2019, we had over 30 programs using agile delivery, and we were looking to double that in 2020, we also had plans to create collaborative, safe learning environments, to bring folks through teaching and learning new processes and technology to inf and to infuse dev ops culture. Additionally, we had interns and rotational analysts, which is a program here at key bank where newly hired employees have six month, three, six month rotations on different teams. And that also plan to start in 2020. Those programs assist in enabling up-skilling and re-skilling and building a pipeline of talented individuals. What else was on tap for 2020? We were looking to grow our strategic partnerships with Google while we worked with them on testing their Google Kubernetes engine on-prem.

00:05:33

And that would allow us to open the door for hybrid cloud workloads. Data is the core of many businesses, ours included, and we had plans to execute on a migration of our data warehouses into Google's cloud platform. The foundation of moving workloads into GCP was scaling our infrastructure's code practices, and we definitely wanted to automate all of the things. Another strategic initiative would be modernizing our contact center capabilities. We believed capabilities like natural language processing and chat bots. And other items that I had mentioned previously would be a great set of new offerings and add delight for our customers. Finally tackling safe learning environments that I previously mentioned. We began to put a plan into place and frameworks to help support our own key bank. Dojo's

00:06:27

We all wanted to start off 20, 20 strong, and it was shaping up to be a tremendous strategic year. And then the pandemic reared its head, we had to report or prioritize our efforts. As many of our partners ramped down due to shelter in place orders. Our data warehouse migrations to the cloud were reprioritized to focus on providing analytical workspaces in the cloud and moving those analytical workspaces to the cloud would offer a significant computing power for things like our fraud detection systems. Due to the pandemic, we witnessed a considerable increase to our call centers also due to shelter in place orders long wait times impacted our customers calling into those contact centers. We had to implement many cloud hosted chatbots to reduce those calls using natural language processing to answer frequently asked questions or provide self service to our customers early in the pandemic, the government announced the payment protection program also known as PPP PPP was a loan program through the small business administration designed to provide incentives for businesses to keep their staff on payroll.

00:07:35

As many states were sheltering in place through the use of our dev ops practices that we already had. We went from ideation to deployment in less than 10 days. Additionally, we were able to react to customer experience quickly and deploy enhancements. We also automated many of those middle and back office processes to help support our PPP efforts using those same dev ops practices. Finally, to support our Laurel road line of business. We continued development efforts. We've transparently changed their underlying cloud platform, and this will allow us to serve our medical professionals better this year and beyond. We were able to pivot quickly and tackle many of the many initiatives and programs that we never thought we would have needed to. It speaks to our underlying design principles and how we've moved to the cloud. And with that, I'll turn it over to Mick to walk through many of those design principles.

00:08:33

Thanks Chris, I'm Mick Miller. I'm the lead of our cloud native team. Today. I want to talk to you about how we use dev ops practices and some of our core principles to move key bank to the cloud. Here's what I want to talk about today. Key bank's approach to large initiatives. So the first thing we do is define a set of goals. I want to take you through those goals for this project. And then we define the strategies, the architectural principles and any kind of tools that we've adopted in order to be successful along the way. And then finally, I want to talk about some of the challenges we've encountered and how we've approached those. I joined the cloud team in may of 2019. And the first thing I did was to reformulate the team with a specific set of goals and strategies that were aligned with the successes we were trying to realize with the team.

00:09:21

Then we chose GCP as our main cloud compute platform, and then all the networking and security pieces needed to be in place for us to onboard our partners and applications into the cloud. Our first major initiative was the cloud data warehouse, and this was a big cost save to key bank to move a lot of the compute from on-prem into the cloud. We did that in the October timeframe and worked through January, and that was our first go live with their data sets. The next thing is once that data was up there, we wanted to use it. And so our analytics team started moving to the cloud and we're still ongoing with all of both of these projects as we move more and more data and more and more analytics processes to the cloud. And then around may of 2020, we've adopted and began working on, uh, replacing our on-prem Kubernetes solution with , which was a big move for us.

00:10:10

And it's a lot of migrations and we're deep into that project right now. We landed on five goals. The first was to make sure that we were delivering multi-cloud capabilities. We wanted to leverage the best of Azure, the best of GCP, the best of our private cloud. And we wanted to provide a single user experience no matter where the processing was running. The second thing was to make sure that everything we built could be used on our private cloud. The third thing was to make sure that we automated end to end everything. The fourth major goal was to make sure that we had operationalized everything and we handed everything. We built off to our run teams and our networking teams and any of the teams that were supporting these applications, including the application teams themselves. And then finally wanting to make sure that we did something along the line here to eliminate this notion of unlimited capacity.

00:11:04

And we did that in small steps. I admit to start using charge back and show back billing. So then we made sure that the folks that are using the cloud, we're also seeing the actual bills and their cost centers are actually paying for the compute capabilities. After our goals were defined, we wanted to adopt principle-based engineering approaches. We started with zero trust networking to make sure something and anything was secure all the time. After that, we thought about zero trust deployment as a critical piece, to make sure that we didn't have to have operators running our books and make sure that we had intelligence built into our runbooks had intelligence built into our automation. And we want to also make sure that whatever we built, we could re write it once and run it anywhere. We also knew that we were going to get a lot of work, uh, coming at us in terms of onboarding different partners in different groups and different opportunities.

00:11:58

But we had to try to have a structure to be able to make sure that we were prioritizing the work in the best way. And we want to make sure that we have a value based approach in the set of questions and surveys to make sure that we were adopting the right stuff in the right order. Then the next thing is core to everything that we do, which is infrastructure as code. This allows us to make sure that we can look at it, all the code changes we've made. We can understand who's making changes and all of our deployment are being pulled from our get repose. The final thing, the two things are really important and core to everything also. And they are to make sure that we're not just lifting and shifting our on-prem capabilities to the cloud, but we're really leveraging cloud capabilities with elastic compute, with a femoral compute and with on-demand computing that also keeps the cost down and make sure we're successful in our cloud journey.

00:12:51

Then finally, we wanted to really make sure that whatever we built was operationalized and we could hand all of that work off to our run teams and that we weren't the only teams that were the cloud-based teams. That cloud becomes a core capability at key bank and all of our it teams in order to be successful. We also had to choose some engineering approaches in terms of design. And so we wanted to make sure whatever we design, everything we built was open source. First. It was all event driven architecture in the background. All of the components needed to be loosely coupled the, uh, things that we built had to be highly available and fault tolerant. And finally that everything was independently scalable at any one of the tiers. As we got closer to execution and being able to deliver on the promise of cloud, we wanted to make sure that we establish a set of tools across our delivery pipeline.

00:13:41

So for collaboration, we made sure that we had the right kind of tooling in place and to make sure that we can move as quickly as possible yet in a highly disciplined way. We chose JIRA for our agile processes, confluence for documentation and slack for high bandwidth communication. In terms of a development we chose Terraform and Ansible is our core pieces of work that we wanted to make sure that all of our main automation was built on top of, but there are times when Terraform and Ansible are not necessarily the best tools. So we chose go and Python, uh, to make sure that we additionally had all the time to capabilities all the way down to low-level programming language, like go, for example, additionally wanted to make sure that we were providing and building insecurity. And so with that in mind, we wanted to make sure that we're using things like for SETI or, or, uh, any other kind of, uh, security policy-based capabilities.

00:14:36

And then finally we wanted to make sure that our development, especially for the Terraform code was as easy and as solid as possible. And we ended up using, uh, the Terraform or Terra grunt for this, for our repositories. We had make a hard decision between get hub enterprise or get lab find tools, great tools. In fact, and we ended up choosing Bitbucket and we felt that that was the best tool to align to both the JIRA and confluence experience and the integration there is very, very good for us also for our binaries and our Docker repositories. We chose Artifactory as our main repo for building and continuous integration. We chose Jenkins, Kubernetes and things that help us quickly deploy. And then for continuous delivery, we chose Terraform enterprise, uh, Ansible tower. Uh, actually we use the AWX source version and digital AI, which is CB labs XL deploy.

00:15:33

The platforms that we landed on for now are Google cloud was our main compute platform as your, for any SAS type things like office 365 and any of the Intune and kind of automation things for our workstations and those for on-prem, uh, Kubernetes, uh, compute. And finally, VMware is still core to a lot of the compute that we have here at key bank. Some metrics I wanted to mention, uh, along the way are our GCP ha I mentioned we had to build a lot of infrastructure, uh, enable to make sure the interconnects between, uh, Google cloud and our on-prem was in place. And so we built all of that with Terraform. And as you can imagine with, well, over a hundred devices per environment, we wanted to make sure that that was fully automated and repeatable. And so in the end, we were able to do a full end-to-end deploy of firewalls, switches, routers all of the different security things that had to be in place, all the accounts that had to be in place.

00:16:31

And we were able to build that end to end in under eight minutes per environment. And it's all operational at the end of that run, additional things is, are, are, are. And those on prem clusters has one environment, for example, has about 75 different Kubernetes nodes. And we know that we can build those in well under 10 minutes, node scaling less than two minutes. Uh, we have pod scaling up and anywhere from a second to five seconds. And finally upgrading wanted to make sure that all of our upgrading was a zero downtime upgrade. So we don't really know exactly how long that's going to be. It varies depending on the size of the environment, which is why I put a bunch of Xs in there, but we wanted to make sure that we understood those environments. But the most important concept is zero downtime. Some of the upgrades can take up to 60 minutes, but again, without any downtime, it doesn't really matter how long it actually takes.

00:17:21

And the process is really a rolling update. You've taken data node out, you'll wait for the cluster to be balanced. You go ahead and update that node, put it back into the cluster, let the cluster balance again, pull the next one out and walk through each one of the nodes until you have a full upgrade in place. Then finally, uh, in terms of scale and growth, we have over 400 and that's a fairly small amount for today, but it's still a lot, a lot from just a year of work, over 400 GCP projects. And we, that all of the automation that we're able to build for these are really driven by the fact that we have locked down or GCP console to read only. And that'll make sure that anybody that's using our, uh, GCP platform has to build automation in order to deploy things.

00:18:05

Uh, we know that we can create a project, uh, in under 40 seconds and soon that's all going to be self-service through service. Now, along our cloud journey, we've had hurdles and challenges. I wanted to speak directly about a few of them. Top of mind is going to be the security requirements. It's a complex subject and it's a complex area. And we've started off by just taking our existing policies for on-prem and our tooling for on-prem and try to apply those to the cloud that didn't work out so well for us. And so we had to really step back and think about how can we make sure that we are secure, make sure that our policies are still in line, but maybe take a more cloud native approach where some of the security concerns and that's been very successful for us. And it's been done through a couple of things.

00:18:52

One is high touch collaboration, iterative development, lots of testing, and yes, bringing in some consultants, we've had a number of security firms come in and we've had Google come in the Google services team and help us through some of these security concerns. Scale is another big area in that is it's been very difficult for us, a small team, six people to be able to support all of cloud capabilities. And so we've chosen a more federated model and to make sure that each team has increased their skill sets and their capabilities to have cloud on the list of those things. And we've helped them with, uh, a number of ways with dojo's, uh, with infrastructure as code and to pair with their teams along the way, as we scale out the entire cloud platform across key banks, it is, this pressures are also been tough. Uh, most cloud companies are moving really fast.

00:19:47

Banking is Trent is transforming at light speed. And so we want to make sure that we have taken an approach that allows us to move quickly. Irritative cycles have helped reducing cycle times ideation through deployment and having C level leadership is allows us to do things like develop in a matter of two weeks, a overall COVID response, uh, and get that to our customers. We've built some new complex things that normally would take a month or so we got them done in a couple of days. And a lot of that is continuous testing, continuous security testing and continuous development through iteration skillsets are also a really big thing. As I mentioned, as we use this federated model, how do we get all of our teams and including our cloud teams up to speed as quickly as possible. OJT is a really big one. And, and it's not our only approach, but we also know hands-on on the job training and pairing has really helped get folks skills up to speed.

00:20:45

You to me is also one of the big things that we use here at KeyBank, but we also, uh, allow really any other training programs. And we encourage our folks to get up to speed on cloud compute automation and a lot of the core things that are needed for a successful cloud deployments in cloud projects. And dojo's are also at the center in front of a lot of the work we do costs are another thing. And we want to make sure, as I've mentioned that the, uh, the teams that are moving to the cloud, uh, we have charged back, uh, compute for them. And they didn't necessarily like that at first, but as they started seeing some of the reporting and started to see some of the work that they were doing, they started really starting to actually use and leverage a lot of the reporting and a lot of, uh, re requests for different kinds of reports so they can get the visibility that they want for their cloud compute. So that's it. I wanted to thank the DevOps enterprise summit for giving KeyBank the opportunity to share our cloud journey and a lot of our dev ops principles that apply to that journey. So thanks and best of luck on your journeys.