Las Vegas 2018

Next-Generation Ops at Standard Chartered Bank

Shaun heads up the Cloud Infrastructure team at Standard Chartered Bank in Singapore. With 20+ years of experience in the industry, Shaun has previously held roles at EDS in Ottawa, lastminute.com in London, VeriSign in Cape Town, and Amazon Web Services in Singapore. Shaun’s team are building a developer tools pipeline and corresponding cloud capability as the new standard for software delivery in the bank.

SN

Shaun Norris

Global Head, Cloud Infrastructure Services, Standard Chartered Bank

Transcript

00:00:05

<silence> Good morning. I'm Sean Norris. I run Cloud infrastructure services at Standard Chartered Bank in Singapore, and it's a real honor and pleasure to be here today and to share a bit of our journey of next generation operations and a bit of our DevOps journey in general. So, how did we get here? Um, I met Damon Edwards at the DevOps Enterprise Summit last year in San Francisco, and when we got chatting about some of the stuff we were doing in operations, he suggested that it might be an interesting idea to come and share it at the summit in London. Well, after some trepidation and, you know, really a, a fantastic experience in London, amazingly enough, gene and the organizers asked me to come back and, uh, kind of give you a, an update on where we're at. So it's a real privilege to be here. It's, um, it's a bit humbling because my DevOps journey kind of started in 2012.

00:01:04

I was working on an MBA while working full-time. I don't recommend that to anyone. And while doing that, I was searching around for a topic for a, for a thesis paper, and I kept, I kept hearing about this DevOps concept, and at first, I, I mean, I'm a operations guy by trade. I started in the mid nineties with EDS that you can see up here. And I ended up in London in 2000 with last minute.com, and ended up with Verisign doing some security work. But these have all been infrastructure jobs. You know, I, my first job was an as a Novell systems administrator. So, you know, I started coming across Agile teams in the early two thousands with last minute. And then really around this 20 11, 20 12, I started hearing all this talk about DevOps and being skeptical, being an operations guy, paranoid paranoia is a very healthy trait.

00:01:55

Uh, if you've spent as long in operations as I have, and it was, uh, you know, I, I was skeptical. So my thesis was really a, a fairly poor attempt, really, if I'm honest, at trying to investigate was there any reality to this DevOps thing? Were there actual real practices, or was it just a marketing buzzword to sell, uh, you know, the next version of software. So that got me introduced to, you know, people like John Allspaw, at least reading their books and, and, uh, you know, gene Kim and Dr. Nicole Forger and et cetera. And so, you know, we, well, I went on this journey of kind of getting immersed in this community and learning a whole lot from it ended up at the very first DevOps Summit in San Francisco. So this is my fourth one now. It, it's definitely imposter syndrome to be here on this stage and sharing it with people who've really started and create and created the movement.

00:02:47

I consider myself, uh, hopefully a fast follower of those who've kind of built this movement. And it's, it's, uh, awesome to be here. Let's talk a little bit about Standard Chartered Bank. Um, you've probably heard of us because we sponsored the Liverpool Football Club jersey. So, you know, our name is on the front of the football club jersey. Um, but, you know, I think more interesting story to introduce the bank is what we do with that. We've got this really amazing charity called Seeing is Believing that we've partnered with since 2003. In 2003, some folks got together and said, how can we, you know, help prevent blindness? Now, just to give you some idea of the numbers, there's over 200 million people globally who have some sort of vision impairment. More than 30 million of those are, uh, are actually blind. And they estimate that 80% of that could be prevented or cured with the right medical interventions.

00:03:37

So, uh, standard Charter set out in 2003 to raise a hundred million dollars us, and it's really awesome to share with you guys to start out today that, uh, just this year we've achieved the a hundred million dollars, uh, two years early. So we've helped, you know, you can see the numbers up here. We've helped kind of 4 million people, uh, with, with actual interventions. We've trained 300,000 health professionals. We've done hundreds of projects worldwide. This is one of the most gratifying parts of working at this organization that we are, we're more than just a bank. We're involved in the communities we're in, and we're, we're trying to make a real difference in, in people's lives of our customers and the communities they live in. So, so I'm really pleased with this. It's a, it's an exciting part of working for the bank.

00:04:25

So, you know, on the topic of the bank, um, we've been around for almost 165 years. Uh, queen Victoria signed our Royal charter that kind of initiated the bank way back then. So we predate things like electric lights and electricity, and definitely things like DevOps and cloud. We had about 14, uh, billion dollars in revenue in US dollars last year. Um, our technologies headquartered in Singapore. We operate in over 60 countries, and we've got more than 9 million individual customers. We've got more than a thousand applications in production, and we've got really all the way from mainframe through to microservices running in containers, you, you know, and everything in between. So if we look at our footprint, this is one of the things that makes Standard Charter unique. If I go back, uh, really quickly, you can see our mission up there in the top that it's really about our unique diversity.

00:05:20

And this unique diversity comes from our really unique footprint. All the blue shaded areas you can see on this map are, um, areas where we operate as a bank. So you can see we're really heavily involved in Africa, the Middle East, and Asia. And you know, the interesting thing about this is that each of these countries have their own financial regulators. They often have their own views on things like data sovereignty and cloud governance and, and many other things. So, um, a bit of background about me. I, I don't come from a banking background, as you saw the kind of NASCAR slide of the, some of the companies I've been lucky enough to work with. And, you know, I, I've really only been in banking for about three years in the technology side. And, you know, one of the most surprising things was just how much paperwork there is.

00:06:11

You know, you hear about it, but when you actually experience it on the inside, it's immense. So you think those 63 or so countries that we're in, uh, with regulations, we also have obviously a lot of internal compliance and policy. And so to give, to put this into context for you, the average application we put into production, we have to look after at least 150 different security controls that have to be mapped at, tested, evidenced, audited, et cetera. And so, um, this, this is one of the overwhelming challenges of doing technology in a financial services environment versus doing it, uh, you know, in a, uh, in a regular enterprise.

00:06:54

So if we now come into the infrastructure world, kind of where, where I've operated in for most of my career, um, the result of all this regulation and paperwork is that our processes over time have grown organically to be optimized for compliance and not speed. Many of these, um, processes around managing risk and controls, and they were really invented or designed for Waterfall one to two times a year releases. And so in an era when development teams and businesses want to go faster and they want to embrace agile and they wanna release more, often, the status quo for infrastructure provisioning looks increasingly antiquated. You know, servers still take weeks to provision in our environment. And while parts of the delivery chain are automated, a lot of it remains manual. You know, one story in particular, I think will illustrate how we tend to add controls and bureaucracy to our work over time.

00:07:49

You know, we have this idea of, uh, a break glass mechanism for production that if you want to, uh, do an operation in production, say there's a production incident going on, and you need to SSH into a Linux server to, you know, maybe do some troubleshooting and remediation, you have to go through and un vault a privileged password from a system that you know, stores the password. You un volt it, you log who you are, what incident ticket you are, you go in and do your work, and you save it in the change request. And, you know, you revolt the password as if that wasn't enough in order to make sure that you only did the things you were supposed to do with that privileged password. We have an extra component where, uh, we like this idea of maker checker. This is an idea that well predates, you know, technology even that, or information technology that if in a banking environment you'd really like anyone's work to be checked by someone else, and you know, as an account holder, I'm sure you appreciate that level of diligence around, uh, you know, maintaining the right account balances, et cetera.

00:08:48

But, you know, in production, if you are doing a change, then you need to have your manager kind of sign off. So what that means in practicality is that after this change request is done and checked back in, your manager then has to watch a video of your entire screen session and then a test and sign off that you did the things you were supposed to do and you didn't do any of the things you weren't supposed to do effectively, at least doubling the amount of, you know, uh, resource effort that goes into production changes. So, um, this is, this is a flow chart. Nothing outta the ordinary of our change and release process. It's roughly 37 steps. It's about a 10 day SLA for a normal production change. And this is largely manually driven. Uh, we heard yesterday from Dr. Forsgren and Jay Humble that, um, you know, change approval boards are, uh, are not really correlated with IT performance.

00:09:41

I wonder how they feel about pre change approval boards where you have a meeting to prepare for the change approval board meeting. 'cause we've got those too. And, uh, you might not be surprised that this process and this way of working is sometimes prone to delay. You know, in a typical incident, this is, uh, this is one that I was involved with about a year ago. We had this many different teams all on a conference bridge to try and resolve the problem. These aren't just different individuals. These are actually different teams with different leadership, uh, who all had to get on. And so each of them had their different lens, uh, of what was working and what wasn't working. And what you find is the further down the stack you go, by the time you get to network or data center people, they, they often don't know the application or its context of how it serves the bank or its customers.

00:10:31

Um, the other challenge we've got is that we've already had a failed cloud transformation. We set up a proper bimodal kind of, uh, you know, consult consulting compliant separate team, and we spent a couple million dollars on it and we didn't accomplish very much, and it was a real big failure. So, um, you know, I won't spend a lot of time on that, but failure's expensive. And banks historically don't like failed projects. And so people are a little nervous about this DevOps and even agile thing. You know, uh, people think things like DevOps just means, oh, I'm using Jenkins, so I'm doing DevOps. Or, oh, our DevOps team takes care of all the DevOps thing. And you know, our, sometimes people have been using Agile as an excuse to just, you know, skirt around the bureaucracy, which fair enough, their intention is good, but you know, we've got, you know, uh, commitments to regulators and compliance, et cetera.

00:11:30

So at this point, you're probably thinking, well, this is a pretty depressing story and there's lots to be pessimistic about. And you know, perhaps that's true. But really what I wanna focus on this morning is talking about, you know, why we're optimistic that we're making progress, even if it's small incremental progress towards, you know, what the DevOps movement is really all about. So, you know, we're gonna talk about a few things up here. Um, one of the things in general is that we've got recognition at all levels of our organization that we need to change the way we do technology. And we've got really enthusiastic participation really across groups and outside organizational structures to do that. One of the things we did is we got together in my group, which is called technology services. So, you know, cloud infrastructure sits within technology services. That's a important point to bring up that I'm not part of an innovation lab or a separate kind of incubator or, you know, like a bimodal go fast team.

00:12:23

We're actually part of the core infrastructure team that, you know, the technology services that runs all thousand plus applications in the bank. So we happen to be doing some of the cloud and DevOps pipeline work, but we're right there with the rest of, uh, you know, the infrastructure folks in the bank. And we got together with all of our leadership earlier in the year, and we actually came up after a one day workshop with this set of principles and tenants. And you probably saw I spent some time at, uh, AWS earlier in my career. And one of the things that I took away from that experience is how powerful this idea of starting with the right thinking is that if you can come up with the right principles and agree on them and then implement them and hold each other accountable to implement them well, you can do really big things.

00:13:07

And it also streamlines things and removes bureaucracy, because now you don't have to argue over first principles all the time because you've agreed on them. So in, you know, the Amazon six pager process you've heard of talking about principles is kind of a key part of that. And, and if you were to prepare a six pager internally that didn't have the principles laid out, usually someone senior in the room would ask you, why not? Um, so let's jump in and talk about some of the stuff we're doing. One of the things I wanna call out is that when you're talking about, you know, cloud in particular, um, which is, you know, the high level title of my team, uh, the, my big takeaway from the last year and a bit is really the how is bigger than where. And so, um, the, you know, have to credit people like Cornelia Davis at, uh, at Pivotal, who I first heard this from.

00:13:57

But you know, if you think of that long provisioning chain that we put up earlier, that the bottleneck in that process is not a systems administrator struggling to find out how to right click and launch a new VM in vCenter. You know, and so, you know, we've heard this a lot and it's really resonated the last couple days at the conference, that if you just pick up your data center with a forklift and you put it into anyone's public cloud kind of infrastructure as a service, you're gonna be disappointed. And I saw this on the other side of the table, working for a cloud provider, folks who didn't wanna do the work, uh, Andrew Clay Schafer's analogy of kind of health and fitness as a metaphor for digital transformation really resonated. And I'm gonna steal that from him. I'm sorry, I'm gonna use it over and over again.

00:14:41

'cause I think it's awesome if it, it reminds me of the kind of maybe person looking like me who's sitting on the sofa watching, you know, someone buying a weightlifting equipment or something going, yeah, I'd really love to be thin, but I don't wanna put in the work. So that, that really resonates in terms of how I think a lot of enterprises, not just ours kind of operate. So one of the things we've done is we've built a DevOps pipeline. So I'll just throw it up quickly. These are all pretty standard tools, but what we've tried to do is say, how do we, how do we modernize the way that teams deploy to production? So if we look at the best teams in the world, whether it's Amazon or Google, or Microsoft or Netflix, uh, one of the commonalities I've found is that they all have a very rigid defined process of how you're allowed to go to production.

00:15:26

At Amazon, it was called Apollo. Uh, I caught up with some of the Microsoft team a couple weeks ago. They told me that within Azure, they've got something very similar within the bank. We're a little more fluid, we got several different routes. And so when you don't have that standardization of how to go to production's difficult, um, the last mile has been a really big challenge though. When I got involved with this team about, you know, six months ago, it was, um, a situation where lots of teams were on the left side of the, of the pipeline, the kind of stuff you see up here in collaborate and build, and maybe even test, but they weren't getting all the way to production. So, you know, of a 10 mile journey o only eight or nine miles were paved. And the rest of the time you had, it was too bumpy to even drive.

00:16:13

You kind of had to get out and walk. So one of our big focuses for 2018 has been how do we get apps end to end all the way from kind of idea and from the business all the way to production, automated all the way through with all the compliance checks, and they have to end up on a cloud environment. Now, I would love for our definition of cloud to be the same as what we heard last night in terms of, you know, the, the five things you really need. Um, right at this point we're starting incrementally cloud is like, can I drive my infrastructure with an API? So we said if it's API driven infrastructure in a public cloud or in our own data center or on contain on a container platform, hopefully Kubernetes, then, you know, we'll, we'll count that as, as kind of compliant.

00:16:58

Now, when I talk to some of you in London and June, uh, we had one app in the whole bank that was live end to end on that journey. As of today, we've got 13. So we've made good progress. We've been doing kind of more than one every two weeks since, since, uh, since I last talked to you. So, so we feel good about that. But when you look at overall numbers, it's, it's still a lot of work to do. So we've got roughly a third of the bank's applications on our new pipeline. We've done almost a million builds on the platform this year. About 4% of those can deploy to some environment, and 1% of those can deploy to cloud. Now you might say 1% cloud. Wow, you guys haven't even gotten started. What I would say is, if, you know, that's 1% by number of applications we've got, we're using cloud to do things that we really can't do in our own data centers.

00:17:49

Uh, just as a idea every, or as a, to give you an example of this, without sharing hard numbers, we spin up more cores every day in public cloud to do some risk simulations than we have in all of our private data centers combined by a factor of two or three, and then we shut them down again six hours later. So we've got a couple of applications in that 1% that are using huge amounts of compute that we couldn't really afford to spin up and down like that anywhere else. So one of the, one of the other things that we get challenged a lot is like, what's your multi-cloud strategy? How do we avoid lock-in? How do we, we hear this from regulators, we hear this from, uh, our executive, from our business. And so we've been thinking a lot about what does multi-cloud mean?

00:18:35

And so I've done kind of, uh, just a simple quadrant map of how we're thinking about cloud, kind of in the top half is public, kind of somebody else's data center. In the bottom half is our own data center. On the right half is maybe more modern cloud abstractions, and on the left is more infrastructure as a service. So, you know, we think in the fullness of time we're gonna be doing business with all three of the large, uh, I as providers. The regulation and complexity of operating in those 60 countries, though, uh, really means that we have to, uh, we have to have a strategy for running in our own data centers for some time. I, I mean, I think if you look 10 years in the future, it's quite possible we could be entirely in a public situation, but probably not in five.

00:19:21

And you know, what we've, uh, learned a lot along the way is that, you know, this is hard and, you know, we've, we've got some OpenShift running in our own internal environment. If you look over on the right side, we, we think Kubernetes is the way as well. I went to John Willis' talk, uh, uh, at this conference, and it was awesome as usual. Uh, and, and I agree that Kubernetes and containers are the future. We've also been seeing that the friction for our teams going live, like of, of those 13 apps that I talked about, more than 10 of them are in containers, uh, on OpenShift. So those were way easier to do. And, and the app teams were able to move a lot faster than, uh, than they could otherwise. So, you know, this is kind of how we're thinking about cloud. Uh, really, I, I think the top left quadrant is already obsolete.

00:20:14

If, if you're having to manage VMs and log in and patch them, it kind of doesn't matter whose data center they're in. So maybe if you're using composable infrastructure and it's all terraformed up and you know, you, you don't really have to touch things and there's, there's lots of tools out there to kind of slipstream and make managing VMs easier. But we want really want to get to the top right, or at least the bottom right. Um, let's talk about some of the less sexy stuff or the, the more legacy parts of our environment, though. You know, when I joined the bank, it wasn't actually to run the cloud team. It was to run a productions operations team running all the applications in the retail bank. So about 250 applications were directly under my purview in that. And what we found is that whenever we had stability issues, we went back and did a bit of a review over a few months of stability, uh, incidents.

00:21:08

And what we found from that is that there were some common causes. One of the common causes of instability was the fact that we were doing a lot of things manually. Things like code deployments, things like Dr. Failovers, things like service restarts. And so, um, really interesting story of innovation. One of the things banks grapple with is how do we be more innovative? How do we avoid being disrupted? How do we, uh, you know, keep up with the cool FinTech kids? How do we be more innovative? And so you see lots of things going on, whether it's innovation labs or you know, uh, incubator labs or those sorts of things. And those are all great, but you need to be innovative in your core technology team as well. There can't be like, that's why bimodal doesn't work for me. You need to innovate and improve in your, in your kind of core operations as well.

00:21:54

'cause that's where you'd probably need it most. And so a junior engineer, uh, about two years ago in the bank said, I've heard of this thing called rundeck, and I think we should start using it. So it was an unfunded project, and he really had no budget to do it. Uh, he didn't really have the organizational position to get it done, but he just kept at it and he kind of got it through the bureaucracy to get it installed as a POC, and he got it up and running. And so that was in the web support team. This is like web middleware, WebSphere mostly support. And they started using it for things like service restarts and, you know, some failover type activities. Well, when I came in and started running the retail team, I said, well, we need some of this. And so, um, this has moved on quite a bit.

00:22:42

Uh, you know, with, with people like Damon's help, we've, we've now gotten to the point where we've run hundreds of thousands of rundeck jobs. Really interesting from this because we're using it for specific, uh, incident remediation. Uh, you know, say there's a service restart. Well, rather than having to go in and un vault a password and do that whole video watching story that I told you about earlier, now because it's constrained and inversion control and can be audited, there's no, we can pass that requirement because it's, it's a kind of a chain of custody thing. You know exactly what the script's gonna do every time, and you can tell it's provenance, et cetera. And so we've found that it reduces incident time. So TTR reduces by on average 25 minutes for apps where we're using rundeck versus where we don't. So that, that's really exciting.

00:23:33

We think next year rundeck is probably gonna save us about 28 people years worth of work, um, at fairly conservative estimates, and we haven't even rolled it out out particularly widely yet. So, uh, this has been a huge win for us. If, uh, I'd be happy to talk to anybody if you, if you wanna find out more about that. The other thing we've been experimenting with is, you know, shifting to an SRE team model. Now, you know, at banks it's a lot easier to just call things new names rather than actually do the new name. I think in this case, we're actually taking good steps towards doing the new name. A colleague of mine named Vencat Raghavan, he has been experimenting instead of the typical L three support model we used, what happened there is that production support people like Vencat and me and my previous job, we would pay a budget to the development teams to actually fund L three work.

00:24:25

So this is like bug fix operational fixes. Inevitably what would happen is those would go down to the bottom of the pile priority wise, and then they wouldn't get done. So Venkat said, well, why don't I take some of this L three budget back? You lend me some developers. I'll pull some people with coding skills up from operations. So we kind of had, you know, developers and operations working together and, you know, they, they made some really good results, uh, within a couple months. This is really early, it's still experimental. We're still kind of feeling our way through this and seeing what works and what doesn't. But this is really exciting. They've reduced their backlog by like 80 some percent in the first few months they've been doing this. And, uh, they've reduced the amount of incidents, they've improved stability in the platform. It's, it's a really exciting story.

00:25:08

If I look at the bank and the organization as a whole, you always hear in these, you know, talks, and this is really a converted audience already, but culture is so important. One of the exciting things that happened earlier this year is we streamlined and simplified kind of our, our codified culture. Before we had six or seven different kind of statements of culture or, you know, valued behaviors that were, you know, they were good, but they weren't as good as these. And these are really simple, kind of do the right thing better together, uh, never settle. And then with some individual specific sub behaviors under those. And this has been ruled out bank wide, not just in technology. And it seems to have started to become a sort of a movement that I start, I, I see people regularly hashtag things like hashtag better together when they're trying to collaborate with a team in, in a different department.

00:25:57

And if someone's kind of pushing back that we should do something better, they'll hashtag it in an email or, you know, uh, on our internal social media with like, you know, hashtag never settle. So this is really optimistic. I think it might take a long time before the true outcome of it is really realized. But that, um, is exciting. If, uh, if we look at, you know, we got a lot of work still in progress. We're still trying to figure out things like our support model, um, like how we really do SRE and make it scale across the whole team. Like how do we retool our processes from this kind of manual handmade per process to something that is API driven and, and eliminates handoffs kind of implement lean, uh, across our process world. We're, we're struggling with that. I, it, it's, uh, it's work in progress, but we've got a lot of work left to do.

00:26:49

And so if I kind of finish up with things we've learned along the way, um, you know, if I go back to the story of this junior engineer who kind of launched our rundeck idea, often innovation comes from unusual ideas. You know, the other story I sometimes tell is about Gmail, that at Google that apparently came from somebody's 20% time. This was just an engineer who thought, Hey, it'd be cool to have an email system. And in his 20% time he came up with a prototype and then it kind of snowballed and went from there. So sometimes, you know, we think the more senior leader you are, well, all the innovation's gonna come from, from me and my team, and I'm gonna have an offsite and we're gonna incubate innovation ideas and we're gonna go that way. It doesn't always work that way. Be open to innovation coming from unexpected sources, but eventually it's gonna need, you know, if you're a senior leader in an organization, it's gonna need your kind of backing to get it further and get it to the next level.

00:27:42

Um, you know, the, uh, don't let perfect be the enemy of the good in this start small. We're, we're, you know, not fully cloud as in all the five factors of scalability and burstability, et cetera, like we heard about yesterday, but we're starting somewhere. Let's get our infrastructure API driven, let's remove manual processes. Let's take our core world and start making it, making it better that way. So, so that's really encouraging. What I'll, I'll finish up is, uh, you know, help, I'm looking for, uh, feedback on this. You know, if you have, uh, complaints, disagreements, questions, et cetera, more than happy to chat with you. Um, Kubernetes it, it may be and probably is the future, but it seems to change so fast that by the time we get the semblance of a strategy written down, it feels old and antiquated. Um, we're also grappling with how do we extend what we've done really around application deployments now and apply it to data as well.

00:28:40

How do we, you know, take things like evolutionary database design and scale it across a large number of teams to make, you know, kind of a DevOps model for how data and schema changes and, and putting data in secure Safeways into public clouds, et cetera, gets done. And, you know, compliance is code is the other one. A lot of our compliance is manual. We need to get to continuous compliance. I liked the ideas this week. I heard of, you know, kind of minimum viable compliance. We're, we're still looking for inspiration and examples and ideas of that. So, um, with that, I'm gonna wrap up. Uh, I hope this has been of some use. It's been a really interesting journey. I wanna thank the organizers again for the opportunity to be here. Uh, I'll leave you with my details and my Twitter handle up in the top left. And that, uh, you know, this is, uh, from a recent advertising campaign we did that our journey is, uh, really just getting started. It has no finish line and it, and it, uh, is gonna continue. So I really like this as a metaphor for, uh, you know, continuous improvement and continuous learning. Thank you very much.