The Art of Platform Engineering (US 2021)

Sanjeev Sharma is an internationally known DevOps and Cloud Transformation, and Data Modernization thought leader, technology executive, and author. Sanjeev’s industry experience includes tenures as CTO, Technical Executive, and Cloud Architect leader. As a former IBM Distinguished Engineer, Sanjeev was recognized at the highest levels of IBM’s core of technical leaders. Sanjeev is currently a Head of Platform and Automation Engineering at Truist, the 6th largest bank in the US, created by the merger of BB&T and SunTrust. Sanjeev provides leadership to drive the adoption of cutting-edge solutions, architectures and strategies for DevOps, Cloud, and Data driven transformations, working with C-level executives leading these transformations. Sanjeev published his 2nd bestseller book, The DevOps Adoption Playbook in 2017. He regularly blogs and podcasts on DevOps, Cloud, and Data Modernization on his popular blog http://sdarchitect.blog John Comas is currently Senior Vice President of Platform DevOps Automation at Truist Financial managing the DevOps transformation for the 2nd largest bank merger in US history. Formerly, Manager of DevOps Solutions at NBCUniversal for over 9 years where over 200 critical applications were automated with DevOps enterprise services.

breakoutuslas vegasvegas2021
SS

Sanjeev Sharma

SVP, Head of Platform Engineering, Truist Financial

JC

John Comas

SVP, Platform DevOps Automation, Truist Financial

TRANSCRIPT

00:00:12

Hi everyone. I'm John comas. Uh, I was born and raised in Northern New Jersey. And well, I guess you can say I'm a Jersey boy I've been working with dev ops principles and practices even before dev ops was an industry buzzword. I'm currently a SVP of platform dev ops automation for truest financial. I was previously the dev ops leader for Barnes and noble.com and NBC universal. I received my PhD in systems engineering, dev ops from Stevens Institute of technology in Hoboken, New Jersey. And in short, I've been a lifelong advocate of advancing technology through promulgating, the principles of enterprise dev ops and the elegance of standardization and simplification.

00:01:06

Thank you, John. Uh, and, uh, thank you Jean, for having us again at the DevOps enterprise summit, uh, as just like John, I have been in the devils industry since long before it was called DevOps, right. I actually met John at 2013. He was one of my clients I was at IBM and I was the first distinguished engineer at IBM who was helping clients adopt DevOps. And I got an opportunity around that time. As Jean mentioned to work with dozens of clients around the world, helping them adopt DevOps at large scale. And today I'm a truest right to the seventh largest bank in the us and the fourth largest insurance holding company in the planet. And I'm, I'm taking what I learned working with several of those clients and implementing that, working with John, uh, as he mentioned now, for those of you who do not know truest, it was created back in 2019 by the merger of BB and T and SunTrust.

00:01:55

And that made us a very large bank, uh, you know, very rarely does a top 10 bank enter, uh, you know, suddenly out of nowhere into the top 10 list, but here we are, the module is ongoing. And while the merger is ongoing, we are taking this opportunity to improve our capabilities, improve our, our developer experience, our developer productivity. And that's what we want to talk about today. How are we doing that? So to take a step back, let's talk about what is a platform. You know, we call this session, the art of platform engineering, right? And it sounds very philosophical sounds very artistic, but at the end of the day, you know, it is all about improving developer productivity and developer experience, as I mentioned, and there's not a better quote or description I could find on that, then Matt Skelton's, uh, description or definition of what a platform is and I'll read it well written for you because it is concise.

00:02:46

And I love things which are simple and concise. He says a platform is a curated experience for engineers to accelerate the delivery teams that use it. Let's dissect this a little bit because this is the philosophy, this what we are building at truest, and this is what we're going to talk about. It's the keyword, the first keyword, there is curated as opposed to ad hoc, right? We don't want developers to have an ad hoc experience, which varies based on how long they've been there, what technology they're using, what you know, uh, uh, tool set they're using, you know, who was on their team? What kind of product are they building? It should be a curated experience. Now I agreed on large enterprise that cannot be the same experience. It might vary, but we want to reduce that variance and make it more, more, more, a better experience for lack of a better term is for engineers.

00:03:35

Now we are using the term engineer here, generically, right? Anybody who's a stakeholder in getting requirements, convert it to code running and production. We're using the genetic term engineer here, right? And running in production also. So it includes people in operations to include people at the help desk, incident management, everybody. And our goal is to accelerate their capability, to make them more productive, make their life easier, right? Reduce toil upon them. That at the end of the day is the definition of what we are calling a platform and what we are building here. So when I got started, I've been a truest only around, you know, less than a year and a half. And John joined a couple of months after me. And we looked at what was going on and we decided, okay, we got to, let's see if we had a blank slate, mark, trust me, we don't write, obviously software is being developed here, but if we had a blank slate, how would we design a platform?

00:04:25

What would we want it to look like? And we came up with a few tenants. Now, if you've heard me speak before, I've been talking with these tenants long before I joined truest. So this is not new, but these are what we are implementing right now. And I'll go into detail about how we implementing it later in the presentation. First and foremost, the first tenet is it should, everything should be self-service. So how can we take all the capabilities and the services, which are shared services organization, which we are a part of, is delivering to the developers, to the engineers so that they can do their jobs. We need to make sure that the self-service is not manual. It's not ticket driven, second thing. And this is very important when you're talking to a large enterprise, especially one in a regulated industry, it needs to come with permission to act.

00:05:11

It's pointless having something which is self service. If you still need to get approvals from five people before you can sell server yourself in order to give permission to act what we need to make sure. And that's our responsibility. And we will talk about that is that we build guardrails around every service. We are delivering YSL service so that people don't break things and it doesn't create jeopardy for us and our company, all of this, then we'll come. We'll only result in if we have a culture and environment of trust, the engineers who are consuming our services, trust us that the service will behave and perform and function as designed. We'll be able to deliver them the SLA as we are promising or SMS. And we can trust them that they are not trying to bypass all the guard rails and try to, you know, game the system for us.

00:05:56

There's mutual understanding of trust. As we started building the platform and designing the platform, we looked at it as a layered cake approach, right? We need to have an ability to provision environments and configure environments and deprovision environments and environment pipeline. Next, we need to have an application delivery pipeline, which is what John will talk about. He's responsible for that. And all of this needs to be secured and compliant. Even if you're not in a regulated industry, you need stuff to be secure and compliant. But when you work for a bank, when you're in a regulated market, everything needs to be secure and compliant. We started building this multilayered cake and we're in the middle of doing that. We want to share how we got, where we got and what, where we are headed. But here very briefly is the capability map of what each of those areas looks like, right?

00:06:44

So we have environment engineering, which forms the basis. You can't do anything without an environment. Then you have the ability to capture requirements, write code, test code, deliver code, you know, the entire application delivery pipeline. And then you, on top of that, you have a data pipeline, right? How do I get data to the right people at the right time, so that maybe they can make the right decisions? And this includes test data. This also includes other data which has feedback coming from operations as to how the application is performing and behaving in the real world. And obviously all of that has to do has to have an aura or a layer of fondant layer or a chocolate layer of on that cake of security and compliance on the top. You'll see, is what we are specifically building a portal by which the developers can consume all the services via self service and participated and participate in that curated experience. We are talking about. What I'd like to do now is hand it over to John. And he'll talk about the application delivery pipeline or the layer, which of the layer cake, which is responsible for you. So what are you, John?

00:07:48

Thanks, Cindy. When I started my career back nearly 20 years ago, now, one of the most pervasive industry challenges I encountered was how large corporations had difficulty in releasing software successfully into prod. There was a definite need to deploy changes rapidly to a live customer facing system with minimal to no disruption, but there was no mechanism to do so. And teams were not organized in such a way to make this viable in and around 2008, the industry really began to see the dev ops movement take hold as a way to modernize traditional software development practices. But one of the challenges I encountered was that principles and practices of dev ops were often viewed as more conducive toward a start up or smaller organization. Implementing dev ops at a large corporation was seemingly a monumental task that involved, changing the way hundreds or even thousands of people work daily and how they interacted with their peers.

00:09:00

And this is where enterprise dev ops was born for me much like the agile process itself, getting the it organization to scale dev ops consisted of many small iterative steps to achieve a cohesive well-oiled end to end process. It's crucial that when you embark on your enterprise dev ops journey, that you steer clear of pervasive anti-patterns, which will present themselves, I've learned that these traps can sneak up on you and muddle the waters for your transition. Firstly, and most importantly, dev ops is not a new silo that sits between dev and ops dev ops engineers are highly comprehensive in their knowledge set and understand the application undergoing development, the SC DLC tooling, the hosting model, et cetera, dev ops is not a tools enablement team for developers. Dev ops is also a cohesive end to end automated process. And you cannot automate half of the software development process. I've learned that to be successful. It's really all or nothing. For example, you can't say that we implemented continuous delivery for dev QA in stage, but we're still doing manual prod deployments that just doesn't work. And we have to be careful not to confuse SRE site reliability engineering with dev ops. We need to prevent the use of non-standardized and or the use of multiple team centric, dev ops frameworks, which can create a hero. Anti-pattern next slide.

00:10:55

So as we all know, the term dev ops, as well as the term dev sec ops is a portmanteau of development, security and operations in the industry. The term dev ops has multiple definitions and perspectives on its core philosophy in my career. And here at truest, I have defined dev ops through the use of the five CS, continuous integration, continuous delivery, continuous testing, continuous monitoring, and feedback and continuous compliance. And while we all know the core of what's CIC, D C T mean there are key aspects to these practices, which a dev ops leader must be aware of when we implemented CGI, we're not just implementing continuous build, we're implementing full, true continuous integration. Remember that you're not just setting up pre-flight builds, nightly builds, et cetera. You have to successfully branch and merge and continuously integrate your developers. Check-ins back to the main line with CD.

00:12:08

There's a big difference between continuous delivery and continuous deployment. You have to judge for yourself, what is the most realistic and appropriate for your organization? And I've learned that it depends upon the application, which is best with continuous delivery. Your dev ops pipeline is creating a deployable asset and release package, which is deemed fully tested and approved and in a holding pattern, ready to be deployed to prod with continuous deployment. The deployable asset is automatically deployed to prod. As soon as it's deemed ready. This is a critical implementation, which needs to made up to your industry type and organizational appetite for change also. And I can't stress this enough, never silo. Your DBS dev ops for the database is critical. The DB changes should be implemented right along with your non database code in the same CD process. So if you ask me what the single most important step toward implementing enterprise dev ops has been for me, I would say that it was simplicity.

00:13:21

I started the enterprise journey being as simple as possible, never trying to boil the ocean. The goal has always been to improve the quality of code developed to deploy faster and more efficiently at a reduced cost. I like to promulgate the idea of the five point dev ops star, simplicity, traceability, accountability, repeatability, and reliability. The dev ops pipeline has to build configure and deploy software of any platform and technology. We designed a robust system that is both highly scalable and highly available. And so quite simply the end to end pipeline should be able to deploy anything anywhere on premise or cloud. You should be able to achieve everything with your single enterprise pipeline. Next slide.

00:14:25

So to the greatest extent possible, we standardized our tool sets and promulgated a single unified path to production standardizing build and deployment practices, reduce your costs and provide prevents errors through automation with faster and more frequent releases. We created a unified development standards, which to Villa, which deliver confidence and code quality, your pipeline needs to exude confidence and always reliably and rapidly push out changes to production to meet the ever-changing business needs. All developer tools used in your process needs to come from the enterprise tool box. And this is very critical to ensure stability and compliance. And so, as you can see here, I like to look at dev ops tooling as a pyramid at the bottom of the pyramid. Our developers have access to a rich and diverse set of tools from the enterprise tool box. But as we move up the pyramid, which represents the various states dev QA stage prod the standardization of the tool set narrows.

00:15:41

So when you reach the apex of the pyramid, which of the greatest extent possible achieving a unified path to production next slide, much like our holistic universe, mother nature and humanity itself. Our it enterprises are comprised of highly complex systems of systems where each individual system, while capable of independent operation interoperates with other independent systems to create a fully comprehensive system, which has overall greater than the sum of its parts. You have to be very careful and critically aware of system interdependencies and how you roll out releases to production. That's something I've learned throughout my career. Transitional state deployment problems are actually a core of my PhD thesis, application interdependencies, and the potential impact that deployment may have to live systems needs to be very carefully analyzed. I cannot emphasize enough the importance of understanding the effect that deployment can have to the holistic system undergoing change.

00:16:56

Next slide, enterprise dev ops standardization and simplification is a fundamental building block to our core implementation. And because our systems are so complex, you need to bring order to the chaos and be simple. I've always said, think about how you can do more with less think about using an aphorism like I'm implementing a single button instead of a keyboard. The simplified methodology allows you to focus your energy on developing the logic necessary to deal with enormously complex interdependent software systems of systems. You have to understand the tradition, the transitional states during a software deployment and continuously assess throughout all aspects of the enterprise dev ops pipeline. When I was working in e-commerce, I remember encountering a situation where in a highly complex order management system, a change to one of the key middleware messaging systems between our front end and the fulfillment system caused major disruption during a deployments of the live system, customers were able to place online orders, but they could not see their order status being updated as the deployment was affecting the messaging cues, the messages were there and generated properly, but just being held in a queue for release post deployment.

00:18:33

So frustrated customers who couldn't see their updated order status called up customer service, requesting refunds, an order cancellation, which the customer service reps obliged, however, even though customer service reps canceled the order, less cancellations didn't update because the message queuing system was still backlog from the deployment. So all those orders still flow to the fulfillment system and got processed. Even though the order was canceled and credit cards were refunded. Customers still received their orders. And since you couldn't recharge the credit cards of customers, people receive free merchandise and you can imagine that's not great for the financial health of your business.

00:19:22

Thank you, John. Now let's go into, how do we take, where are we let's go to, you know, how do we take what these two banks, which are separate, right? Which are had different philosophies, different it technologies, how are we merging them and bringing that simplicity that John talked about, right. If we look at the organizations, right, they were very diverse, very different. And you know, I worked with dozens of large enterprises around the world and they all have a higher variance of technology, stacks tool being used, team maturity, and all you're taking to enterprises who had that variance and putting them together and trying to create an a, you know, one unified organization. How do we bring all those people together? Remember we are talking of tens of thousands of developers supporting thousands of applications for tens of millions of customers. For the more, our goal was not to build, to support the bank as it would be today, right, as a combined truest, but to support the bank as it would be three years from now, based on our strategic plans and all our growth plans, organizing those, you know, one of the things we are doing is, you know, uh, uh, and for those of you who watched the TV series, Loki, we'll get this reference, right?

00:20:36

We are acting like the TVA. We are pruning what, how things are done, right? We are reducing it, moving up. John's pyramid of standardizing as you move up higher in the, uh, to higher and environments and towards production. So reorganizing ourselves to support this large and diverse team set, even as we start moving towards standardization is very important. So we are the platform team, which is taking the capabilities from across all it shared services and delivering it as the platform to the developers, to the engineers who are consuming them, right. What John is doing. And what he talked about is how are we taking these diverse application delivery teams, tools for the various teams we support and helping them standardize, not standardized, just in terms of the standards we are writing, but standardize on that unified path to production so that we can make life easier for not just ourselves, but like the, like it was talked about in the dear auditor book, where are we today?

00:21:32

Right. And you see some color coding here and I'll, I'll go over it in detail, but I want to walk you through what we have achieved. Now, I've been here as I mentioned slightly over a year, and we are still very early, you know, from that perspective and our journey, but we weren't starting at zero either, right? We are, we are, we have a bank with a lot of automation already in place, but it is still a long way to go, but let's talk about it area by area. First of all, I'd like to start off with test automation. That's probably the most mature area have in the area of, in as far as automation is concerned, we are 100% coverage for our, uh, performance testing. For example, I just got an email a few weeks ago from, uh, from one of the projects, which is way ahead of schedule, which is, you know, rare to Euro nowadays, right?

00:22:15

Uh, for any project anywhere, right? We are ahead of schedule or getting all the performance testing done, but that's an area where we are very mature. And, but it is not an area where cell service, but by definition, because performance testing is not, is highly specialized. And we have a specialized team which goes in and works with, uh, with the application teams to figure out what needs to be tested. What kind of roles are they expecting, load and stress. And, you know, all the various criteria which go into performance testing, the other aspect of testing is we are very, we have pretty much close to a hundred percent coverage is test automation. We have a centralized test center of excellence, which has developed a test framework, which everybody uses and all the teams, which have been onboarded to it. And as I said, there's close to a hundred percent adoption utilize that test framework to test their applications.

00:23:02

Of course, they're always outlier applications, right? Which do not, you know, the technology doesn't fit the test, uh, framework, um, barring those, the rest are using the automated standardized test automation framework. We have, we are gathering all the test data in, at a centralized location to, to ensure coverage and ensure compliance to all our QA requirements. And this goes on actually, even in the area of security testing, right, where we are doing code coverage analysis and code vulnerability analysis, all of that is automated and is done by a very large extent by self service, by the individual teams for wanting to consume the set tests at the other extreme, we have areas like SRE. In fact, I mentioned it, sorry. No, because just this morning, I got an, a, I got an email from a project where we are piloting SRE, sorry, is a new area for us where we are looking at what are the most common, um, pilot as SRD folks call it areas are for an operations team.

00:23:58

We are working with, uh, with one of our, one of our divisions. And we have, we have experimented with what, what can we automate to reduce that toil on the operation teams, one to 10 attends to handling the repetitive tasks. And this is mostly in the data space where we have started, uh, and we are helping the date of the operations team address, uh, some of these combinations, which can reduce the toil and make life easier for our operations team. Another area of SDR place where we already have a portal, a self-service portal, uh, is in the area of environment provisioning, you know, putting out servers, all that work is completely automated. The workflows are automated. Our teams can come in and provision their own servers, where we are looking at right now, where we can truly make yourself service. Isn't the guard rail space for a mother, for tenants.

00:24:47

I spoke about cell services, one of them, but sell services, kind of not possible to roll out to various teams to consume if we don't have the guardrails in place, right, without the guard rails, you don't cannot control what kind of service people, uh, publish for them, uh, you know, uh, configure and provision for themselves. So we are working on providing those guard rails. That way we can open up that self service portal to anybody, anything that needs servers put it very, very frankly, right? I can, I can go into test data automation rate. All the test data we use, uh, is, is, is masked, right at all. The plight BI privacy data is, is sophisticated in it. That's a hundred percent coverage. What we want to get to a self-service of test data, right? So we can use data virtualization so that a test, a delivery team can self provision data into the test environment.

00:25:38

We do that today, but it's not. Self-service another area where we have a lot of atomic automation and I call atomic is because automation works is just all the Lego of black bricks, or as John put it earlier on, right. Uh, the individual keys on the keyboard are there, but we don't have the curity pipe, uh, you know, uh, pipeline. So to speak the keyboard, uh, is in the space of, of, uh, what John was talking about. The application delivery pipeline, your builds, CGI CD, you know, all the, uh, gates and automated gates that need to go through in order for us to be ready to say a particular application is ready for deployment to production. All those individual automations that are available, what we are working on is creating a composable pipeline so that people can sell. So that pipeline itself and with the push of a button provision, that pipeline is, you know, and of course the pipeline we need to be configured because based on the application, based on the technology stack, being used based on also the maturity of the team, the level of handholding, which we in the dev ops team need to provide with the team, what John was referring to as synergize, how much is needed, depends on the maturity of the team.

00:26:45

The really dark color, which is purple are things which are full adoption. That automation is built. It, we are a hundred percent adoption waiting levels of self service, the orange area, uh, uh, ones which are quarter in orange are work in progress. We, we are building automation and onboarding teams doing pilots. Like I talked about, uh, you know, earlier on, and then there is work in progress. These are some new areas like SRE, where we are doing some pilots to really ourselves understand what do we need when it comes to SRE, what do we want to focus on? Where are the low hanging fruits and what should our longterm strategy be? We do not today, as you can probably guess, have a true platform, right? I mean, there's not a platform. These are atomic services, which aren't available. Some of them self service, some of them not, what we really want is a portal, a marketplace, just like a, like a private pass where our teams can go in and consume these various services and compose them together to create the pipelines they need for their specific, uh, uh, consumption needs and their specific requirements.

00:27:47

So how can you help? Right? We're on this journey, then it's going to go on for awhile. But if you have done, you know, being on the journey that we've been on, and you're a little bit ahead of us, we'd love to learn from you, right. If you know where the landmines are and where the chapters are. So, you know, share that with us so that we don't step in the same ones. Again, secondly, uh, we need some guidance. We are still figuring this out as to how to make the onboarding to the platform itself, self service, which means if one of my, my sister organizations, uh, let's say the network team wants to build self-service for our SDN capabilities. And we want to make that portal that available on our platform, what are the SLA is what are the criteria we use to allow them to self-serve onboarding their own new service or updating a service they have available?

00:28:33

Because we understand what to put on the services that John has on this broader platform, as you saw, it's much broader than just, just devolves. Other services will also be available so that we can have an end to end developer experience. And lastly is, is, you know, uh, do mandates work, are we have if you build it, they will come. So we'd love to get some feedback from you with that. I'd like to thank you all for your time. Uh, you know, we'll be, you know, on slack, uh, taking questions and answers and, uh, thank you as great to be here. Thank you.