Las Vegas 2019

Shaping the Cloud - How We Transformed FINRA With DevOps

FINRA regulates the US broker-dealer industry and monitors market/exchanges daily, processing up to 135 billion transactions per day. Given its massive amount of data processing needs and 30+ petabytes of data storage, it was necessary for FINRA to rely on cloud computing and services to ultimately meet the regulatory demand. However, it was not an easy task to migrate the systems to the cloud due to the regulatory environment, compliance needs, audit requirements, security risks, and so forth. On top of that, culture change was an absolute must to shift from the traditional data center mindset to the public cloud world.

DevOps transformation was a key to success in migrating the entire portfolio of applications to AWS. The transformation involved organizational structure changes, introduction of new tools/concepts, continual education, and rapid refinements. So how exactly can you achieve this transformation for your organization?

In this session, Daniel Koo will speak about FINRA's DevOps journey, how it started and how it evolved throughout the past 3-4 years. The talk will highlight the organizational structure/transformation that worked, provide a walkthrough of the DevOps toolchain (custom tools, Open Source, SaaS) supporting from project inception to delivery/operations, as well as the successful bootcamp training implemented and conducted across 1000+ technologists. Additionally, it will outline how you can measure DevOps maturity to support sustainment and continuous improvement.

DK

Daniel Koo

DevOps Products & Engineering, FINRA

Transcript

00:00:02

My name is Daniel CU. I'm a senior director at FINRA, currently managing the dev ops products and engineering organization. I strive to be a visionary leader influencer and the change agents wherever I am and wherever I belong to. So I led the DevOps movement at FINRA since 2015, and I'm always looking for ways to positively disrupt the enterprise and then contribute back to the industry. All right. So brief history about myself, where I started and where I am now. So prior to 2007, I was a software developer. I worked for various different companies, pretty much coding, any tasks that was given to me. And then in 2007, I joined FINRA as a contractor and I joined FINRA to help with the modern test automation development using selenium and Java. So I played that role for the next two to three years, and then Finner appointed me as the automation architect. They want to see what I have built for several projects and now scale across the enterprise. And in 20 15, 20 12 is when I joined FINRA as a full-time employee. And then in 2015 was when I was asked to lead a group called productivity engineering. That's when we start to think about dev ops and CIC, and then in 2017, my team expanded, we grew and then, uh, we were tasked to implement DevOps across the enterprise. And here I am now.

00:01:41

So quick intro on FINRA. So who has heard about FINRA here are quite a few. Wow. Um, so FINRA stands for financial industry regulatory authority. We oversee the broker dealer industry in us and, uh, we regulate the, uh, the markets. So we have, um, about 4,000 brokers firms that we regulate more than 634,000 securities representatives. And we might have 12 different markets and exchanges. Our mission is to protect investors and promote market integrity. We are a not-for-profit organization. We have, uh, close to $1 billion per year in revenue. We have total of 3,600 employees. And in tech organization we have around 1200 people.

00:02:31

It's a FINRA is a big data company. Um, we ingest up to 135 billion transactions in a single day. When we reconstruct these data, we can get up to trillions of edges and nodes. And our store's footprint is more than 30 petabytes of data. And we have more than 150 different applications that we deploy in AWS to run our regulatory function. And we'll be running up to 50,000 compute nodes per days, not per month, not per year or per days. So here's a quick snapshot of our data volume growth. In 2017, we had about 37 billion events on average per day, and now it's grown up to 135 billion transactions in a single day. So this shows how much data that we're ingesting, how much we're analyzing and how much data we're storing to do our regulatory function.

00:03:30

All right, so let's talk dev ops. Now. Um, our journey really began back in 2014 when we started to think about cloud. So we decided to move to the cloud with the growing demand of the data volume. So we couldn't sustain the data anymore in the data center on prem. So that's when we decided to move to the cloud. And when we think about moving to the cloud, we had to change our mindset. We couldn't continue to practice the way we were doing in the data center, right? Having the ops team, the infrastructure team provision the servers, and then we deploy the applications onto the servers, right? We had to change our mindset. How can we in the cloud provision configure, deploy, and then automate operate? Um, so we had to rethink, right. We have to reestablish our CIC, these strategists as well. So that's when in 2015, a group was formed called product of engineering.

00:04:29

It was a small team that was formed to think about what are the gaps. So we were already practicing to a good level. But then when we think about again, cloud and infrastructure automation, we had to think about what gaps do we have within our current tool chain, right? We're using things like Jenkins back then SVN get right. We're doing the continuous integration. But again, we, we, we thought there were gaps in terms of a tool chain. So a lot of different entities got together, including architects, um, security, operations, and developer community got scared to think about some of the products that we need to build to practice DevOps. So in 2015 and 2016 is when we really cranked up building these products, right? These products were onboarding applications, provisioning, setting up your network, your security groups, how to do deployment in the cloud. And we were building these products. We had to think about baking in security, baking in compliance, and then following architecture patterns within it, just go and create any products. Right? We had to think about us being a regulatory company. We really need to think about compliance and security. So we had to build all those things into the tools.

00:05:52

Now we built these tools and we saw some products starting to adopt, right? So we start to see some success, but then it wasn't scaling, right? We have 150 different applications that are trying to move to the cloud, right. Migrate to the cloud, but it just wasn't scaling enough. So what we decided to do in 2017 was to transform our software configuration management group, right? The CNS into application engineering group. These guys became the experts in dev ops experts in these tools and the products that we custom developed and we maintain, and they were embedded within the project teams. Now it wasn't that these guys were doing all the DevOps work. That was not our point. That was not our goal. It was for these guys to go and help teams to practice DevOps. That was the differentiator.

00:06:46

Now in 2018, we really want to scale across the enterprise, right? We were really serious about moving all of our systems to the cloud. That's when we invested a lot in training. So we built a self paced bootcamp training website for developers to come anytime at their own pace and take the bootcamp training. I wear it reteach about the concepts and the tools and how to practice DevOps. So by 2018, we had about 600 developers that took the course and they were able to actually start practicing DevOps. And in 2018 was a year when we started to move a good portion of our portfolio to containers. We started to move to Docker, start to move to AWS ECS. So good portion of a portfolio got migrated over to containers. And in 2019 this year, we start to look at serverless industry was moving to serverless and we were seeing, um, a high maintenance costs even moving to the containers, right? So we started to think about serverless using Lambdas, FARA gates in the ROROs. Um, so we start to see a small movement of apps to serverless. And also this year, we really start to look at AIML. So within our business application portfolio, there were apps that were already adopting AIML and seeing value. So what we wanted to do was to look at what can we do within the dev ops space? How can we apply AML to dev ops? So that's where we are now.

00:08:25

So I want to take a moment to read our vision statement. I truly believe in having a vision statement and mission. It really helps the organization to understand where we're trying to go and what we need to do. So we have four vision statements within my organization. First is so enabled teams to deliver software faster and ensure reliability at enterprise scale, through automation, building products, with built-in security compliance, best practices, and continuous monitoring. The second one is advocate DevOps practices that allow teams to gain confidence in delivery pipeline and empower them to continuously deploy to production on their own. Anytime, anywhere, third provide feedback mechanism to team is by continuously collecting data and making them accessible via meaningful interfaces. And lastly, promote collaboration and innovation by shaping the DevOps culture within the technology community. So now are we meeting all these goals? Not yet, but this really helps us to drive and steer the way we want to go.

00:09:45

So let's talk about the organizational structure. I touched on some of the groups already, but the dev ops engineer practice at FINRA consists of these three different groups. First is the productivity engineering group. So my entire organization is, is about 50 people serving 1200 technologists. So we have about 20% of the folks working within the productivity engineering, focusing on building custom tools and maintaining developer tools like Jenkins, Bitbucket, JIRA confluence, and et cetera. So these guys are full stack developers. They can develop front end back ends. There are dev ops and test capable. The second group is called cloud engineering. So they are primarily responsible for releasing the base images, the Docker images and the AMI, Amazon machine images that bakes in all the compliance, all the security and all the governance tools that are needed to run our applications. And, uh, they are system engineers with programming, uh, knowledge, and they are 20% of the organization. And lastly, the application engineering group. So I briefly touch upon those guys. Um, earlier these guys are the field engineers that are on the floor, helping the business application teams. Um, they are the experts in CIC, the dev ops and SRE, and they are the automation guru. 60% of my organization are application engineers. And without this group, it was not possible for us to scale dev ops across the enterprise.

00:11:26

So let's talk about tool chain. Um, I'm going to highlight some of our custom tools. There's there are a lot of tools that we use, but I want to highlight some of the custom tools that we built that really helped us to implement DevOps. And I want to speak in the four different phases. These are the phases that project teams typically go through inception phases. When teams start thinking about building their app, or they want to migrate their app to the cloud kickstart when they start coding and then development testing is a cycle that they go through before production. And finally released is after they go to production is our monitoring and governance. So let's talk about the inception phase. We built a tool called onboard. Um, this tool basically sets up all of your access to AWS. It creates your active directory group, your key pairs, your IAM roles, your certificates, your token to make dynamic DNS registration, your JIRA projects, your confluence space, your Bitbucket projects, uh, your Jenkins setup.

00:12:32

So everything is done through this application. It, it used to take us a month, sometimes even two months between different groups to get all of this set up. Now it happens in one day. Second app is called Portas. This is an app that will be created in conjunction with our information security. This is an app that helps teams to manage their security groups in AWS. So InfoSec would come and create the policies, right? They will white list of rules. And then now the development team can come into this app, uh, app self-service and start creating their security groups, using infrastructure as code.

00:13:16

Now the Kickstarter phase is when they start coding. And when I taught coding, this is both infrastructure code and application code. That's the difference, right? When we moved to the cloud and think about dev offs. So we spend a lot of time thinking about how we can create resources in AWS, in a compliant, secure manner, as well as putting in the architecture patterns that we want to see in our infrastructure. So we created a tool called provision underneath. It uses clot formation and various API APIs and SDKs to create the resources in AWS. But again, we define different stacks, right? We define the patterns for a traditional web application. You use this type, it we'll go ahead and create all your resources for you and configure it for you. That was the goal for provision app config was the puppet and Ansible module that'll be created to configure your servers.

00:14:15

that is a product that we built on top of Jenkins. I'm sure everyone knows about Jenkins here. So what we created was to help teams to build their bill jobs, their deployment job, their pipeline, and orchestration, again, in a compliant, secure, convenient manner, and Docker based images. And, um, base AMS is what we produce so that it's got everything that's needed for you to run your application built in. Now you get into the development testing cycle, right? So use the same tools to continue to build your application. And then there are some other tools that we built. A, we have a tool called Fidelis with manager, your secrets in AWS.

00:15:02

Now it's ready, it's in production. And that we have different tools that we use, uh, to minor and govern our infrastructure, right? Both open source third-party and our custom tools to achieve that. These are the goals that we try to, um, integrate into our, our tools, whatever we build, whatever we choose standardization, very important, make it, make it convenient, make it easy for people, uh, compliance, building security in from the beginning, not at the end, but from the beginning, integrating offs from the beginning, embedding architecture patterns and self-service and automation all the way through. So we have, um, many open source projects out on GitHub. So I encourage everyone to go check it out. We also have other, uh, tools that are coming out as well.

00:16:04

All right. So we talked about our history, um, our, our tool chain, our organizational structure. Now I want to touch upon some of the best practices. Um, I'm sure many of you guys are already aware of this, but I want to touch on these six items. First automate everything. This, this was our motto going into dev ops and CICB is to try to automate everything that we can. I know we can't automate everything, but having this mindset helps us to think about when, when we're dealing with a problem or when we're trying to solve a problem. Right? We always think about automating that infrastructure as code configuration, as code. We have a lot of APIs that we can integrate, right? Automated tests and roll back. Second thing, compliance and security. Many times you think about compliance and security at the end, but you need to be thinking about compliance security from the beginning.

00:17:02

And also key thing here is baking in the compliance and security in the tools itself, right? Um, no, opting out. So when, when teams use a tool, you already have compliance security built in, in Christian, at rest and in transit, um, authentication, authorization on every traffic and making sure you have, uh, scans in place. Very important. Third thing is standardization and architecture patterns, right? I talked about that. When we describe the tools, we define different stacks, we define different types and then let them use those types, right? Make it easy for people, make it convenient for people, right? Easy to troubleshoot, no, the wheel. That's why we try to do fourth resiliency and reliability moving to the cloud. Again, we have to change our mindset. We have to design for resiliency, right? Cloud your, your, your AC could go down. Your service co could go down.

00:18:07

Your service could just go down. So you have to design for resiliency and think about auto recovery and then the delivery insights. So we try to capture all the relevant data within the pipeline and use that to analyze how we're doing predictions, what we want to do next. Right? So you've got to make sure you define the different data that you want to collect and catalog them and use them to query, analyze and visualize. Lastly, monitoring governance, similar to compliance and security. Think about monitoring governance from the beginning, not at the end free configure them, big them into the tools, right? We want to see, we want to have transparency. We want to be audit friendly and audit ready. So these are the different best practices that we try to follow.

00:19:05

So I want to show you how we're practicing this within a FINRA, uh, with a simple workflow. So first we have different entities working together to define the policies and standards, right? We have security and enterprise architecture. We could have development community creating the policies and standards, and then those get feed it into the product team. Right? That I talked about, the productivity engineering team is the one that takes the standards and policies and creates the products. Once they're created, they help with the help of application engineers, development teams consume these tools. And then they start deploying to the different life cycle accounts that we have. And after that, because we have monitoring and governance in place, we can see what is happening within our infrastructure, right? What are the changes that are happening? And even at, at some occasions, we go back, we roll back to the configurations that we have set.

00:20:08

And now with that, we can, uh, create compliance reports that can be used by audits and it gets fitted, uh, feed it back into, again, the security and architecture and development community. We refine the policies and standards. Now that goes into the products we make enhancements and the necessary changes and the you go into the cycle. And that is how we're practicing those best practices out of the fine. All right. So now we've got, you know, different tools and the products that we developed and teams are starting to use it, but how do we scale, right? How do we see this across the board? Again, our goal was to enable teams, right? My organization's goal was to enable teams to practice DevOps. We don't have a separate dev ops team that is doing all the dev ops work, but the goal is to make developers, practice DevOps.

00:21:06

And we also provide oversight. We need to become the feedback channel to listen to the developers. What are they going through? What problems, right? What do they want to see? So you have the focus product team, developing the products, the embedded engineers, helping the developers and getting the feedback, right. And also continuously training these folks so that they are dev ops aware. So speaking of training, um, this is the self-paced bootcamp that I mentioned before. So we created an, a website that teams, the developers can come. And, uh, this is a purely lap based exercises. So it's not just theories and concepts. Right. But they can actually come and do a lab exercises. Right? The goal was at the end of this training that the developers can now go and practice DevOps, right. Not just with knowledge, but they have hands-on experience using our tools and understanding our concepts. So we teach things like tools and the best practices and of containers, serverless, uh, database deployments, security, and operations. So we cover all the fundamentals of the dev ops practice through this training.

00:22:28

Okay. So now we're seeing, uh, um, dev ops being scaled across the enterprise right now, how do we measure the maturity? You know, right. We don't want to stay stagnant. Um, you know, stay at the same level. We want to continuously measure how we're maturing, right? So six different, um, aspects of, of measurement here, the automation level, we want to make sure teams are using approved CIC V um, solutions, no manual steps whatsoever. No downtime. Do they have adequate test coverage? Do they have health check in place? So these are the things that we try to measure for each of the projects, tool adoption. Are they using approved solutions? If so, what version are they using? Because as we ship out new products, right? Some teams are not upgrading. So we've got to make sure as we release new features, new enhancements that they do adopt the new versions.

00:23:37

See, I see the metrics from beginning to the end, from source code, all the way to deployment. We try to measure and collect relevant data, right? From source quote perspective. Um, how many check-ins are they doing, right? Are they checking in every day? Are they checking that date at the end of the sprint? Right? How many, uh, builds are they doing? Um, how many failures do they have? Uh, how much are they deploying? Um, how much failures do they have? So all of these things is what we try to collect and we analyze to see their maturity. Next thing, uh, is called scorecard. This is what we developed in house. Um, this is what we measure, uh, the compliance aspect, quality assurance, security and operations. So what happens is when teams are trying to go to production, we run the scorecard, right? We run the scorecard and go, all right, your score is X, right?

00:24:32

But you need to reach Y in order for you to deploy into production, right? So teams are forced, uh, to up their score so they can actually deploy to production deployment. I really look at their deployment strategies. Are they using things like blue, green? Do they have automated rollback, zero downtime, deployment? Do they have orchestration? Right. All these things is what we look at and see if teams are implementing any of these deployment strategies, lastly, reliability and availability. We look at the outage numbers, right? How much downtime do they have in their servers? Are they auto scaling? Are they resilient? Do they have the right alerts set up? Are they a multi-agency right? How good is their performance? So all of these elements and aspects is what we look at when we measure the reliability and availability. So it's important for us to look at all of these, uh, measurement to see how mature they are. We continue to get to the next level.

00:25:43

All right. So this is the last slide. Um, what problems still remain? What are we trying to solve now? So we are continuing to shape the future. Um, one of the things that we're looking at is continuous deployment. I think we're doing pretty good on CGI. See the and dev ops, but, um, you know, how can we get to the level where a continuously deploying to production, right? We're building things like the production deployment readiness check, doing dry runs in production, doing auto upgrades. Blue-green zero downtime feature flagging all the different deployment strategies is what we're trying to implement.

00:26:27

And also, uh, using bots using AI, right? Lets humans use more bots, apply chat ops, right. To promote collaborations and being transparent. This is one area that we're looking at now, which we haven't solved yet. DevOps insight. I talked about being able to collect relevant data to analyze, right? So what we're trying to do is to collect the data, to be able to predict different patterns, right? Um, predict failures before it happens, right? Applying AIML is what we're trying to do with the data that we collect. There is a lot of different things that we can do with AIML. And it's not just a, you know, a buzzword that industry is talking about. I truly believe in using AI now in dev ops. I think that's really the next phase within the dev offs, um, in the industry. And with that, I want to thank everyone for listening. And, um, like I said before, I hope you guys have something to take back to your organization and please feel free to reach out to me below my email, or you can find me on LinkedIn. I would love to continue our conversation. Thank you very much.