Las Vegas 2019

In Search of DevOps - the Evolution of a Data Warehouse Team Towards DevOps

This is the story map of our journey throughout our Corporate Data Department cultural transformation to create autonomous DevOps Squads managing multiple data warehouses.


We went through the classic transformation axis (people, process and tools) while continuously delivering projects, supporting 24/7 production environments and having fun.


We will share stories throughout more than 2 years of multiple iterations/improvements/fails/learnings by our team and with the help of other internal groups at National Bank of Canada.


A lot was learned starting with our redefinition of roles, to the way our "source code" (ETL) is stored in git repositories while we took control of our own software release life cycle and also experimented with database virtualization and with automated testing.

MC

Maxime Clerk-Lamalice

Software Engineering, National Bank of Canada

Transcript

00:00:02

When I joined a data warehouse team, two things were said to me, first agile was not really working in this group and also automating the deployment of software component. Uh, then this team, uh, was really not possible. So we did both, and this is our story, uh, before going for, or who else in the audience is working in a data warehouse environment or data related work. All right. So my name is Maxim Clara camellias. I'm a software engineer by training based in Montreal, Canada. Yes, we do speak French in Canada. Uh, most of my time is focused on, uh, software engineering practice and building high-performance team while obviously having fun, a quick intro about my journey. So I started in the startup world. I was there for 10 years. We build software and hardware for the healthcare and, uh, industry then went to e-commerce then banking since 2016.

00:01:18

Uh, I'm at national bank right now in 2018. So last year I did attend the London edition. And since then it has been a personal goal for me to present at this conference. So I'm very excited. So national bank of Canada, uh, based in Montreal is, uh, one of the six important bank in Canada. It's the leading one for Quebec. It's also very active for the small and business industry, uh, founded in 1859. It grew organically and through acquisition. So obviously this has an impact on the it infrastructure, but also the data richness that is available.

00:02:09

The bank's mission is to have a positive impact on people's life. It's doing it by building long relationship with the clients throughout, uh, for line of business, couple of numbers to give you a sense of scope. So more than 24,000 employees from this number around 2000 in it. And, um, obviously, uh, to print 2.6 million employees keep in mind that, uh, national bank is part of the Canadian, uh, banking system, which is highly regulated. And, um, and it's very stable at disclaimer before going further. So this story is not about the big data, data visualization, or even the cloud. So it's about corporate data. So we're team are around 65 employees, which are, um, maintaining multiple ODS, so operational data store and, uh, EWP EDW. So enterprise data warehouse, uh, we're doing it by having a pretty classical technical logic hosts, a stack, which is DataStage our echo and control them.

00:03:34

Keep in mind that, uh, the data, uh, producer and data consumer are never sleeping. So we're always open 24 hours a bit more about the context. We are one of the many teams at the bank, which, um, are using agile and dev ops to deliver quicker, faster, uh, solutions. But the specific about us is that we are managing data, which mean that we're very popular. Uh, we're involved in very small, to very large program. And the only thing in common that they all have is they're, they're very anxious and impatient to have access to our data. Also, uh, since we grew organically, our echo it ecosystem is a mix of legacy and new system, obviously.

00:04:33

So going back 2016, uh, a typical ETL deployment was, uh, using lots of interaction between different profile over three team were involved. Uh, we were really seeing every three months or so, and the manual, the deployment was manual. So where did we start our transformation? Where do you put most of our focus? Obviously the classical three axis people process technology, we'd start it everywhere, but in iteration and always involving the team. So let's go through some highlights, uh, about the people having, uh, agile mindset was key. So the bank did invest in training, not only for the it people, but the business people which led to having a mindset that this is my job versus I will kind of help the team. So a big change in the mindset of the people which led to being able to have, um, you build it, you run it, uh, mindset at the squad level.

00:05:55

Secondly, we remove the QA people that came from a centralized group at the bank and move the quality to the team level, uh, using different tactic, uh, peer review, uh, test coverage. And th the thing, uh, built in the pipeline. And lastly, we did merge the dev and the ops team, which was a big move for us. We have done this, uh, more than six months ago, and those are autonomous quad right now. So basically from the request to the production environment, that was a major change for us about the process. So to support the new squad, the new autonomous squad, we changed the process. We used a unified backlog, so coming from three different backlog, a dev ops and even management backlog to a unified one. So we were able to have a unique view of what was in the pipeline and what was coming next.

00:07:05

And obviously the, the refinement with different POS at either the program level, uh, or the, the, um, project level. Secondly, we are a data team. So obviously we have lots of data that are about the commit data, about the time to release. And, uh, it grew from only acquiring the data to having a conversation with the team about the performance, about the reflexes, uh, about the different best practice though. That was a nice shift for us and lastly, uh, new requests. So basically every everything that came into the group, we were able to focus it through a service desk and have different workflows to dispatch it to, uh, the right squad technology, obviously, uh, dev ops is often associated with tech tool. So obviously we build pipeline, we build pipeline to deploy ETL. We bid build pipeline to deploy, uh, our scheduler schedule. Um, and most the, uh, most of the new tools that we did, we did it, uh, in our team.

00:08:20

So people grew custom to maintain it earlier. I talked about our classical stack. Um, right now there's a need for speed to have access to the data. So new integration pattern are coming in, uh, with the new project, with the new program. So we are seeing, we're building the team. We're adding new profile to the squad to have API is to have streaming capabilities. So this is, um, a great shift for us also, lastly, uh, all our lower environments were switched from physical to virtual environments, which obviously with the, this, we save lots of this space, um, lots of, uh, speed to refresh the data. But most importantly, we were able to have self-serve capabilities at the squad level. So the developers could request and manage their own data, data pods, which were a big, uh, time-saver for all the programs that were currently in the, uh, the team.

00:09:34

So when did the change basically, uh, the interaction of a deployment right now is at the squad level. So there is no other teams involved. Um, there's no multiple coordinator implied, uh, there's no, um, problem to have the right information of what is currently being deployed. The, this fixed capability, uh, has also allowed us to have, uh, new interns coming in through mentorship, but also bring the architect closer to the implementation. And obviously, uh, the pipeline that we talked earlier, they are being used at the squad level. So a big shift for us. So key results. So this new approach for us is making it scalable so we can scale up multiple squads scale down based on the work required. We went through from monthly to weekly deployments, more stable in production. So 55% less incident the performance or the global time it takes to ingest and to manage the, all the data went up so very more, uh, quicker the same team, but obviously more fun, which was very important for us.

00:11:07

So the timeline there was no magic. So it took, we went from a classical banking model, uh, with a dev team and ops team and more a data science BI background for the developers. When we went to the agile, the dev ops transformation, right now, we did this shift to the autonomous and the future for us is very exciting because, um, the speed at which the team wants to change is getting faster and faster. So that's very, very interesting for us. That means we can add a new concept and, uh, virtualize more environment, uh, since we're using multiple track, uh, per squad, another tools that we used was, uh, using the team meeting as a snapshot in the time to see what the change, since the last meeting, what can we improve and have feedback, obviously, uh, there were some challenges. So we're sharing three today with you.

00:12:13

Uh, there was lots of, um, fixing forward. Uh, obviously pride were stable, but the knowledge of how it was fixed was not shared. So we make it public. So we had what we call a wall of fail. So everyone in the group was able to see what went wrong, but most importantly, what was the solution? So being shared with the old group, secondly, and we did a custom software and we did pipeline. Obviously we had to train our employees, but what we did not expect is to, we have to train a neighboring groups also, so that required more time than expected. Uh, so obviously this was a surprise, but the end goal was that, uh, the general knowledge of software engineering practice went up, uh, across, uh, the group and also, uh, neighboring groups that that was positive. And lastly, um, we grew a dependency on those DevOps experts.

00:13:20

So they were doing the coding, they were doing the pipeline, we switched this mentality. We make sure that our employees, uh, understood that those DevOps expert were doing development. Uh, so right now there are ex those experts are more coach for our employees, making sure that, um, the, the new hires, but also the senior one are having the right information. So data ops up is still a concept for us. Um, we're not yet done with the dev ops information obviously, but it's now built in our mission. So we want to have a open discussion about the complexity of managing data, the complexity of making data available throughout, uh, different groups, um, and making sure that people focus on the data, not only the code, obviously we are doing experimentation with new tools, the new, um, for example, data as code data catalog, we are using multiple vendors and it's still a work in progress.

00:14:36

So next year, obviously I'll give you an update for sure, couple of months, do, um, be bold. So share your win, make sure that everyone knows about it. It's super important to inspire. We did develop a startup mindset at the department level. Uh, obviously there's a tax in running this a startup in the big corporation. So plan for it, complexity processes. It comes with it. Get closer to your end users. Um, the end, especially define that owner. So who owns the data? This will simplify a discussion with POS with PM and the actual team, what to do with the data, work on the heart and complex issues and problems. Um, adding regressing, automating regression testing in the complex data warehouse environment is very hard. You should be working on that right now. Not only tackle quick wins. And lastly, uh, said the department mission, but stay aligned with the enterprise guidelines. So, um, let's talk more, one-on-one here about you guys is how did you tackle the automated testing in a complex, uh, data warehouse environment, and secondly, the conjection of data governments and release management. How are you doing it? So that's two topic, uh, that I would like to hear more about you guys. Thank you.

00:16:37

If you have any question or sure.

00:16:41

Oh, when you complain the deans, let's say a bad engineer, um, holiday upbeat, like a Dean, you use that skills. Are they learning from others? Because I usually get like an alignment, you, most of the teams I have the leash days, I don't want to learn that. So how have you overcome when you bring it?

00:17:01

Uh, we have the notion of major, like you're the major and minor, so your major skills are this, but obviously you, you need, we expect you to learn something else to help the team. So some people are not open to it. You cannot force them, but we can encourage that. So that's what we're doing right now and making it available through like training and pairing also. Yeah.

00:17:29

You described data pods earlier. What does that, is that, that sounds like a developer would use or

00:17:38

Yeah. So where are you seeing your Delphix? And they have the notion of a data pods basically make a branching branching model apply to data. So you have a source and you create pods of data that, um, developers or any employees in your team can use, uh, and, uh, play with time. So you can run a Bab for example, and ATL we'll do transformation, then you can go back in time. So that's one of the, the notion that we did we're using right now. There's, it's a mix it's based on the use case. It, it could be, yes. Yeah.

00:18:23

Great. Can you talk a little bit more about your data virtualization? What you're using that for?

00:18:30

Well, basically it's, uh, we were using all, um, uh, physical environments, Oracle physical environments, uh, and then we switched to, uh, this provider they'll fix, which is, uh, allowing us to replicate quickly an environment, uh, at that time base and to clone a database. So based on the, the need of the project, the life cycle of this database will match the project. So, no, uh, um, yeah, exactly. So once the project is done with, uh, remove the environment and then we could create a new one based on the specs of the, uh, the project. So developers are only accessing what they need to have, uh, for the project,

00:19:18

Uh, external tools to, to create those virtual copies, if you will, of the, of the database,

00:19:25

Uh, the, the Delphix environment is managing lots of the, the logic, but we also have pipeline to create, uh, to facilitate the creation of those different environments. Yeah. So we, we stick it together with, uh, Not the arcades, but we could use this, this kind of, yeah, sure.

00:19:50

So we're currently using very similar environment agent and Delphix for virtualization. I'm curious how you're plugging in Delphix into your pipeline. That's one thing that we're trying to figure out,

00:20:01

Uh, well, first it's for a non-prime environments. Um, so how we are using it, uh, when we start a project, we create a environment specific to it. We plug, uh, reconfigured data stage and all the echo, the schedule, or also the ecosystem to have, um, a self maintain environment so people can use it. And then, uh, it's a self-serve while we use this self-serve facility. So developers can do them on vacation or roll back into time. I don't know if I'm answering the question, how are you using it?

00:20:38

Well, right now, we're, we're using it to refresh our physical

00:20:42

Servers. Fantastic.

00:20:44

What we're trying to figure out, how do we give the developers that capability, right. Without them going crazy and spending

00:20:51

Yeah. It needs at first need some coaching and DBS are still involved to make sure that part of it is a standardized, but we're starting to give access more and more. Yeah.

00:21:04

Okay. Do you give access to end users in your data warehouse environment or

00:21:10

W w w uh, yes. Well, we create specific environments for exploration or a data engineer, which are, uh, locked. Yeah. Kind of. Yeah. So they can have their fun, They it's a self-serve. So they, they plug it with whatever they want. Yeah. Yeah. We would just make sure that it's secure and only the right data is access.

00:21:40

Can you discuss your schema and DDL, uh, promotions, uh, as you go through, uh, I imagine your branches, uh, the kind of tests though, what does that look like for Y cycle? You know?

00:21:53

No, so it it's, it's not all automated, so it's the DDRs are in get and dif um, the DBA is our imagining, uh, our matching, uh, based on the requirements of the project. So it's pretty standard, uh, evolution of the data warehouse, no mat, no, there's no magic right now. We're we are integrating, um, they'd be my store, but we're still too early to give you like a, like, integrated answer of how we are using it. Yeah.

00:22:34

I have a question regarding testing. So since you're kind of refreshing your data, right. So how do you validate, you know, your math because of application development, right? They'll give you a scenario saying this input was now put into this function just should get that out. But since you're retracing your data, you don't to know what's in there, but then you do have to build something. And then what's, what's the kind of the process to validate, to make sure that the whatever e-tail code you're, if you're buying stuff

00:23:03

Well, you mean regression testing, Well, unit testing is done at the data stage level, so that small transformation are done, uh, where we're using more like a higher level, like functional are almost end to end testing. Uh, we did, um, part of the, um, testing life cycle using, uh, uh, how, how would you say,

00:23:35

Well, then you get a story that says, you know, this cable leads to how this data, there was a couple of formula,

00:23:43

How we match the data for the project.

00:23:46

Are you, are you kind of match, you know, whatever you want farms that you have since you have refresh data. I mean, I guess in some scenarios, if a VA or someone gives you test data and, you know,

00:23:57

Well, the, in our case, the BA is generating synthetic data. So they're, they're creating data to match the test cases. And this synthetic data is living inside the pod, and then it gets refreshed, uh, whenever our recycle at the end of the project, they will create it. And obviously the, the, the end goal will be to version it. So then when we destroyed the pods, we can reuse it in a different context, build a library of, uh, of . Yeah, exactly. So this is where we are working right now.

00:24:37

Uh, I should find out the next time.

00:24:42

Okay.

00:24:44

Another

00:24:44

Question. So

00:24:47

Our code base is about 90% data stage ETL. Um, are you guys looking to maybe do something else outside of DataStage or is that

00:24:58

Yeah, for sure, for sure. Um, there was, there's no final decision right now, but we're obviously, you know, 10 years of development, it's, that's a lot of, um, ETL that you cannot shift, uh, in a, in a day, but yeah, obviously we want to, uh, to migrate to a new, uh, simpler approach to a more DevOps. Uh, absolutely. Yeah. All right. One last question, maybe. Yeah, sure.

00:25:32

So, um, basically the development, the customer fields, bringing the new functionality, the application is quicker than really creating a report. Right. How do you bring into your experience? How do you bridge the gap, the data, our obligation rights of what ETL you extract, it's totally, maybe different ways. So how do you minimize that effort so that it will be quicker to,

00:25:54

It's all bringing down to how we, um, I would say the architectural level or the designer level, we split the different portion of a normalization of the data. Uh, so smaller, uh, ETL transformation that we need to do and push it, uh, progressively. Um, I don't know if I'm answering your question. It's a, that there's more of that design level. So less big transformation, smaller, smaller blocks. That can be the deployment, uh, independently. All right. Thank you very much, guys.