Las Vegas 2018

Oracle Corporation: 'oRE - DevOps Transformation at OMC

Ajay Chankramath will talk about how his team went about creating an SRE model at a place with very rigid ideas on who should do what. This has been challenging not just due to the cultural issues but also significant cost cutting as Oracle was forced to move to Cloud from a traditional capex model.


Ajay Chankramath has more 20 years of experience in Development and DevOps leadership roles in various industries. He is currently the Director of Platform & Release Engineering at the Marketing Cloud division of Oracle. Prior to that, he was Vice President of Development at Broadridge, a leader in Fintech as well as held several Senior DevOps Management roles at Xilinx, the pioneer of Fabless semiconductors. He is passionate about Developer Productivity, Self Sufficiency, SRE, cultural transformation and breaking down silos in organizations using lightweight self-enforceable processes.

AC

Ajay Chankramath

Director, DevOps, Oracle B2B Marketing Cloud, Oracle Corporation

Transcript

00:00:04

Okay, let's get started. Uh, I have, um, an experience report the way that Gene actually puts it every time. So I have a classical experience report I wanna actually talk about, um, did any of you attend Ian's, uh, presentation yesterday from on, um, site reliable engineering? Okay, you got a couple of people there. That's good. So if you did that, I'm sure you're gonna appreciate this a little bit more. Um, how about, um, the, um, presentation from the rundeck? I mean, Douglas in Rundeck this morning, the pre session. So I think that was also very relevant to what I'm gonna talk about today. So, so let's jump right in. Um, so you see the name here as Primor. Uh, one of the reasons for that, I do see some Oracle folks there. Are there any, any other Oracle folks in this audience? I see a few.

00:00:51

So I don't represent all of Oracle. I'm sure all of you know a lot of people at Oracle who have, maybe I worked at Oracle in the past. I don't represent all of Oracle. I represent this specific line of business at Oracle, and that's called Oracle Marketing Cloud. So, um, basically the reason why I'm actually talking about, or, or even before I start, or I want to sort of tell you what, or is that stands for Oracle Reliable Engineering. So it's not a classical site reliable engineering model. We are talking about an Oracle liability engineering model. And even in that, I want to talk about our specific implementation of that. That's why I'm calling it primor. So at Don Percent all of Oracle, it's about how Oracle Marketing Cloud has gone through the transformation of, uh, uh, using the reliable engineering principles. So just so that, um, you understand a little bit more of the context of what we are talking about, the first thing to think about is like, understand how this $40 billion company, uh, is sort of transitioning from where it was this big database giant into all this cloud, uh, cloud first, uh, policies and, uh, you know, strategies that you keep hearing about.

00:02:04

So to do that, what I'm trying to do here is to break down that $40 billion, uh, company into a $200 million company. So that's what we actually are in Oracle Marketing Cloud. So Marketing Cloud is a fairly easy concept to understand. So if you think about, um, uh, your companies or any, any of the similar companies trying to sell a product to some somebody else or sell a service to somebody else, the first thing you really need to have is to have set of a marketing campaign and being able to generate your sales leads out out of that. So that's where marketing automation tools come in. So this marketing automation field has exploded over the past 10 years or so. There are so many different companies, so many different players there. We are obviously one of the market leaders. We as an Oracle one of market leaders.

00:02:50

Adobe is a huge player there. And, uh, Salesforce is a huge player. There are a lot of different, uh, very, very, um, big companies playing in this market. Um, so the, uh, the, the part that is interesting here is the fact that, um, just go back there. Uh, so there are lots of different components of Oracle in a marketing cloud here. So there's B2B obviously, then there's B2C, how you actually reach out to your customer and your consumers. Then you have the social marketing where you actually try and get your social media in, so the inputs from your social media into your marketing campaigns, and you also have your test analytics, um, and, uh, lot more analytics driven marketing. So all these things are put together into what we call as this Oracle Marketing Cloud and this marketing, um, cloud is sort of works off of each other.

00:03:41

So out of the, the biggest component is what I'm gonna talk about, which is this B2B marketing cloud. So, as I said, um, our business is about 200 million. We got about 2000 customers. Um, and obviously the number of employees that we are talking about here are about, um, you know, 2 25 plus, um, the sales sales folks. So think of this like a classic LSMB kind of thing. This is, so when you hear about Oracle, don't think of this large company in the context of what we are talking about here. Uh, it varies, um, classical SMB here with a fairly profitable margins. So what do I do there? So I do platform engineering. I know that a lot of you have actually, um, exposed to platform engineering and probably been working in platform. So I do platform and release engineering for the Oracle Marketing Cloud. And our goals are fairly straightforward.

00:04:28

So basically the number one thing that we provide is, uh, provide platforms for, uh, you know, developers and possibly SREs to build things off of our platforms. So basically these are tools and systems that that, that they could build off of the platforms. Uh, the second one is of course, you know, we are just like all of you, we are also moving to the cloud. In our case, we are moving to Oracle cloud. So how do you actually make that transition process a lot more seamless? So that is, uh, another one of our focus areas. And then the classical, um, DevOps, uh, value stream. If you think about everything from the point of view, how do you actually have the tools for the ci cd, uh, monitoring and, you know, the, basically the deployment process to end from a lifecycle point of view. We provide tools for all of that.

00:05:12

So, um, this is sort of like an hr, but I hope you can see this. Um, I just wanna throw that out there because this gives you a context of what we talk about as platform engineering. It might be different from what you have there. The reason for that is we look at the platform engineering team as sort of the glue between pretty much all the different services that this product is and, uh, needed for, um, uh, needed for this product to actually get it deployed. So you can see that it sort of works with, in on the bottom, you can see that it works with lots of product services who actually are, think of them like scrum teams, who would need the services. And on the, um, on the other side, the left side, you can see the cloud services. So this is another set of teams, some of the organizations that actually uses some of the tools that we have and make sure that this all the whole thing sort of works as one process, as supposed to multiple processes.

00:06:03

Um, so again, um, the question when I first talked about this, uh, at my company, the first question I got was like, why don't you say, why don't you use the classical SRE model, right? Uh, and that's, again, I don't know about all of you, but I've had some trouble getting the classic LES model working because you talk about all the successes, the end, you read about, read about all the success at the banks and everywhere. Uh, I'm not really seeing that at my place. So obviously there are some mismatches there. And how did that happen? A bunch of reasons. The number one reason, right? Oracle's growth over the past several years in the cloud space. So if you look, um, I talked about $40 billion in, in a company. Out of that 50, 15% of the revenues come from, um, cloud. So it's only 15%.

00:06:48

The rest of it is still database and all our older products, but that 15% has been, uh, through a tremendous growth. I mean, so we have had two x growth over the past couple of years. So, uh, what we are really talking about is when Oracle buys companies, we actually bring the whole stack in and we don't make any attempts to actually change that over. So this was, uh, something that was surprising to me when I actually joined Oracle. So they, they believe that if a business is successful, try and keep that know success intact instead of really having to disrupt that by trying to migrate them. So obviously one of the reasons is that, yeah, you have, um, um, multiple stacks with different technologies all playing together. Um, then that brings about challenges with, um, skill sets. You know, you're not going to have a lot of people with uniform skill sets who could possibly be turned into an SRE or who could join an SRE team.

00:07:41

Uh, and, you know, because of the fact that, uh, we are actually in that po the process of migrating from a classical SaaS platform to, uh, more of a, um, cloud platform in, in our OCI, um, uh, paradigm, we find that there is significant lack of interest for people to really understand what is happening today with the fear that, okay, you know, when we move to OCI, things are gonna be different. And if things, things are gonna be different in OCI, I would rather focus on my product right now as opposed to trying to do an SR kind of work. Uh, and the last one, um, again, I was, was at, um, uh, Nike's presentation yesterday, and I really like the concept of MVC, but, uh, you know, this is something that has been hurting all of us, right? So when you talk about compliance, what is that minimum viable compliance?

00:08:29

Yeah, that's great. No, let's actually try and reduce that, uh, compliance level to the lowest possible levels. But what if we have way too many compliance activities that's actually going to bring you down? So that's always a challenge that we actually work through. So, um, in, in that context, um, let's talk about why, right? Again, this, that may not have convinced you as to why you wanted a different process. Let's talk about like very specific, um, use cases here. The number one use case, I'm sure you have seen this. If not, if you haven't seen this, you are really lucky you're working in a place that has already gone through that transition. What you're really seeing here, or what I saw when I started doing this was that there was this, you know, tacit expectation that whatever you do, the, the roles of every team is sort of very well defined, and there has to be some kind of a team or an activity that sort of glues everything together.

00:09:24

Uh, I would call it like, you know, duct tape or bandaid. But essentially every step of the process from starting from the point of view of product definition to product development, to deployment, to monitoring, to ensuring that your cus customer success is there, is sort of going through, as you can see, DevOps, DevOps, DevOps, right? So DevOps in that case is what I mean. Again, it's not about understanding what DevOps is, it's about using DevOps as that defacto player, defacto, uh, box in there to say that, okay, this is the team that sort of glues everything together. So this is great. You know, I, I was feeling great. Yeah, I'm, I'm so important there. But having said that, the number one issue that you're gonna have there is you're becoming, you're gonna be spread too thin, and you're gonna be the critical path of everything.

00:10:11

And that really brings down your overall efficiency. So, um, obvious reason, right? So why does it ha happen like that? Why do you need that glue between all these different steps in the process? And this sort of explains it. Um, organizationally, you can see that this sort of splits into multiple lines of businesses. So the fact that we are inherently into very different lines of businesses, uh, the fact that, um, my SVP don't ha even have like a regular conversation with SSVP of the operations team makes my D job a lot more difficult because their priorities are so misaligned with what we have in our world. So, so this really brings about that, you know, sort of highlights the challenge that if we are really trying to do something like this by putting a team in between, at least let's make sure that the lines of businesses can actually talk to each other.

00:11:05

So this one is truly an hr, but I wanna show you this because this is really telling, this was the number one reason why I thought we have to do something differently to deploy one line of code to our customers on the cloud. It takes us 21 different handoffs and six different teams to work through it. So think about it, one line of code. So this is where you would have a, some kind of a release cadence available. Uh, you, you release the product and the customers have come back with one issue that, that, that you might wanna solve. It could be a configuration issue, it could be like a code issue, it could be anything that is so simple that it's just a line of code. So the inefficiencies of something like this is so telling that if we have to spend that much mental energy to do that, um, again, the handoffs in itself aren't, um, really bad.

00:12:03

If you're really talking about one team, you're trying to hand things off, you say you write the code, you, you build it, and you hand hand it off to a tester to test it, that sort of makes sense. But that's not what we are talking about here, right? What we are really talking about, sort of transcending organizational boundaries when you do this. So it always has to go to some kind of an approval somewhere, and it's all built in using, you know, the Jira workflow. So what you would have is that after every step, somebody has to go in and ping somebody, Hey, did you see my Jira request? Did you, can you approve that? You know, so once you approve that, okay, so it's approved, now you go to the next person. So that's, you know, incorrectly slows it down, and it becomes such a issue with the morale for the developers, right?

00:12:44

Because your developers are, you know, heard about the problem, solve the problem, you want the customers to have it right now, you're going to wait for one week because it has to go through your, um, approval board. So this, these were the context in which, um, we wanted to do what we did. And what you see on the left side here is something that is totally familiar to all of you, right? Is that anybody who has not seen this, so if you have, um, read, um, uh, gene and, uh, John's like, um, DevOps handbook, I'm sure you have seen this. If you haven't read that, you should be reading that book, right? So, and this is, um, originally coming from, um, Google. So Google, um, came up with this several years back and, um, think it was highly popularized by, uh, DevOps handbook. Uh, illustration of this, uh, what you're seeing there is fairly simple SE concept.

00:13:31

This is the traditional SRE model. So in the traditional SRE model, um, as Jan was talking about us today, as well, as most of you know, about, uh, it's a, it's a permanent role. It's, it's a group, um, that actually eventually takes over a product that is ready to be launched from the developers. So development actually gets the product to, in a state of being launched, and they go through a process of launch readiness and, uh, handoff readiness review. Once that is there, they actually take it over. Even the, um, presentation from Douglas this morning, one of the things that we are hearing is that not every feature is Sari is going to be Sari Radio, or you're gonna have an SRE for that. You're going to have to reach a level of maturity to do that. So we basically took that whole thing and flipped it around.

00:14:18

What we did in primer was that we basically created a rolling role within the development community within the Scrum teams to make sure that the, the role that is done by developers are done by primer. So your developers, typical developers, do not have to get up the, get up to the skill of using the platform to build what they need to build. And once it is reached a level of maturity, we don't need to have a situation where we have to go look for an SE or we have to go look for, is there an SE team? Because we don't have an SE team. So what do we do? We go with the traditional model there. We don't really disrupt every part of that process. We actually keep as many pro, you know, steps and the silos in the process that needs to be there, but then use our existing leverage to make sure that this happens.

00:15:06

So the primary difference, as you can see here, is, um, initially you can see that developers are creating and self running it for, um, you know, six months or some amount of time. In the SA model, in the Primor model, you basically see that primor is actually doing that. And when it reaches the level of maturity it needs, and this is where the operations teams are very skilled at, right? They're extremely skilled at operating things, making sure that things are monitored, things are actually reported back if there are some issues. So by the time it reaches that level of maturity, operations teams have the right kinds of SOPs that they need to do to troubleshoot, to actually make sure that it sort works well. So that is the real, the fundamental difference between SE and the Primor model that we created. Um, so what is the primor need typically, right?

00:15:55

So, um, obviously they need a lot of things from the platform engineering team, which is my team. So, uh, these are all classical things that you can think of. Let's say if I want to get run, um, you know, if I want to create a node type, if I want to create something that actually, uh, an environment in which I can actually run my bills, I need to make sure that I know what that bill book is. I know I need to know how I'm gonna build that, uh, node. So obviously there, uh, they need to be skilled at chef. Um, obviously from a bill configuration point of view, you wanna make sure that, uh, your local bills aren't gonna cut it. You need to actually integrate it into the CI bills as soon as possible. So their ability to do things, um, using team CD trying to set up the configs and get it rolled into our CI bill processes, very critical.

00:16:43

Same thing with, you know, we have, uh, our platform team has created an elk stack, um, that is very generic that anybody can plug into by having their own workflows. Uh, so in the past, before primer, what happens is, in any team wanting to build any kind of telemetry, they would basically come back to us platform team has, Hey, can you do this for us? When the platform is already there? Now the primer can actually take that and basically design a workflow. They don't need to be an expert in anything other than the fact that hey, they know how the product works. Um, similarly on, uh, we are moving, um, really headfirst into the whole, uh, concept of containers and Kubernetes. So if you need to, um, deliver service, if you need to deploy service, you basically need to know how do you need to run your cube control and how do you need to actually run your, um, uh, set up your home charts and run it.

00:17:31

So that's, that's where, um, Primark can help too. Um, I don't know if any of you stop by at the sense you booth, um, there. So this is a, a fantastic product we've been using for quite some time. Um, so one of the ways in which we use Sense U is not just to do infrastructure monitoring, but we do application monitoring too, too. So the way it works is that we all, um, we provide platform engineering provides the containers that, uh, developers can basically check out and add their application monitoring checks, and then push it through the pipeline so that it actually gets, you know, all the elements actually gets monitored, uh, in, in, in production. And that's another one, uh, we would expect the primer to do. And obviously on, if you're really talking about, um, the real time metrics, you know, pri primarily time series data metrics and all that, uh, we have a tick stack that we provide and primer can actually, um, hook into that.

00:18:22

So, so as you can see, um, the way we have defined primer is to be more of a consumer of the platform services in a more efficient and some somewhat of an elevated manner than a typical developer who's not, of not initiated to some of these things should be using. Um, the other critical aspect of it, a few things, right? Um, uh, information is key for primer or the developers to be, you know, to be succeeding. So what we have, um, here is basically we have a really, really interesting dashboard that we have provided. This is a realtime dashboard that'll tell the primers and the developers any given point of time, what's the status of your build, what of on each of the branches. And you know, that, that they're working on what's the status of each of the parts in which they're actually working on, they have deployed their code to.

00:19:12

So that kind of information is very, very critical for them to actually make that, make their decisions without really having to come back to us, the platform team to say, okay, how do I do this? Uh, there are a few other things, right? As I mentioned earlier, we don't use AWS we don't use, we don't have elastic resources in house. So what, and we don't even have access to OCI to actually get elastic resources. So how do we do that? So we use very innovative, I mean, really innovative technologies there. So this is something that was mentioned, um, yesterday at one of the presentations. I believe it's the Verizon presentation they were talking about. Um, so we basically take larger, um, you know, servers, containerize them and run Kubernetes there to create elastic resources on that. So these, these are the solutions that's provided by the platform team, but that sort of enables the prime models to do what they need to do.

00:19:59

Um, then there are some other aspects like, um, there, there might be times in which, uh, you might want to access and run, um, access databases and run queries on, um, shared databases. So pretty much all these things are provided as sort of a service, uh, to the prime so that they don't really have to go figure out how to do some of these things. Um, and, you know, obvious things, you know, everybody does this and everybody does ci Uh, I hope everybody does pref flights, but we obviously do significant amount of pref flights, uh, this sort of in, and this is all automated to the extent that the merges are typically autom mergered, basically meaning that if your pre-flight pass and if your test pass, your commits are actually gonna get merged in. So this sort of ensures that your trunk is never broken.

00:20:49

Um, the, the o the other aspect of it is we, I mean, this is chef as chef DK has been a huge godsend for us, and anybody here uses Chef dk. That's great. So yeah, I see a few of you here. So, um, without Chef dk, I don't think we would ever be as successful as, uh, with, you know, primer as we have been so far. The, the, the fundamental reason for that is you think about it, right? We provide the cookbooks for various note types. Now the primers would come in and say, okay, I need to actually make some changes to make some tweaks to your cookbook. I need to actually add some recipes so that my new service can actually work on those, these nodes, if they have to come back to us, to the platform team to do that, obviously it's going to be a lot more inefficient.

00:21:32

So the ability for the, uh, primers to actually download the chef dk, set up their enrollment using NT and uh, virtual box, and really get what our, um, os they want get installed there and get it tested, that is significantly useful for us. Um, so it, I talked about some of the self-sufficiency aspects of it. So, um, it's, it, it's, it's really, really important to empower your developers and prime Mars instead of having to have them come back and open tickets and things like that, right? So we have built a, built a completely seamless, uh, reporting systems, basically self-sufficient reporting system, uh, primarily based on the fact that yes, all these services here. So this, these are the kind of things that typically a primer would need, right? I mean, you wanna do your CICD, you want to do your Kubernetes deployments, you want to actually do, uh, you know, get your alerts and things like that.

00:22:24

Pretty much all those things are integrated into a very simple, uh, in-house system, basically, which we have built based on, uh, Sinatra and MySQL. Now, these are all the things that we provide platform teams provides, and, um, great, you know, hopefully all these things work, but unfortunately we still have a lot of problems. I mean, that's where I would absolutely love to have some feedback from any of you. So the number one issue that we have is the whole primal lifecycle, uh, ownership, right? So if you remember, we sort of started from the point where you saw that DevOps boxes in between and the whole workflow. Now it looks like we have changed all of that to put PE everywhere, right? It's not as bad as, um, I said that we, but still, you can still see that there is that dependency somewhere. And eventually our goal is to make sure that these are owned by PRI Mars or a traditional SRE, so that there is no, uh, transitions that are happening that are very, very heavy and costly for us.

00:23:30

And that's, that's where we are really going with it. So here, if you look at it, you know, if you go through three different streams, your systems architecture or infrastructure architecture or your launch plans, uh, peas are involved along the way for every step of the way. How about if we can get the actual developers, the people who actually code to be owning these things, that's, that's our eventual goal. And so we are still working through some of this. So the next, and again, it's a progression, it's an evolutionary path for us. The next step in the process here is to sort of start replacing PE here with lot of, uh, primers. And then see what's the extent to which we can do it. Is it possible to do it like a hundred percent or is that actually going to be, uh, you know, somewhat less than that?

00:24:13

So that's, that's one of the challenges. And sort of to summarize this whole thing, right? Um, we still have a lot of problems, as I said, uh, as as and when we are migrating to cloud, migrating to OCI, that is Oracle's, uh, cloud initiative, um, you know, we are going, we are seeing a lot of things that are different. And as I mentioned earlier, that is one of the reasons why we are getting hesitance from, uh, people to really get, become an SRE kind of thing. So is it even possible to eliminate the, the whole operations team with services? That is something that we are considering how practical that is, I don't know, but sometimes it becomes a self-fulfilling prophecy that, uh, some of our operations teams, when they see these things, they're actually moving away from being able to support some of these services.

00:24:58

So we have to think about a model in which maybe the, some of these operational activity activities has to be lot more supported by a true SRE model, or it has to be through like a bunch of automated activities. So, um, training is obviously the other one. So you can see that, um, a primer is typically a developer, um, who continues to be a developer. They're not changing groups, they're not doing anything, they're actually a developer. So if you look at their, uh, typical sprint responsibility, it's about 50% on the actual product features, the other 50% on, um, primer activities. So the question is like, um, is that really the right kind of mix? You know, is that something that, um, uh, we are able to sustain? So let's say if we have multiple primers in a Scrum team, which is always possible because of the fact that there might be one developer who's lot more well-versed in chef, whereas another developer who's a lot more well-versed in Docker and Kubernetes.

00:25:54

So depending on that, we would end up having multiple primers. How sustainable is that model? You know, I don't know. This is something that we are still figuring out and, um, going, go ahead, right? One of the things we want to do is, um, as I said, try and see if we can automate out of this whole, uh, mess, right? I mean, so, uh, one of the things that's been suggested that we are sort of exploring right now is this concept of super containers. Uh, it's, it's pretty simple. So I talked about the aspects of, uh, monitoring and logging and telemetry and, uh, pretty much all the different services that, uh, platform engineering provides. How about we actually create containers that include all those things basically comes, all those things comes right off the bat. So it becomes a lot more easier for the developers to do that themselves instead of having to have a primer or a separate, uh, set of people who are responsible for doing that. So those are some of the thoughts I have. Um, I know that we had, uh, we got another two plus minutes, so any questions, any thoughts you wanna share?

00:27:03

May, maybe I should ask this. So are other people who are actually using SRE today in your organizations, in of you considering using SRE?

00:27:13

Great.

00:27:15

What, what do you think is the biggest challenge that you're seeing right now? I mean, if you're trying to use an SRE model,

00:27:22

We're not really a service organization. We're a bank. Well, well, we're an internal part of the bank, so, so, uh, we don't offer a retail service of providing a service. We we're just maintaining a system. So I think that SRE is, is geared towards that type of, well, Google invented it. It's, it's for their type of operation. I, I think your way of modifying it to ORE is, is probably the kind of concept we would've to look at.

00:27:54

Yeah, I, I think, uh, that, that's something at least, um, I, I found the hardware. I mean, so we tried to do SRE in the traditional way. It wasn't working, and I keep reading everywhere that it works great, sure, it works great for, you know, places where you have the right kind of infrastructure to support that. What if, how do we actually get there? So I hope, um, this has given you some insight into some of the thought process that goes into really trying to get to that point where we want to get to. Right. Thank you. Anybody else wanna share? If you're considering SE go ahead.

00:28:27

So far, the main challenge is how to interested in the other way we're others where we have people who have op skills, we're trying to them,

00:28:42

Yeah, so, so I've seen that, I mean, that's always a lot easier, right? So people who are on the op side of things, hey, I mean, I, I want to learn some of these things. And a lot of the ops fee people these days are fairly savvy with a lot of the tools that's out there, right? I, I think, I think it's, it's a, it's a good point. So again, as you can see, the way we have tried to address that is basically providing, um, developers the right kind of tools so that they don't really have to go beyond a certain extent. Um, like, I mean, which is always their concern. Am I actually, um, doing more core versus context kind of thing? So basically, let's say you focus on your core, but at the same time you, you understand that here's what you really need to do to make sure that your core reaches your customers as as fast as possible. I think that's the mindset flip that we have been able to at least start, start right now. Again, as you can see there a lot more to be done, to get to where we want to, but I think that transition has started happening and we want to really continue pursuing that.

00:29:42

Alright, thanks everyone. Appreciate it.