Las Vegas 2023

The Business Necessity for Platform Engineering

The momentum around Platform Engineering as an industry trend continues to increase almost exponentially. A challenge for many technology leaders is making sense of how industry trends might apply to our own business strategies. Like Cloud Computing, Containers, Kubernetes, and even DevOps time and again trends with this kind of momentum prove to be a critical success factor for an organization's technology strategy. As a result it's vital we understand how Platform Engineering applies to us and can be implemented to help support our organization's success.



In this talk I will share the story of SPS Commerce's evolution to platform engineering and how it supports our strategic business priorities to ensure we enable the continued growth of our business and improve our delivery on the organization's strategic priorities. We will also explore how SPS Commerce uses platform engineering strategies to improve operational confidence in resiliency and security while increasing developer productivity.

AD

Andy Domeier

Sr Director of Technology, SPS Commerce

NA

Nathaniel Andersen

Senior Director, Technology, SPS Commerce

Transcript

00:00:00

<silence>

00:00:12

Uh, thanks a ton for coming to our talk today. We're real excited about this talk. We put it together, um, just kind of set some expectations. This is really more just sharing our experiences and journey and this topic so far. So, uh, we think it's a pretty fun story, and hopefully you all can take something away from it or relate to it in some way. I do wanna give a quick props to, uh, all of our, uh, folks back home at SPS Commerce, where we're from. A lot of what Nate and I are gonna talk about today is, uh, the culmination of a lot of hard work from a lot of awesome folks. So wouldn't be here and have all these fun things to talk about without 'em. And props to the conference organizers, too. I know it's day three. I don't know if you're all as exhausted as I am. Um, but I'm exhausted 'cause the content's been really awesome. So, uh, along with the content, you know, I think one of the things that's really surprised me coming in is I didn't anticipate this level of platform engineering, um, content being here. And I think, um, one of the things that we see with these kinds of events is usually when you see that kind of momentum, it's pretty meaningful. And we can all build confidence together that there really is something here. So, really quickly, let's jump into intros. Nate, like always, why don't you go first?

00:01:22

Yeah, great. Thanks. Uh, the advantage of having a name that starts with a alphabetically, I get to go. Uh, I'm one of Andy's, uh, compatriots at SPS. Uh, I lead a handful of teams that, uh, focus on shipping software, solving customer problems. Uh, and I am generally one of Andy's, uh, top debating sparring partners. Um, often he comes to me with, uh, pushes to standardize. So when he said, I'd like to do a talk with you at DevOps Days Enterprise, I said, with me, I'm not sure what you're going for. Uh, but I'm super happy to be here and I'm, uh, excited to talk about something that I actually have been convinced on, uh, in the last few years working with Andy.

00:02:06

Yeah, sparring partners is, is a good analogy. We've worked together for 13 years now. Um, and I think one of the things that's been just awesome about getting to the spot where we're at is this really isn't something I think either of us would've thought we'd be on stage talking about together, and to, to the point where we agree. So I've been working with us at SPS Commerce for, uh, 19 years now. Uh, it's a really fun organization. Nate's gonna chat a little bit more about it. I oversee the cloud operations group, so, um, network security, access management accounts, the platform teams deploy and observability teams and, and the things that kind of go along in that space with SRE as well. Um, so our talk today is titled, uh, the Business Necessity for Platform Engineering. And

00:02:45

I added the subtitle. It felt a little, uh, especially when Andy first pitched it a little, uh, overly prescriptive. Uh, and there have been times in my career where I've been like, I don't need a platform. I just need it to work <laugh>. So, uh, I think this has been a really good conversation as we built this content, and I think probably is something that's evocative of conversations that, uh, if you're a platform engineer you've had with your, uh, delivery teams or if you're a delivery team you've had with some of your shared services teams. Yeah,

00:03:15

I believe once, not that recently, I, you've said the words out of my way <laugh> to me in the past. I think folks can relate to that.

00:03:24

Um, but before we get into like, uh, uh, platform engineering needs to be treated like a product, treating delivery teams like customers, I thought it would be good to talk a little bit about the customers that, uh, we are, uh, collectively trying to serve. And the problems that I'm particularly targeting. Um, SPS commerce, you might not know the brand, uh, but it sells, uh, solutions that connect members of the supply chain together to exchange, uh, assortment or item data, sales data and fulfillment data. Uh, and the solutions we bring to market are somewhat varied. Um, and I think to explain it, I I think it'd be good to just cast ourselves in the lens of one particular type of customer that I have, which is a supplier. Uh, so suppliers business problems are pretty complicated. Uh, to get a product to market and sell it and be profitable, they have to solve for a lot of things.

00:04:19

And so they will bring in, uh, companies like SPS to solve for some of their connection details. Um, the, the, the details that they need to solve for, like get their product produced, manufactured, and then to sell it are further complicated by the fact that once they, uh, get a deal going, uh, or maybe get five deals where you have retailers of the four different colors, uh, buying your product, the, the complexity just multiplies because those retailers treat the way that they interact with their suppliers as a business differentiator. Uh, and so they're then requiring different things from suppliers, uh, for, uh, each retailer. So the supplier has a very complicated problem. What my teams are attempting to do is take those, uh, data requirements and the rules that they use to be successful with their customers. And we try to roll those up into consolidated rule books with a standard interface so that when the supplier, uh, plugs in to SPS, they have a single interface, um, where all those rules are normalized and their data exchange, uh, is standardized so that they have a single set of validation, a a core way of, of understanding how they're supposed to get their shipments done, get their items shared.

00:05:40

Um, and generally speaking, this has been a successful business. Um, it might sound a little niche to you, but we've got over a million connections on our platform, um, and lots of suppliers bringing lots of, uh, uh, not just suppliers, but logistics companies, retailers as well along for the ride. Um, uh, but the growth has, uh, been something that's been organic over time. We've actually had about 90 quarters of consecutive growth, which has meant that we've needed to solve problems and scale our organization as we've gone along. Um,

00:06:16

So I think, I think that part, you know, the, the growth part explains a lot of why, why I've been here for 19 years, but also explains a lot about why we get to have the fun and conversations we're having now with the technologies we're consistently solving for scale and trying to accelerate the business and make sure that we, our technology has enough runway to meet the market demand and meet the opportunity, uh, in front of us. And so what the way that we think about that then is we really think about things like being a high performing technology organization as just a requirement. It's just baseline for us. We have to do this. We have to make sure we're doing these things to produce the runway we need within our tech to meet the needs that, uh, that our business has in terms of the opportunity in front of us.

00:06:54

So everyone's probably really familiar with, you know, if you step back, I think Dora has done a great job with a lot of data over time. And these four things continue to kind of lead the way in terms of saying, where's the bar? Where are you to the bar? You know, how frequently deploying, how fast can you get ideas to production? Um, when you are making change, you know, what's your success rate? I thought the talk, the S3 talk this morning was awesome. And then how fast do you recover? So everyone knows kind of those four metrics, and Nate being somebody who's leading product teams does this every day and it's super easy, right? Yeah, yeah,

00:07:26

Yeah. And I, I like these metrics. Uh, they're valuable lenses to have. However, uh, nine days out of, or whatever, six days outta seven of a week, I'm not thinking about these metrics as my primary driver. I'm thinking about my customer and the problems that they have, um, as opposed to how frequently am I shipping? That said, I have, uh, like I think they're good aims to have as you layer on how do you succeed at scale. And I actually think, uh, like the story of the last few years really demonstrates that. So I'd like to go into casting you back to 2014. At the time, I was leading an organization at SBS, uh, called release engineering. I was attempting to, uh, implement all the things I'd read in the DevOps, uh, uh, or in the Phoenix project, uh, that I learned at DevOps days.

00:08:15

Things around Dora like metrics, shipping frequently, having smaller teams iterating, failing, fast learning. Um, and I just wasn't able to get traction across the whole organization. So the CTO at the time, the new CTO was like, Hey, I've got an opportunity for you, uh, a special project. Uh, so why don't you go try and solve this customer problem? So I got a focused runway to solve. Unfortunately, at the time, uh, that was my top priority, but the rest of the business had a different top priority move all of our workloads into the cloud. Uh, so the, the rest of the team was focused on that, which meant, uh, I and my two pizza team were on a bit of an island. Uh, we were able to consult with some of the experts, uh, and I see a few of them actually even in the room <laugh>.

00:09:02

Uh, but we did have to like, learn a lot as we went along. We had to learn how to do cloud formation, deal with, uh, VPC routes and the reasons why our, uh, or how to provision our environment into production. But with a lot of iteration and focus on the customer problems we were trying to solve, and the tech stack we were trying to solve for, uh, we on our island kind of iterated into a nice spot. We became artists, uh, with people taking different slices of the pizza, focusing on different problems. Uh, our dev people got a little more oxy and our ops people got a lot more devy. Um, and it felt really successful. It was a really successful operating model, and we carried that forward into really adjacent problem spaces. Uh, we took a monolith and we decomposed it into a bunch of microservices.

00:09:52

We had, uh, need to launch a lot of new workloads, and so we leveraged serverless technologies to get those things done. Uh, we knew that we needed a better way of, uh, enabling data sharing. And so we enabled some event sourcing solutions. Um, and all of those things created something beautiful. We felt really proud of the solutions we were able to bring to market. And we just iterated customer problem after customer problem. And, uh, we produced something of beauty. And, uh, the saying might go, I think a thing of beauty, uh, is a joy forever. Uh, but snowflakes, uh, maybe have a slightly different, uh, implication for how beautiful they last. Um, over time, uh, a snowflake becomes its own source of pain, uh, because in order to adopt, uh, security protocols or controls in order to adopt some of the shared services benefits that the rest of the organization was, was, uh, implementing, we weren't able to pull 'em in seamlessly. And so our system that looked beautiful at one point in time was now a puzzle piece that didn't fit with the rest of the organization. Our serverless workloads ran into problems, and we needed to adopt Kubernetes and, uh, to learn that full stack implement, it felt like a heavy lift. Uh, our artistry, uh, <laugh> started to feel more like toil, uh, where we were implementing things that the rest of the organization had already, uh, built out.

00:11:22

I think, I think one part of this story that always gets me is that we're like, we're from Minnesota, so we're supposed to love snow <laugh>. But, but in this case, uh, it doesn't, doesn't necessarily feel that way. One of the things that I think has been really powerful for me in, in the journey that I've been been on with Nate and everyone else at SPS is again, like we're, we're, we're chasing this growth opportunity. And as Nate's telling his story and showing some of the toil that comes from, um, kind of moving fast and focusing on that customer problem with as much tunnel vision as you can, but delivery, um, it really, it really presents something that we feel we've grown a good understanding to at SPS, which is this, this priority friction that these concepts put us in, especially the folks that are doing product development.

00:12:05

At the end of the day, Nate talks a lot about wanting to focus on the customer problem. How can he really spend as much time as possible within his group solving for really meaningful customer problems? At the same time, we have these expectations and the customers are setting these expectations, right? They expect our services to be available, they expect them to be secure, and they, and, you know, they don't expect them, but we expect them to be cost effective, right? We're gonna come knock on Nate's door if he's spending way too much money. And so I, I think that what we're starting to see is this movement of DevOps and understanding how to, how to move quickly, is bringing a lot of these delivery teams that are advanced that can move quickly into this space where they're like, okay, I've done this for a couple years.

00:12:46

We've done really great things, but now we have this friction and this burden that's really heavy. And so I think platform engineering is really representing the response to this. Ultimately, the way that I, I feel like the story comes together really well is to talk about it from a concept of like undifferentiated engineering. If you step back and think about that, just as a whole, everybody has to deploy software. Whether you're copy pasting or dragging or whatever you're doing. You have to get your code to production somehow it needs to be secure. That's not, that's not debatable, right? We need to monitor those things. Even if it's the customer calling you to tell you the site's down, that's a really bad way to monitor. But it's a form of monitoring. And so thinking about the fact that everybody in our organization has to do those creates this, this approach that makes a lot of sense.

00:13:37

Now, why wouldn't we share those things? We all have to do them. Why wouldn't we all do them the same way and share them and learn in a way where we can benefit from each other? And that's where platform engineering, from our perspective, has really come into play. And it's been really important to approach it with a product mindset. And I think the reason why that, that we're starting to see this, and if you've seen some of the talks this week, um, or even some of the blog posts, the idea of approaching with the product mindset, I would challenge us a little bit there. What we're really doing is not so much saying, Hey, we have to productize our undifferentiated engineering, but we have to think about who the customer is. That's what actually matters here. The whole point of productizing is you're creating a new customer relationship that we didn't have before, right?

00:14:18

It was just IT operations and we were there, we were trying to help, we were trying to make sure that we were, uh, being resilient, being cost effective, and enabling our product engineers to move as fast as they can. But really that customer relationship is what sets us up. So something that I wanted to share, uh, along our journey here that I, it's been really fun to hear a lot of the other talks references as well. Something that gets really hard in this space is adoption. I'd just be really curious, quick, does anybody in the room currently have a platform engineering project going on and you're currently trying to navigate? How do you get people to move to it, move existing workloads to it? Yeah, quite a few. Uh, it's super, super hard. I'm gonna share a story that that, uh, that's been really great for us at SPS, uh, and then share a few more thoughts.

00:15:04

So, um, really generically, uh, as Nate mentioned, we have a growing business. It's moving really quickly, and part of the opportunities for us there is to make sure that we're staying ahead of the curve when it comes to scale and resiliency. While some of the things that we started looking at a few years ago is we really have to be more region resilient. We have to be able to have an active, active network that's processing supply chain communications with higher availability and less dependency on a single region. So from the executive management team, we just gotta copy paste, right? You can do that, right? Easy. Yeah. Just like, copy it over. Yeah, control easy enough. Um, so we, it was really, uh, we have great leadership team and we had a chance to really step back and think deeply about this. And when we started trying to decompose what it is that made up our network and made up our technologies, some things really came to light.

00:15:57

As you start trying to go through this process of like, well, what are we, how are we gonna take this tech? How are we gonna make it look, um, uh, how are we gonna make it active, active? And so as you start to unpack it, you start to look at the different architectures and the different approaches you've used over time. Um, Nate had mentioned we did a lot to lift out into the cloud initially. So we had a decent amount of workloads that were pretty standard. They were running EC2 behind an ELB, nothing too crazy. It worked great. Easy pattern, auto scaled, cost effective, um, no problems there. As Nate and his teams got new opportunities to go solve customer problems, he referenced the fact that they got into serverless. They started looking at how to move quicker. If we saw a market opportunity or a feature that a customer needed, they could go deliver those things.

00:16:39

And so we started seeing this approach, uh, and started seeing more serverless workloads come into our environment too. And so we started seeing those APIs being available for different, different pathways. We are very transformation heavy, uh, product. We're doing a lot of data transformation between file formats. And so sometimes we have kind of heavier Java running, uh, heavier Java workloads that run a bit longer. And so we needed to start looking more at the container world. The serverless space didn't make a lot of sense at the time. ECS was for sure the most approachable kind of safest way to get going in container orchestration. And so we started moving a lot of our container as workloads to ECS at the time. And that was when we started kind of moving forward with those types of patterns or container patterns over time, the industry and Kubernetes and the momentum and all the, the opportunities that come with that tech stack came in.

00:17:32

And so we again, then shifted more towards building services inside Kubernetes and really getting good at running, uh, that compute platform. So now we're starting to unpack this like, Hey, take this multi-region. Well, okay, um, I suppose we could, but then, I don't know if you're all thinking this already. You start looking back at those and you're like, oh, and by the way, there's a Jenkins server that deploys those things and we're still using chef on those and it works fine. It's not a bad pattern. Um, but it's a lot different than the cloud formation that Mac using to deploy a serverless functions. And I'm sure most of you probably at some point touched on a little bit. It was pretty cool and worked pretty well for us for a while, right? Yeah. Yeah. Okay. I thought it was kind of fun for a while.

00:18:17

<laugh>. Uh, and then, um, you know, there's a lot that we've seen great things with either GitHub actions, in our case, Azure DevOps pipelines, um, and if you're playing with Kubernetes, use Helm. So as you kinda keep going down this path and you start seeing these things, you're like, oh my gosh, okay, let's really look at this. Let's start to think about some of the just basic problems that to ask Nate to try to go multi-region active active that he has to think about. And there's a whole, I mean, the list is way longer than this, right? But there's a lot of basic ones. How are you gonna route traffic? How are you gonna decide where traffic's gonna go? What are you supposed to do with your secrets? What build servers are you using? Do you need them in both environments or both regions? Are you gonna copy that?

00:18:57

What's your network strategy? Um, of the things you have today, how many of them are hard coded to region? We saw quite a bit of that. Um, but I think overall what you see here is a lot of the things, a lot of the questions that we're suddenly having to ask Nate and his team about are undifferentiated engineering, right? They're things that everybody has to go figure out. And so we got really, really lucky here. I think with the timing to a point where in our business this was, this is ourself for platform adoption, right? Why would anybody try to take their serverless functions in cloud formation and go try to figure out how to make that multi-region when we have other ways to approach that or take those, those EC2 patterns, it's a really meaningful business reason to go this route. So, um, I did, uh, I, you'd think that I asked my 8-year-old to do some of these slides <laugh>, um, his were way too nice.

00:19:47

So I ended up erasing it all and doing it myself. But we did go with a, with a platform strategy. We call our platform Atlas, like pretty much 90% of the world decided to call their platforms. Um, but pretty basic stack. I don't think that this is too crazy. We were on an Istio service match on Kubernetes, and when we started talking about multi-region, we said, you know what? We think that there's a really, really obvious path for us to be able to get here with Atlas. And so we're able to load balance at the edge and create an environment where we can federate the service mesh across regions and have it really nice, consistent, easy to use, approachable platform for developers. So now in the sense of trying to accomplish that business goal that we have of trying to get active, active multi-region, we have an interface, but we also have a business purpose, right?

00:20:29

It's one of those scenarios where we're not necessarily just asking somebody to move because my OCD really wants to shut off that, that old tech right OCD isn't a business value we can necessarily drive, uh, a lot of action on. And so this has been a really great and a really fun story for us. But I think the thing that I would just kind of of want to really highlight here for us is we've had a lot of success in just thinking about what's your business trying to accomplish and how does the platform technologies that you're trying to provide go solve those things? Some of, some of the air, the traps we got caught in along the way is it, it feels really easy to worry about how to reduce the friction for net new services, right? It's like kind of a fun process. How, how much like button click can get us to a, to like an initial container running in an environment.

00:21:16

But, but have you stopped and thought about how many new services your teams actually start each quarter? For us, that was kind of a distraction at first. We wanted that net new service flow to be really easy to use, really approachable. And when we step back and look, we were deploying a ton, but it was on services that existed. We're not creating a lot of new services every quarter. We are, but in terms of what we're optimizing for, that wasn't really gonna provide the value. So speaking of value, we did wanna talk and share a little bit of the data that we have. One of the things that makes me really excited about this space and this consistency is the more consistent that your team uses tools, the easier it is to pull data out of them. You can understand things like change rates, understand, um, different aspects and approaches.

00:22:00

And so this is just some of our story that we thought made a lot of sense to, to talk through. Um, I'd highlight just the bottom graph there. I had to pull the axis off, but the red line is incidents and the green line is our change rates. So we're deploying over thousands of times a month. Um, and in general we feel pretty good about we're we're generally breaking, um, the change or like deployment frequency records every quarter we do, you do kind of see a dip. We're very cyclical around the holiday season is a busy season for us, so that kind of changes. Um, usually Q4 is a little bit where we're kind of focusing more on, um, being hyper aware and hypersensitive the drier attention to some of the middle numbers there. Um, we do keep a really close eye on our change rates, on our change success rate, things like that. This system, um, and the predictability that we have gives us an opportunity to do that at scale. So, um, had a lot of fun with these numbers and the journeys we've had, and we thought maybe we would just kind of wrap up by just kinda sharing some of the high level takeaways that we feel we're taken back.

00:23:00

So to summarize, at least from my anecdote, uh, at a certain point, iterating locally, uh, hits of, of velocity ceiling. And there are times that whether it's going to be platform engineering or another shared service that will spin out, uh, optimizing, uh, for a narrower customer set can, uh, end up helping with focus and flow.

00:23:25

And I think that, um, mentioned it earlier, like the Dora metrics are great, but that customer relationship is where things really start to matter. If your door metrics are awesome, but your deployment frequency, if your deployment frequency is great, but you're not shipping anything that, that has a customer relationship that's producing value, really doesn't matter, right? And so this platform engineering product mindset gives this customer relationship that we have now where my intentions and my motivations and the metrics that we're looking at are all oriented around the value that Nate's Nate's receiving. And for us, keeping Nate's team to, to kind of steal from Nicole's talk yesterday in that flow zone as much as possible, uh, is a huge part of what we're trying to accomplish.

00:24:03

And I just a logical argument to this point. Generally speaking, the service economy of which most of us I think probably all of us are pretty much a part of is built by abstracted levels of customers or consumers. Uh, and so it is valuable to add layers of abstraction or customer abstractions. And so internal customers have a lot of value for providing focus. Um, and that may feel a little counterintuitive when you're thinking about like DevOps culture. Uh, I have a T-shirt that says like, uh, silos are for grain. Um, which, which like as an adage like having hard barriers between your teams is something that we've tried to like tear down culturally over the last decade. Um, but silos or areas of focus in order to facilitate like faster customer value, delivering customer focus actually can accelerate your overall velocity. You can accelerate your quality, um, and reduce duplicative work. So silos aren't always a bad thing, um, when done, uh, kind of depending on a platform. And so you might just like strap a platform engineering rocket on that there. Silo. And finish the talk, <laugh>.

00:25:18

Alright, thank you so much. We hope our journey was helpful.