Strap on your coolest tech swag and put your Slack away message on - because we’re going deep on DevOps. Join Rob Jahn of Dynatrace and Dawn Parzych of LaunchDarkly as they go beyond the buzzwords to discuss and debate the topics that will set up software teams for today, and well into the future. No sales pitches, no Powerpoint - just real talk from industry leaders who are daily helping DevOps teams actually have fun building cool software and worrying less about bugs and outages. Moderated by Erin Jones. This session is presented by LaunchDarkly and Dynatrace.
Manager, Developer Marketing, LaunchDarkly
Tech Partner Manager & DevOps Advocate, Dynatrace
Tech Alliance Manager & Marketer, Dynatrace
Hi to everybody to waiting us for beyond the buzzwords, going deep on DevOps with Dynatrace and the LaunchDarkly. My name is Erin Jones. I'm from Dynatrace and I'll be your moderator today. And let's kick things over first to our wonderful speakers, Don, I'm going to let you introduce yourself first.
Hey everyone. I am Don manager of develop a marketing at LaunchDarkly and I'm very happy to be battling Rob today.
Well, thank you for joining us, John. I know y'all are super busy getting ready for your own user conference coming up and, uh, I'll turn things over to another very busy presenter of ours. Rob Jan, if you'd like to introduce yourself.
Yes. Hi. Yes. Thanks for having me. I'm excited today. I am a technical partner manager here at Dynatrace, so we talk a lot about dev ops and, you know, using observability to help drive decisions. And so I think today will be a really good topic because we're seeing, uh, you know, folks use, uh, you know, new frameworks and new processes to help automate and deliver good stuff. So, yeah. Excited about today and thanks for joining.
Well, thank you both for being here. I know now I know you're both very, very busy, but I'm obviously we've got some great things to talk about and cover today as our title alludes to, we are going to try and get away from buzzwords. We may even make it into a fun little drinking game, although I know it's lunchtime, but it's a conference, so anything goes right. Um, but first let's go ahead and just jump in at a high level. Don Rob, what are some of these big SRE and DevOps trends that y'all are seeing, especially as we look ahead to 2023,
I think the big trends that we're seeing is a continued drive and push towards automation. Like not everything is automated yet. There's still a large way to go with that automation. And part of the desire for so much automation is to improve, uh, developer and organizational productivity. So like the more that you can get away from the toil and the repetitive tasks and the manual tasks, um, the more productive your employees and your organization, uh, will be
John, can I follow up and ask when you think about productivity, what is that KPI? What is that thing that everybody points to, to say, Hey, I'm being productive.
Trick question, question. I mean, I think everybody has their own metrics. I think some of what you can look at within that in terms of productivity is release velocity. Like how often are you releasing? How often are you deploying? What is your failure change? Right. Like all of these metrics, like looking at our metrics that we've heard about are repeatedly over the last day and a half are important pieces to look at, um, looking at where you were last year versus where you are today. Um, studies are coming out, showing that some companies are deploying multiple times a day, but there's still some companies deploying on a weekly or a monthly basis. But if you look at those trends over time, like, are they shifting their deployments? Are they moving from monthly to weekly? Are they moving from weekly to daily? So like, how are you improving on like your own personal, uh, benchmarks and baselines?
That's good perspective. I think a lot of folks are probably under the gun to try and be a Netflix or one of these shops that we look to to kind of set that pace. But it's good to understand that we're all looking to make marked improvements against where we were a year ago, but obviously Rob with the concept of delivering both faster and better quality software. I mean, those two concepts seem to almost be at odds with each other. Are there any trends that you're noticing within the dev ops community that sort of speak to, or help to alleviate the concept of needing to produce something faster while also producing it better?
Yeah. Well, I would say those are certainly the goals to get things to production, right? So deliver it faster, more frequently, obviously don't cause those problems. Um, but, uh, but I think another kind of trend that's happening as well is we're changing the underlying architectures of the application. So there's a huge trend in, maybe this is not so much dev ops, but it's things that dev ops and SRS are having to deal with is, you know, containerization. So there's a, just a rapid adoption of Kubernetes, microservices, architectures feature flagging architectures as well too, to be able to deliver that. So it's not like you're having your monolithic applications as you used to have them and just delivering it quicker. You're actually, there's a lot of big trend to fundamentally change the architectures. And that's also coupled obviously with the move to the cloud infrastructures, you know, with all the, the SAS offerings and pass offerings from the Azures and the AWSs of the world to, to couple that.
So there's, so the trend is changing of architecture. The trend is driven by the need to deliver things faster, but it's also causing, you know, teams to really have to deal with, you know, more tools, more technologies, um, with the same amount of people and it's overwhelming. So I think we're going to talk a little bit about that toil and stress and stuff that kind of goes along with that. But I think the trend is, is, is also that, so it's, it's yes. Deliver faster, but it's also, what are you actually delivering? And that's also rapidly changing to this new containerized, um, you know, architectures.
Yes, we've got, we've introduced now the concept of more tools, more toil, more anxiety, all the more reason that I hope you all at homered are drinking along with us today. Uh, only partially kidding bear. So obviously is we're moving towards this, uh, this more cloud native sort of reality, is that creating new problems within or challenges rather without within these DevOps teams, because now you're hosting things in the cloud. Um, there are so many technologies that maybe your team, unlike the days before, where you could just go to the server closet and press a button, things are out of your control. So how are teams embracing this rather than letting that be something that, that prohibits them from being able to rapidly innovate and release?
Yeah. Hey, I have a couple thoughts. I'll just kind of maybe start something with diet donkey, giant, giant enemy, but that's, I mean, that's what really, what dev ops is all about. So you can't work in these siloed teams. It's forcing people to work together because you're, you're now intertwined. So I mean, that's the whole philosophy is that we're, you know, you're responsible from the code to delivery, understanding how it works. And so again, from an observability kind of part, you know, platform point of view, you know, it's, it's getting that common point of view from the dev, from the dev environments to the production environments, and then coupling that because there's more to monitor more complexity of microservices, architectures, you know, needing to, to, to be able to have automatic tracing of what's going on, seeing what the end-user behavior is seeing, what, what, you know, features are rolled out to what, what target audiences, um, it's, you know, demanding a new way to, to put the monitoring in place, use these tools, leverage those tools and then have those tools inform, you know, not only the, you know, not definitely not the manual.
I mean, definitely the manual work. You can look at a dashboard, but really the way you scale and do things faster is that automation that, that Don talked about earlier. Yes. Because we're seeing people automate stuff. So that is a trend to automate stuff, but you can't just, you know, what, what is, you know, automation for the sake, automation doesn't mean anything you, it has to be driving, you know, under, you know, automating a process, a software delivery process at a remediation process and incident management process, a business decision to roll out features to whom roll things back. All of the automation will help you do that. Um, but, but you have to have the right kind of foundation, um, to do it as well as the right teams that are kind of working together to do that to
Completely agree. And I'm seeing in the chat here, uh, one of our attendees is commenting architecture changes, monolith to microservice and feature developing at the same time is killing us. So, so Don not to put you on the spot and make you solve all the world's problems. But I know Rob spoke to the observability piece of being able to overcome some of these challenges. What are some things that you're maybe seeing with your colleagues or customers that are helping address this, this anxiety and where we're at right now?
When I at DevOps, I think of it as like three pieces. It's like the people it's the processes and the culture and the culture is a huge, huge piece of it. I think that a lot of people often overlook like, well, look at the tools, look the processes, but in order to be successful in order to kind of reduce it, anxiety, you need an environment that is psychologically safe, where people are free to ask questions. And they're able to question the way work is being done. If we're doing things too fast, are we doing it in the right way? Like, are we doing it in the right order? Like everybody wants to move fast, but you can only move fast if you have like the safety nets in place to recover, when things go awry and go wrong. So trying to do everything simultaneously may seem like that's the right thing to do because like we can't slow down, we can't stop innovating, but doing everything simultaneously raises anxiety, raises stress. And when your anxiety and stress levels are higher, you're more prone to make mistakes. You're able to work more effectively and more productively if you're not overwhelmed with like all the things that need to be done and not just all the things that need to be done at work, but like all the things that need to be done, you know, at home and in other places as well, like there's has to be that kind of balance going on. Yeah.
Yeah. And it's stressful because I mean, you know, I was actually just looking at a report before this call just to get some new data, but a lot of people are still doing things manually, you know? So the depth, you know, people are, have embraced the dev ops culture of death by dying. But if the measurement is, am I automating things, it turns out that a lot of people still aren't, you know, w I'm just looking at, uh, this is from a, um, from a 4 51 research. So from the survey of folks, uh, you know, 12% are people are doing everything manually. You know, there's some automation for 18%, so there's still 30% of people are doing mostly manual work to get their software out there. Um, they want their, the trends are moving as Don said, but when you do things manually, that stress, right.
So I, if I, if I am I relying on some person to remember to do this change before I do my work, that stress cause you, you can't see what they're doing. Um, I doing this thing over and over, and maybe I missed a step. So automation is key to kind of helping reduce that stress because it's repeatable, right? So I, I run a program. Um, it does the work for me, but the trick is to do it not as a one-off. So this is where the dev ops. I really believe, you know, we're seeing this as it starts from the beginning of the life cycle all the way through. And I'll just use, you know, monitoring tools as an example, if I can program in my alerting rule and my tagging rules, have it go with my code in a good ops manner. We're seeing that now as a big trend that you're, you're doing configuration as code, you know, feature flags to, to drive, you know, as code as they're being turned on in environments.
Um, you know, the, all those types of things, if we're doing that in a repeatable way from the beginning every time, I think it takes out that stress that Don's talking about, because, you know, it's nothing new, like the big release once a quarter, that stress, you know, but if I'm delivering software every week or every day through a process that I continuously improve, that that's, that's a way to a scale, but also take out some of that stress and anxiety. Um, but it takes a commitment, you know, that's where I think another trend we didn't, we didn't really highlight was, you know, that's where SRE S are coming in. So each dev ops teams and the SRA team, some organizations have two different groups. Sometimes they mush them together. It kind of depends on the org, but, you know, they're there to support the, the, the framework they're not responsible necessarily for, um, signing off and everything, but it's like, I enabled the developers to be self-service. I enable operations to have the tooling, to put the guardrails in place before it gets there. So it's really that stress comes by having some guardrails, having resources dedicated to really put those foundational things in place, having the right tools to support those people, put those foundational things in place. But then that allows you to have that, that, you know, hopefully a repeatable process that accelerates the delivery and then gets ultimately to the reduced stress, you know, hopefully
That's the vision anyway. So we're here to talk about the vision and, you know, there's the reality, but there's, you know, but the vision is, you know, we, we, we learned from great books, uh, DevOps, uh, you know, examples, people incrementally improve there's best practices. Um, and so we, we do, you know, we'll make, we'll talk about some examples of people are doing this stuff for real. It's not just methodology and talk. It's it's, um, you know, the, the, and it's measurable from these metrics that you were talking about earlier, Erin, around, what KPIs do you use to measure, you know, the, the, the high achievers or the, you know, the real industry, you know, go getters, if you will, to, you know, to the rest of the pack.
Well, and you brought up some good points there, Robin and I know Don had spoken previously about this concept of safety nets. That seems like that role of an SRE to put those guard rails in place would be wedding sample of a safety net. Um, one other safety net that I'm going to throw out to y'all and, and see if you feel like this really is a safety net for dev ops teams, is this concept, I know we said no buzzwords, but I have an inkling that this may be more than just a buzzword. What are y'all seeing in terms of the idea of shifting left is that if that wouldn't be safety net, that teams are embracing and what do those practices look like? Don, maybe I'll let you go first.
We don't really talk about shift left. We talk a lot about testing in production, which is kind of the opposite of shifting left. It's like put yourself on that. But when we say testing production, it's not just about like don't test at all. Don't do your unit testing. Don't do your integration testing, but no matter how often you're testing, no matter how soon you're testing your environments are not production, your users are going to use your app and your website in unique and unusual ways. And like, you're never going to be able to test all of those corner gas cases. Um, you need to see how things are interacting in your production environments with all of those third-party components, integrating and firing simultaneously and all the wonderful weirdness that exists in your production world. So it's about making sure you're testing in a way that's most indicative of how your users are using the application. Um, so we talk a lot about this testing and production, um, concept, um, as opposed to this notion of like shifting left and testing early and testing often, like, yes, you need to do all of that, but you also need to look at things from the user perspective, use Canary deployments, use brain deployments slowly roll out a feature to see what's happening. Um, because that's the only way you're going to get that true feedback on how things are actually operating and whether things are successful or not.
I'm not a developer, but the concept of testing and production just gives me her palpitations on their behalf. But it's, it sounds like we're, we keep going back to this safeguard concept. It sounds like with things like feature flags and being able to incrementally roll things out into production, it gives people the opportunity to correct me if I'm wrong, kind of dip their toe in the proverbial waters to see if what they're releasing is going to work for the masses before they unleash it to the whole world. Is that accurate?
Absolutely. Yeah. It's about using with like small circles first. So the first time you deploy software, it's only available to the engineers that wrote the code and the testers, so that they're able to see how things are working. It's in the production environment. Nobody else sees that featured. Nobody else has access to what's going on only a very, very small percentage. And then you can why the ring, so, okay. That's working well, let's open it up to more people inside the company and see how things are working at that point in time. Then let's roll it out to 10% of our users in 20% and then 50%, um, you can, you can roll back or like turn off a feature much faster if things go wrong. Um, and you're only impacting like a small percentage as opposed to, okay, there's this big bang release out for all the users now, and it's crashed and things aren't working well, we've impacted a hundred percent of your users as opposed to only a very small percentage. So we're talking about testing and production. We're not talking about not testing, we're talking about having the safety mechanisms in place and rolling things out in a safe and sensible manner so that you're able to catch things early on. And it's easier to correct things if only like a small amount of change has been made versus a large amount of change has been made.
Right. Right. Absolutely. Yeah. I mean, and that's, I mean, that's where, I mean, we, we, you know, as we were talking about this before we got online here was, you know, that difference between, you know, releasing, uh, and, and the deployment they're, they're, they're kind of tweaked to be, think of them as two different things. Cause I can, I can deploy my software and it may take, you know, but then enabling a feature is kind of a different, a different thing. And that's where we were actually, that's the benefit of, of a LaunchDarkly type type framework where it's in your code, right. With, with, with the deployment. And then you can turn the feature on and off, which is more of a business decision, right. I only want to have my premium customers get this, or I want my early access folks to get access to this, or this is a, you know, support only, um, feature for something like that.
Um, and it could be another use case where you try out something to get early feedback and it may never go to anybody. So it's also a way to iterate through a design, you know, different design options too. So that's a more advanced use case, but, you know, certainly, um, we're seeing that too, right. So I think that's, that's a way to minimize your risk, um, take, take advantage of something, but I'll just comment that, you know, maybe observability guy here, but, you know, you have to be able to measure that, you know what, and it really starts without getting into tools. It's, it's really those service level objectives. So this is the domain of the SRS out there, but really, you know, how am I measuring and verifying that I'm not having a customer impact. I'm not using more resources, um, you know, than I expected.
Um, you know, you know, and sometimes these are, these can be architectural, um, service levels as well as business service levels. But if we can put those in and, and measure those in a continuous automated way, which is, you know, again with the right tools, you can build these things, call API APIs to automatically score them, bake into your automation. So now you're deploying, running automated service level verification. Yes. It's good. Roll it out to more people. Oh, it's bad. Roll it back by, by simply turning on and off a feature flag to a disable something. So that's, that's sort of another way to incorporate it. It's yes. You could turn them on. Yes, you can deploy, but you also have to measure, you know, what, what is the health, uh, to the end-user primarily, or just services that other downstream systems depend on?
That's a good point. I'm shutting that down, bro. Cause I definitely want to circle back to that, but I do have a question I want to get to that in our chat right now. Um, Laura from American airlines is asking or rather, you know, letting us in on some, some conversations that are being had at Riviera. Whenever I bring up testing in production with safety, that's clarified, I get an immediate reaction of no, it is too risky. So how do you get people to just try it? So, Don, what would, what would you tell someone like Laura, that she can take back to her team and say, it's going to be okay guys. We just need to give it a shot.
Hmm. You know, it's not a college freshman, freshman, that's the term that we use as well tour. A lot of people use call it an experiment, like say like, we want to do like a beta test. We want to examine this, like change the wording and describe like, what it is that we're trying to do is like, it's a very targeted release. So we're not sending this out to everybody. We're targeting a very specific group of users. Um, so explain what some of that is and define like who your, your segments are, who you're targeting with that specific test. Um,
Yeah. I like the word targeted release too. Yeah. That's to me that sounds better than test, but it'd be just as a terminology, but you could also say, Hey, look at what, you know, leading software vendors do. Right. So Dynatrace, so pick on us, you know, that is something integral into our product. So we have a T you know, when you think about it, like in our case, in the, you know, you're, you're, you're a company or you're building custom apps and as a project, you know, but we're, we're responsible to deliver a SAS offering to multiple customers. And so, you know, we have clusters that all have the same deployment, right. Everything's deployed the same version of something, but it's all manipulated, you know, through feature flags. So early access features combinations of things. So that's how we can do it. So we can re we can deploy twice, twice, twice a month, but we can turn on features anytime for them off any time.
So it's, it's definitely, you could say, Hey, you use us like, Hey, this is what's offered companies, do they do it? So we should do it too. Um, you know, whenever, you know, so that's maybe, maybe another argument, if that helps to say, look, you know, uh, we're, we're, you know, this is what, what, what folks that do it at scale. Um, do you know, especially with, you know, when you do a SAS offering, so typically if you're like an airline company, same thing, you're going to have, you know, customer facing applications that need to be up 24, 7 at different needs pricing, you know, geographies, whatever. Right. There's, there's a lot of things that, that, you know, you, you have to manage in terms of who gets access to what, um, so yeah, I would, I would just maybe not call it testing, um, you know, yeah. I don't know where they can, where work came from, but yeah, I mean, you can do testing and production, but I think when we're talking about kind of feature flags and enabling things, it's more of a targeted release, um, for sure.
And it goes back to the notion of separating deploys from release and really defining what those two terms mean. They're very often used interchangeably and we use them in a very specific way at LaunchDarkly where deploying code does not mean that it's available to everybody deploying code means it's maybe available to a small group. Deploying code is a technical decision. It determine like, is everything operating the way we should? Are we seeing the right metrics? Here is everything, you know, all systems green type thing are released as a business decision or release says, do we have everything that we need lined up to have this available for all of our users? Do we have marketing collateral that go along with it? The promotions, like there's a lot of things that go into a release that have nothing to do with deploying code and software. So testing and production is about deploying, getting early feedback, improving those feedback loops so that when you're ready to release the release goes smoothly.
That's a great point, Don. I think we've got also some great suggestions from the chat that, uh, definitely support what y'all are saying. Basically rebranding the concept of testing into production to make it that more probably palatable for the organization, but also reassuring that this is a great way to give early access to some of these new features and deployments before they go out in mass. So I think we've got some good consensus here in the channel. Um, one thing I did want to circle back to Rob, you brought up kind of the concept earlier of having these different service level indicators in the metrics that are going to have the most customer impact identified and sort of, you know, quality checks for lack of better words before that big day comes where you release things into production. So can you speak a little bit more to that practice?
Yeah, sure. I mean, this is, um, um, a lot of this, this, this is, you know, it's driven by, you know, the volume, you know, so I, we were talking to a customer and I mean, big customer and they, you know, they have, um, some, some, some I gave you named the customer, but, you know, they had over two, you know, they'll have 2000 different projects, uh, underway for different applications. Some customer facing some internal, they have over a thousand pipelines, uh, you know, delivering software in different places. So when you're talking like a thousand pipelines and thousands of projects, um, you know, you need, you need, you need, you can't do this all manually. And so you, so it's all about that repeatability and, um, and, and measuring things so that, you know, that, that the systems do a lot of the work for you.
Um, and you leverage that. So if we're taking, um, metrics like, you know, say it's a microservice, it could be, um, you know, the, um, you know, obviously the response time, you know, the, the key, the key things around performance are throughput, response, time, failure rates. There's also architectural aspects like looking at like database connections or number of objects or payload size. So those types of sort of metrics that someone's typically doing as a result of a test, like a performance test or a series of, of like, uh, you know, user acceptance, acceptance tests. If you look at what people are analyzing and you can, you can codify a lot of those metrics into a service level indicator and have objectives for it. And then as you're doing your pipelines over and over again, you can compare relative to the last run. So this is a practice where I may have, you know, uh, as long as my response time is plus or minus, you know, a certain percentage within my last, my last set of, of, uh, of deployments or builds or, um, uh, can also have a fixed threshold.
Like it never can exceed this amount. And so by doing that automation in your, in your software delivery pipelines, then you're kind of measuring that in a repeatable way. Um, and then with automation, that's where you can, you can, you can try out these things that Don was talking about. All right, here's the, here's the, here's the deployment, let's put it in this environment. We run my test, score it, let me flip a flag through an API call with LaunchDarkly with, to enable a feature, rerun it, measure it. So then you can get a lot of that feedback in an automated way. So this automation of scoring things is definitely something we're seeing. Um, you know, definitely introduce we, we use the word quality gates is a term that we often use in Dynatrace when we talk about it. So you're measuring the quality and having a gate, which can be a decision point to just be feedback loop to the, to the people doing the test, or it could be a decision point to allow something to progress to the next phase.
So it could be, you know, QA to staging, to production, or to Don's point. It could be a targeted release to, within a broader release. So using a quality gate, you know, from multiple aspects, not just performance, but, you know, uh, security is another aspect of this. Um, you know, the, um, you know, the, um, you know, um, what's another one going to resource consumptions and other ones. So when you look at sort of w the quality is measured on a couple of different aspects, so, uh, non-functional requirements kind of cut is the broader category for all those things, but if we can score it, um, through the service levels and then use those same service levels for ongoing production. Cause when you talk about service level agreements, they're typically a broader range. Like I can't have downtime within a month, but I might also by performance for certain key services also can't exceed a certain thing, but those things that can be tested earlier, um, are often the same, the same, the same, uh, service level indicators. And if you can codify that all the way along the way, then you have consistency between your prod and non-prod
W we need to get beyond just the technical metrics. A lot of times we'll look at like, well, how are we doing from a response time perspective and an error rate perspective? And we get very focused on those, but the bigger picture is like, what are the business metrics like? So like, why should a business owner care that we're at this level of availability or this response time, if shaving milliseconds off of page, is that leading to greater engagement, isn't leading to more conversion. What is like tying those things back to the business is a piece that's missing. We get very focused on like these technical metrics, because that's like a lot more concrete and we're feel like, Hey, like, yes, I can control this. Like, you can't necessarily control the users, but that's what we're trying to do. It, we're making these changes for the users. So how are we measuring things in a way that matters to the users? And it shows that it is improving, like their use of the application and however way that business matters and measures that.
Yeah. Yeah. Yeah. I think it's a great point. Yeah. I mean, it's, I mean, we are, but a lot of technical people on that call, so it's like, yeah, we, we tend to gravitate towards those, those technical metrics, right. The defects and vulnerabilities and availability, but absolutely customer satisfaction, conversion, um, you know, doll, you know, dollars of trench per transactions, uh, you know, all those types of metrics, um, you know, really drive the business, you know, more so than, than, than these other ones.
Yeah. I mean, there's the, you can go on the internet and find the stories of the companies that they got. So laser focused on like, oh, I'm like tracking this metric that they weren't realizing that like they were losing users. Like they were like, these metrics were all going great. And they were excellent, but like their subscriber base was stagnating and going down, it's like, so like they weren't giving the users what mattered. Right. So like, remember like user centric, like we want to make sure that user, like, we're thinking about like how people are using the tools and the services that we're building.
Yeah. I mean, and that's, you know, that's, I mean, this, you know, when you talk about a framework that can do both of those things, you know, when you think of, um, you know, cause I know like with LaunchDarkly you can, you know, very front and center, there, there is the, the customer population that you're targeting for something it's very much front and center you're you can analyze, you know, who's getting what, why they're getting, what schedule they're getting. What, so it's, it's very much front and center, um, you know, saying, you know, thus we're, we're also trying to, you know, that's part of our platform is tying the individual, you know, um, transactions that our user is doing in their mobile app or their, their web browser app and tying it back into the surface health and giving that end to end view. So when they're, you know, you can map like, all right, this service has a problem.
What's the customer impact in an automated way, because when we connect that, that trace from the user down, thinking that another techie techie thing, but we are connecting the dots, if you will, in an automated way to build that, that relationship between the user experience, whether it's able to complete a, something, or they're doing an abandonment, or they're able to quickly go through a workflow in a faster way as a result of this feature being enabled for them. Um, so it's yeah, so it's, it's critical that, you know, maybe that's another requirement, I guess, of this, of the scale is like that we can connect the infrastructure, the, the application monitoring and the end-user behavior, and being able to map it to what's happening with the releases and versions and deployments, um, you know, happening in, in our environment. So that's, so it's a complex problem for sure. Um, but you know, when you, you know, but when you can get all that in one place, um, then, then th then you really, you know, the teams from those different viewpoints, whether it's a business person looking at those conversion rates real time, and it's tied to the operational health and it's tied to what's happening with releases, then that becomes a real powerful thing. And that's some of the leading companies are doing that.
Yeah, we don't say recently, but it probably isn't that recent anymore. But a good example of this is we run an experiment. We had realized as more users started adding more feature flags to LaunchDarkly and within their accounts, we didn't have pagination, um, originally on like the main feature flag page, and people that had like large numbers of flags were complaining that the page took way too long to load, which given that we didn't build pagination originally, wasn't surprising. And instead of just saying, great, we're gonna move and do pagination, and it'll be great. We wanted to make sure that adding pagination for the users with a lot of flags didn't accidentally harm the users that didn't have a lot of flags. So would pagination slow down the response for people with a lower number of flags. And so we ran an experiment where we were looking at stats and information from two different groups of users.
One group had a lot of flags and the other group had a smaller number of flags just to make sure that their response time wasn't negatively impacted because you don't want to help one group and then accidentally, you know, have a negative experience for another group. And so random experiment using feature flags targeted a couple of users. Yes, we did it in production and we got the data showing that the pagination wasn't a negative hit for the other users. And so we rolled out pagination, but making your decisions, using data, as opposed to, Hey, I just have a feeling in my gut that this is the right thing to do. Right. Like it might be the right thing to do, but like, it doesn't hurt to like, get a little bit of data confirm. Okay.
Absolutely. Yeah. I mean, yeah. I mean, those, those are things and again, and that's the real power, if you can turn those things off and we have a, like, we were helping a customer, um, there's a lot of state agencies, as you can imagine with COVID, um, had a lot more activity on their website. So we, there was actually quite a lot of projects where we were helping, um, folks, you know, that, that stood up an informational website. Um, and, and, and, and, and they would get hit like, so, you know, there's a new announcement would come out for something and everyone would rush like, oh, when it, when is my vaccine available or whatever. And so there was there's things that, that had to be disabled. So w so another strategy, I guess, where I'm going with this is, you know, there's a, again, designing your application to have those critical functions and noncritical functions, um, to be controlled so that they could be on or off.
So if it's not critical, you know, cause to, to say, you know, in the example of the COVID one was a lot of like third-party tracking sites. So they were willing to kind of give up like all of these site tracker sites that were cracking their site. Um, so that the page would load, you know, we've seen other ones where someone has like, you know, customer, you know, like, you know, you go to a shopping site and there's like customer surveys. Well, the survey isn't critical to the page, cause you're just trying to find the product, you know, it's nice to see the surveys, but you know, if push comes to shove, you can yank the surveys off and let it's more important. The person buys the thing than to see the survey. So I think that's another kind of use case where, you know, you could turn on and off functionality under high volumes of load.
And so, um, feature flags help with that too. Right? So you can disable certain things on and off based on load. And that's kind of that remediation. And I don't know if I mentioned at the beginning, but what we're seeing is people, you know, coupling the, monitoring the re in the real time, as it detects problems to automate, you know, remediation workflows. So it's often companies have to tie it into a ticketing tool. So if I can automatically make my JIRA or service now, or whatever, you know, PagerDuty incident management ticket and have that automatically call a remediation playbook, you know, turn off a feature flag, you know, recycle this process, revalidate that, that took place close the ticket. There was no human intervention. So we see a lot of that's sort of the, the S the trend we're seeing is people are doing those simpler things.
They're looking at where are my ops teams spending a lot of times recycling a box or flipping something on and off, or just looking at a bunch of logs, or what are they really doing? And that's where the automation comes in. Like, all right, let me get those patterns in the logs. Let me get the developers to write better logs so I can easily parse it out. Um, and then tie in with those simple use cases that take a lot of time, like there's a recycle a box, you need approval for a bunch of stuff. So that's where it takes data to Don's point to then drive it. But then, you know, having a platform where you can, like, that's what a feature flags do, it's an API primarily right. Done, or, you know, that, that could be automated and say, Hey, turn this thing off. It's problem. You know, we actually, I think, I think you and I did a, uh, a webinar or something on that where we actually demo that exact use case where, you know, problem detected by Dynatrace call a feature flag, turn it on and off problem solved. And, um, you know, and that's kind of automation. That's, that's, that's possible. Um, now when people are doing that
And the one thing that's great about that kind of automation, when you have that flag trigger. So I received an alert or a page or something, and I know that this feature is causing it, and it's associated with a flag. You turn that flag off automatically. What that does is it gives the people that are troubleshooting a chance to like, take a moment, breathe, collect their thoughts, and then dive in if you're diving in and trying to like troubleshoot when your stress levels are high. And like the alarm bells are all going off, it's a lot easier to miss things than if you're like taking that, like split second to like, catch your breath, look at the data, and then go stop the alerts from happening. Take a moment. But taking like, take days, right? Like literally like five minutes to like, collect your thoughts and then go in and it's going to potentially reduce the amount of time it takes to resolve incidents, because you're doing it with a clear head and from like, uh, a healthier kind of frame point, as opposed to like, I've got to get this fixed, I've got to get this fixed.
Oh, crap. Everything's on fire.
Right. Right. Well, and I think, you know, what's important to making that even possible to not be stressed out is that you can identify what's being what's broken. And so forget the monitoring tool for a second it's. But what, what starts with this is like configuration as code as you're building out your, and this is where, you know, as people are rearchitecting and modernizing their platforms, using things, you know, to, to build up their environments, um, a key part of enabling all of this stuff we're contacted about is, is like tagging of things. You know, it's kind of a weird intro to tagging, but, but tagging is, you know, when you, when you can tag, you know, this, this host is production versus non production. You know, that's a simple example, but as you go up, you know, this is the web service, what is it doing?
What it's all about. If I can tag my, my transactions as they're executing with the, that they have this feature flag on. So once you have that in the traces of the data, so it kind of every layer up from the physical, to the services, to the transactions that our customer's doing, if I can tag those things, that's then the data to drive all of this, this decision-making. So I can query and say, you know, you know, if it's a problem, like here's, this problem was identified to this specific thing. I know this tag. Now I can use that to look up, you know, the team that's responsible for this thing. So I only bug that little team. I don't bug everybody. I don't get, I don't get a bunch of alerts to everybody that the decipher through all that stuff. I can do that.
And then same thing with, um, you know, that analysis of, you know, what is this behavior for the person with this flag on, or this bag or this combination of flags. So tagging is really key. So anyone that's been trying to advocate like, Hey, we need to spend time to come up with a tagging strategy or tagging architecture and bake that into our configuration tools. You have my support because that's foundational. So every week I always call it the whereby clause and Dynatrace. So we go, you know, give me all the things whereby this tag or push events or do queries or whatever is always in an abstracted way. So we have strapped the layers of the, the, the architecture, you know, from end-user all the way down to a host, um, you know, through metadata and it's that metadata, um, you know, and then the value of flag is, is, is a piece of metadata. So all of that data then can be used to, to drive the automation.
And the other piece, let's go back to what we were talking about earlier is testing and production is what allows you to put those run bumps on those processes in place, because you've been able to identify through your testing map, Hey, like, this is some weird behavior that occurs in certain scenarios. And so you're able to tag and know, like, if we're seeing an alert on this, it's likely this feature, you know, things that we also consider, um, testing in production at LaunchDarkly is doing things like chaos stays. Um, and, um, yeah, using game days and chaos days to figure out, like, how do things break? And then like, how do you remediate that? So once you have that knowledge, you can then build the automations. Like you can't build an automation if you don't know how things break, like you can guess maybe, but like getting as much knowledge and information as you can, when a failure occurs, like, right. I know what happened here. We're going to turn that flag, disable this feature. And we know we have to go and fix it doing like X, Y, and Z.
Right? Absolutely. Yeah, yeah, no, that's a whole nother use case. Yeah. I mean, that's when I think of testing and production, that's kind of what I comes to mind is that chaos experiments where you're, you're you're, you know, how do you enable those easily? Cause oftentimes like chaos is, uh, it's usually a series of experiments that you kind of want to run relatively quickly. Um, you know, so that's, that's where you can take advantage of, of right that framework, all right, turn this one on or this off or whatever, and, uh, you know, validate this, what happens. Right. See what's happening, understand, you know, cause that's usually what chaos is all about is like you have a hypothesis, you have an experiment, you kind of prove your theory and then right. Get that data to prove or disprove something. And then if you learned, hopefully you learn something, you know, that you weren't expecting from those things like, oh geez, we had a whole total black hole, you know, no visibility to this happening.
Um, and on that gives you that feedback to make it better. So, yeah. So I think it's, you know, that's, that is the DevOps philosophy. And I think people are doing that. I mean, chaos is definitely another trend, right. That people are I've matured to that point where right. You can, you know, that's the fun of it, I guess. Right. If I can just get out of the mundane, getting stressed out, I can automate some things. Now I can actually have a little more interesting job once in awhile where I can write, do an experiment or try something or build the framework that other people can use. That's um, you know, that's what keeps it exciting for sure. So Erin hit us up, get another, uh, any other questions on the, uh, on the slack land,
A lot of great agreements. So it looks like everybody's, uh, everybody's really enjoying what y'all have been sharing, which is awesome. Um, no pressure, but I do see Jean Kim in the chat. Oh, great. I know. So don't screw it up. Y'all
Watching big fan. So yeah, I mean so far. Yeah. So it's good that we're all in consensus mode. I mean, uh, you know, is like, I'm going to battle like, oh no, it's the battle dome, but, uh, you know, hopefully, uh, you know, we're here to, we're here to share ideas and, and you know, definitely, you know, there's pain, right. There's definitely, um, you know, uh, I don't know, maybe you have a pain story, uh, uh, Don you know, any, any, any, uh, pain stories that people have, think he's doing it the wrong way and have that stress, I dunno, and how we can help them out.
Oh man, you don't have to name names. So there could anybody on this spot here. So Pesach customer X, what are, what are the horror stories done?
No, I can't, I don't have any real horses. I mean, what we hear a lot is people come to lunch directly. Like if they've already been doing feature flagging, it's not about like, things are broken it's we need to scale the way we're doing things. We need to get a uniformity. We we're tired of supporting three different flagging systems internally, or we want to be able to use do or use cases than we can internally and focus on building our own features as opposed to supporting this internal tool. Um, so I don't know that I have any like horror stories of, you know, things that went horribly wrong and it brought a customer a LaunchDarkly normally it's just like a progression of growth of like, okay, we need to buy things as opposed to trying to build them internally.
Yeah. Yeah. There's definitely that build versus buy of tools. Right. So, I mean, we see, I mean, same thing, like when, you know, there's oftentimes there is tool consolidation so that you do that and, you know, and it's, um, yeah, I'm trying to think of some, some, I mean, I think paint, paint, paint is just, you know, the, the complexity, you know, of triaging. So I know, you know, there's, there's, um, um, um, lost my train of thought there, but I think that, you know, the, the, the one thing that, that I want to comment on is, is, you know, maybe more back to Don is when were the benefits of, of getting, um, um, I told maybe one more question. I, I totally zoned there what it was one of the say, but maybe, uh, hit me up with something there. Sorry about that.
I've actually going back to my days in incident management, before I joined Dynatrace and in hearing horror stories there, hopefully nobody on the line has experienced this in their career, but where a major outage happens and you have literally hundreds of people joining our conference bridge, like just imagine hundreds of people and your boss and your boss's boss on this conference call just going through the list saying, Nope, not my fault, not my fault and finger pointing. It just, I've never been there personally, but I can only imagine how stressful that is. So hopefully with what we share today and with what LaunchDarkly and Dynatrace are doing in the industry where we're keeping folks from finding themselves on those nightmarish conference bridge calls moving forward, I do to be cognizant of time here, um, and make sure that y'all each have a chance to, to let our listeners know where they can go to learn more about what LaunchDarkly and in Dynatrace does. So Don we're conflicted learn more about LaunchDarkly or maybe chat with you further after, after the session. Sure.
I will be here, uh, in slack for a few more hours. There is the expo LaunchDarkly slack channel. If you want to check that out, um, you can also go to launchdarkly.com for more information. Um, you can find me on Twitter and LinkedIn, if that's your preferred method of communicating as well. So I'm like online probably more often than it should be.
And how about you Rob?
Yeah, same thing. I mean, dynatrace.com is a great, great resource for information. We have, you know, pages specifically around dev ops and some use cases. Um, but yeah, I mean, we're, we're, I'm available too. I mean, we're, I'm available when LinkedIn I'm also joined the new, the new slack channel for the summit. Um, but yeah, definitely these, these use cases around, you know, reducing problems, you know, taking advantage of feature flags, um, automation, um, all topics that, you know, we, we have, you know, a lot of things that we could show you how to go about it, um, how to get started. And it's really, you know, I would just say, you know, service levels, release validation and production is often a way people start and then as they move, move left, as you're saying, shifting left earlier, um, you know, you were still validating those, those changes as you're going through the software delivery pipeline. So, um, you know, there's a lot of, you know, a lot, a lot of resources and things that we, we talk about that we have a great YouTube channel where there's a lot of performance clinics where we have long videos to kind of show the use cases and demo it in action. So that's a great resource as well. So
Thank you for that rub. And, and just to summarize some of the great takeaways we've heard from this session, we've heard first and foremost, breathe, take a beat, look at some data, figure out ways to automate things we're always incrementally improving. Um, let's think about maybe rebranding the concept of testing and production to make it a little bit more palatable to get that, uh, cultural adoption organizationally. And I would be remiss if I ended the session without one of my, uh, one of my favorite Jean Kim quotes take this as you will, audience, I'm starting to associate the smell of pizza with the futility of a death March, great quote from the Phoenix project. So Don is y'all are doing your, your chaos gains. A's on-site. I hope they are getting y'all catering other than, uh, than pizza. It shouldn't be a death March should be, should be a lot of fun. So thank y'all again for joining us today. Thank you to our audience and to Rob and John for presenting today and hope y'all enjoy the rest of the conference. Please stop by our virtual expos to chat more with our friends at lunch, darkly and Dynatrace. So thank y'all
Have a rest of your day. Thanks everyone. Bye. Thanks Erin. Thanks Don. Bye bye. Thanks all.
Unlimited users from organization
Jason Cox's SRE Playlist
Service Level Objectivity: Improving Mutual Understanding Through the Language of SRE Accepted (MediaMath and Google)
Adam Shake, MediaMath Source; David Stanke, Google