VendorDome: Beyond the Buzzwords: Going Deep on DevOps with Dynatrace & LaunchDarkly

Strap on your coolest tech swag and put your Slack away message on - because we’re going deep on DevOps. Join Rob Jahn of Dynatrace and Dawn Parzych of LaunchDarkly as they go beyond the buzzwords to discuss and debate the topics that will set up software teams for today, and well into the future. No sales pitches, no Powerpoint - just real talk from industry leaders who are daily helping DevOps teams actually have fun building cool software and worrying less about bugs and outages. Moderated by Erin Jones.


This session is presented by LaunchDarkly and Dynatrace.

DP

Dawn Parzych

Manager, Developer Marketing, LaunchDarkly

RJ

Rob Jahn

Tech Partner Manager & DevOps Advocate, Dynatrace

EJ

Erin Jones

Tech Alliance Manager & Marketer, Dynatrace

Chapters

Full transcript

The complete talk, organized by section.

Erin Jones (Moderator)

Hi to everybody joining us for Beyond the Buzzwords: Going Deep on DevOps with Dynatrace and LaunchDarkly. My name is Erin Jones. I'm from Dynatrace, and I'll be your moderator today. Let's kick things over first to our wonderful speakers. Dawn, I'm going to let you introduce yourself first.

Dawn Parzych

Hi, everyone. I am Dawn Parzych, manager of developer marketing at LaunchDarkly, and I'm very happy to be battling Rob today.

Erin Jones

Thank you for joining us, Dawn. I know y'all are super busy getting ready for your own user conference coming up. I'll turn things over to another very busy presenter of ours, Rob Jahn, if you'd like to introduce yourself.

Rob Jahn

Yes. Hi. Thanks for having me. I'm excited today. I am a technical partner manager here at Dynatrace, so we talk a lot about DevOps and using observability to help drive decisions. I think today will be a really good topic because we're seeing folks use new frameworks and new processes to help automate and deliver good stuff. So yeah, excited about today, and thanks for joining.

Erin Jones

Thank you both for being here. As our title alludes to, we are going to try and get away from buzzwords. We may even make it into a fun little drinking game, although I know it's lunchtime, but it's a conference, so anything goes, right? First, let's jump in at a high level. Dawn, Rob, what are some of these big SRE and DevOps trends that y'all are seeing, especially as we look ahead to 2023?

Dawn Parzych

The big trends that we're seeing are a continued drive and push towards automation. Not everything is automated yet. There's still a large way to go with that automation, and part of the desire for so much automation is to improve developer and organizational productivity. The more that you can get away from the toil and the repetitive tasks and the manual tasks, the more productive your employees and your organization will be.

Rob Jahn

Dawn, can I follow up and ask, when you think about productivity, what is that KPI? What is that thing that everybody points to, to say, "Hey, I'm being productive?" Trick question, I know.

Dawn Parzych

That's a loaded question. I think everybody has their own metrics. Some of what you can look at in terms of productivity is release velocity: how often are you releasing, how often are you deploying, what is your change failure rate? All these metrics, like the DORA metrics that we've heard about repeatedly over the last year and a half, are important pieces to look at.

Look at where you were last year versus where you are today. Studies are coming out showing that some companies are deploying multiple times a day, but there are still some companies deploying on a weekly or monthly basis. If you look at those trends over time, are they shifting their deployments? Are they moving from monthly to weekly? Are they moving from weekly to daily? How are you improving on your own personal benchmarks and baselines?

Erin Jones

That's good perspective. A lot of folks are probably under the gun to try and be a Netflix or one of these shops that we look to to set that pace, but it's good to understand that we're all looking to make marked improvements against where we were a year ago. Rob, with the concept of delivering both faster and better quality software, those two concepts seem almost at odds. Are there trends you're noticing in the DevOps community that help alleviate the need to produce something faster while also producing it better?

Rob Jahn

Those are certainly the goals: get things to production, deliver faster, deliver more frequently, and obviously don't cause problems. Another trend is that we're changing the underlying architectures of the application. There's rapid adoption of Kubernetes, microservices architectures, and feature flagging architectures. You're not taking monolithic applications as you used to have them and just delivering them quicker. There's a big trend to fundamentally change the architectures, coupled with the move to cloud infrastructures and SaaS and PaaS offerings from Azure and AWS.

The trend is driven by the need to deliver faster, but it's also causing teams to deal with more tools and more technologies with the same amount of people, and it's overwhelming. Yes, deliver faster, but what you're actually delivering is also rapidly changing to these new containerized architectures.

Erin Jones

We've introduced the concept of more tools, more toil, more anxiety. As we're moving toward this more cloud-native reality, is that creating new challenges within DevOps teams? Now you're hosting things in the cloud. There are so many technologies that maybe your team does not fully control, unlike the days when you could go to the server closet and press a button. How are teams embracing this rather than letting it prohibit rapid innovation and release?

Rob Jahn

That's really what DevOps is all about. You can't work in siloed teams. It's forcing people to work together because you're intertwined. The philosophy is that you're responsible from code to delivery, understanding how it works. From an observability-platform point of view, it's getting a common point of view from development environments to production environments.

There's more to monitor, more complexity in microservices architectures, and a need for automatic tracing of what's going on: seeing end-user behavior, seeing what features are rolled out to target audiences. It's demanding a new way to put monitoring in place, use these tools, leverage those tools, and then have those tools inform the work.

Automation for the sake of automation doesn't mean anything. It has to automate a process: a software delivery process, a remediation process, an incident management process, or a business decision to roll out features to whom or roll things back. Automation helps, but you have to have the right foundation and the right teams working together.

Erin Jones

An attendee is commenting, "Architecture changes, monolith to microservice and feature flagging at the same time is killing us." Dawn, Rob spoke to the observability piece of overcoming these challenges. What are you seeing with colleagues or customers that helps address anxiety and where we are right now?

Dawn Parzych

When I look at DevOps, I think of three pieces: people, processes, and culture. Culture is a huge piece that people often overlook. They go look at the tools and the processes, but to be successful and reduce anxiety, you need an environment that is psychologically safe, where people are free to ask questions and question the way work is being done. If we're doing things too fast, are we doing it in the right way? Are we doing it in the right order?

Everybody wants to move fast, but you can only move fast if you have the safety nets in place to recover when things go awry. Trying to do everything simultaneously may seem right because we can't slow down and we can't stop innovating, but doing everything simultaneously raises anxiety and stress, and when anxiety and stress levels are higher, you're more prone to make mistakes. You're able to work more effectively and productively if you're not overwhelmed with all the things that need to be done, both at work and at home. There has to be balance.

Rob Jahn

It's stressful because a lot of people are still doing things manually. I was looking at a 451 Research survey: 12% are doing everything manually, and there is some automation for 18%, so 30% of people are doing mostly manual work to get their software out there. When you do things manually, that's stress. If I'm relying on some person to remember to do this change before I do my work, that's stress because you can't see what they're doing. If I'm doing the same thing over and over, maybe I missed a step.

Automation is key to reducing stress because it's repeatable, but the trick is not to do it as a one-off. DevOps starts from the beginning of the life cycle all the way through. If I can program in my alerting rules and tagging rules and have them go with my code in a GitOps manner, that's a trend: configuration as code and feature flags driven as code as they're turned on in environments. If we do that repeatably from the beginning every time, it takes out stress because it's nothing new. The big release once a quarter is stress. Delivering software every week or every day through a process you continuously improve is a way to scale and take out some stress and anxiety, but it takes commitment.

That's where SREs come in. DevOps teams and SRE teams may be two different groups or mushed together, depending on the organization. They're there to support the framework. They are not necessarily responsible for signing off on everything; they enable developers to be self-service and operations to have tooling and guardrails in place before production. The stress reduction comes from guardrails, dedicated resources for foundations, the right tools, and a repeatable process that accelerates delivery and reduces stress.

Erin Jones

Rob, you brought up SRE guardrails as a safety net. Another safety net I want to throw out is the concept of shifting left. Is that one of the safety nets teams are embracing, and what do those practices look like?

Dawn Parzych

We don't really talk about shift left. We talk a lot about testing in production, which is kind of the opposite of shifting left. But when we say test in production, it's not about "don't test at all" or "don't do unit testing" or "don't do integration testing." No matter how often or how soon you're testing, your environments are not production. Your users are going to use your app and website in unique and unusual ways, and you're never going to be able to test all of those corner cases.

You need to see how things interact in production environments, with all those third-party components integrating and firing simultaneously and all the wonderful weirdness that exists in your production world. It's about testing in a way that's most indicative of how your users are using the application.

Yes, you need to test early and often, but you also need to look from the user perspective. Use canary deployments, use blue-green deployments, slowly roll out a feature to see what's happening, because that's the only way you're going to get true feedback on how things are actually operating and whether they are successful.

Erin Jones

The concept of testing in production gives me heart palpitations on developers' behalf. But it sounds like with feature flags and incremental rollout, people can dip a toe in the water to see if what they're releasing will work before unleashing it to the whole world. Is that accurate?

Dawn Parzych

Absolutely. It's about using small circles first. The first time you deploy software, it's only available to the engineers who wrote the code and the testers, so they can see how things are working. It's in production, but nobody else sees that feature. Then you widen the ring: more people inside the company, then 10% of users, 20%, 50%.

You can roll back or turn off a feature much faster if things go wrong, and you're only impacting a small percentage, as opposed to a big-bang release where 100% of users are affected. When we talk about testing in production, we're not talking about not testing. We're talking about having safety mechanisms in place and rolling things out in a safe and sensible manner so you're able to catch things early, and it's easier to correct things if only a small amount of change has been made versus a large amount of change.

Rob Jahn

We were talking before we got online about the difference between release and deployment. If you think of them as two different things, I can deploy my software, but enabling a feature is different. That's the benefit of a LaunchDarkly-type framework: it's in your code with the deployment, and then you can turn the feature on or off, which is more of a business decision. Maybe only premium customers get it, or early-access users, or it is a support-only feature.

It can also be a way to iterate through design options. But you have to be able to measure it. It starts with service-level objectives. How am I measuring and verifying that I'm not having customer impact or using more resources than expected? These can be architectural service levels as well as business service levels. If we can measure them in a continuous automated way, bake them into automation, deploy, run automated service-level verification, and then roll out or roll back by turning a feature flag on or off, that's another way to incorporate it. You have to measure health to the end user and to services that downstream systems depend on.

Erin Jones

Laura from American Airlines says that when she brings up testing in production with safety nets, she gets an immediate reaction of, "No, it is too risky." How do you get people to try it?

Dawn Parzych

Instead of calling it testing in production, call it an experiment. Say we want to do a beta test. Change the wording and describe what you're trying to do. It's a very targeted release. We're not sending this out to everybody; we're targeting a very specific group of users. Explain that and define your segments and who you're targeting with that specific test.

Rob Jahn

I like the phrase targeted release. You could also point to what leading software vendors do. Dynatrace is a SaaS offering to multiple customers. We have clusters with the same deployment and the same version, but it is manipulated through feature flags: early-access features and combinations of things. We can deploy twice a month, but turn features on anytime and off anytime. This is what software companies do at scale, especially for SaaS offerings and customer-facing applications that need to be up 24/7.

Dawn Parzych

It goes back to separating deploys from release and defining what those terms mean. They are often used interchangeably. At LaunchDarkly we use them in a very specific way: deploying code does not mean it is available to everybody. Deploying code may mean it's available to a small group. Deploying code is a technical decision: is everything operating as it should, are we seeing the right metrics, is everything all systems green? A release is a business decision: do we have everything lined up to make this available to all users, including marketing collateral and promotions? Testing in production is about deploying, getting early feedback, and improving feedback loops so that when you're ready to release, the release goes smoothly.

Rob Jahn

A lot of service-level practice is driven by volume. We were talking to a large customer with 2,000 different projects underway for different applications, some customer-facing and some internal, and over 1,000 pipelines delivering software in different places. When you're talking about 1,000 pipelines and thousands of projects, you can't do this manually. It's about repeatability and measuring things so systems do a lot of the work for you.

For a microservice, metrics might include throughput, response time, failure rates, database connections, number of objects, or payload size. If you can codify those metrics into service-level indicators with objectives, then as you run pipelines again and again, you compare to the last run or against fixed thresholds. Automation can then deploy, run tests, score it, flip a flag through an API call, enable a feature, rerun it, and measure it.

We use the term quality gates in Dynatrace. A quality gate can be a feedback loop to the people doing the test, or a decision point to allow something to progress to the next phase: QA to staging to production, or a targeted release within a broader release. Quality gates can cover performance, security, resource consumption, and other non-functional requirements. The same service-level indicators can be used for ongoing production. If you codify them all along the way, you have consistency between production and non-production.

Dawn Parzych

We need to get beyond just technical metrics. A lot of times we look at response time and error rate and get very focused on those. The bigger picture is business metrics. Why should a business owner care that we're at this level of availability or response time? If shaving milliseconds off a page leads to greater engagement or more conversion, tie those things back to the business.

That piece is missing. We get focused on technical metrics because they're concrete and we feel we can control them. You can't necessarily control users, but we're making changes for users. How are we measuring things in a way that matters to users and shows it is improving their use of the application in the way the business measures?

Rob Jahn

Technical people gravitate toward technical metrics: defects, vulnerabilities, availability. But customer satisfaction, conversion, dollars for transactions, all those metrics really drive the business more.

Dawn Parzych

You can find stories of companies that got so laser-focused on tracking a metric that they did not realize they were losing users. The metrics were going great, but subscriber base was stagnating or going down. They weren't giving users what mattered. Remember user-centric: think about how people are using the tools and services that we're building.

Rob Jahn

A framework can connect customer population, feature targeting, individual transactions, service health, and end-to-end view. You can map a service problem to customer impact in an automated way by connecting the trace from the user down. That's another requirement of scale: connect infrastructure, application monitoring, end-user behavior, and what is happening with releases, versions, and deployments. When you can get all that in one place, business people can look at conversion rates in real time tied to operational health and releases. That's powerful, and leading companies are doing it.

Dawn Parzych

A LaunchDarkly example: as more users added more feature flags in their accounts, we realized we didn't originally have pagination on the main feature flag page. Users with large numbers of flags complained that the page took too long to load. Instead of just adding pagination, we wanted to make sure adding pagination for users with a lot of flags didn't harm users with fewer flags.

We ran an experiment looking at stats from two different groups: one with a lot of flags and one with a smaller number. We did it in production, targeted a couple of users, and got data showing pagination was not a negative hit for other users, so we rolled it out. Make decisions using data instead of just a gut feeling.

Rob Jahn

During COVID, a lot of state agencies had much more activity on their websites. We helped folks who stood up informational websites and then got hit with traffic when announcements came out. They needed to disable non-critical things so the page would load. One example was third-party tracking sites: they were willing to give up site trackers so the page would load. On shopping sites, a survey is not critical to buying the product. If push comes to shove, you can yank the survey off and it's more important that the person buys the thing.

Feature flags help turn functionality on and off under high load. We're seeing people couple real-time monitoring that detects problems to automated remediation workflows. That might tie into Jira, ServiceNow, PagerDuty, or incident management: automatically make a ticket, call a remediation playbook, turn off a feature flag, recycle a process, revalidate, and close the ticket without human intervention. Teams look at where operations spends time recycling a box, flipping something, looking at logs, or seeking approvals, and automate those patterns. Developers need to write better logs so the data can drive automation.

Dawn and I did a webinar where Dynatrace detected a problem, called the feature flag, turned it on or off, and the problem was solved. That automation is possible now, and people are doing it.

Dawn Parzych

What is great about that automation is that when you have a flag trigger, and you received an alert or page and know this feature is causing it, you turn the flag off automatically. That gives troubleshooters a chance to take a moment, breathe, collect their thoughts, and then dive in. If you're troubleshooting when stress levels are high and alarm bells are all going off, it's easier to miss things than if you take a split second to catch your breath, look at the data, and then go. Stop the alerts from happening. Take five minutes to collect your thoughts and then go in. It can reduce incident resolution time because you're doing it with a clearer head, instead of "I've got to get this fixed" and "everything's on fire."

Rob Jahn

What's important to making that possible is identifying what's broken. Forget the monitoring tool for a second: it starts with configuration as code. As people re-architect and modernize platforms, tagging is key. You tag a host as production versus non-production, then services, what they do, and transactions as they execute with a feature flag on.

Once you have tags in the traces of the data, from the physical layer to services to customer transactions, that's the data to drive decision-making. If there's a problem, you can identify the specific thing, use the tag to look up the team responsible, and only bug that team instead of alerting everybody. You can analyze behavior for users with a flag on, a flag off, or combinations of flags. Anyone advocating for a tagging strategy or tagging architecture and baking it into configuration tools has my support because that's foundational. In Dynatrace, I call it the whereby clause: give me all things whereby this tag. Metadata, including flag value, drives automation.

Dawn Parzych

Testing in production is what allows you to put runbooks and processes in place because you've identified through testing that weird behavior occurs in certain scenarios. You're able to tag and know that if you're seeing an alert, it's likely this feature. At LaunchDarkly, we also consider chaos days and game days as testing in production: figure out how things break and how to remediate. Once you have that knowledge, you can build automations. You can't build an automation if you don't know how things break. When failure occurs, you can say, "Great, I know what happened here. We're going to turn that flag, disable this feature, and fix X, Y, and Z."

Rob Jahn

Chaos experiments are about having a hypothesis, running an experiment, and getting data to prove or disprove something. If you learn you had a visibility black hole, that feedback makes things better. That's the DevOps philosophy. If you can get out of mundane stressed-out work and automate some things, you can do experiments and build frameworks other people can use. That's what keeps IT exciting.

Erin Jones

I'm seeing a lot of agreement in the chat. I see Gene Kim in the chat, so don't screw it up, y'all. The boss is watching. Rob asks if Dawn has any pain stories about customers doing it the wrong way.

Dawn Parzych

I don't have real horror stories. What we hear a lot is people come to LaunchDarkly after they have already been doing feature flagging. It's not that things are broken; it's that they need to scale the way they're doing things. They need uniformity. They're tired of supporting three different flagging systems internally, or they want to do more use cases than they can internally and focus on building their own features instead of supporting an internal tool. Usually it's a progression of growth: we need to buy things instead of trying to build them internally.

Rob Jahn

There's definitely that build-versus-buy issue and tool consolidation. Pain is the complexity of triaging.

Erin Jones

Going back to my incident-management days before Dynatrace, a major outage can mean hundreds of people joining a conference bridge: your boss and your boss's boss on the call, going through the list saying "Nope, not my fault," with finger-pointing. Hopefully, what we shared today, and what LaunchDarkly and Dynatrace are doing, keeps folks from finding themselves on those nightmarish conference bridge calls.

Dawn, where can folks learn more about LaunchDarkly or chat with you?

Dawn Parzych

I will be here in Slack for a few more hours. There is the Expo LaunchDarkly Slack channel. You can also go to launchdarkly.com. You can find me on Twitter and LinkedIn if that's your preferred method of communicating.

Erin Jones

Rob, where can folks learn more about Dynatrace?

Rob Jahn

Dynatrace.com is a great resource. We have pages around DevOps and use cases. I'm available on LinkedIn and joined the new Slack channel for the summit. Use cases around reducing problems, taking advantage of feature flags, and automation are topics where we can show how to get started. Service levels and release validation of production are often ways people start; as they shift left earlier, they're still validating changes through the software delivery pipeline. We have a YouTube channel with performance clinics and long videos that show use cases and demos in action.

Erin Jones

To summarize the takeaways: breathe, take a beat, look at data, and figure out ways to automate things. We're always incrementally improving. Think about rebranding the concept of testing in production to make it more palatable for cultural adoption. I'll end with a Gene Kim quote from The Phoenix Project: "I'm starting to associate the smell of pizza with the futility of a death march." As y'all are doing chaos game days on site, I hope they are getting catering other than pizza. It shouldn't be a death march. It should be a lot of fun.

Thank you again for joining us today. Thank you to our audience and to Rob and Dawn for presenting today. Enjoy the rest of the conference, and stop by the virtual expos to chat more with LaunchDarkly and Dynatrace.

Dawn Parzych / Rob Jahn

Thanks, everyone. Bye. Thanks, Erin. Thanks, Dawn. Bye-bye. Thanks, all.