Las Vegas 2020

Innovate or Die: Use Experimentation to Eradicate Uncertainty

This session is presented by Split.

HJ

Henry Jewkes

Staff Software Engineer, Split

Transcript

00:00:14

Hello. My name is Henry Jukes and I'm the experimentation architect that splits off way where we power engineering teams to build impactful products. Also hosts the adventures in dev ops podcast. If you are enjoying this conference, you should definitely check us out this year. The world changed the COVID-19 pandemic has impacted every single one of us in small ways, in large ways, in permanent ways. So to has it impacted our businesses. In some cases, we have seen long-term trends take giant leaps forward. For most of us here, our work has shifted from offices into the home. We even attend our conferences remotely. The already booming e-commerce sector has scaled up dramatically. Those positions dig advantage of it. Have thrived while small businesses have struggled to stay afloat. And we're only a year ago, the academy was voting on whether streaming movies were eligible for Oscars at all.

00:01:25

In 2020, almost all major movie releases have occurred online on the other end. Many businesses that have been seeing rapid growth have all been disappeared with little travel occurs, local or long distance has moved to personal vehicles. Airlines were forced to clear flight schedules and the ride sharing sector has gone from Silicon valley success story to cutting drivers and employees alike. Visitors to hotels and casinos have similarly dropped dramatically. As the world stays home to maintain relevance. It has become essential for businesses to respond quickly to market conditions. Our schools have secured suites of online tools and teachers are learning to run video conferences and these virtual whiteboards, those same ride sharing apps are doubling down on their food delivery offerings. And I find a new ways to engage drivers by being the link in the chain for local personal deliveries and manufacturing companies of all types are producing personal protective equipment ventilators and even sanitation supplies to help our frontline workers for us in developer operations.

00:02:48

Innovation means ensuring our organization is delivering value as fast as possible by this time, I think almost all of us in dev have made it our duty to facilitate the move away from waterfall processes and towards a more flexible way to manage and develop our systems and software elastic, scaling automated testing and continuous build and deployment are so ingrained and that successful dev ops organization that is almost automatic. We have incredible tools available to us and have the knowledge and training to spread the culture of continuous releases throughout our engineering departments. As time goes on, we have facilitated accelerating all these schedules from weeks to days to minutes. It is so much easier to provide robust, effective testing environments to ensure that the bill and release process is automated to the point of being truly continuous and to empower developers, to work directly off of charm and release new code as it is created. So with these pillars already in place, our job is done right? Release management is solved. Let's all head to the bar. Of course not migrating services and databases are, is still a very painful process. Developers will always push bugs, some small, some large, and many of those requiring rollbacks. And even if our changes are released perfectly without a hitch, we need to handle the cases where, what was changed, proves to be unsuccessful for our customers.

00:04:40

Fortunately, as with everything else we can solve these problems with software. Many of you are likely already using feature flags. They're a tool that empowers your organization to separate code deploy from feature release. For those not familiar, a future flag is a simple F L statement and it's powered by a configuration service or external tool. It can be modified to target the enclosed code to a specific subset of your population. Usually feature flags are targeted at users or customers, but they can also modify functionality by session service request or database transaction, whatever is applicable for the code change feature flags come in many types. That can be a simple switch that enables the feature either globally or for a specific portion of traffic. They can be ramped to a random population, allowing you to steadily roll out the functionality or a flag may also gate multiple of the feature either to compare the impact or as part of a phased rollout strategy at LinkedIn, their team proposed that effective release strategies, balance speed, quality, and the risk.

00:06:16

Every release is a decision that decision to provide the change for users, the decision to introduce this change to your code long-term speed refers to how quickly you reached this decision. The faster you decide, the faster you start delivering value holidays, not whether the change is bug free, but represents whether this is the right decision. Whether the change accomplishes what we expected to risks come in the form of bugs, performance issues, security holes, but also of having worse results than what was replaced. Now, traditional deployments maximize speed. I mean the code is immediately active, but it makes that decision blindly and it exposes the entire system to any night. Get, have ramifications on the opposite side of the spectrum, never changing our code, minimizes the risk related to change, but also has a philosophy of exactly zero. And in some cases not taking action can be the greatest risk of all the ramping process divides your release into phases. Each phase protects and informs the next step in that process. When deploying the feature flag ensures the change is not active, that eliminates all risk. At that time.

00:07:54

Ramping begins by targeting a small percentage of customers watching for issues while protecting the majority of your traffic, you can target randomly, or you can use a specific white list to target beta customers or even internal users to be able to validate the change. Understanding the effects of the change typically requires data to be collected. This data can be collected most efficiently when the population is evenly divided, increasing the quality of your decision in the fastest way. If you do decide to launch, you can either release the feature fully or continue to ramp to monitor how your system scales under load as a developer. The greatest advantage of a future flag is the peace of mind that provides knowing that should an issue ever occur with a release. Rolling it back is just one click away. There's no writing a hot fix at 3:00 AM and no scrambling to get a rollback deployment approved.

00:09:09

Well, now that we are all on the same page about the capabilities and benefits of feature flags, let's see what this might look like in action. Migrations are one of the most common transitions we managing that ops. There are also one of the most challenging scenarios and release management, whether you're changing data stores, shifting scheme as updating API end points or breaking a model into individual microservices, migrations must be designed carefully to be successful. The traditional model of migrations is a painful one. The service must be disconnected from the systems that rely on it. The data must be copied into the new service taking minutes, hours, or even days. And only once that copy is completed, can you then start reconnecting systems to the new infrastructure and then watch and hope that everything went according to plan this approach, inevitably results in downtime. It must be scheduled on off hours teams, work nights and weekends and customers need to be warned. Then only once the migration is complete, can you validate that it was successful? And if an issue is discovered, you need to have a recovery plan because your data isn't reaching that retired service that you wish to roll back to.

00:10:47

Fortunately, there is a much better way by leveraging feature flags. You can architect the transition ahead of time and then migrate in phases. Start by sending your rights to both services. During the transition, the flag lets you ramp up the load, the new service and keep the latest data in both systems. The whole migration, then that migration can be performed while rights are still active, resulting in a final state where both versions of the service have the same set of data to validate you can perform dark reads. This is a pattern where some portion of requests hit both versions of the service and check that the result is the same reporting issues before customers are ever exposed to them. Once the migration is proven to be successful, reads can be transitioned to the new service and the old infrastructure can be, this allows even the most complex migrations to be done with little to no risk to the system. You only launch once you confirm the two systems match and at each step you're able to progressively ramp and validate, making sure that the service rights work, that the migration is successful and that the reeds work and that the system scales every step of the way.

00:12:25

All right, let's pat ourselves on the back. Not only is our deployment pipeline continuous, we have now created independence between those deployments and the release process. Our development teams can ramp those rollouts and stages to minimize risks. And we have that big red button to press. If things go wrong time for that beer. Well, no sure. We can ramp out those features, but how do we make the decision to roll out? How do we know it's safe for that? The feature of successful. Similarly, how are we supposed to know when to hit that big red button and with many teams running many releases, how do we know which of those releases is responsible for any given issue? Do we just kill them all to be safe? Oh no. Easy. You say we have dashboards. So many dashboards. You're probably already measuring server metrics, service metrics, system metrics, requests, and clicks and views today. Every company has more dashboards than they know what to do with. So when it comes to measuring a feature release, we can just turn it on and watch the dashboards, right? Oh, if an issue does occur, you can expect to see one of these screenshots at your why metric spike everyone on the team, scrambles, you know, trying to identify what changed and how it might be causing the issue.

00:14:09

One of the first tools that we built in split was a way to overlay feature. Flag changes onto these dashboards that empowers the team to correlate the release and the metric change much faster. Unfortunately, the amount of data flowing to these metrics means issues for a small ramp percentage can be impossible to see. Also just because two events occur alongside one another does not mean that they are causally linked every summer of the rate of shark attacks in the United States rises right alongside the sales of ice cream. But that doesn't mean the sharks have changed their diets Rocky road. So back to that dashboard I showed earlier, This is a real view of an incident that we encountered sometime back the team spent hours looking for root cause only to find that the source of the issue was external to our system and not even related to our release at all. So unfortunately, just looking for changes on the dashboard often isn't enough to measure your releases, Other effects, whether they be a denial of service attack, great. Each weather or global pandemic can have a direct impact on your data that needs to be accounted for,

00:15:38

To go beyond correlation and look for causality. Science has provided us with the randomized controlled trial By enabling the change at random and measuring the behavior of both the exposed and unexposed traffic. We were able to distribute the effects of outside factors between the two samples, thus attributing, any behavior difference to the change itself. This process begins with attribution. For every data point we identify what feature variation the traffic received As the data is plotted. At least on its exposure patterns may emerge. Those patterns can then be analyzed and turned into a distribution of data, allowing us to understand how that metric behaves within that particular sample, Through the use of statistics. Those distributions can then be compared, determining whether a difference exists beyond normal variation. This statistical analysis is the cornerstone for experimentation and AB testing, but it has also proven itself to be critical regardless of the type of release. It can prove that a bug was really fixed. If a new refactor is more performance, or if a feature actually provides value to customers, There are many ways that companies can achieve this comparative analysis. Most dashboarding tools offer ways to tag and segment data. This lacks the statistical rigor that is key to making decisions, but it is a stepping stone on the path and provides a far better data than the overall dashboard

00:17:41

Teams that are already performing. Internal data analysis can store their feature flag assignments in the same analytics warehouse and process the results manually. And we are seeing more and more companies are either building or procuring experimentation platform capable of this data collection and analysis. In addition to feature management. At this point, I spent a lot of time describing how your team can benefit from better release management, but how might you build such a tool for your very own release platform is built of four core parts. The targeting system, which powers your future flags and records. The assignments performed the tracking sensors, which capture key metrics, whether they be technical errors, low time or throughput or business retention and engagement and revenue next, the statistical engine is responsible for the attribution calculation and analysis, which we discussed earlier. And finally, the management console through which releases are configured and analysis is shared.

00:19:01

You're targeting system must be fast to avoid becoming a bottleneck, must be random to remove bias and how uses are assigned to each variation. And it must be sticky to assign a user to the same variance, no matter how many times an experiments is evaluated, it also must be reliable as under no circumstances. Can the targeting engine be down in the spirit of dev ops, the targeting system, and indeed the entire release platform is best isolated into a micro service. This service contains a logic for assigning a given fire or key to a feature and to store those assignments for later processing. This allows any part of your infrastructure to quickly and reliably use feature flags in their code. Some of you may be asking how a system can be both random and sticky targeting systems achieve this through hashing process that maps arbitrary data.

00:20:06

In this case, the traffic key to a fixed value and a consistent and reliable way. The same data will always result in the same hash provided the algorithm is given the same seat. So to ensure each feature is released independently, a new seat should be generated each time. A feature is created modern hashing algorithms assign values, uniformly, meaning that a key is equally likely to hash to any value. This behavior allows us to normalize the hash to a percentage and know that the population will be randomly distributed. If overall out is targeted at 55% of the population, any key, which maps to less than 55 will be assigned the treatment and the remaining will be assigned to the control.

00:21:02

Any modern software already has measurement and telemetry in place. It may be start internally or central business intelligence tool. It might also be automatically collected by another product, collecting data for your release platform. Typically it's as simple as building a rapper for your existing tracking in sending those events to your release service. In addition to its other destinations for teams within internal analytics, where it has this step can be skipped and the release service can read from that warehouse directly, it is important that your telemetry data incorporates the key of the traffic generating the event as this is needed to tie the data to your futures for the statistical engine, we returned to the attribution calculation and analysis steps and the attribution process, the assignment data is combined with your telemetry. So determining what events are relevant to the release, important to note that this process should be limited to the data received during a single phase of your rollout.

00:22:07

If you try to combine data before and after ramping the experience of returning traffic and change over that period, and the telemetry can not be attributed properly, there are many ways to calculate the distribution of your attributed data with sufficient sample sizes. The most practical approach is to calculate the summary statistics, such as the meme, variance and size of sample. Other statistical techniques may require different data collection, but in this process, these statistics can be collected very efficiently. The final analysis that compares the two distributions using a statistical test, there are a wide variety of tests available most commonly in beliefs, monitoring analytics teams use a T test though, I recommend reviewing your options and choosing the technique right for your team. Statistical tests typically return out probability or P value that the two samples are the results of the same underlying behavior. If that probability is very low, then it can be inferred that a meaningful difference exists between the two samples.

00:23:19

When the samples are randomly assigned to the release, we can conclude that the changes responsible for the difference we observe in the impact of metrics. The last component of a release platform is the management console. This is where your team can manage your rollouts and review metric results. Many people's first feature of lightning tool is powered by either a static configuration file or an entry in the database. These approaches limit who has access to a rollout and required technical knowledge to make a change. We found that once targeting is made available more intuitively organizations find value in empowering product customer support and even silent sales teams to control or white list features for specific customers. Access to such changes should be regulated as part of any good security model, but simplifying the system increases the likelihood that it will be used then comes the challenging question of deciding what to build with this new release platform. What does it matter if you're releasing the new code a hundred times per day, operating in one week sprints and crushing your deliverables. If you aren't building the right things, aren't you just making lousy software faster,

00:24:44

An incredibly common way to find your priorities is to conduct customer and market interviews. There are many companies for whom it is sufficient for their customers to be the sole source of feature prioritization. After all, if you can keep your customers happy that happiness often spreads it is worth noting though, that customer requests are limited by their customer. By the current experience, they can help Polish and optimize and identified gaps. However, true innovation often requires a spark of creativity whose ownership should not live with their customers alone. Then obviously we can look internally team members throughout the organization can suggest their best ideas, and those can be combined with customer requests to prioritize what should be done. First. I'm a strong proponent of the impact effort matrix.

00:25:45

These should be scores filled out by as many team members as possible helping to identify high value targets and unproductive time things that is important to note. However, that humans are notoriously bad at estimated trying to make a good guess at the amount of work required is a skill developed over entire careers. And knowing ahead of time, what changes will be successful is surprisingly difficult. In fact, the team at Microsoft Bing search engine started off discovering that Haiti to 90% of the features they shipped fell to have the success that they were expected to. Even now, with a decade of experience, they report that more than half of experiments run have surprising results

00:26:41

In a lack of meaningful evidence. It is common that prioritization will be driven from the top. If you are just focused on making your boss happy, this isn't necessarily a bad thing, but rarely has any individual, no matter how senior shown a perfect track record the secret then is that there is no secret. True innovation is the results of trying things. Often failing sometimes succeeding. What is essential is to not have those attempts occur in a vacuum by collecting data on each attempt, you can inform future steps in a deep, meaningful way. Successful avenues can be explored. Further. Failures can be learned from and changed in subsequent trials or abandoned without meaningful data. And without a record of that process, you're left. Moving on gut instinct, a vague sense of what has happened before.

00:27:46

The real value of continuous delivery is the ability to continuously learn. The goal is to fail fast, to learn faster and to use that knowledge, to steer your future decisions. At this point, we have seeing how feature flags can streamline the release process. We've reviewed some events that needs for releasing features across services, and we've explored how combining feature assignment data with your metrics can provide understanding, not available. Otherwise, finally, we've talked about the process to run and prioritize experiments at your organization in dev ops. I, her job has never really done, but each year we can try to automate a way the challenges of the year before I think tonight you all deserve a beer. Thank you all for your time and attention. And thank you. It revolution for inviting me to speak. If you have any questions, I will be available to chat in the conference action. If you've enjoyed this talk and want to learn more about the power of release platforms, please check us out at split that IO, where you can learn how to kill the release night, how to automate your deliveries with data and to turn every feature release into an experiment. Thank you and enjoy the rest of the.