San Francisco 2017

Best Practices for Availability

The ask was almost naively complicated -- get over 100 geographically distributed SaaS engineering teams to deliver services that meet their stated SLA in production. Also, do it without a budget or dedicated headcount. No pressure, right?

To address this challenge, the team co-opted a half-dozen other teams that had demonstrated highly mature development practices and also achieved 99.99+ % uptime. This pilot group helped the team to meet its goal of identifying and visualizing risks to uptime across the organization. These teams helped identify some critically important behaviors, particularly when observed in all of the pilot teams.The first deliverable turned out to be (unexpectedly) a document entitled Best Practices for Availability, that set out the foundations for what successful teams have in common. It lays out three pillars for availability: culture and ownership, mature service management, and availability-driven architecture.

But the Best Practices for Availability document is not a policy or standard; it is intended to drive awareness and inspire other engineering teams to emulate successful techniques. To meet the goal of the project, quantitative analysis was required. Working with the pilot teams, a set of tools were developed to assess the organizational maturity and availability risk associated with a service team.

One of those tools was a failure mode analysis instrument loosely based on the Six Sigma process of the same name. Service teams were asked to contemplate the likelihood and impact of problems across thirteen different problem domains. These problem domains include things like failures of an instance, database, or region as well as DDoS vulnerability and risks associated with cross-service dependencies. This allows senior management to see risk as function of problem domain in a heat map and that's extremely helpful when making investment decisions.

The program produced a best practices document, created engagement in and between service teams, and allowed senior management to measure risk and progress. It also provided a roadmap for how successful teams find a way to dedicate resources to availability and quality work. The tactics used by the team were also a notable outcome, as will be presented in a sidebar discussion entitled, "how to get 100 engineering teams to do something for you." The program is currently running in an ongoing state, with service teams periodically resubmitting failure mode risk information so that progress over time across the organization becomes observable.

David Owczarek, Sr. Manager, Document Cloud, Adobe

David Owczarek

Sr. Manager, Document Cloud, Adobe