Las Vegas 2022

How Google SREs Modify Production Resources Securely & Safely

Most outages are caused by changes to production resources. Automated production changes are typically fast and secure, but can't address every use case -- especially during an incident. An SRE with production access can fill this gap, but that access introduces reliability and security risk if they make a mistake or their account is compromised. To balance this risk, Google developed a framework that automates the majority of production operations, while providing routes for manual changes when necessary.

BB

Brett Beekley

SRE, Google

MB

Michael Bird

SRE, Google