Site Reliability Engineering
Site Reliability Engineering (SRE) is what happens when you ask a software engineer to design an operations team. An SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning. SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor.
In this free online book, members of the SRE team explain how their engagement with the entire software lifecycle has enabled Google to build, deploy, monitor, and maintain some of the largest software systems in the world.
-Rushi