Site Reliability Engineering(SRE) is Google’s approach to systems engineering and operations. Given Google’s scale and position in the market the problems it faces are unique. Furthermore a big part of Google’s services is it’s reliability i.e. it’s services are very rarely down. The consequences of their downtime is easy to gauge, angry customers, lost revenue etc. The company found that if they employed a traditional Operations teams they would have to scale linearly with load, which is not possible unless you can breed engineer at the same rate as the load. Google’s solution to this was to SRE which applies the principles of software engineering to operations.

As far as I know Google was the first company to have something SRE. Today most companies have copied this approach(some unsuccessfully). Google has codified the principles and practices in a book that is available for free online.

The major principles:

  • Make frequent releases, even though it goes against the ethos of operations. Developers are incentivized to release frequently which Operations are incentivized to maintain reliability which can be achieved by avoiding change(releases). This is a fundamental dichotomy. SRE resolves this by implementing an error budget. SRE starts with an error budget which is a measure of how long a service can go down in a time frame(say a quarter). Developers can continue doing releases as long as error budget is not used up. Once exhausted, all releases are stopped.
  • Service Level Objectives(SLO) commonly confused with SLA. Describes the target level for a service level indicator(SLI). An SLI measures some aspect of a system e.g. latency, uptime etc. SLO gives a range for this SLI e.g. uptime should be 99.9%.
  • Limiting toil(which can include work other operational duties) to a maximum of 50%. The remaining time has to be spent on projects. Personally I like this since doing traditional Ops work means throwing the same solutions on the same problems over and over again.
  • Focus on monitoring, specifically time series like is being used with Prometheus
  • Automation since it can perform the same work better than humans with no mistakes and reduces toils.
  • Having a specific release engineering role in the organization that provides infrastructure and guidelines on how to test, package and deploy software.
  • Simplicity i.e. Making systems as simple as possible through minimal APIs and modularity.

The practices part of the book covers theoretical and practical concepts like on-call, post mortems, testing, load balancing, distributed consensus, cron, data pipelines and data integrity.

There is also a management section which I skipped which seemed to be focused on building and maintaining SRE teams.

To meet it’s needs SRE hires engineer that have a mix of experience in both software engineering and systems engineering(unix systems internal and networking).

Overall I like the book a lot, except for the chapter on testing which was a slog to read through. Seriously did someone not proofread this chapter?

However my main concern from this book is whether an SRE department is required when the company has not reached Google’s scale? Sure, they are good practices that in an ideal world should be incorporated from the beginning. However in practice, this will increase the time to market for a product. Also a lot of traditional Ops solutions may end up work really well and you may not reach Google’s scale. I think that most companies will find it easier to start implementing some of the SRE principles - like frequent releases, SLO, monitoring, simplicity - at the beginning and then implementing the remaining - limiting toil, automation, release engineering - once they get bigger.