ŷ

Jump to ratings and reviews
Rate this book

The Site Reliability Workbook: Practical Ways to Implement Sre

Rate this book
In 2016, Google's Site Reliability Engineering book ignited an industry discussion on what it means to run production services today--and why reliability considerations are fundamental to service design. Now, Google engineers who worked on that bestseller introduce The Site Reliability Workbook, a hands-on companion that uses concrete examples to show you how to put SRE principles and practices to work in your environment.

This new workbook not only combines practical examples from Google's experiences, but also provides case studies from Google's Cloud Platform customers who underwent this journey. Evernote, The Home Depot, The New York Times, and other companies outline hard-won experiences of what worked for them and what didn't.

Dive into this workbook and learn how to flesh out your own SRE practice, no matter what size your company is.

You'll learn:


How to run reliable services in environments you don't completely control--like cloud
Practical applications of how to create, monitor, and run your services via Service Level Objectives
How to convert existing ops teams to SRE--including how to dig out of operational overload
Methods for starting SRE from either greenfield or brownfield

512 pages, ebook

Published July 25, 2018

295 people are currently reading
1578 people want to read

About the author

Betsy Beyer

9books33followers
Betsy Beyer is a Technical Writer for Google in New York City specializing in Site Reliability Engineering. She has previously written documentation for Google’s Data Center and Hardware Operations Teams in Mountain View and across its globally distributed datacenters. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University. En route to her current career, Betsy studied International Relations and English Literature, and holds degrees from Stanford and Tulane.

Ratings & Reviews

What do you think?
Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars
185 (47%)
4 stars
162 (41%)
3 stars
35 (9%)
2 stars
6 (1%)
1 star
0 (0%)
Displaying 1 - 25 of 25 reviews
Profile Image for Sebastian Gebski.
1,157 reviews1,265 followers
August 23, 2018
Solid 4.5 stars.

Surprisingly good supplement to the original SRE book. BUT be warned - it's a workbook, it's practical, but it doesn't mean it tech-based, in fact it's more conceptual & tech-agnostic. Filled with many good examples, coming not just from Google, but from various (but all well known) organizations.

What did I like most? That it's tight to real-life practices - there's a full chapter on SLO, another one on On-Call duties, Post Mortems, etc. Some are quite specific (data processing one felt almost like out of place), but some can be easily related to any case - like configuration one (btw. I didn't expect to learn anything new here, but actually I really like some conceptual figures when approaching this topic - ones I didn't use before).

Anything I didn't like? 3 final chapters need some more work - it's not just about polish, it's more like that they are not "selling their goal" properly - IMHO it was like some tangle of thoughts that was in general hard not to agree with, but did lack the clarity & natural flow of previous ones.

Still - it's a very decent book. Highly recommended, additionally - book is currently (for a limited time) available for free - not using this opportunity to educate yourself would be a sin.
Profile Image for Gui.
42 reviews6 followers
October 29, 2019
Muito interessante ver como algumas questões relacionadas ao mundo SRE são aplicadas na prática. Sinto que terei que voltar ao primeiro livro para entender como apresentar conceitos simples e basilares ao restante da empresa para iniciar um gradual processo de adoção de algumas práticas.
Profile Image for Rastko Vukasinovic.
32 reviews3 followers
September 7, 2019
More of this Google tech writings in this zone I read, more redundancy I feel - exceptional writing style and discuss approach are covering for lack of original thought comparing to the first book.

Still recommended read in some of my coaching sessions, modern and high quality approaches right from the source. Could live without it, tho 😀
Profile Image for Nat.
Author3 books57 followers
December 7, 2019
Written by a bunch of my coworkers, I really enjoyed this book. Arguably much more practical than the original SRE book. I find myself sending folks chapters from this book far more than the first.
Profile Image for Ines.
560 reviews32 followers
October 5, 2020
This book is great at explaining by example, which makes it easier to absorb the information. I found chapters "boring" because they focus on areas that are alien to me, but mostly I think this book works wonderfully for future consulting: when confused or in need of help, go directly to the chapter that explains what you need.
Profile Image for Giovani Facchini.
47 reviews4 followers
November 4, 2018
This is a great book for those who want to learn about how to build a SRE team and the main concepts that need to be taken into consideration for a successful team.

On the positive front, it gives the reader a step by step approach on how to start such as definition of SLO as for when the team should take action so it focus only on what is important, to have the team focusing from 50% to 70% of their time in automating the resolution of problems and improving operations and smaller percentage of time in toil, and finally the blameless postmortem culture.

Another good point is the focus on the human aspect of processes (culture, respect, challenge) since it is paramount to have a functioning stable team. On the challenge front, it empowers team members to focus on the problem themselves and not just pass it onto someone else. On the respect, the blameless postmortem tell us to focus on the technical problems and everything that happen for the situation (incident) to occur and how it could be avoided with changes in tools, automation, alerts, etc and taking the blame out of people for their mistakes. Mistakes will happen.

A lot of technical details are provided on how to solve specific problems and this is great for those who are starting and do not have experience in SRE team. It also focus on many aspects of performance engineering, networks, parallelization, distributed processing and has great content for people interested in massive global distributed systems.

The downside is on the oversimplification of transitions and the marketing style of writing for some of the business cases. Since those cases came from companies trying to promote themselves, you may not find the real struggles and issues you will face in your team. At least some hardships are described and you are able to get a feel about the transition process, but in the end of the day, the tacit knowledge plays a big part and it is hard to find it in writing in order for you to be able to understand things that can happen (behavior, challenges, culture, issues, etc).

I strongly recommend this book for those who likes operations, automation, improving stability! Have fun reading.
Profile Image for Ahmad hosseini.
314 reviews71 followers
June 26, 2019
This is the second book that Google published about SRE. The first one explains the theories and principles of SRE and this book shows you how to implement SRE at any company, startup and giant.
What is SRE?
SRE is a job role, a set of practices we’ve found to work, and some beliefs that animate those practices. If you think of DevOps as a philosophy and an approach to working, you can argue that SRE implement some of the philosophy that DevOps describes. So, in a way,
class SRE implements interface DevOps.

Book provides good practices and have good case studies from Spotify, DayerDuty, Pokemon Go, and etc. that can help you to understand practices. Book examines real incidents in Google and other big companies and how their SRE teams handle them. At the last part of the book, there are good advices for creating and managing SRE teams in every kind of companies.
Profile Image for Shoshana.
44 reviews13 followers
Read
August 28, 2021
I read this without reading Site Reliability Engineering first, but read the relevant chapters from that as I read through this and ended up reading almost the whole book anyway. I liked doing it that way.

This feels relevant for anyone trying to run a reliable software, even if your job description doesn't fit into the SRE role, and it was nice to see everything supported with real examples, including from other companies. I mostly started because I wanted to read the sections on monitoring and SLOs (since that has become such a standard in the tech industry) but I also liked the focus on eliminating toil and managing overload. The parts on architecting distributed systems somehow managed to find the balance of being general enough to apply to many situations while also still somehow being informative, which isn't a simple task. There were plenty of interesting links in this and the first book that I hope remember to I dig into.
Profile Image for Dimos Raptis.
Author2 books3 followers
February 9, 2025
When I got to read this book, I had already read the first SRE book from Google, but it had been several years. I was a period where the topic of SLI/SLOs was a big them in the team I was leading, so I was looking for a good book for that topic.

It what I was looking for. Good, concise overview of the basic terminology, with a lot of practical examples for real-life projects. A good description of trade-offs around setting up alerting strategies instead of prescriptive instructions, such as what are the drawbacks of using time duration for alerts as detection time doesn't scale with severity of outage. The book provides 6 different strategies with formulas and trade-offs for each one.

Some chapters, such as toil, were a bit long and boring to go through though for me personally, which made me go from 5 down to 4.5 stars.
Profile Image for Dimitris.
10 reviews3 followers
October 7, 2018
Excellent read, with a lot of interesting ideas on how to change your organization into an SRE-oriented structure. There are a lot of methodologies and guides on how to achieve this, as well as real-world case studies - both of which make the book much more approachable than the first SRE book with its more theoretical approach.

I found the 3rd part of the book ("Processes") the most useful one and I consider it reason enough to read this book. I wish I could have read the "Identifying and recovering from overload" chapter from it 3 or 4 years ago. I was very happy to see that this book puts a strong focus on the well-being of the people working in this field and makes it clear that burnout is an issue that is directly correlated to organizational issues.
Profile Image for Andrew Hatch.
20 reviews1 follower
November 5, 2020

Overall this is a good book and worthy follow up to the original. However it is really long and many of the case studies presented are very similar so there are a lot "groundhog day" moments. You will probably find you will read for quite some time before identifying advice or paragraphs that are worthy of highlighting or note-taking.

Sections on incident diagnosis require caution by the reader. There was a lot of advocation for Root Cause Analysis, Reductionism and linear thinking - practices that will drift you into failure in very complex and dynamic systems. More thought and understanding by the authors is needed before championing these processes.
59 reviews1 follower
March 31, 2019
great book about site reliability, i never have read any book which covers the best in this domain, it's clearly that you can only write the best book about what you are doing and Google is doing great.

I enjoy the most with Post mortem chapter, the way how to build SRE team and how to design scalability system. The Chapter about Confiuration Philosophy is also great as well as the importance of reliability.

The best ist about SLA, SLO and SLI, and it's deserve the main pillar of Site reliability! Thank you very much!



346 reviews1 follower
December 15, 2019
Similar to the first book, it is a worthy read and covers a lot of the best practices (including extensive techniques for dealing with toil). It also comes over as product placement advertising (write your software in Go, deploy it to GCP). Despite that, it is worth your time if you are in the IT industry
80 reviews4 followers
June 15, 2023
A lot of practical ideas, know-how, interesting use-cases, and approaches have been shared. Definitely, something that can be referred to multiple times for ideas and refresh.

Most of the content is for larger corporations with big SRE teams - but can easily be adopted, extracted for smaller companies with leaner teams.
Profile Image for Sharon.
18 reviews
August 22, 2021
This book is practical. I used the chapters re team lifecycles to build roadmap and checklist, also send SLO, oncall chapters to existing development and ops team while creating NFR/SLO/SLI and revamping runbooks
Profile Image for Daniela.
34 reviews
November 11, 2021
Ótimo livro para entender os problemas de aplicações de alto desempenho.
Tem muitos exemplos do Google, mas dá para relacionar com problemas com sistemas em geral.
Talvez seguir todos os procedimentos descritivos não seja a realidade da maioria das empresas.
Profile Image for James Wasson.
33 reviews
May 3, 2024
After reading the first book, I had the immediate "but how do you do this thing?" question. This workbook gave a lot practical steps to accomplishing goals and standards. I still reference it at times.
Profile Image for Yaroslav.
41 reviews1 follower
November 19, 2018
Great workbook for SRE practises. I like last few chapters where authors explain how change management might be done to implement SRE.
Profile Image for Ivan.
223 reviews10 followers
December 29, 2018
Та же SRE, но с практическими примерами из практики Google. Читать без первой книги, наверное, смысла нет.
Profile Image for Miguel Alho.
56 reviews9 followers
June 7, 2019
I've found a great set of ideas and practical examples to bring back into my work, even though i am not an SRE.
Profile Image for August Schau.
151 reviews1 follower
February 19, 2021
Great case studies and ideas for improving operations within an organization, small or large.
Profile Image for SolidM.
176 reviews1 follower
March 24, 2025
Bon livre mais certains chapitres/sujets plus intéressants que d'autres.
5 reviews1 follower
Read
December 27, 2018
I stopped about three quarters through ... was being overloaded with SRE.
Displaying 1 - 25 of 25 reviews

Can't find what you're looking for?

Get help and learn more about the design.