In 2016, Google's Site Reliability Engineering book ignited an industry discussion on what it means to run production services today--and why reliability considerations are fundamental to service design. Now, Google engineers who worked on that bestseller introduce The Site Reliability Workbook, a hands-on companion that uses concrete examples to show you how to put SRE principles and practices to work in your environment.
This new workbook not only combines practical examples from Google's experiences, but also provides case studies from Google's Cloud Platform customers who underwent this journey. Evernote, The Home Depot, The New York Times, and other companies outline hard-won experiences of what worked for them and what didn't.
Dive into this workbook and learn how to flesh out your own SRE practice, no matter what size your company is.
You'll learn:
How to run reliable services in environments you don't completely control--like cloud Practical applications of how to create, monitor, and run your services via Service Level Objectives How to convert existing ops teams to SRE--including how to dig out of operational overload Methods for starting SRE from either greenfield or brownfield
Betsy Beyer is a Technical Writer for Google in New York City specializing in Site Reliability Engineering. She has previously written documentation for Google’s Data Center and Hardware Operations Teams in Mountain View and across its globally distributed datacenters. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University. En route to her current career, Betsy studied International Relations and English Literature, and holds degrees from Stanford and Tulane.
Surprisingly good supplement to the original SRE book. BUT be warned - it's a workbook, it's practical, but it doesn't mean it tech-based, in fact it's more conceptual & tech-agnostic. Filled with many good examples, coming not just from Google, but from various (but all well known) organizations.
What did I like most? That it's tight to real-life practices - there's a full chapter on SLO, another one on On-Call duties, Post Mortems, etc. Some are quite specific (data processing one felt almost like out of place), but some can be easily related to any case - like configuration one (btw. I didn't expect to learn anything new here, but actually I really like some conceptual figures when approaching this topic - ones I didn't use before).
Anything I didn't like? 3 final chapters need some more work - it's not just about polish, it's more like that they are not "selling their goal" properly - IMHO it was like some tangle of thoughts that was in general hard not to agree with, but did lack the clarity & natural flow of previous ones.
Still - it's a very decent book. Highly recommended, additionally - book is currently (for a limited time) available for free - not using this opportunity to educate yourself would be a sin.
Muito interessante ver como algumas questões relacionadas ao mundo SRE são aplicadas na prática. Sinto que terei que voltar ao primeiro livro para entender como apresentar conceitos simples e basilares ao restante da empresa para iniciar um gradual processo de adoção de algumas práticas.
More of this Google tech writings in this zone I read, more redundancy I feel - exceptional writing style and discuss approach are covering for lack of original thought comparing to the first book.
Still recommended read in some of my coaching sessions, modern and high quality approaches right from the source. Could live without it, tho 😀
Written by a bunch of my coworkers, I really enjoyed this book. Arguably much more practical than the original SRE book. I find myself sending folks chapters from this book far more than the first.
This book is great at explaining by example, which makes it easier to absorb the information. I found chapters "boring" because they focus on areas that are alien to me, but mostly I think this book works wonderfully for future consulting: when confused or in need of help, go directly to the chapter that explains what you need.
This is a great book for those who want to learn about how to build a SRE team and the main concepts that need to be taken into consideration for a successful team.
On the positive front, it gives the reader a step by step approach on how to start such as definition of SLO as for when the team should take action so it focus only on what is important, to have the team focusing from 50% to 70% of their time in automating the resolution of problems and improving operations and smaller percentage of time in toil, and finally the blameless postmortem culture.
Another good point is the focus on the human aspect of processes (culture, respect, challenge) since it is paramount to have a functioning stable team. On the challenge front, it empowers team members to focus on the problem themselves and not just pass it onto someone else. On the respect, the blameless postmortem tell us to focus on the technical problems and everything that happen for the situation (incident) to occur and how it could be avoided with changes in tools, automation, alerts, etc and taking the blame out of people for their mistakes. Mistakes will happen.
A lot of technical details are provided on how to solve specific problems and this is great for those who are starting and do not have experience in SRE team. It also focus on many aspects of performance engineering, networks, parallelization, distributed processing and has great content for people interested in massive global distributed systems.
The downside is on the oversimplification of transitions and the marketing style of writing for some of the business cases. Since those cases came from companies trying to promote themselves, you may not find the real struggles and issues you will face in your team. At least some hardships are described and you are able to get a feel about the transition process, but in the end of the day, the tacit knowledge plays a big part and it is hard to find it in writing in order for you to be able to understand things that can happen (behavior, challenges, culture, issues, etc).
I strongly recommend this book for those who likes operations, automation, improving stability! Have fun reading.
This is the second book that Google published about SRE. The first one explains the theories and principles of SRE and this book shows you how to implement SRE at any company, startup and giant. What is SRE? SRE is a job role, a set of practices we’ve found to work, and some beliefs that animate those practices. If you think of DevOps as a philosophy and an approach to working, you can argue that SRE implement some of the philosophy that DevOps describes. So, in a way,
class SRE implements interface DevOps.
Book provides good practices and have good case studies from Spotify, DayerDuty, Pokemon Go, and etc. that can help you to understand practices. Book examines real incidents in Google and other big companies and how their SRE teams handle them. At the last part of the book, there are good advices for creating and managing SRE teams in every kind of companies.
I read this without reading Site Reliability Engineering first, but read the relevant chapters from that as I read through this and ended up reading almost the whole book anyway. I liked doing it that way.
This feels relevant for anyone trying to run a reliable software, even if your job description doesn't fit into the SRE role, and it was nice to see everything supported with real examples, including from other companies. I mostly started because I wanted to read the sections on monitoring and SLOs (since that has become such a standard in the tech industry) but I also liked the focus on eliminating toil and managing overload. The parts on architecting distributed systems somehow managed to find the balance of being general enough to apply to many situations while also still somehow being informative, which isn't a simple task. There were plenty of interesting links in this and the first book that I hope remember to I dig into.
When I got to read this book, I had already read the first SRE book from Google, but it had been several years. I was a period where the topic of SLI/SLOs was a big them in the team I was leading, so I was looking for a good book for that topic.
It what I was looking for. Good, concise overview of the basic terminology, with a lot of practical examples for real-life projects. A good description of trade-offs around setting up alerting strategies instead of prescriptive instructions, such as what are the drawbacks of using time duration for alerts as detection time doesn't scale with severity of outage. The book provides 6 different strategies with formulas and trade-offs for each one.
Some chapters, such as toil, were a bit long and boring to go through though for me personally, which made me go from 5 down to 4.5 stars.
Excellent read, with a lot of interesting ideas on how to change your organization into an SRE-oriented structure. There are a lot of methodologies and guides on how to achieve this, as well as real-world case studies - both of which make the book much more approachable than the first SRE book with its more theoretical approach.
I found the 3rd part of the book ("Processes") the most useful one and I consider it reason enough to read this book. I wish I could have read the "Identifying and recovering from overload" chapter from it 3 or 4 years ago. I was very happy to see that this book puts a strong focus on the well-being of the people working in this field and makes it clear that burnout is an issue that is directly correlated to organizational issues.
Overall this is a good book and worthy follow up to the original. However it is really long and many of the case studies presented are very similar so there are a lot "groundhog day" moments. You will probably find you will read for quite some time before identifying advice or paragraphs that are worthy of highlighting or note-taking.
Sections on incident diagnosis require caution by the reader. There was a lot of advocation for Root Cause Analysis, Reductionism and linear thinking - practices that will drift you into failure in very complex and dynamic systems. More thought and understanding by the authors is needed before championing these processes.
great book about site reliability, i never have read any book which covers the best in this domain, it's clearly that you can only write the best book about what you are doing and Google is doing great.
I enjoy the most with Post mortem chapter, the way how to build SRE team and how to design scalability system. The Chapter about Confiuration Philosophy is also great as well as the importance of reliability.
The best ist about SLA, SLO and SLI, and it's deserve the main pillar of Site reliability! Thank you very much!
Similar to the first book, it is a worthy read and covers a lot of the best practices (including extensive techniques for dealing with toil). It also comes over as product placement advertising (write your software in Go, deploy it to GCP). Despite that, it is worth your time if you are in the IT industry
A lot of practical ideas, know-how, interesting use-cases, and approaches have been shared. Definitely, something that can be referred to multiple times for ideas and refresh.
Most of the content is for larger corporations with big SRE teams - but can easily be adopted, extracted for smaller companies with leaner teams.
This book is practical. I used the chapters re team lifecycles to build roadmap and checklist, also send SLO, oncall chapters to existing development and ops team while creating NFR/SLO/SLI and revamping runbooks
Ótimo livro para entender os problemas de aplicações de alto desempenho. Tem muitos exemplos do Google, mas dá para relacionar com problemas com sistemas em geral. Talvez seguir todos os procedimentos descritivos não seja a realidade da maioria das empresas.
After reading the first book, I had the immediate "but how do you do this thing?" question. This workbook gave a lot practical steps to accomplishing goals and standards. I still reference it at times.