The Complete Guide to Optimizing Systems Performance Written by the winner of the2013 LISA Award for Outstanding Achievement in System Administration Large-scale enterprise, cloud, and virtualized computing systems have introduced serious performance challenges. Now, internationally renowned performance expert Brendan Gregg has brought together proven methodologies, tools, and metrics for analyzing and tuning even the most complex environments. Systems Enterprise and the Cloud focuses on Linux® and Unix® performance, while illuminating performance issues that are relevant to all operating systems. You&;ll gain deep insight into how systems work and perform, and learn methodologies for analyzing and improving system and application performance. Gregg presents examples from bare-metal systems and virtualized cloud tenants running Linux-based Ubuntu®, Fedora®, CentOS, and the illumos-based Joyent® SmartOS&; and OmniTI OmniOS®. He systematically covers modern systems performance, including the &;traditional&; analysis of CPUs, memory, disks, and networks, and new areas including cloud computing and dynamic tracing. This book also helps you identify and fix the &;unknown unknowns&; of complex bottlenecks that emerge from elements and interactions you were not aware of. The text concludes with a detailed case study, showing how a real cloud customer issue was analyzed from start to finish. Coverage includes &; Modern performance analysis and terminology, concepts, models, methods, and techniques &; Dynamic tracing techniques and tools, including examples of DTrace, SystemTap, and perf &; Kernel uncovering what the OS is doing &; Using system observability tools, interfaces, and frameworks &; Understanding and monitoring application performance &; Optimizing processors, cores, hardware threads, caches, interconnects, and kernel scheduling &; Memory virtual memory, paging, swapping, memory architectures, busses, address spaces, and allocators &; File system I/O, including caching &; Storage devices/controllers, disk I/O workloads, RAID, and kernel I/O &; Network-related performance protocols, sockets, interfaces, and physical connections &; Performance implications of OS and hardware-based virtualization, and new issues encountered with cloud computing &; getting accurate results and avoiding common mistakes This guide is indispensable for anyone who operates enterprise or cloud system, network, database, and web admins; developers; and other professionals. For students and others new to optimization, it also provides exercises reflecting Gregg&;s extensive instructional experience.
This isn't a book, so much as it is a reference manual or an appendix. It's nearly 800 pages of dense, low-level discussions of performance issues related to the CPU, memory, hard drive, OS, and so on. The writing is very structured, repetitive, and dry and resembles a list of facts more than prose. If you have a specific performance issue and need to know how to, say, use DTrace to diagnose an issue with a memory leak, this book is perfect. If you're looking for something you can read cover to cover to generally improve your understanding of system performance, this book probably isn't it.
If you are going to read this, I recommend reading the first few sections of each chapter, which typically have a nice introduction to the architecture of the CPU, memory, etc. They are also full of handy tables, such as typical, real-world latencies and typical performance trade-offs to consider (e.g. cpu vs memory, small vs large record sizes). The remainder of each chapter is a deep-dive into specific performance tools you can use, which is handy as a reference, but does not make for interesting reading otherwise, as there is no way you can retain so much detailed info. I'd also mention that since the author is a Solaris expert and creator of DTrace, you will see a lot of information about both in every single chapter.
The final chapter of the book is great: it walks through a real-world case study and shows how to use various techniques to analyze it and the thought process that goes into tracking down performance bottlenecks. Seeing such a case study gives you a much better sense for the context in which the various performance tools should be used and some awareness of whether the data returned by those tools is normal or not. This would have been a much better book if every chapter had been primarily focused on such case studies, with all the other nitty gritty details tacked on solely as supporting information (perhaps in an appendix!).
Whenever I watch Netflix movies from different locations, on different networks, and via different devices without any serious performance problems, I think about this book. I just can't help it. Guess why?
Brendan Gregg, also known as , among other things, wrote one of the most pragmatic and comprehensive books on Unix and Linux performance engineering. If you're in any way seriously involved with benchmarking, tuning, and analyzing GNU/Linux or Unix based systems in the last 10 years, you've either come across some of the tools developed by the author such as , or used concepts and techniques developed by him such as , and .
I'll cut it short, and try to answer a simple question: Should you buy and study this book?
Yes, and no.
Yes, in the sense that it has solid and very-well laid out & presented material on the fundamental aspects of performance engineering of most of the underlying components and layers such as kernel, file systems, disks, memory, networking, etc. On top of that, the principles clearly explained and exemplified by Gregg are timeless; e.g. queuing theory, benchmarking pitfalls, workload characterization, checklists, visualization techniques, etc. (The book also provides nice historical context which is a plus in my book, but that's just me.) Moreover, the questions at the end of chapter can be considered very nice additions: some of straightforward technical questions will help you fill your technical the gaps, whereas some of the more open ended ones can be considered a very good practice material to stretch your analytic thinking capabilities.
No, in the sense that, a period of 6 years is like a lifetime in this industry, and technology is a moving target, GNU/Linux doubly so! From 2021 and onward, I don't think majority of the readers of this review will need any of the Solaris-related practical and historical stuff in the book (it is great, but not for everyone I guess). Moreover, it'll be good to have a more up-to-date resources for finer details of practical performance engineering tools for GNU/Linux systems, as well as heavier focus on cloud computing related tips, tricks, and pitfalls when it comes to analyzing and tuning performance.
I've already bought this book some time ago, but I'll definitely buy the 2nd Edition from the same author (see for more information).
Long story short, most of the body of knowledge in this book (and most probably the one in the 2nd Edition), can be considered "required" for most of the professional performance engineers working with Linux based systems.
Possibly, that's the most hardcore technical book I've ever read. In times of abstractions, when all things are distributed and it's almost impossible to find where your app is running Brendan Gregg has reinvented his own wheel. He's doing what most engineers are dreaming of doing but neither they have time nor interest from their stakeholders. He's getting to the bottom of what comprises the performance.
Screw your Dockers and Prometheuses, with the few exceptions of few self-written scripts and eBPF one-liners the author is using common tools to measure and troubleshoot the performance problems. This book doesn't go too far as trying to solve all issues with GNU utils only, but almost every tool described in the book is battle-tested and has earned its fame long before this book. Gee, it has full chapters on sar & perf.
Whoa, this book is a refresher in the world of abstractions that we live in now. Not sure, how practical this book is for anyone outside of Netflix but nevertheless it's 100% worth your time if you're interested in how things are working under the hood.
This is an impressive book, packed with information about methodology and tooling to diagnose performance problems in modern computing environments. The problem is that the presentation of the material is so incredibly dull, in the end I wanted to go read a telephone book instead for pleasure.
The structure is clear and I see what the author was trying to do: provide a repeating pattern of how information is laid out across chapters, but it makes the robotic writing even feel even less appealing. I was also annoyed by Gregg tooting his own horn a little too much for my taste. He wastes no opportunity pointing out which programs presented here exist thanks to him, down to the exact date and circumstances during which he built them. Who cares?! That is useless information to the reader.
I wish the last chapter, the case study, would have appeared in every chapter, presenting real world examples of how to apply all the theory Gregg teaches. In that sense the book feels like a mixture of a university course and a reference: lots of theory and frameworks, but very little time is spent on actually guiding the reader through how to transfer this knowledge to actual problems.
Nevertheless, I can see myself coming back to this book a lot. Again, the university analogy comes to mind: by the time you're taught the subject you have no idea how to apply any of it or when, but several years later you might encounter a problem and you have that light bulb moment in which you remember and dust off this tome and find the advice you need.
If you're looking for a gentle and most of all practical introduction to performance analysis, however, this isn't it.
Great book on debugging production systems. It serves a comprehensive, but simple, mental model for how systems work, and solid methodologies to look at each component. Especially the USE-method: looking at each system component for utilization, saturation, and errors: network, disk, cpu, memory, mutexes, ... Most of the time people use the 'streetlight' method, going through random tools they know. Best illustrated in its absurdity by the parable of the drunk man who was looking for his keys in the dark under the streetlight.
The reason why I can't give the 5th star is because it's focused on currently observable problems. Many of the gnarliest systems performance problems I've encountered happen for a shorter period of time under some hard-to-reproduce condition (where focus is on recovery, not understanding). You then have to dig through metrics _after_ the fact to find out what might have happened. This is often easy for errors, but not for saturation and utilization. Why is there nothing in the book about this?
Do not let the size daunts you however. Chapters are self-contained, as the author understands that the book might be read under pressure, and contain useful exercises at the end.
What really makes this book stands out, is not the top-notch technical writing or abundance of useful one-liners, is the fact that the author moves forward and suggests a methodology for troubleshooting and performance analysis, as opposed to the ad-hoc methods of the past (or best case scenario a checklist and $DEITY forbid the use of “blame someone else methodology�). In particular the author suggests the USE methodology, USE standing for Utilization � Saturation � Errors, to methodically and accurately analyze and diagnose problems. This methodology (which can be adapted/expanded at will, last time I checked the book was not written in stone), is worth the price of the book alone.
The author correctly maintains that you must have an X-ray (so to speak) of the system at all times. By utilizing tools such as DTrace (available for Solaris and BSD) or the Linux equivalent SystemTap, much insight can be gained from the internals of a system.
Chapters 5-10 are self-explanatory: the author presents what the chapter is about, common errors and common one-liners used to diagnose possible problems. As said before, chapters aim to be self contained and can be read while actually troubleshooting a live system so no lengthy explanations there. At the end of the chapter, the bibliography section provides useful pointers towards resources for further study, something that is greatly appreciated. Finally, the exercises can be easily transformed to interview questions, which is another bonus.
Cloud computing and the special considerations that is presenting is getting its own chapter and the author tries to keep it platform agnostic (even if employed by a “Cloud Computing� company), which is a nice touch. This is followed by a chapter on useful advice on how to actually benchmark systems and the book ends with a, sadly too short, case study.
The appendices that follow should be read, as they contain a lot of useful one-liners (as if the ones in the book were not enough), concrete examples of the USE method, a guide of porting dtrace to systemtap and a who-is-who in the world of systems performance.
So how to sum up the book? “Incredible value� is one thought that comes to mind, “timeless classic� is another. If you are a systems {operator|engineer|administrator|architect}, this book is a must-have and should be kept within reach at all times. Even if your $DAYJOB does not have systems on the title, the book is going to be useful, if you have to interact with Unix-like systems on a frequent basis.
Good to skim through to learn about what is possible and out there, but it's more like a reference book to check when needed with a specific performance problem.
Brendan is probably the de-facto authority in the performance world. Brendan walks through the Linus Kernel internals and covers the performance of each areas like Memory, CPU, File Systems, Disks, Networks. His methodologies for analyzing performance problems are must read for SREs and performance engineers. The are plethora of tools that Brendan contributed in creating for Linux performance troubleshooting. I love the easy to follow and structured approach of Brendan's writing. Specifically the USE methodology, drill-down methodology, the block diagram with tools should be at every desk of SREs, production engineers and performance engineers.
My only concern is that I'm too late to pick this first edition. BPF tools are not covered and some contents are outdated for this point in time.
If you want to read this, please wait till November 2020 till the book hit the stands.
Very very well written book. I didn't actually read it front to back, I read the first 4 chapters which covers the foundation, chapter 5 which covers application-level performance, and the last 3 chapters on cloud and multi-tenant performance, benchmarking, and a case study. The middle chapters dive into other specific topics like CPU, memory, file systems, etc. that I will reference on an as-needed basis.
Overall very well written, communicates concepts clearly, and reifies a lot of things that I often see go unnoticed or underappreciated. A book worth keeping on the bookshelf even after being read.
Absolutely amazing book on performance measurement. Contains a lot of theory how to measure performance (starting from "what performance really is" - and it is not so obvious) to example how to drill down. This books contains a lot of practical examples on performance issues investigation. Looks slightly outdated (tap, solaris, DTrace) but it is really worth reading for admins and every person who cares about performance.
Though at risk of being a tad ranty about how Solaris is better than linux, Brendan Gregg's detail and understanding of Kernel development and performance is comprehensive and both introduces the topic and then guides the reader through how to measure it. It's a must-read for Linux developers.
When investigating a CPU issue, I found that I had to dive much deeper than the details provided by this book. It's a good starting point though. Brendan Gregg is a legend.
One of my most favourite everything-reference when I need to do system benchmark/trouble-shooting. This book covers almost all aspects of low-level stuff from kernel to network-protocol, or file system, disk system. Highly recommended.