The popularity of the Web and Internet commerce provides many extremely large datasets from which information can be gleaned by data mining. This book focuses on practical algorithms that have been used to solve key problems in data mining and which can be used on even the largest datasets. It begins with a discussion of the map-reduce framework, an important tool for parallelizing algorithms automatically. The authors explain the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing. The PageRank idea and related tricks for organizing the Web are covered next. Other chapters cover the problems of finding frequent itemsets and clustering. The final chapters cover two applications: recommendation systems and Web advertising, each vital in e-commerce. Written by two authorities in database and Web technologies, this book is essential reading for students and practitioners alike.
The mining of massive datasets a clear, practical, and studied exploration of how to extract meaning from huge datasets (Terabytes, Exabytes, Petabytes oh my). I recommend the .
The book uses practical examples including spam email, google's page rank, and netflix's recommendation service to explore the algorithms necessary to process huge data on infrastructures like map reduce.
The authors have the necessary experience to define the field. Ullman is the powerhouse behind several venerable CS textbooks, the '' (Compilers: Principles, Techniques, and Tools), and Database Systems: The Complete Book. He also advised Sergei Brin before that student went on to co-found his own small startup. The first author Anand Rajaraman is another one of Ullman's students who's had his own helping Amazon get its wings.
The book is not a statistical exploration, but a true computer science book. The statistics they do employ are simple. They avoid the mathematical rigour of validating their statistical approaches and take a more intuitive approach, employing the simples statistical models they can. But calculating even simple statistics can be complex when data is distributed across hundreds of computers.
For example google's page rank algorithm has to multiply together matrices that represents the entire web's link structure. If you'd like to know how that kind of work is done, this book is for you. However, the text is not a how-to focused on implementation details. This means it will age gracefully, but will require supplementary reading before you can analyze huge data on your own.
The most profound moment of this book came to me while reading about the . Without going into detail, the bloom filter is capable of filtering a huge list of incoming items and answering the simple question 'have I seen this before'. But the bloom filter is optimized so that it doesn't remember what it has seen, just that it has seen it. Therefore, the bloom filter can recognize that an object is familiar without the ability to pinpoint when it first saw it.
This algorithm produces a remarkably human result. I know the experience of recognizing that an item is familiar without being able to figure out where I've seen it before. It seems our minds have developed something very similar to the bloom filter to mine the data of our experience.
Lots of insight into assorted subjects: networks, algorythms, matrix operations for CS, machince learning and advertising (of all things). Some history on data dredging and its development. Quite a lot of data modeling made simple and easy to understand. Takeouts: - Pagerank - Apriori - MapReduce - hashing - graphs - simrank - CineMatch - CUR-decomp Easy material delovery and quite a lot of breadth in topics selection and coverage. The only minus I see is that there is not as much practical tasks as I like. That's easily outweighed by the easy explanations of all kinds of unwieldier theoretical concepts.
This is a text book for Mining of Massive Datasets course at Stanford. Was very helpful when taking this course at Coursera. It describes different aspects of the domain and the theory behind existing solutions (search engines, networks analysis, recommender systems, online algorithms). It keeps a good balance of strict mathematical theory with all the proofs and references to its practical applications in modern systems. Wide variety of algorithms and ideas for applications in different domains. Not boring at all, I recommend it.
Sooner or later you're going to discover problems too big to solve with most traditional approaches. The authors show a wide range of problems where size can get out of control: finding similar items, working with data streams and graphs and many more. I loved the Coursera course, but I found the book a bit too dry. Many of these problems can be well-illustrated but the book lags in terms of visual layer. Still, it is a great resource written in rather accessible manner. It could have been more entertaining, though.
This is more what I was looking for with the other "Big Data" book I read. Although, this is quite a bit over my head, and more positioned at college study.
I expect this is something I will reference back to later. I read the "free" pdf version, but I'd like to have a copy of the updated version when it becomes available.
There's an up-to-date free version at and a full-fledged Stanford MOOC on Coursera. I took the course initially without much reference to the text, but while the lectures were excellent (and the whole course one of the best I've taken on the platform) I wish I'd had more time to go through the book first.
I skimmed this book to decide whether to enroll on the Stanford course with the same name, definitely I will enroll on the next available session, very interesting stuff about squeezing information from big data sets
This book is definitely a great companion of the Coursera MMDS class. But it lacks a few things, such as proper introductions and a natural information flow. Good for quick reference and examples, not too great to study or understand in depth.
A great way to learn on many roots of modern software engineering. I missed sometimes more figures and visual explanations, but it's an excellent book overall.
Read the part on recommendations. Combined it with the online lecture. good for theoretical, not suitable if you are in dire need to implement the code
Brilliant introduction to Data Mining and it's real world usages and scope. The chapters on Frequent Itemsets and Mining Social graphs are explained brilliantly.
Highly recommend this book for anybody interested in Data Mining.