ŷ

Jump to ratings and reviews
Rate this book

Mining of Massive Datasets

Rate this book
The popularity of the Web and Internet commerce provides many extremely large datasets from which information can be gleaned by data mining. This book focuses on practical algorithms that have been used to solve key problems in data mining and which can be used on even the largest datasets. It begins with a discussion of the map-reduce framework, an important tool for parallelizing algorithms automatically. The authors explain the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing. The PageRank idea and related tricks for organizing the Web are covered next. Other chapters cover the problems of finding frequent itemsets and clustering. The final chapters cover two applications: recommendation systems and Web advertising, each vital in e-commerce. Written by two authorities in database and Web technologies, this book is essential reading for students and practitioners alike.

326 pages, Hardcover

First published October 27, 2011

91 people are currently reading
1,048 people want to read

About the author

Jure Leskovec

1book2followers

Ratings & Reviews

What do you think?
Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars
120 (48%)
4 stars
97 (39%)
3 stars
22 (8%)
2 stars
5 (2%)
1 star
1 (<1%)
Displaying 1 - 19 of 19 reviews
Profile Image for Ben Haley.
58 reviews16 followers
October 31, 2011
The mining of massive datasets a clear, practical, and studied exploration of how to extract meaning from huge datasets (Terabytes, Exabytes, Petabytes oh my). I recommend the .

The book uses practical examples including spam email, google's page rank, and netflix's recommendation service to explore the algorithms necessary to process huge data on infrastructures like map reduce.

The authors have the necessary experience to define the field. Ullman is the powerhouse behind several venerable CS textbooks, the '' (Compilers: Principles, Techniques, and Tools), and Database Systems: The Complete Book. He also advised Sergei Brin before that student went on to co-found his own small startup. The first author Anand Rajaraman is another one of Ullman's students who's had his own helping Amazon get its wings.

The book is not a statistical exploration, but a true computer science book. The statistics they do employ are simple. They avoid the mathematical rigour of validating their statistical approaches and take a more intuitive approach, employing the simples statistical models they can. But calculating even simple statistics can be complex when data is distributed across hundreds of computers.

For example google's page rank algorithm has to multiply together matrices that represents the entire web's link structure. If you'd like to know how that kind of work is done, this book is for you. However, the text is not a how-to focused on implementation details. This means it will age gracefully, but will require supplementary reading before you can analyze huge data on your own.

The most profound moment of this book came to me while reading about the . Without going into detail, the bloom filter is capable of filtering a huge list of incoming items and answering the simple question 'have I seen this before'. But the bloom filter is optimized so that it doesn't remember what it has seen, just that it has seen it. Therefore, the bloom filter can recognize that an object is familiar without the ability to pinpoint when it first saw it.

This algorithm produces a remarkably human result. I know the experience of recognizing that an item is familiar without being able to figure out where I've seen it before. It seems our minds have developed something very similar to the bloom filter to mine the data of our experience.
Profile Image for ☘Misericordia☘ ⚡ϟ⚡⛈⚡☁ ❇️❤❣.
2,519 reviews19.2k followers
September 29, 2020
Lots of insight into assorted subjects: networks, algorythms, matrix operations for CS, machince learning and advertising (of all things). Some history on data dredging and its development. Quite a lot of data modeling made simple and easy to understand.
Takeouts:
- Pagerank
- Apriori
- MapReduce
- hashing
- graphs
- simrank
- CineMatch
- CUR-decomp
Easy material delovery and quite a lot of breadth in topics selection and coverage. The only minus I see is that there is not as much practical tasks as I like. That's easily outweighed by the easy explanations of all kinds of unwieldier theoretical concepts.

Profile Image for Natalia Shakhalova.
5 reviews
November 30, 2014
This is a text book for Mining of Massive Datasets course at Stanford. Was very helpful when taking this course at Coursera. It describes different aspects of the domain and the theory behind existing solutions (search engines, networks analysis, recommender systems, online algorithms). It keeps a good balance of strict mathematical theory with all the proofs and references to its practical applications in modern systems. Wide variety of algorithms and ideas for applications in different domains. Not boring at all, I recommend it.
16 reviews2 followers
May 28, 2017
Sooner or later you're going to discover problems too big to solve with most traditional approaches. The authors show a wide range of problems where size can get out of control: finding similar items, working with data streams and graphs and many more. I loved the Coursera course, but I found the book a bit too dry. Many of these problems can be well-illustrated but the book lags in terms of visual layer. Still, it is a great resource written in rather accessible manner. It could have been more entertaining, though.
Profile Image for Shane.
97 reviews2 followers
August 6, 2013
This is more what I was looking for with the other "Big Data" book I read.
Although, this is quite a bit over my head, and more positioned at college study.

I expect this is something I will reference back to later. I read the "free" pdf version, but I'd like to have a copy of the updated version when it becomes available.
Profile Image for Yasiru.
197 reviews138 followers
April 10, 2015
There's an up-to-date free version at and a full-fledged Stanford MOOC on Coursera. I took the course initially without much reference to the text, but while the lectures were excellent (and the whole course one of the best I've taken on the platform) I wish I'd had more time to go through the book first.
19 reviews4 followers
December 20, 2014
I got a lot from this and that was surprising as I had already read some books here. The nicest part is the Locally Sensitive Hashing.

This is just very good quality. Quite a bit of ideas you can use in your practise.
Profile Image for Victor.
72 reviews9 followers
September 14, 2015
I skimmed this book to decide whether to enroll on the Stanford course with the same name, definitely I will enroll on the next available session, very interesting stuff about squeezing information from big data sets
Profile Image for Akash Goel.
164 reviews13 followers
November 18, 2015
This book is definitely a great companion of the Coursera MMDS class. But it lacks a few things, such as proper introductions and a natural information flow. Good for quick reference and examples, not too great to study or understand in depth.
Profile Image for Bryan.
670 reviews24 followers
April 27, 2015
It is a text book. It is a good book for the topic. Not much to say here.
Profile Image for Nick Greenquist.
121 reviews3 followers
June 18, 2018
one of the best books you can read in the realm of data mining, machine learning, and generally doing really cool things with piles of data
Profile Image for Mickaël A.
137 reviews7 followers
December 12, 2018
A great way to learn on many roots of modern software engineering. I missed sometimes more figures and visual explanations, but it's an excellent book overall.
Profile Image for Cem.
48 reviews
December 5, 2019
A really good textbook for the "Foundations and Applications of Data Mining" INF 553 class.
Profile Image for Yk Chia.
75 reviews1 follower
February 6, 2021
Read the part on recommendations. Combined it with the online lecture. good for theoretical, not suitable if you are in dire need to implement the code
Profile Image for Varun Reddy.
17 reviews
April 4, 2019
Brilliant introduction to Data Mining and it's real world usages and scope. The chapters on Frequent Itemsets and Mining Social graphs are explained brilliantly.

Highly recommend this book for anybody interested in Data Mining.
Displaying 1 - 19 of 19 reviews

Can't find what you're looking for?

Get help and learn more about the design.