In this insightful book, you'll learn from the best data practitioners in the field just how wide-ranging -- and beautiful -- working with data can be. Join 39 contributors as they explain how they developed simple and elegant solutions on projects ranging from the Mars lander to a Radiohead video.
With Beautiful Data , you That's only small sample of what you'll find in Beautiful Data . For anyone who handles data, this is a truly fascinating book. Contributors
A wide array of anecdotes of how people are working with data, from creating survey that people actually take, over to doing music videos with Radiohead and LiDAR and genomics. Fun to read if you deal with data for yourself all day, giving some ideas of what to incorporate into your own workflow.
Beautiful Data Edited by Toby Segaran and Jeff Hammerbacher The first chapter is a description of two projects by Yahoo. The first one captures the area throug which the user is travelling. The aim is to show the user how much of pollution the user has been subject to. The challenge was to show the exposure levels along with the movements and stopping of the user. A variety of techniques were used before finally arriving at a way of overlaying the exposure as a coloured line and showing the locations where the user stayed for a long time as circles. The colour of the line/circle indicated the exposure levels. The second project involved analyzing the activity tweets of the users and give them a perspective of their behaviour. The challenge in this was to make the user tweet their activities. This was achieved by asking the user to set a goal and then tweet so that the application could analyze the activities of the user so as to provide hints on what were the positive and negative activities.
The second chapter is a project that was carried out to request for a survey of old people on the usage of a product that was to be launched. The challenges were, how to survey a population that is not comfortable with computer; how to survey a population whose vision may be deteriorating due to old age; how to induce them to respond to a survey; how to identify the geography from which they are responding? To induce the responders to go through the survey the survey was kept as small as possible. The questions were displayed such that the users did not get a feeling of having to answer too much or too little. The fonts were selected so that the question stood out. The questions were worded to avoid all ambiguity. The geographic region was determined by using the IP addresses from where the users responded. In effect the project manage to garner response far beyond the normal response to any survey.
The third chapter presents the challenges of Image Processing on board the Mars rover. The processing power was a challenge as one worked with toughened CPUs and not the normal CPUs (CPU clock speed was only 20 MHz). The RAM was also limited to a few MBs. The only storage was some flash memory. The VxWorks Real Time Operating System was used as the OS. The code was written in C. The main functionality was to store the images captured by the on board cameras (each image was 1 Mega Byte in size), process them for any errors, and downlink it to earth when the connectivity was available. The final solution involved building a queued system where the image captured by the camera was stored in a section of the flash drive. The image processor picked up this image and cleaned it up. The downlink module then picked this up and sent it to the earth and after successful downlink marked this space as available for the next image. All this happened without the position of the image being shifted in the flash drive.
The fourth chapter describes the design of PNUTShell a distributed database at yaho, used for a variety of purposes. The key requirement was to have geographic replication and distribution of the data with minimum latency. Some of the principles that were followed were: 1. Each record had a mastership, which would the place where it would get updated. It will change only if it is observed that the update is taking from another geographic location consistently for a long period of time. 2. The tables were not modelled in the traditional but were logically grouped together based on the expected access. 3. The Replication order was fixed from one geography to another and was not random. This ensured that the replication travelled minimum distance. 4. The system provided the application the ability to choose between availability and consistency on a per table basis. So if the main replica goes down the application can choose to make the table available and choose to sacrifice consistency or it can stall the functionality and choose to sacrifice availability till the main partition is up. 5. Because of the data structure being used certain complex queries were not possible.
The Fifth chapter talks of the rise of Data Scientist. These are people expected to be able to take a vast amount of data, process it quickly and make sense out of this data. The chapter talks about how the data about access patterns being collected at Facebook was initially setup in MySQL, how it moved to Oracle and finally had to be moved to Hadoop as the data being generated grew in size.
The sixth chapter talks of an exercise carried out in England where the users were asked to take photographs of the locality that they move around in and geotag them and provide tags to denote the topography of the terrain. These images were then used to create tree maps which provided insights into the geography of England as well as the differences in the semantics used by the people in the different regions.
The seventh chapter speaks about a very interesting concept of Data find Data. Some of the examples quoted are 1. A guest calls the reception at 8:00 AM asking them to give a wake up call at 12:00 PM. But soon the maid knocks the door for houskeeping. If "data could find data" then the maid's schedule would have been altered so that the guest is not disturbed. 2. A user searches for a soon to be released book which is not yet available on Amazon. Now if Amazon can keep track of this search and can notify the user when the shipment of books arrives then the user could buy it if interested. If this notification did not go through then the user may come back after a month only to find that the book is sold out. 3. A parent checks a website for the safety of a particular street as her child walks to the school through the path. There is nothing to report on the street. Now if the website can keep track of this inquiry and if an incident occurs, notify the parent of the incident the parent would be altered. Otherwise the parent may never notice till the next time she comes and check the site. 4. Government departments do not generally share data amongst each other given the security, privacy requirements. This prevents identification of patterns. E.g. the same person being involved in terrorism and narcotics could show up if these departments use "Data finds Data" principle and look up each other's databases, possibly in a restricted, secure fashion. A federated search across the various data stores is one way to achieve this, but this can be tedious and long drawn process considering the amount of data that one needs to parse. The way to achieve this would be extact and classify the data and then act on this data. A system that needs to implement this feature needs to have the following blocks: 1. The existing of, and availability of, observations 2. The ability to extract and classify features from the observations 3. The ability to efficiently discover related historical context 4. The ability to make assertions (same or related) about new observations 5. The ability to recognize when new observations reverse earlier assertions. 6. The ability to accumulate and persist this asserted context 7. The ability to recognize the formation of relevance/insight 8. The ability to notify the appropriate entity of such insight.
The eighth chapter is about how Gnip uses event driven mechanism, rather than polling mechanism, to gather social data, "clean" and "normalize" it provide it to the subscribers. The key take away from this how bad polling for data is, compared to data being pushed to the subscribers.
The ninth chapter is about how the search engines only scanned the static contents of the websites and failed to scan "deep". E.g. scannning a second hand car website for all models available. Typically this will involve the user providing some inputs like in which state is she looking for second hand cars, the models that she is looking for, the price range she is looking for. After this the user will need to click a button to get the list of matching cars. A normal search engine bot will not do any of this. All it will indicate to somebody searching for second hand cars is that here is a site that sells second hand cars. If instead it were able to dig "deep" into the site by simulating a real user and index these pages too, then it will be possible for this search enging to provide more details. But it is not easy to achieve this. As one needs to think about what parameter values should one search with. The permutation/combination of parameter values will be too many, but most may not make sense. The chapter talks about a way by which they first try and probe the site with generic values like "*" or blank for parameter values. Parse the output of this and try and guage the actual parameters that can be passed to it to get meaningful data. The idea is not to get every single data stored behind the scenes, but to try and cover most types of data that is hidden and index them so that the search engine is enriched.
The tenth chapter is about how Radiohead's music video "House of Cards" was made without any camera, but using only data captured using two main equipments which measured the amount of light reflected back by the environment in which it is operated. The whole data is captured as a set of XY coordinates and the intensity of the point. This data is available at , for people to play around with and create their own videos and effects.
The eleventh chapter is about how a website was built for the residents of Oakland, Chicago by scrapping data from the Police department and displaying it on a map so that it made it easier for the residents to track crime in the streets of Oakland. Similar to the first chapter, this involves displaying the crimes on a map. The challenges are similar to what was seen in the first chapter. The site was briefly shutdown as the police department figured out what these people were doing and stopped access to their websites. The scrapping of site stopped when finally the police department relented and started giving them data at the end of the day in an Excel sheet.
The twelfth chapter is a project that took the US census data for the 100+ years and built visualizations around it. Some interesting concepts were: 1. Usage of stacked line charts to indicate different percentations. E.g. the percentage of population engaged in a particular field of work, or percentage of population by place of birth. These show the trends over a period of time. These can further drilled down as another set of stacked line charts. 2. A population pyramid was a chart where the left side represented the statistics of male population and the right side show the population of female population. The Y axis was the age group and a year slider provided the user to visualize the change in the population by age group over a period of time. 3. Doubly linked discussion. The project provided the ability to the users to annotate the views. These were linked to the current view that user was seeing. The users could navigate from the comments to the views and views to the comments. Hence doubly linked. 4. A provision was given to annotate the graphs graphically. So users could over lay the view with lines, circles etc and could share these with others. 5. There was also a feature by which the views could be collected by the user for later viewing or for sharing with others. This involved the user just indicating that she wished to "Add View" to her graphical list of bookmarks. This project was not released to the outside world. Instead the technology was used by IBM to build the site many-eyes.com and can be used by users to create their own visualization, upload their data and visualize this data using the visualization that was created by them. Chapter 13 - What Data Doesn't Do This chapter talks about scenarios which cannot be detected using data. Somethings which seem trivial to the eye will not be easy to figure out using data. E.g. presence of an ugly duckling amongst a set of swans in an image can be immediately spotted by the eye, but detecting the same with the data that represents the image is not easy. Simlarly reading the line "Iamnotgoingtocomehometoday" is easy for the eye, but not easy for the computer. Similarly there are scenarios where the computer will be able to figure out trends more easily than the human eye can. E.g. from a complex scatter plot it will be difficult for the eye to detect a trend whereas using mathematically techniques a computer would be able to find a suitable trend. Sometimes one can tend to be misled through a biased "narrative". This is called as "narrative fallacy". E.g. if one is shown a set of graphs and is told that these represent the stock prices of three companies in the manufacturing industry and if one is asked to pick the stock that will perform well in the coming days one would tend to guess based on the data provided. Humans tend to builda story around the data to support their conclusions. The tendency to apply a past conclusion to the present analysis is called "confirmation bias". The author states that data does not necessarily drive one in the right direction because 1. Our tools for using data are inexact. 2. We process data with known biases. The author also makes a set of statements: 1. More data isn't always better: The argument is that this applies well for data that has a normal distribution, but not all data show normal distribution and so it does not necessarily apply in all scenarios. 2. Mode data isn't always easy: Capturing, storing and processing large amounts of data is not easily done even given the advancement that we have seen in the processors. 3. Data along doesn't explain: The author argues that "given two variables correlated in a statically significant way, causality can work forward, backward, in both directions or not at all". The author cites an example of how an article in Wall Street Journal had suggested that since "premartial cohabitation is correlated with a higher rates of divorce, unwed couples could avoid living together in order to improve their chances of staying together after marriage". A very skewed conclusion at best. 4. Data isn't good for a single answer: Analysis of data does not lead to a single conclusion most of the times. It usually points to possible conclusions. 5. Data doesn't predict: In a controlled environment it is possible to predict an outcome with near certainity, but in domains with less certainity, such as human or physical behaviour, modeling is an important tool to help explain patterns and in the eagerness one can tend to overfit a model. 6. Proabability isn't intuitive: The author cites and example of how probability is not always intuitive. The author states "when using data to answer a question, we don't know what evidence to exclude and how to weigh what we include". 7. Probabilities aren't intuitive: The author states that when we dealing with multiple probabilities, it becomes even more tricky and one tends to get biased by prior experience. 8. The real world doesn't create random variables: Sometimes one can carried away by statistics, forgetting that statistics is not laws of nature. This can lead to very wrong conclusions. In the real world there is lot of interconnection and the data observed is not random independent values. 9. Data doesn't stand alone: It is not easy to make a decision based only one data. E.g. when one has to take a decision on whether to give a loan to a person or not, it is not only the financial credentials that influences the decision, but factors like the social background of the applicant and the approver also influence the outcome. 10. Data isn't free from the eye of the beholder: The same data viewed and analyzed by different persons can lead to different conclusions because of their personal cognitive biases.
Chapter 14 - Natural Language Corpus Data This chapter talks about how the corpus of natural language words (tokens) that has been accumulated by Google over the years can be effective used to improve natural language text interpretation. One exmaple that the author illustrates is "Word Segmentation". This involves interpretting a phrase like choosespain.com. Does this mean "choose spain" or "chooses pain"? Based on the frequency of the these two terms coming together, in the collection of phrases at Google, it will be possible to say with reasonable certainity that the phrase is "chooose spain" as the number of occurences of the phrase "choose spain" is 3120 whereas there no occurences of the phrase "chooses pain" in the collection of phrases. But if one considers a phrase like "insufficientnumbers" it becomes difficult to determine if this means "in sufficient numbers" or insufficient numbers". A human eye may be able to make it out based on the context, but for a computer to determine this will become difficult especially when one knows that the number of occurences of "in sufficient numbers" is 32378 and occurences of "insufficient numbers" is 20751. The second topic that the author talks about is how to use this data to decipher secret codes and the third topic that is discussed is correction of spellings. The author mentions that the other interesting applications would be 1. Language Identification 2. Spam Detection and Other Classification Tasks 3. Author Identification 4. Document Unshredding and DNA Sequencing 5. Machine Translation
Chapter 15 - Life in Data: The story of DNA In this chapter the author describes how the billions and billions of possible sequencing of DNA was achieved.
Chapter 16 - Beautifying Data in the Real World This chapter starts off with the premise that there are two fundamental problems with collecting "beautiful data" 1. Universe is inherently noisy and so one will tend to get different readings of the same experiment given slightly varying circumstances. 2. The second problem is of space limitations. Raw data based on which conclusions are drawn tend to be too large. The question that arises is how does one present the raw data so that somebody looking at the conclusions drawn can validate the relevance of the conclusion?
The author goes on to describe different techniques were used to minimize these problems in collection of data with respect to chemical behaviour of the different chemicals.
Chapter 17 - Superficial Data Analysis: Exploring Millions of Social Stereotypes This talks about the site facestat.com. This site allows users to post their photos and ask for the world to comment on their appearance. When sufficient data was collected this was used to analyze stereotypes. One of the standout observation was the people tended to rate women as more beautiful than men. The majority of children were considered to be cute or beautiful as opposed to any other age group.
Chapter 18 - Bay Area Blues: The effect of housing crisis This chapter talks about how the data of sale of houses in California over the a period of 2000 to 2010 was analyzed to try and figure out the impact of the housing loan crisis on the price of the houses. The data was analyzed for various perspectives and it provided different insights.
Chapter 19 - Beautiful Polical Data This chapter explains how election data from different elections in the US was taken and analyzed to yield some very interesting insights into how the people tended to vote.
Chapter 20 - Connecting Data This chapter talks about how similar/same data obtained from different sources can be linked to find the right matches. The chapter talks about how data stored in a graph form can be identified to be same by trying to approach the data from different directions. E.g. to identify two movies are same one can try to reach the movie through the actors, through the director, through year of release and if all of them/or majority of them lead to two different nodes from different sources then it is very
a collection of 20 articles about data. Some are very dated (~2009) some are a bit timeless. Some are actually about data (and beautiful data) while more are about "I did this project to look at a dataset, here's how I did it" (both dated, and not really about beautiful data). One article on user interface (or "UX"/user experience) design for online form definitely has not been read widely enough. Spend the time to make it easy and engaging (not painful and bureaucratic). Overall, worth reading and reviewing. Also - not at all a book like Tufte's classics. Most of the articles in the collection referenced Tufte and a few had minor mentions of Tufte. I was expecting something "Tufte-like" and this isn't.
"Beautiful Data" is a collection of essays on data; how people have transformed it, worked within its confines, and offers a glimpse of where we might go. Many of the essays are wonderful snippets into how some people perceive data while others fall flat. Overall its a mostly enjoyable read that helps open up your mind to new potentials.
First a disclaimer; I am not a data person. However I've been involved, fairly heavily, in the data field. In the parlance of the world, I'm a back end person. However I'm always trying to think about the front end; how will things be used and what information can we gleen from the system (or systems). With that in mind, this is a book that speaks to me - its all about the front end.
Some of the best essays in the book would be:
The first essay by Nathan Yau he talks very much about user created data and personal databases (knowledge bases). What's exciting here is how he takes data already out there, data you have provided, and creates something useful and yes, beautiful, out of it.
The Second essay by Follett and Holm really gets down to how if you want the data, you need to present it in a way that brings people into the process. As someone who has a slight crush on the statistics and practices in polling (and designing poll questions) this essay really was a fascinating read.
The third essay by Hughes detailed how he handled images on the Mars mission. There wasn't anything here that wasn't done in embedded systems 15 years ago; still it was a great walk down memory lane since I used to program embedded imaging systems.
Chapter 4 really hit home PNUTShell is cloud storage and data processing in real time. This really is the stuff of the future.
Chapter 5 by Jeff Hammerbacher really didn't offer too many insights but his writing style is fluid and fun plus he offered a glimpse into how Facebook grew.
We then have the slow section of the book - Chapter 8 on distributed social data had promise but it read more like a company white page than an interesting article. Same with Chapter 12 and sense.us.
Thankfully chapter 10 on Radiohead's "House of Cards" video was there - and here we are presented with true beauty in data - beautiful enough to create a music video out of!
I'm still on the fence with Chapter 13 - What Data Doesn't Do. It was an interesting chapter but it felt both too long and too short at the same time. I almost felt that in the author, Coco Krumme, were to write a book on this topic, I'd want to read it. However her essay was not the right vehicle.
Finally, the last chapter - "Connecting Data" was a truly inspiring piece; one that offers up paths for the future. I am sure a few start ups will form over the questions posed in by Segaran (or maybe the questions to the questions).
Overall there were enough strengths to overcome the weak chapters. My main complaints are trivial; poor binding of the book, too many PhD candidate papers and not enough from out in the trenches. I'd love to see something from Stonebreaker here; its hard to talk about beautiful data and not have him in it. Or forget Sense.us and talk about many eyes. Or map reduce. Still, "Beautiful Data" succeeds. It opened up my mind to different possibilities for data representation and usage.
Looking forward to his book which is due in July 2009.
\n
\nThe 39 contributors of the book explain how they developed simple and elegant solutions on projects ranging from the Mars lander to a Radiohead video.
\n
\nSome of the topics include:
\n* opportunities and challenges of datasets on the Web
\n* visualize trends in urban crime using maps and data mashups
\n* challenges of data processing system of space travel
\n* crowdsourcing and transparency helps drug research
\n* automatic alerts when new data overlaps pre-existing data
\n* massive investment to create, capture, and process DNA data
This is a collection of 20 different stories about data - gathering, planning, interpreting, storing, visualizing, etc.
The case study methodology points out the necessity of designing all phases of the data capture, processing, analysis and representation process around the goals, open questions and constraints of the client organization, or user/consumer of the data whose decisions are being informed. The thinking and design process behind these cases of beautiful data are fully described--this will enable you (or an untalented artist such as myself) to design systems which answer the questions and support the decisions of the individual or organization who needs this data.
The book has three kinds of chapters - crap, somewhat OK and amazing.
The bad chapters are hand-waving and nothing new or interesting, mostly fodder, but are easy to spot. The amazing on the other hand (like the one by Peter Norvig) give you some great insights and ideas what can be done, and show it very well.
Also, some people over-use the word 'beautiful' which makes them sound like a (very) bad love poem.
This one's a keeper. An excellent book offering insightful surveys of what people are doing with data these days.
My favorite chapters are: "Seeing Your Life in Data", "Information Platforms and the Rise of Data Scientists", "Portable Data in Real Time", "Visualizing Urban Data", "Natural Language Corpus Data", and "Superficial Data Analysis"
Thought-provoking collection of essays about different aspects of data. The book was a bit slow to start but soon picked up and covered a variety of thought-provoking topics. Although I feel that it just scratched the surface and the whole book isn't greater than the sum of its parts, frequent changes of subject with each chapter worked to its benefit and kept me interested.
"Natural Language Corpus Data", "What Data Doesn't Do", and "Connecting Data" were my favorite articles. I also liked the idea of using Second Life for easy scientific 3D visualization in "Beautifying Data in the Real World".
A thought-provoking tour of recent data-intensive research. I found it most useful for its "aha" and "I should have thought of that" moments. I also appreciated the numerous external references to free datasets and data processing tools.
Always tough when you deal with edited volume, but good book overall. I expected bit more from some chapters, but good idea of what is going on overall.