|Beautiful Data: The Stories Behind Elegant Data Solutions|
|author||Edited by Toby Segaran and Jeff Hammerbacher|
|publisher||O'Reilly Media, Inc.|
|summary||A collection of twenty essays and chronicles from the implementers of successful projects revolving around real world data processing and display.|
Chapter One deals with two projects done by grad students: Personal Environmental Impact Report (PEIR) and your.flowingdata (YFD). This chapter starts out slow describing how the system harnesses personal GPS devices — a common trend in phone development these days. After clearing the basics, the chapter reveals a lot about the iterative developments the author took to select and include a map interface to effectively and quickly display several routes that a user has driven with intuitive visual queues to indicate which was the most environmentally expensive. Trying to stick with the green means good and red means bad proved difficult and they employed an inverted map of mostly shades of gray to avoid clashing colors with the natural colors on a regular map. The final part of PEIR discussed a Facebook application that simply paired you up against friends also using PEIR. This gave the user a relative value basis of otherwise incomprehensible numbers surrounding their environmental impact. YFD focuses more on an interface for accumulating Twitter data from a user to help them track sleeping and weight loss.
The second chapter deals entirely with constructing a very simple survey that has a variable length depending on what answer you give to an earlier question. While this seems to be a very simple task, the chapter does a great job of explaining how you can make it better and why doing this makes it better. A great quote from this chapter is "The key method for collecting data from people online is, of course, through the use of the dreaded form. There is no artifact potentially more valuable to a business, or more boring and tedious to a participant." The chapter points out that for every action you require the user to make, the user may decide the survey is not worth their time. Yes, clicking "Next" on a multi-page form only gives the user another chance to decide this isn't worth it. Furthermore, many pages might cause the user to be unsure of the real length of the survey. So they decided against this and instead made the survey branch from one page so that page would continually get a little larger depending on how you answered the questions. Knowing the targets for the surveys were older made a copy large font mandatory as 72% of Americans report vision impairment by the time they are age 45. This chapter dealt more with collecting the data, respecting the source of data and building trust with the participants than displaying the data they provided.
Chapter Three deals with the recently disabled Phoenix that landed on Mars and how precisely the image collection was done. While it might seem like the wrong place to do it, there was actually pre-processing and compression done on board the lander before transmission to Earth. This article tackles interesting issues that are long thought to be an extinct animal in computer science where resources are constrained and radiation bombarding keeps the CPU modestly lower than your average desktop. Do you process the image in place in memory or make a copy so that the original image can be retained during processing? These are familiar issues to embedded developers but stuff I haven't touched since college. While the author details the situation on all fronts down to the cameras being used, it's largely a blast from the past as far as resource aware computing is concerned. Then again, I doubt any of my code will ever be flight certified by NASA.
Chapter Four has a very interesting analysis and description of Yahoo!'s PNUTS system for serving up data in complex environments like tackling issues with latency across the world when dealing with social networking. The chapter does a decent job of explaining how issues are resolved when replicated servers across the United States become out of sync and the resolution strategy. The chapter ends on an even more interesting note explaining why Yahoo! deviated from Google's BigTable, Amazon's Dynamo, Microsoft's Azure and other existing implementations. This tale of well thought out design is a stark contrast to Chapter Five which centers on a Facebook 'data scientist' that — instead of explaining the solution as a well planned finalized implementation — tells the trial and error approach of a very small team of developers treading into waters unknown with data sets of Sisyphean proportions. It was tempting for me to read this chapter and chastise the author for not foreseeing what numbers could come with making it big in social networking. But the chapter has a lot of value in a "lessons learned" realm. It may even prepare some of you who are writing web applications with a potentially explosive or viral user base. While it's popular to hate Facebook and in turn transfer that hate to the developers, no one can argue against them being one of the most successful social networking sites and any information of their (sometimes flawed) operations certainly proves to be interesting.
Chapter Six was completely unengaging for me. The chapter covers geographing. More specifically the efforts to take pictures of Britain and Ireland and map/display them geographically. The images would aim to cover a large area than users could tag them with what they see (tree, road, hill, etc). Unfortunately it never really registered with me why someone would want to do this and what the end goal was that they were aiming for. Instead they managed to produce some pretty heinous and very difficult to digest heat maps or "spatial tree maps." By embedding coloration and lines into the treemaps the authors hoped to convey intuitive information to the reader. Instead my eyes often glazed over and sometimes I flat out disagreed with their affirmation that this is how to display data beautifully. You're welcome to try to convince me that geographing has some sort of merit other than producing pretty mosaics of large image sets but it took a lot of effort for me to continue reading at points in this chapter.
Chapter Seven sets the book back on track in "Data Finds Data" where the writers cover very important concepts and problems surrounding federated search and instead offer up directories with some semantic metadata or relationship data that makes keyword searching possible over billions of documents. For anyone dealing with large volumes of data, this chapter is a great start to understanding the options you have to processing your data when you first get it (and only once) versus searching for that data just in time and paying for it in delay. While the former incurs much more disk space cost, Google has proven that paradigm shift definitely has merit.
Chapter Eight is about social data APIs and pushes gnip heavily as the de facto social endpoint aggregator for programmers. The chapter mentions WebHooks as an up and coming HTTP Post event transmission project but doesn't offer much more than a wake up call for programmers. The traditional polling has dominated web APIs and has lead to fragile points of failure. This chapter is a much needed call for sanity in the insane world of HTTP transactional polling. Unfortunately, the community seems to be so in love with the simplicity of polling that they use it for everything, even when a slightly more complicated eventing model would save them a large percentage of transactions.
Chapter Nine is a tutorial on harvesting data from the deep web. What they mean by this is that — given proper permission — one can exploit forms on websites to access database data and then index that instead of merely being relegated to static HTML pages. In my opinion, this is a fragile and often frowned upon approach to data collection but as this chapter (and many others) illustrates, sometimes data is locked up due to lack of resources to expose it. This means that if a repository of information is meant to be available to you through a simple submission form, you can tease that information out of "the deep web" and into your system with the tricks mentioned in this chapter.
Chapter Ten is the story of Radiohead's open sourced "data" music video of "House of Cards" and the collection process from the kinds of devices used to the methodology of collecting that data to the attitude they used when treating the data. This chapter is a sort of key for understanding what data you have with Radiohead's offerings and I heavily recommend it for anyone interested in taking a stab at this video. The most interesting things I found in this was their method for collection and, more importantly, their decision to actually degrade the data and opted not to texture when displaying Thom Yorke's face — citing artistic choice. This chapter gave me one very amazing display tool that I am embarrassed to admit I had no knowledge of prior to this book: processing.
Chapter Eleven is the story of a few people that chose to do something about serious crime problems in Oakland. The city was compiling reports of crimes weekly but they weren't opening up the data. You could do a search and get a very minimal display on a map of crimes that had happened. This caused Oakland Crimespotting to arise. At first they were forced to graphically scrape and estimate crime locations so their own system could offer it back to the user in more intuitive and useful ways to the citizens so the citizens could take action. At first they were forced to work around problems but in the end the city government came to its senses and began offering them the data in a far more open format. From browsing the site now, you can get an idea of the tale this chapter tells. The evolution of that end product is chronicled in this chapter.
Chapter Twelve center's on sense.us, a potentially powerful product that aims to empower users to analyze and create notations on graphs that might relay correlations between factors inside US Census data. The only disappointment with this chapter is that sense.us isn't live for us to use. The tool shows powerful abilities in collaboration in analysis of census data but also is a double edged sword. There's nothing that stops this tool from being used for political and monetary ideals instead of purely academic revelations. They used tools like Colorbrewer and prefuse to dynamically generate graphs and charts that were pleasing to the eye. Then they used 'geometric annotation' (a vector graphic approach to recording user's doodling and annotations) in order to facilitate collaboration. The notes the researchers took on the collaboration between their pilot users is probably more intriguing than their actual approach to display good graphics. Each user seemed to take a natural progression from annotation producer to annotation crawler and then bounce between them as other user annotations gave them ideas for more annotations to create. While not exactly ideal collaboration, it's interesting to hear what users do in the wild when left to their own.
Chapter Thirteen "What Data Doesn't Do" is a very short chapter with a set of ten or so rules that are intended to remind you that data doesn't predict, more data isn't always better or easier, probabilities do not explain, data doesn't stand alone, etc. This chapter felt sort of like a pause and remember way point through the book. Just when you've gone through these great stories of success, the book, reels you back into reality with this chapter. In other chapters you'll be reminded to avoid pitfalls like the narrative fallacy but this book just reminds you quite literally what data doesn't do automatically for you. It's an indicator that you need to shore up these things that data doesn't magically do when you present data.
Chapter Fourteen is Peter Norvig's "Natural Language Corpus Data" and does not disappoint. Once the reader is empowered with the code and the data in this chapter, it almost seems like one could solve several problems using ngrams, Bayes' theorem and natural language analysis. As you read this chapter, Norvig lays out how to tackle several problems with ease: decoding encryption levels up to WWII, spelling correction, machine translation and even spam detection. In just 23 pages, Norvig conveys a tiny bit of the power of a corpus of documents coupled with the willingness to be a little dirty (total probabilities summing to more than one, dropping ngrams below a threshold, etc). It's clear why he's employed at Google.
Chapter Fifteen takes a drastic turn into one of Earth's oldest data stores: DNA. As the chapter so coyly notes, programmers can view DNA as a simple string: char(3*10^6) human_genome; The chapter gives you a brief glimpse of DNA analysis but focuses more on the data storage involved in facilities that are currently working to harvest data from many subjects. As of the writing of this chapter, one facility was generating 75 terabits per week in raw data. Most interesting to me from this chapter was ensemble.org, a site to find DNA data, genome data and also collaborate with other researchers on annotating and commenting on certain parts and regions of DNA.
Similar to the previous chapter, Chapter Sixteen focuses briefly on chemistry and describes how data was collected "to predict teh solubility of a wide range of chemicals in non-aqueous solvents such as ethanol, methanol, etc." Having a very minimal chemistry background, it's never really revealed what purpose this data collection has but nonetheless the chapter explains a lot of challenges in this environment that are similar to other chapters. The interesting aspect of this chapter is that the team used open notebook science to collect this data and therefore faced the challenge of cleaning crowd-sourced data. A constantly recurring problem in these chapters is how one represents data and chemistry apparently has many standards — some more open than others. This book makes a very good argument for open standards and selecting open standards when one witnesses the screen scraping, licensing issues and costs researchers face when unifying data even for something as old as the representations of chemicals.
Chapter Seventeen is the case study of FaceStat, a statistically more ambitious Hot-or-Not effort from researchers. The site would allow anyone to upload a photo of a person and then allow users to rate them and tag them. After collecting this data, the researchers used the ubiquitous R statistical language to do some feature extraction on the data. Of course, the chapter first deals with cleaning the data and catching bad user input. While this chapter sounds like vanilla run-of-the-mill feature extraction, it also includes some interesting display examples as well as the very interesting yet controversial stereotype analysis. From taboo topics like attractiveness vs age line fitting to the sexism of tags to using k-means in order to establish stereotype clusters in the data. While other chapters sought offense through possible privacy concerns, this chapter reveals more about the callow stereotypes that internet inflict upon each other.
Chapter Eighteen looks at the San Fransisco Bay Area housing market from a very interesting selection of recent years. What differentiates this chapter from so many of the others (we collect, clean and process the data) is that it needed to break the data down by neighborhood to find the really interesting features of the data. The neighborhoods could then be grouped into six different groups with their increase in house prices to their decline in house prices. Only one group had one neighborhood that showed no decline (Mountain View). Unfortunately for this chapter and the next one, by the time the reader arrives they appear to be straight forward replications of ideas from other chapters. Chapter Nineteen is brief chapter on statistics inside politics. Aside from revealing five or six interesting correlations in voting revealed through data, this chapter merely relays what we already know: politicians implement statistics to a sometimes harmful degree (gerrymandering).
The last chapter is, appropriately, about the many sources of data exposed on the internet and the problems everyone faces in matching entities from one data source to another. The idea of using a URI to describe a movie hasn't really seemed to catch on. And if that wasn't enough, even words like "location" used to describe a column could mean drastically different things between houses and genomes. The chapter lists out a number of sources where data is available to download and tinker with (most already listed in the book) and proceeds to analyze an algorithmic (collective reconciliation) way for a system to differentiate between two movies with the same name. Naturally the author of this chapter worked on freebase which was recently (and predictably) acquired by Google. Although a short chapter, it speaks to problems that all online data communities face and what prohibits mashups from automagically happening between two disparate data sources holding data that is actually related.
With the exception of chapter six, every chapter offered me something that I won't forget. More importantly, most chapters offered a data source or data processing tool that expanded my toolbox of things to use when programming. The only reason this book misses a perfect 10/10 from me is chapter six and a couple of the later chapters feeling like weaker ideas from earlier chapters rehashed into a different domain. A worthwhile book if you work with data — whether you be a consumer or producer.
You can purchase Beautiful Data: The Stories Behind Elegant Data Solutions from amazon.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.