Edward Tufte is a personal and professional hero of mine. Professionally, he’s best known for his magisterial work in data visualization and data communication through such classics as The Visual Display of Quantitative Information (1983) and its follow-on volumes, but less well-known is his outstanding academic work in econometrics and statistical analysis. His 1974 book Data Analysis for Politics and Policy remains the single best book I’ve ever read in terms of teaching the power and pitfalls of statistical analysis. If you’re fluent in the language of econometrics (this is not a book for the uninitiated) and now you want to say something meaningful and true using that language, you should read this book (available for $2 in Kindle form on Tufte’s website). Personally, Tufte is a hero to me for escaping the ivory tower, pioneering what we know today as self-publishing, making a lot of money in the process, and becoming an interesting sculptor and artist. That’s my dream. That one day when the Great Central Bank Wars of the 21st century are over, I will be allowed to return, Cincinnatus-like, to my Connecticut farm where I will write short stories and weld monumental sculptures in peace. That and beekeeping.
But until that happy day, I am inspired in my war-fighting efforts by Tufte’s skepticism and truth-seeking. The former is summed up well in an anecdote Tufte found in a medical journal and cites in Data Analysis:
One day when I was a junior medical student, a very important Boston surgeon visited the school and delivered a great treatise on a large number of patients who had undergone successful operations for vascular reconstruction. At the end of the lecture, a young student at the back of the room timidly asked, “Do you have any controls?” Well, the great surgeon drew himself up to his full height, hit the desk, and said, “Do you mean did I not operate on half of the patients?” The hall grew very quiet then. The voice at the back of the room very hesitantly replied, “Yes, that’s what I had in mind.” Then the visitor’s fist really came down as he thundered, “Of course not. That would have doomed half of them to their death.” God, it was quiet then, and one could scarcely hear the small voice ask, “Which half?”
The latter quality — truth-seeking — takes on many forms in Tufte’s work, but most noticeably in his constant admonitions to LOOK at the data for hints and clues on asking the right questions of the data. This is the flip-side of the coin for which Tufte is best known, that good/bad visual representations of data communicate useful/useless answers to questions that we have about the world. Or to put it another way, an information-rich data visualization is not only the most powerful way to communicate our answers as to how the world really works, but it is also the most powerful way to design our questions as to how the world really works. Here’s a quick example of what I mean, using a famous data set known as “Anscombe’s Quartet”.
In this original example (developed by hand by Frank Anscombe in 1973; today there’s an app for generating all the Anscombe sets you could want) Roman numerals I – IV refer to four data sets of 11 (x,y) coordinates, in other words 11 points on a simple 2-dimensional area. If you were comparing these four sets of numbers using traditional statistical methods, you might well think that they were four separate data measurements of exactly the same phenomenon. After all, the mean of x is exactly the same in each set of measurements (9), the mean of y is the same in each set of measurements to two decimal places (7.50), the variance of x is exactly the same in each set (11), the variance of y is the same in each set to two decimal places (4.12), the correlation between x and y is the same in each set to three decimal places (0.816), and if you run a linear regression on each data set you get the same line plotted through the observations (y = 3.00 + 0.500x).
But when you LOOK at these four data sets, they are totally alien to each other, with essentially no similarity in meaning or probable causal mechanism. Of the four, linear regression and our typical summary statistical efforts make sense for only the upper left data set. For the other three, applying our standard toolkit makes absolutely no sense. But we’d never know that — we’d never know how to ask the right questions about our data — if we didn’t eyeball it first.
Okay, you might say, duly noted. From now on we will certainly look at a visual plot of our data before doing things like forcing a line through it and reporting summary statistics like r-squared and standard deviation as if they were trumpets of angels from on high. But how do you “see” multi-variate datasets? It’s one thing to imagine a line through a set of points on a plane, quite another to visualize a plane through a set of points in space, and impossible to imagine a cubic solid through a set of points in hyperspace. And how do you “see” embedded or invisible data dimensions, whether it’s an invisible market dimension like volatility or an invisible measurement dimension like time aggregation or an invisible statistical dimension like the underlying distribution of errors?