As I’ve mentioned previously, currently I’m working in the realm of Web analytics. I don’t have a deep statistics background, and I’m definitely not what anyone would mistake for a data scientist, but I do have a good understanding of how analytics can be applied to business problems.
I gained most of that understanding by way of being a baseball fan. I was hanging out with baseball nerds on the Internet talking about baseball analytics long before Moneyball was a twinkle in Michael Lewis’ eye.
Around the time most baseball teams started hiring their own analysts, I assumed that baseball analytics was a solved problem. Given all of the money at stake and all of the eyes on the problem, new analytic insights would be less common. That has turned out not to be the case, for interesting reasons.
The aspect of baseball that makes it the perfect subject for statistical analysis is every game is a series of discrete, recordable events that can be aggregated at any number of levels. At the top, you have the score of the game. Below that, there’s the box score, which shows how each batter and pitcher performed in the game as a whole. From there, you go to the scorecard, which is used to record the result every play in a game, in sequence. Most of the early groundbreaking research into baseball was conducted at this level of granularity.
What happened in baseball is that the instrumentation got a lot better, and the new data created the opportunity for new insights. For example, pitch-by-pitch records from every game became available, enabling a number of interesting new findings.
Now baseball analytics is being fed by superior physical observation of games. To go back in time, one of the greatest breakthroughs in baseball instrumentation was the radar gun, which enabled scouts to measure the velocity of pitches. That enabled analysts to determine how pitch velocity affects the success of a pitcher, and to more accurately value pitching prospects.
More recently, a new system called PITCHf/x has been installed at every major league ball park. It measures the speed and movement of pitches, as well as where, exactly, they cross the strike zone. With it, you can measure how well umpires perform, as well as how good a pitcher’s various pitches really are. You can also measure how well batters can distinguish between balls and strikes and whether they’re swinging at the wrong pitches. This data enabled the New York Times to create the visualization in How Mariano Rivera Dominates Hitters back in 2010.
If you’re working on analytics and you find it’s difficult to glean new insights, it may be time to see if you can add further instrumentation. More granular data will always provide the opportunity for deeper analysis.
In an opinion piece in today’s New York Times, neuroscience researchers Leaf Van Boven and Charles M. Judd talk about research into people’s willingness to trust secret information simply because it is secret. I assume that one day we’ll have to add some sort of secrecy bias to the list of cognitive biases.