rc3.org

Strong opinions, weakly held

Tag: analytics

Where have referrers gone?

This article in Business Insider is the first media mention I’ve seen discussing the disappearance of referrers on inbound traffic to Web sites. For people who work in analytics, especially on sites that make money by selling advertising, this is a really big deal. In many cases, analytics can be invasive from a privacy standpoint, but referrers generally don’t contain any information you’d just as soon not disclose. Hopefully this will spur a wider discussion of this change.

For what it’s worth, the article is wrong about why browsers strip referrers from traffic that originates on HTTPS sites. When you are viewing an encrypted page, browsers want to make sure that none of the encrypted information is sent over a non-encrypted link. So when you click on a link on an encrypted page that points to a non-encrypted page, the browser strips the referrer to avoid sending information that was encrypted over the non-encrypted connection. Referrers are not stripped when you click on a link from one encrypted page to another, even if they’re on different domains. Sites can get potentially get referrers back by switching to HTTPS, but only if people link to the HTTPS URLs. So if I have a site that accepts HTTP and HTTPS, and all of the links indexed by Google are HTTP links, the referrers will be stripped even if the user ultimately lands on a secure page. So in this case, it’s not really a choice on the part of browser vendors to protect user privacy, but rather one to respect the sanctity of encrypting information.

Update: Also, apparently this discussion of traffic has been going on for awhile.

Retailers fight to control customer data

John Gruber has a piece up about retailers disabling NFC at checkout to prevent customers from checking out using Apple Pay. Retailers are intentionally degrading the customer experience in order to retain the ability to collect data about their customers’ habits. This tradeoff is near and dear to me, as analytics is currently a huge part of my job.

What I’d like to know is, what’s the return these companies are getting from tracking the behavior of specific users? For one thing, the work to build systems to exploit this data is resource intensive, and often results in failure. Companies are risking hurting their business by inconveniencing customers in exchange for the opportunity to make more money by exploiting the purchase history of their customers. I’d be really, really surprised if the economics actually work.

That’s not analytics

OK, so a guy made a Web site called Tab Closed Didn’t Read to post screen shots of site that hide their content behind various kinds of overlays that demand that users take some action before proceeding. He’s written a followup blaming the problem on over-reliance on analytics, mainly because some people justify the use of these intrusive calls to action by citing analytics. Anyone who justifies this sort of thing based on analytics should be sued for malpractice.

You can measure almost anything you like. It’s up to the practitioner to determine which metrics really matter to them. In e-commerce the formula relatively simple. If you’re Amazon.com, you want people to buy more stuff. Cart adds are good, but not as good as purchases. Adding items to your wish list is good, but not as good as putting them in your cart. If Amazon.com added an overlay to the home page urging people to sign up for a newsletter, it may add newsletter subscribers, but it’s quite likely that it would lead to less buying of stuff, less adding of stuff to the shopping cart, and less adding of items to wish lists. That’s why you don’t see annoying overlays on Amazon.com.

Perhaps in publishing, companies are less clear on which metrics really matter, so they optimize for the wrong things. Let’s not blame analytics for that.

Analysts and their instruments

As I’ve mentioned previously, currently I’m working in the realm of Web analytics. I don’t have a deep statistics background, and I’m definitely not what anyone would mistake for a data scientist, but I do have a good understanding of how analytics can be applied to business problems.

I gained most of that understanding by way of being a baseball fan. I was hanging out with baseball nerds on the Internet talking about baseball analytics long before Moneyball was a twinkle in Michael Lewis’ eye.

Around the time most baseball teams started hiring their own analysts, I assumed that baseball analytics was a solved problem. Given all of the money at stake and all of the eyes on the problem, new analytic insights would be less common. That has turned out not to be the case, for interesting reasons.

The aspect of baseball that makes it the perfect subject for statistical analysis is every game is a series of discrete, recordable events that can be aggregated at any number of levels. At the top, you have the score of the game. Below that, there’s the box score, which shows how each batter and pitcher performed in the game as a whole. From there, you go to the scorecard, which is used to record the result every play in a game, in sequence. Most of the early groundbreaking research into baseball was conducted at this level of granularity.

What happened in baseball is that the instrumentation got a lot better, and the new data created the opportunity for new insights. For example, pitch-by-pitch records from every game became available, enabling a number of interesting new findings.

Now baseball analytics is being fed by superior physical observation of games. To go back in time, one of the greatest breakthroughs in baseball instrumentation was the radar gun, which enabled scouts to measure the velocity of pitches. That enabled analysts to determine how pitch velocity affects the success of a pitcher, and to more accurately value pitching prospects.

More recently, a new system called PITCHf/x has been installed at every major league ball park. It measures the speed and movement of pitches, as well as where, exactly, they cross the strike zone. With it, you can measure how well umpires perform, as well as how good a pitcher’s various pitches really are. You can also measure how well batters can distinguish between balls and strikes and whether they’re swinging at the wrong pitches. This data enabled the New York Times to create the visualization in How Mariano Rivera Dominates Hitters back in 2010.

If you’re working on analytics and you find it’s difficult to glean new insights, it may be time to see if you can add further instrumentation. More granular data will always provide the opportunity for deeper analysis.

Big Data and analytics link roundup

Here are a few things that have caught my eye lately from the world of Big Data and analytics.

Back in September, I explained why Web developers should care about analytics. This week I noticed a job opening for a Web developer at Grist that includes knowledge of analytics in the list of requirements. That doesn’t exactly make for a trend, but I expect to see a lot more of this going forward.

Also worth noting are the two data-related job openings at Rent the Runway. They have an opening for a data engineer and one for a data scientist. These two jobs are frequently conflated, and there is some overlap in the skill sets, but they’re not the same thing. For the most part what I do is data engineering, not data science.

If you do want to get started in data science, you could do worse than to read Hilary Mason’s short guide. Seth Brown has posted an excellent guide to basic data exploration in the Unix shell. I do this kind of stuff all the time.

Here are a couple of contrary takes on Big Data. In the New York Times, Steve Lohr has a trend piece on Big Data, Sure, Big Data Is Great. But So Is Intuition. Maybe it’s different on Wall Street, but I don’t see too many people divorcing Big Data from intuition. Usually intuition leads us to ask a question, and then we try to answer that question using quantitative analysis. That’s Big Data to me. For a more technical take on the same subject, see Data-driven science is a failure of imagination from Petr Keil.

On a lighter note, Sean Taylor writes about the Statistics Software Signal.

John Myles White explains multi-armed bandit testing

If you’re interested in A/B testing on the Web, you should check out John Myles White’s talk at Tumblr on multi-armed bandit testing. You can learn a lot about standard A/B testing from the explanation he gives as a contrast to how multi-armed bandit tests work. I’ve read a lot of blog posts on multi-armed bandit tests, and this lecture is better than any of them in terms of explaining how this sort of testing actually works.

People who are wrong about data analysis

The first order of people who don’t get data analysis are those who believe it’s impossible to make accurate predictions based on data models. They’ve been much discussed all week in light of the controversy over Nate Silver’s predictions about the Presidential campaign. If you want to catch up on this topic, Jay Rosen has a useful round up of links.

There are, however, a number of other mistaken ideas about how data analysis works as well that are also problematic. For example, professional blowhard Henry Blodget argues in favor of using data-driven approaches, but then saying the following:

If Romney wins, however, Silver’s reputation will go “poof.” And that’s the way it should be.

I agree that if Silver’s model turns out to be a poor predictor of the actual results, his reputation will take a major hit, that’s inevitable. However, Blodget puts himself on the same side as the Italian court that sent six Italian scientists to jail for their inaccurate earthquake forecast.

If Silver’s model fails in 2012, he’ll revisit it and create a new model that better fits the newly available data. That’s what forecasting is. Models can be judged on the performance of a single forecast, but analysts should be judged on how effectively adapt their models to account for new data.

Another post that I felt missed the point was Natalia Cecire arguing that attempting to predict the winner of the election by whatever means is a childish waste of time:

A Nieman Lab defense of Silver by Jonathan Stray celebrates that “FiveThirtyEight has set a new standard for horse race coverage” of elections. That this can be represented as an unqualified good speaks to the power of puerility in the present epistemological culture. But we oughtn’t consider better horse race coverage the ultimate aim of knowledge; somehow we have inadvertently landed ourselves back in the world of sports. An election is not, in the end, a game. Coverage should not be reducible to who will win? Here are some other questions: How will the next administration govern? How will the election affect my reproductive health? When will women see equal representation in Congress? How will the U.S. extricate itself from permanent war, or will it even try? These are questions with real ethical resonance. FiveThirtyEight knows better than to try to answer with statistics. But we should still ask them, and try to answer them too.

I, of course, agree with her that these are the important questions about the election. When people decide who to vote for, it should be based on these criteria, and the press should be focused on getting accurate and detailed answers to these questions from the candidates.

The fact remains that much of the coverage is, however, still focused on the horse race. Furthermore, much of the horse race coverage is focused largely on topics that do not seem to matter when it comes to predicting who will win the election. This is where data-driven analysis can potentially save us.

If it can be shown that silly gaffes don’t affect the ultimate result of the election, there may be some hope that the press will stop fixating on them. One of the greatest benefits of data analysis is that it creates the opportunity to end pointless speculation about things that can in fact be accurately measured, and more importantly, to measure more things. That creates the opportunity to focus on matters of greater importance or of less certainty.

One quick analytics lesson

Yesterday I saw an interesting link on Daring Fireball to a study that reported the results of searching for 2028 cities and towns in Ontario in the new iOS 6 Maps app for which Apple has apologized. Unsurprisingly, the results of the searches were not very good.

The first question that sprang to my mind when I read the piece, though, was, “How good are the Google Maps results for these searches?” Not because I thought Google’s results would be just as bad, but because looking at statistics in isolation is not particularly helpful when it comes to doing useful analysis. Obviously you can look at the results of the searches and rate the Apple Maps versus reality, but rating them against their competitors is also important. What should our expectations be, really?

Marco Tabini dug into that question, running the same searches under iOS 5.1 (running the Maps app that uses Google’s data). He found that the old Maps app does not outperform the new Maps app by a wide margin, and some interesting differences in how Apple and Google handle location searches.

This isn’t an argument that people shouldn’t be mad about the iOS 6 Maps search capabilities or lack of data, but rather that useful comparisons are critical when it comes to data analysis. That’s why experiments have control groups. Analysis that lacks baseline data is particularly pernicious in cases when people are operating under the assumption that they already know what the baseline is. In these cases, statistics are more likely to actually make people less informed.

Why Web developers should care about analytics

I’m pretty sure the universe is trying to teach me something. For as long as I can remember, I’ve been dismissive of Web analytics. I’ve always felt that they’re for marketing people and that, at least in the realm of personal publishing, paying attention to analytics makes you some kind of sellout. Analytics is a discipline rife with unfounded claims and terrible, terrible products, as well as people engaging in cargo cultism that they pretend is analysis. Even the terminology is annoying. When people start talking about “key performance indicators” and “bounce rate” my flight instinct kicks in immediately.

In a strange turn of events, I’ve spent most of this year working in the field of Web analytics. I am a huge believer in making decisions based on quantitative analysis but I never connected that to Web analytics. As I’ve learned, Web analytics is just quantitative analysis of user behavior on Web sites. The problem is that it’s often misunderstood and usually practiced rather poorly.

The point behind this post is to make the argument that if you’re like me, a developer who has passively or actively rejected Web analytics, you might want to change your point of view. Most importantly, an understanding of analytics gives the team building for the Web a data-based framework within which they can discuss their goals, how to achieve those goals, and how to measure progress toward achieving those goals.

It’s really important as a developer to be able to participate in discussions on these terms. If you want to spend a couple of weeks making performance improvements to your database access layer, it helps to be able to explain the value in terms of increased conversion rate that results from lower page load time. Understanding what makes your project successful and how that success is measured enables you to make an argument for your priorities and, just as importantly, to be able to understand the arguments that other people are making for their priorities as well. Will a project contribute to achieving the overall goals? Can its effect be measured? Developers should be asking these questions if nobody else is.

It’s also important to be able to contribute to the evaluation of metrics themselves. If someone tells you that increasing the number of pages seen per visit to the site will increase the overall conversion rate on the site, it’s important to be able to evaluate whether they’re right or wrong. This is what half of the arguments in sports statistics are about. Does batting average or on base percentage better predict whether a hitter helps his team win? What better predicts the success of a quarterback in football, yards per attempt or yards per completion? Choosing the right metrics is no less important than monitoring the metrics that have been selected.

Finally, it often falls on the developer to instrument the application to collect the metrics needed for analytics, or at least to figure out whether the instrumentation that’s provided by a third party is actually working. Again, understanding analytics makes this part of the job much easier. It’s not uncommon for non-developers to ask for metrics based on data that is extremely difficult or costly to collect. Understanding analytics can help developers recommend alternatives that are just as useful and less burdensome.

The most important thing I’ve learned this year is that the analytics discussion is one that developers can’t really afford to sit out. As it turns out, analytics is also an extremely interesting problem as well, but I’ll talk more about that in another post. I’m also going to revisit the analytics for this site, which I ordinarily never look at, and write about that as well.

© 2024 rc3.org

Theme by Anders NorenUp ↑