rc3.org

Strong opinions, weakly held

Category: Data Engineering

That’s not analytics

OK, so a guy made a Web site called Tab Closed Didn’t Read to post screen shots of site that hide their content behind various kinds of overlays that demand that users take some action before proceeding. He’s written a followup blaming the problem on over-reliance on analytics, mainly because some people justify the use of these intrusive calls to action by citing analytics. Anyone who justifies this sort of thing based on analytics should be sued for malpractice.

You can measure almost anything you like. It’s up to the practitioner to determine which metrics really matter to them. In e-commerce the formula relatively simple. If you’re Amazon.com, you want people to buy more stuff. Cart adds are good, but not as good as purchases. Adding items to your wish list is good, but not as good as putting them in your cart. If Amazon.com added an overlay to the home page urging people to sign up for a newsletter, it may add newsletter subscribers, but it’s quite likely that it would lead to less buying of stuff, less adding of stuff to the shopping cart, and less adding of items to wish lists. That’s why you don’t see annoying overlays on Amazon.com.

Perhaps in publishing, companies are less clear on which metrics really matter, so they optimize for the wrong things. Let’s not blame analytics for that.

Defining data engineering

Last year I started working in the world of Big Data, and at the time, I didn’t know that “data science” and “data engineering” were separate things. At some point, I looked at what my team is working on and realized that the distinction between the two is important, and that the team is firmly entrenched in the data engineering camp.

Data scientists get all the glory and attention, but without data engineering, there’s no way for data scientists to practice real science. I’ll talk more about this in another post.

In this first post, I want to talk about the four basic layers of the data engineering stack. These apply whether you’re working to enable people to collect analytic data for a Web-based business, or building the infrastructure for scientists to analyze rainfall patterns. The layers are:

  1. Instrumentation
  2. Data crunching
  3. Data warehousing
  4. End-user tools

Let’s look at an example from Web analytics, because that’s what I understand the best. A tool like Google Analytics spans all four layers but end users only have a measure of control over two of them. When you add the Google Analytics JavaScript to your Web site, you’re setting up the instrumentation. Google crunches the data they collect, and they warehouse it for you. You can then view reports using the Web interface. Google Analytics is a great general purpose tool, but the lack of control and visibility is what limits its potential.

At Etsy, we have our own custom instrumentation, our own Hadoop jobs to crunch the logs the instruments write to, our own data warehouse, and, for the most part, end-user tools for exploring that data that we wrote ourselves.

All of the data engineering team’s projects involve at least one layer of the stack. For example, we worked with our mobile team to add instrumentation to our native iOS and Android apps, and then we made changes to our Hadoop jobs to make sure that the new incoming data was handled correctly. The new mobile data also has implications for our end-user tools.

Along with building up the data infrastructure, managing data quality is the other priority of data engineering. It’s possible to lose data at every layer of the stack. If your instrumentation is built using JavaScript, you lose data from browsers that don’t have it enabled. Your instruments usually log through calls to some kind endpoint and if that endpoint is down or the connection is unreliable, you lose data. If people close the browser window before the instruments load, you lose data. If your data crunching layer can’t properly process some of the data from the instruments (often due to corruption that’s beyond your control), it’s lost. Data can be lost between the data crunching layer and the data warehouse layer, and of course bugs in your end-user tools can give the appearance of data loss as well.

In terms of skills and daily work, data engineering is not much different than other areas of software development. There are cases where having a background in math or quantitative analysis can be hugely helpful, but many of the problems are straightforward programming or operations problems. The big problems tend to be scaling each of the layers of the stack to accommodate the volume of data being collected, and doing the hardcore debugging and analysis required to manage data loss effectively.

That’s a quick description of what life in data engineering is like. I am planning on writing a lot more about this topic. If you have questions, please comment.

Analysts and their instruments

As I’ve mentioned previously, currently I’m working in the realm of Web analytics. I don’t have a deep statistics background, and I’m definitely not what anyone would mistake for a data scientist, but I do have a good understanding of how analytics can be applied to business problems.

I gained most of that understanding by way of being a baseball fan. I was hanging out with baseball nerds on the Internet talking about baseball analytics long before Moneyball was a twinkle in Michael Lewis’ eye.

Around the time most baseball teams started hiring their own analysts, I assumed that baseball analytics was a solved problem. Given all of the money at stake and all of the eyes on the problem, new analytic insights would be less common. That has turned out not to be the case, for interesting reasons.

The aspect of baseball that makes it the perfect subject for statistical analysis is every game is a series of discrete, recordable events that can be aggregated at any number of levels. At the top, you have the score of the game. Below that, there’s the box score, which shows how each batter and pitcher performed in the game as a whole. From there, you go to the scorecard, which is used to record the result every play in a game, in sequence. Most of the early groundbreaking research into baseball was conducted at this level of granularity.

What happened in baseball is that the instrumentation got a lot better, and the new data created the opportunity for new insights. For example, pitch-by-pitch records from every game became available, enabling a number of interesting new findings.

Now baseball analytics is being fed by superior physical observation of games. To go back in time, one of the greatest breakthroughs in baseball instrumentation was the radar gun, which enabled scouts to measure the velocity of pitches. That enabled analysts to determine how pitch velocity affects the success of a pitcher, and to more accurately value pitching prospects.

More recently, a new system called PITCHf/x has been installed at every major league ball park. It measures the speed and movement of pitches, as well as where, exactly, they cross the strike zone. With it, you can measure how well umpires perform, as well as how good a pitcher’s various pitches really are. You can also measure how well batters can distinguish between balls and strikes and whether they’re swinging at the wrong pitches. This data enabled the New York Times to create the visualization in How Mariano Rivera Dominates Hitters back in 2010.

If you’re working on analytics and you find it’s difficult to glean new insights, it may be time to see if you can add further instrumentation. More granular data will always provide the opportunity for deeper analysis.

Big Data and analytics link roundup

Here are a few things that have caught my eye lately from the world of Big Data and analytics.

Back in September, I explained why Web developers should care about analytics. This week I noticed a job opening for a Web developer at Grist that includes knowledge of analytics in the list of requirements. That doesn’t exactly make for a trend, but I expect to see a lot more of this going forward.

Also worth noting are the two data-related job openings at Rent the Runway. They have an opening for a data engineer and one for a data scientist. These two jobs are frequently conflated, and there is some overlap in the skill sets, but they’re not the same thing. For the most part what I do is data engineering, not data science.

If you do want to get started in data science, you could do worse than to read Hilary Mason’s short guide. Seth Brown has posted an excellent guide to basic data exploration in the Unix shell. I do this kind of stuff all the time.

Here are a couple of contrary takes on Big Data. In the New York Times, Steve Lohr has a trend piece on Big Data, Sure, Big Data Is Great. But So Is Intuition. Maybe it’s different on Wall Street, but I don’t see too many people divorcing Big Data from intuition. Usually intuition leads us to ask a question, and then we try to answer that question using quantitative analysis. That’s Big Data to me. For a more technical take on the same subject, see Data-driven science is a failure of imagination from Petr Keil.

On a lighter note, Sean Taylor writes about the Statistics Software Signal.

One quick analytics lesson

Yesterday I saw an interesting link on Daring Fireball to a study that reported the results of searching for 2028 cities and towns in Ontario in the new iOS 6 Maps app for which Apple has apologized. Unsurprisingly, the results of the searches were not very good.

The first question that sprang to my mind when I read the piece, though, was, “How good are the Google Maps results for these searches?” Not because I thought Google’s results would be just as bad, but because looking at statistics in isolation is not particularly helpful when it comes to doing useful analysis. Obviously you can look at the results of the searches and rate the Apple Maps versus reality, but rating them against their competitors is also important. What should our expectations be, really?

Marco Tabini dug into that question, running the same searches under iOS 5.1 (running the Maps app that uses Google’s data). He found that the old Maps app does not outperform the new Maps app by a wide margin, and some interesting differences in how Apple and Google handle location searches.

This isn’t an argument that people shouldn’t be mad about the iOS 6 Maps search capabilities or lack of data, but rather that useful comparisons are critical when it comes to data analysis. That’s why experiments have control groups. Analysis that lacks baseline data is particularly pernicious in cases when people are operating under the assumption that they already know what the baseline is. In these cases, statistics are more likely to actually make people less informed.

Why Web developers should care about analytics

I’m pretty sure the universe is trying to teach me something. For as long as I can remember, I’ve been dismissive of Web analytics. I’ve always felt that they’re for marketing people and that, at least in the realm of personal publishing, paying attention to analytics makes you some kind of sellout. Analytics is a discipline rife with unfounded claims and terrible, terrible products, as well as people engaging in cargo cultism that they pretend is analysis. Even the terminology is annoying. When people start talking about “key performance indicators” and “bounce rate” my flight instinct kicks in immediately.

In a strange turn of events, I’ve spent most of this year working in the field of Web analytics. I am a huge believer in making decisions based on quantitative analysis but I never connected that to Web analytics. As I’ve learned, Web analytics is just quantitative analysis of user behavior on Web sites. The problem is that it’s often misunderstood and usually practiced rather poorly.

The point behind this post is to make the argument that if you’re like me, a developer who has passively or actively rejected Web analytics, you might want to change your point of view. Most importantly, an understanding of analytics gives the team building for the Web a data-based framework within which they can discuss their goals, how to achieve those goals, and how to measure progress toward achieving those goals.

It’s really important as a developer to be able to participate in discussions on these terms. If you want to spend a couple of weeks making performance improvements to your database access layer, it helps to be able to explain the value in terms of increased conversion rate that results from lower page load time. Understanding what makes your project successful and how that success is measured enables you to make an argument for your priorities and, just as importantly, to be able to understand the arguments that other people are making for their priorities as well. Will a project contribute to achieving the overall goals? Can its effect be measured? Developers should be asking these questions if nobody else is.

It’s also important to be able to contribute to the evaluation of metrics themselves. If someone tells you that increasing the number of pages seen per visit to the site will increase the overall conversion rate on the site, it’s important to be able to evaluate whether they’re right or wrong. This is what half of the arguments in sports statistics are about. Does batting average or on base percentage better predict whether a hitter helps his team win? What better predicts the success of a quarterback in football, yards per attempt or yards per completion? Choosing the right metrics is no less important than monitoring the metrics that have been selected.

Finally, it often falls on the developer to instrument the application to collect the metrics needed for analytics, or at least to figure out whether the instrumentation that’s provided by a third party is actually working. Again, understanding analytics makes this part of the job much easier. It’s not uncommon for non-developers to ask for metrics based on data that is extremely difficult or costly to collect. Understanding analytics can help developers recommend alternatives that are just as useful and less burdensome.

The most important thing I’ve learned this year is that the analytics discussion is one that developers can’t really afford to sit out. As it turns out, analytics is also an extremely interesting problem as well, but I’ll talk more about that in another post. I’m also going to revisit the analytics for this site, which I ordinarily never look at, and write about that as well.

© 2024 rc3.org

Theme by Anders NorenUp ↑