rc3.org

Strong opinions, weakly held

Log it, don’t count it

Last week at work we were having a discussion about reporting, and I shared one of my principles when it comes to data collection, which is that logging is nearly always preferable to counting.

Let’s say I’m required to produce reports that show how often every entry on a weblog has been viewed. There are two obvious options for doing so. I could create a table called entry_views that has columns for the entry ID, and a date field to record when the event occurred. If I had some identifying information for the viewers, I could store that in the table as well.

Another option is to just add a column to the entries table called times_viewed. Every time someone views the article, you can just increment the field. Some people would argue that if all you care about is the count, this is all you should do. Your logging table is going to get huge, your reports will be slow, and you’re recording lots of extra information if you build that table just for logging views.

There are two things that you lose if you count rather than log. The first is that you lose all of the information that’s stored in the log. Take a look at Amazon.com sometime and you’ll see tons of features built on what someone once probably thought was extra information. You can see a list of the products you’ve viewed recently, products that were purchased by people who bought the product you’re looking at, and even products people purchased instead of the one you’re looking at. Keeping detailed logs is what makes all of those features possible. Collect the data now, and figure out what to do with it later.

The second thing you lose is an audit trail. A few years ago, I worked on a rewrite of a licensing system for a software company. When a customer purchased a number of licenses, a record was added to the database. When the customer came and generated those licenses, a column in that table was incremented. When the license count was equal to the number of licenses purchased, the system refused to generate any more licenses. The problem was that there was no audit trail. If a customer called and asked who had used up their licenses or when they were redeemed, the company couldn’t tell them.

Obviously, the person who failed to record the individual licenses made a really terrible decision since revenue and customer satisfaction were on the line, but the larger point is that hoarding the data is, to me, always a better decision than not recording something there’s a chance you may need later. Data storage is cheap. The only issue is how to handle reporting performance when you’ve logged tons of information. I generally file that in the category of problems that are nice to have.

3 Comments

  1. The only issue is how to handle reporting performance when you’ve logged tons of information.

    This approach brings up its own set of concerns, but the answer is to both log it and count it. Use the times_viewed counter to generate quick reports, check it periodically against your log data, and use the logs for more detailed (and long-running) reports.

  2. Logging is also often easier to offload, and process in batch mode, where counting will usually require updating records in your live data with all the contentions and limited resources of a production system.

  3. The practice of collecting all possible data and seeing what to use it for later is one of the information age philosophies (enabled, as you point out, by cheap storage) that leads to all kinds of privacy issues.

    It represents, on a smaller scale what underlies the TIA (or whatever it’s been renamed to) approach to terrorism detection/prevention – vacuum up all possible data and then figure out what to do with it/how to search it later.

    The need for audit and retrospective analysis is fundamentally in tension with the desire for minimization, consent, fair information practices, and privacy.

    (I don’t have any particular solution – just pointing out what our easy collection + cheap storage capabilities lead to..)

Leave a Reply

Your email address will not be published.

*

© 2024 rc3.org

Theme by Anders NorenUp ↑