This week I’ve been working on a report. Originally, everything on the report was pulled from one table, but it needs to include data from the original table, plus some information from a second table that is associated with each record in the original table. The relationship is many to one, and in some cases, there may be no data in the second table, so I have to use an outer join, which returns a result set like this:

IDnameattribute
1Rafeprogrammer
1Rafewriter
2Boblawyer
3James
4Michaelengineer
4Michaelcook
4Michaelphotographer


That has to be converted to a data structure that includes single attributes for ID and name, and an array of attributes (which may be empty). When you add pagination, things get a little bit confusing. Let’s say I want to display the records three at a time. If the one to many relationship were not involved, I could just stick a LIMIT 3 clause in MySQL code. That’s how the report works right now. Unfortunately, now that the outer join has entered the picture, LIMIT 3 would limit the results to 3 rows, which includes only 2 records. LIMIT 5 would be even worse, you’d get all of records 1, 2, and 3, and part of record 4. Ouch.

One option is to leave out the limit, pull back the entire result set, and handle the pagination in my application code. That would be fine except that we’re dealing with hundreds of thousands of records. In most cases they’re filtered, but I still don’t want to risk queries that could return thousands of results when I only want to deal with a small fraction of them.

Faced with a somewhat ugly problem (which would be abstracted away in the Java world by an ORM solution like Hibernate), I’ve been reluctant to start coding. Instead I’ve spent the whole week just thinking about the problem, talking to people about it, and doing research. Normally I have a bent toward procrastination, but in some cases it’s my friend. Rather than writing something that I’ll just throw away, it’s better to wait and really think about the problem before attacking.

I’ve thought of all sorts of bizarre approaches, many of which would take weeks to write and may not solve the problem anyway. I then thought that I just needed to use the LIMIT clause a bit differently. The plan was to fetch the results with a LIMIT clause just as I am now, iterate over the results and move the data to the new data structure that I need, and then, if the query doesn’t return enough results to make up the page, go back and run the query again with a new limit and offset, and continue iterating to build the data structure.

This seemed like it would make sense, and it does solve the potential performance problem that occurs when you retrieve thousands of records even though you only need a few of them. This works well for the first page of results, but things get messy from there. Let’s say that to display 5 records on the first page, you use 12 rows in the result set. You store the offset between requests, and when the user hits the “next page” link, you run the first query with an offset of 13. That’s fine, unless another record has been added to the database. Let’s say it takes up 2 rows. Then your current result set is thrown off. This happens if the outer join isn’t involved, but the only cost is that you’re seeing the set of 5 records from that moment in time, not the 5 records subsequent to those on the first page of results. When a “record” consists of an undetermined number of rows in a result set, then everything is thrown off. You’re back in partial record territory with no easy way to figure out that’s what’s going on.

So this solution is insufficient. Two ways around this one are to keep the primary keys around rather than the offset, or even to work off of the time stamps on the records. Let’s look at each. When you produce a page of results, you can keep the ID of the first and last items on the page. If you’re lucky, you use an auto-incrementing ID field of some kind, so you can assume that earlier items have IDs less than the lowest ID on the page, and later items have an ID higher than the highest ID on the page. You can do the same thing with timestamps. Save the time stamps of the first and last records on the page to serve as markers for the next and previous pages.

The glaring problem with this approach is that it only works if your report can only be ordered based on time. If you want to produce a report that displays records 10 per page and has 50 records total, sorted by name, then IDs and timestamps are useless for paginating the results. Theoretically names would be useful, but names may not be unique, and if they’re not, then that approach won’t work either.

Yet another approach is to attack the problem with the IN clause. MySQL 4.1 supports subqueries, so you can theoretically write a query like this:

select people.id, people.name, people.attribute
from people left join attributes on people.id = attributes.person_id
where people.id in (select id from people limit 10)

Unfortunately, that won’t work, because you can’t use LIMIT inside a subquery. However, you can run two queries. The first gets a list of IDs to use in the IN expression in the query above, in place of the subquery. That’s looking like the best solution so far. Anyone have any better ideas?