­dice.jpgFor some reason my imaginary version of SQL has some nice operations to take a r­andom sample of rows from some table.  In reality, the most common query that I’ve seen is something like:

SELECT * FROM sometable ORDER BY RAND() LIMIT ­samplesize

In MySQL, this appears to do just what it says: generate a random number for every row then sort them and return the lowest few — which is prohibitively slow for any reasonably big table.

If you know the identifiers of rows in the table and they form a gapless sequence, the story is somewhat better;  you can generate samplesize random numbers from the range and select those identifiers.  Of course, this is generally not the case (and almost begs the question).  The best approach along these lines that I’ve seen requires maintaining a table that maps from a gapless sequence of integers to the identifiers of the table that you want to sample from using a bunch of triggers — http://jan.kneschke.de/projects/mysql/order-by-rand .

I’m also unsure of the right way to tackle this in databases like couchdb or the appengine datastore.  Maybe a similar tactic with views/query indexes? Anyone implemented anything similar?

Filed August 10th, 2008 under mysql, app engine, couchdb, development