For some reason my imaginary version of SQL has some nice operations to take a random sample of rows from some table. In reality, the most common query that I’ve seen is something like:
SELECT * FROM sometable ORDER BY RAND() LIMIT samplesize
In MySQL, this appears to do just what it says: generate a random number for every row then sort them and return the lowest few — which is prohibitively slow for any reasonably big table.
If you know the identifiers of rows in the table and they form a gapless sequence, the story is somewhat better; you can generate samplesize random numbers from the range and select those identifiers. Of course, this is generally not the case (and almost begs the question). The best approach along these lines that I’ve seen requires maintaining a table that maps from a gapless sequence of integers to the identifiers of the table that you want to sample from using a bunch of triggers — http://jan.kneschke.de/projects/mysql/order-by-rand .
I’m also unsure of the right way to tackle this in databases like couchdb or the appengine datastore. Maybe a similar tactic with views/query indexes? Anyone implemented anything similar?
A few quick notes if you haven’t picked them up elsewhere:
App Engine is open to everyone now, but still in preview release. They rolled out access to memcache, an image manipulation api and announced an expected pricing structure for resources beyond the free limits (details). In the “fireside chat” they suggested an alternate pricing model for non-profits and education, but with no specifics.
They mentioned a timeline of “by the end of the year” for rolling out billing. It’s worth noting it’s basically the only thing they would attach a timeline to, although their big priorities also seem to be offline processing and alternate language support.
The question whether you will be charged for the size of the indices that are created to service your queries appears to be a strangely touchy issue. The word so far is they would “prefer not to” charge, but it appears this may be a matter of some internal debate?
There don’t seem to be any plans around offering versioned objects. Bigtable can do this, they don’t have the appengine’s tables configured this way. Versions up to the last committed are garbage collected.
While I had hoped to get something out the door with the latest release that was cooler off the bat, it turned out to be a slightly bigger ball of wax than I was willing to wait for at this point. Sonali has been doing some great work thinking through alot of the interfaces and we’ve had some very productive discussions about the initial experince that we want for users. I’m excited about what we ‘re aiming, but it’s going to take a fair bit of rearranging to get there.
The biggest goal is to eliminate (or significantly lower) the barrier to seeing what is special about the site when you arrive. An anonymous user should be able to show up, see some feeds being displayed and immediately get the experience that takes about 4 long boring and meaningless steps to get to currently (sign up, create a jug, add some feeds, add some filters). There should be something on the front page when you show up that is perhaps interesting and that you can immediately start refining without signing up or logging in. We allow you to save the customized thing what you’ve done if you sign up.
Meanwhile, in the non-user-facing realm, I’ve been a bit (okay… maybe more than a bit) distracted from the main line of development trying to absorb what Google App Engine is all about, how much it has to offer to melkjug and how much would need to be done to take advantage of it in a significant way. I’m really excited about it, and I think that certain portions of the application map very well onto what Google is offering. Big props to Ian for appengine-monkey.
In a lot of ways, the changes I put in place to accommodate CouchDb have also payed off in fiddling around with the app engine data api — in terms of managing data, and constraints on finding it, I find them to be deeply similar. (p.s. looks like opencore isn’t alone in hand drawn diagram dept) Put together, CouchDB and App Engine are really starting to shape my concept of what melkjug needs to look like on the inside if it’s going to handle more than a handful of users.
They’re very strange and restrictive environments in some respects, but I think at least some of these constraints are valuable to have right in front of you to prevent you from doing things that just really aren’t going to scale. I’m no expert on doing things in these ways, and I’m certain I’m doing things stupidly, but I think I’m starting to see the light… I’ll post more on this along with some of the prototypes I’m been playing around with at some point soon — poke me if you want them sooner :)