• Scaling

  last modified December 20, 2006 by ianb

Luke has been doing load testing (December 2006).  Results are fairly good, but when requests take a long time to finish it can exhaust the pool of worker threads and keep fast requests from getting through.  As a result, a badly performing project could effect other projects, or a badly performing theme could cause other effects.

After some discussion, some possible solutions came out:

  1. The number of worker threads could be increased.  It's currently 10, which makes sense in a CPU-bound application.  This application is primarily IO bound (even when the IO is fast) so a larger number of threads might help generally.  But even a large pool can be exhausted with just a few poorly-performing requests.
  2. We can try to make sure that requests respond promptly, if not always accurately.  Setting up timeouts on subrequests would help this.  However, long-running requests are not uncommon, especially something like an edit in Plone.  We don't want to abort these requests if we don't have to.  So long as the pool of worker threads is not being exhausted, long-running requests are not a problem.
  3. So killing requests when we need to free up the pool could be helpful.  We would need to keep track of which requests are taking too long, then kill the thread (which is not particularly easy).  This page shows one way to kill a thread.  This might not work when a socket is hung; we'll have to test.  Per-socket timeouts are also tricky.
  4. The worse or most likely case is when the theme takes a long time to request.  We're already planning on caching the theme (regardless of any cache headers); even a modest cache like never requesting more than every 5 seconds would help a great deal.  We could also move the theme fetching (and perhaps other external fetches, but not content) to another worker thread.

Moving to another thread

This outlines the theme fetching strategy (4) in more detail:
  1. One or a couple threads are dedicated to fetching external resources.  These are not the WSGI/request threads; this is a separate pool of threads.
  2. When fetching an external resource, if a no cached version of the resource is available then it is fetched directly, and the page saved in the cache.  The request is blocked during this time.  (Potentially by using the worker pool some extra requests can be avoided, but it may not be important enough because it can only happen when no cached page exists and suddenly multiple requests for the page come in)
  3. When a cached version of the page is available, if the cache isn't stale it is used.
  4. If the cache is stale, then a job is added to the queue to fetch that resource.  The fetching is done in a separate thread.  The request thread waits for some limited amount of time for the job to be finished.  If it isn't finished in time, the stale cached version of the resource is used.  Later requests may see the new version of the page.  Multiple jobs to fetch the same resource won't go in the job queue -- as soon as one such job is there, everyone waits on that job.
This handles the situation when a page is very slow to fetch; it will not use of the pool of worker threads for the requests.  It may cause the caches to become very stale if the fetching worker pool is wedged; some timeouts are still necessary there to keep the queue moving.

Totally unavailable resources are still a problem.  For example, a failed DNS.  There may be no cache available, and so every attempt to get that will block (case (2)).  Ideally caches will be long-lived -- saved to disk, and persistent over restarts.  This way even if a theme page is gone for days on end, the site we host will still be viewable and working.  (Some error message is necessary; where will it go?)  If we combine this with validation during setup (i.e., making sure the theme page really exists and is fetchable) then a totally unavailable theme should be uncommon.  We should still handle it, though.

thread2 provides a useful interface for killing threads.