Code


Version 2 (modified by Paul Collier, 7 years ago) (diff)

Seems my copy-to-clipboard script is broken...

This is the original Django GSoC proposal. There have been quite a few revisions since, but I'm posting this first for reference.

Abstract

This addition to Django's ORM adds simple drop-in caching, compatible with nearly all existing QuerySet methods. It emphasizes performance and compatibility, and providing configuration options with sane defaults. All that is required for basic functionality is a suitable CACHE_BACKEND setting and the addition of .cache() to the appropriate QuerySet chains. It also speeds up the lookup of related objects, and even that of generic relations.

Proposed Design

The QuerySet class grows two new methods to add object caching:

cache()

cache(timeout=None, prefix='qscache:', smart=False)

timeout defaults to the amount specified in CACHE_BACKEND. prefix is in addition to CACHE_MIDDLEWARE_KEY_PREFIX.

Cache keys are calculated with the content-type id and instance id, to accomodate generic relations.

Internally, QuerySet grows some new attributes that affect how SQL is generated. When in effect, they cause the query to only retrieve primary keys of selected objects. in_bulk() uses the cache directly, although cache misses will still require database hits, as usual. Methods such as delete() and count() are largely unaffected by cache(), but methods such as distinct() are a more difficult case and will require some design decisions. Using extra(select=...) is also a possibly unsolvable case.

If values() has been used in the query, cache() takes precedence and creates the values dictionary from cache. If a list of fields is specified in values(), cache() will still perform the equivalent of a SELECT *. Perhaps another option could be added to allow retrieval of only the specified fields, which would break any regular cached lookup for that object.

select_related() is supported by the caching mechanism. The appropriate joins are still performed by the database; if joins were calculated with cached object foreign key values, cache misses could be very costly.

cache_generic()

cache_generic(field, timeout=None, prefix='qscache:', smart=False)

field is the name of the generic foreign key field.

Without database-specific trickery it is non-trivial to perform SQL JOINs with generic relations. Currently, a database query is required for each generic foreign key relationship. The cache framework, while unable to reduce the initial number of database hits, greatly alleviates load when lists of generic objects are required. Using this method still loads generic foreign keys lazily, but more quickly, and also uses objects cached with cache().

Background logic

To achieve as much transparency as possible, the QuerySet methods quietly establish post_save and post_delete signal listeners the first time a model is cached. Object deletion is trivial. On object creation or modification, the preferred behavior is to create or update the cached key rather than simply deleting the key and letting the cache regenerate it; the rationale is that the object is most likely to be viewed immediately after and caching it at post_save is cheap. However, specific cases may not be as accommodating. This is likely subject to debate or may need a global setting.

To reduce the number of cache misses, additional "smart" logic can be added. For example, the first time a model is registered to the cache signal listener, its model instances are expected to be uncached. In this case, rather than fetching only primary keys, the objects are retrieved as normal (and cached).

By storing the expiration time, this can also take effect whenever the cached objects have likely timed out. All "smart" functionality is enabled using the smart keyword argument.

Implementation Notes

  • All caching code lives in a contrib app at first. A custom QuerySet class derives from the official class, overriding where appropriate. A Manager class with an overriden get_query_set() is used for testing, and additional middleware, etc. are located in the same folder. Near or upon completion, the new code can be merged to trunk as Django proper. Hopefully the code will not be too invasive, but quite a few QuerySet methods will have to be hijacked.
  • If the transaction middleware is enabled, it is desirable to have the cache only update when the transaction succeeds. This is simple in implementation but will couple the transaction middleware to the cache if not designed properly. An additional middleware class can be created to handle this case; however, it will have to stipulate placement immediately after the TransactionMiddleware in settings.py, and might be confused with the existing CacheMiddleware.

Timeline

First Month

  • Write preliminary tests. Initial implementation of cache() for single objects. Support almost all typical QuerySet methods.
  • Devise a generic idiom for testing cache-related code. Work on agregates; implement select_related(), values(), in_bulk() cases, and cache_generic() method.

Second Month

  • Work on signal dispatching, cache coherency. Write more tests and preliminary documentation.
  • Write "smart" cache logic. Explore other possible optimizations.

  • Add transaction support. Design decision needed about extra middleware.
  • Implement extra features if possible (distinct(), extra(select=...), ...)

Last Month

  • Write up documentation, extensive tests, and example code. Possibly move from contrib into the main cache module.
  • Refactor, especially if the new QuerySet has been released. Continue merging with changes to trunk and testing.
  • Allow for wiggle room, QuerySet refactoring work, cleanup, etc.