Code


Version 18 (modified by wkornewald, 3 years ago) (diff)

started going into more detail on the required changes

This wiki page documents the requirements for supporting NoSQL (or non-relational) databases with Django.

This is not part of the official Django development efforts.

The Django-nonrel branch of Django already provide support for NoSQL and it requires only minimal changes to Django's ORM. However, for the more interesting features like select_related() Django's ORM needs to be refactored and simplified in several areas. Many of the sections in this page are described from the point of view of Django-nonrel since a lot of experience required for official NoSQL support has been integrated in the Django-nonrel project.

For the record, Django-nonrel has quite a few backends, already:

Also take a look at the feature comparison matrix for an overview of what is supported and what is missing. Database-specific features are sometimes provided by an automatically added manager. For example, MongoDB adds a manager which adds map-reduce and other MongoDB-specific features.

Representing result rows

Problem
SQLCompiler.results_iter() currently returns results as simple lists which represent rows, related selections, annotations, and extra selections. Since the ordering of the lists' entries matters NoSQL backends have to use complex code to map their results (which are dicts) to a specifically ordered list. In the next step, Django converts each list back to a dict which is passed to the model constructor. The row format is especially inconvenient when combined with select_related() because then NoSQL backends have to collect all fields in the correct order and also take deferred fields into account. Basically, the current results format is too SQL-specific.
Solution
Instead of returning lists, results_iter() should return more structured data. For example, each result could be wrapped as a dict like this
yield {
    'result': {'id': 8, 'some_string_column': 'value', 'some_bool_column': True, ...},
    'related_selection': {
        'fk': {'id': 10, ...},
        'fk__user': {...},
        ...
    },
    'annotations': ...,
    'extra_select': ...,
}
Implementation status
This is not implemented in Django-nonrel.

select_related()

Problem
Django's internal representation of select_related() depends on JOINs, which aren't supported on NoSQL DBs.
Solution
Django needs to provide a simpler internal representation of select_related() which allows the backend to easily retrieve the related models and their selected fields (so deferred fields aren't loaded unnecessarily).
Implementation status
Django-nonrel merely provides a connection.feature.supports_select_related flag which tells QuerySet that the backend won't return additional data for the related data in the result rows (otherwise select_related() causes bad results full of None values). All NoSQL backends set this flag to False.

AutoField

Problem
Currently, AutoField assumes that it's always an integer. However, in several NoSQL DBs (MongoDB, SimpleDB, etc.) the primary key is a string.
Solution 1
Add a StringAutoField and require developers to explicitly use that. The disadvantage of this solution is that it becomes impossible to reuse existing Django models and NoSQL models become less portable even across NoSQL databases.
Solution 2 (preferred)
Change AutoField to support both integers and strings. Since some existing code assumes that an exception is raised when assigning a string to an AutoField we could detect the installed backends and keep the old behavior (but additionally show a deprecation warning) when only SQL backends are in use. When using a NoSQL backend the new behavior would be activated and AutoField would accept both integers and strings without raising an exception.
Additional notes
Portable code should never assume that the "pk" field is a number. If an entity uses a string pk the application should continue to work. This is currently a problem in Django's auth app (see further below).
Implemenation status
This is already implemented in Django-nonrel, but it's missing the deprecation warning and backwards-compatible mode when only using only SQL backends.

INSERT vs UPDATE

Problem
Currently, Model.save_base() runs a check whether the pk already exists in the database. This check is necessary for SQL, but it's unnecessary and inefficient on many NoSQL DBs which have an "upsert" operation that inserts or overwrites the entry in the DB. App Engine also doesn't allow to run queries within (optimistic) transactions, so the current save() method doesn't work on App Engine.
Solution
Django shouldn't distinguish between insert and update operations on DBs that don't require such a distinction. Instead of checking the DB each model instance could get a constructor parameter that tells it whether it represents an existing entity or not.
Implementation status
This is already implemented in Django-nonrel, but some of Django's unit tests fail because they assume that the model constructor won't get additional parameters. Maybe an alternative solution is required.

Counting

Problem
Counting is not a scalable operation on some DBs (esp. App Engine). The more entities you try to count the longer the operation takes. In the worst case it times out. Django's Query.count() always tries to count everything instead of just a subset of the results which is very inefficient. For example, sometimes you might only want to know whether there are more than 10 results, but you might not be interested in the exact number of results unless it's less than or equal to 10. In order to work around timeouts for too large counting operations some backends might even artificially limit the maximum count to 1000 (e.g. on App Engine).
Solution
Allow to pass an upper limit to the count operation. For instance, queryset.count(100) would never return a larger number than 100 even if there are more results.
Additional notes
A related problem is that it might be impossible to retrieve a large number of results (i.e., not just the count, but the actual entities) or even results beyond a certain offset (esp. on App Engine). Since it's impossible to count the whole result set in advance and possibly even iterate through the whole result set this feature affects all apps that do pagination (e.g., the admin interface). In order to allow paginating through the whole result set so-called cursors must be supported (see below).
Implementation status
Django-nonrel's App Engine backend currently just limits the maximum count to 1000. Other backends don't have a count() limit, but that might lead to inefficient queries.

ListField

NoSQL DBs use ListField in a lot of places. They are basically a replacement for ManyToManyField. BTW, some SQL DBs have a special array type which could also be supported via ListField.

This is already implemented in Django-nonrel.

SetField

Another useful type is SetField which stores a set instead of a list. On DBs that don't support sets this field can be emulated by storing a list, instead. This is the approach taken by Django-nonrel's App Engine backend.

This is already implemented in Django-nonrel.

DictField

MongoDB and other databases use ListField in combination with DictField to completely replace ManyToManyField in a lot of cases. Django currently doesn't provide an API for querying the data within a DictField (especially if it's embedded in a ListField). Ideally, the query API would just use the foo__bar JOIN syntax.

The field is already implemented in Django-nonrel, but lookups aren't supported, yet.

EmbeddedModelField

This is a field which stores model instances like a "sub-table within a field". Internally, it's just a DictField which converts model instances to/from dicts. In addition to the DictField issues this field also has to call the embedded fields' conversion functions, which again requires special support if the JOIN syntax should be supported.

The field is already implemented in Django-nonrel, but lookups aren't supported, yet.

BlobField

Many databases provide support for a raw binary data type. Many App Engine developers depend on this field to store file-like data because App Engine doesn't provide write access to the file system (there is a new Blobstore API, but that doesn't yet allow direct write access).

This is already implemented in Django-nonrel.

ImageField

Currently, ImageField depends on PIL. It might be necessary to provide a backend API for sandboxed platforms (like App Engine) that don't provide PIL support.

This is not implemented in Django-nonrel.

Serializers

Due to lack of JOIN support on NoSQL DBs, Django fails to serialize any app's entities that have a ManyToManyField (e.g. django.contrib.auth). Instead of actually fetching the whole entities Django could fetch only the keys which are stored in the ForeignKey columns. That way, JOINs aren't required, anymore.

This is already implemented in Django-nonrel.

Batch operations

For optimization purposes it's very important to allow batch-saving and batch-deleting a list of model instances (which, in the case of batch-deletion, is not exactly the same as QuerySet.delete() which first has to fetch the entities from the DB in order to delete them).

This is not implemented in Django-nonrel, but Vladimir Mihailenco has implemented a patch which could be integrated.

Transactions

Not all backends support transactions, at all (e.g., SimpleDB). Some (e.g., App Engine) only support optimistic transactions similar to SELECT ... FOR UPDATE (which isn't exactly the same as @commit_on_success because it really locks items for read/write access).

Django-nonrel currently doesn't provide any support for optimistic transactions.

Pagination

On some DBs it's inefficient to request entities using a large offset (queryset[5000:...]). E.g., App Engine's datastore doesn't actually support offsets. When you use an offset the datastore always starts from offset 0 and throws away all results you didn't request (which means you can't ever query e.g. for the 10000th result). Instead of integer offsets App Engine and SimpleDB provide some kind of "bookmark" which marks the query's current position in the result set. You can pass a bookmark to a query to move the cursor to a certain position in the result set and then query efficiently from there.

This also affects the pagination in the admin interface. Efficient "pagination" would only provide forward/backward navigation without any page numbering. This would also be a candidate for paginating via AJAX (e.g. like in Twitter).

Django-nonrel doesn't yet support bookmarks, but the App Engine backend provides a private API for them.

Multi-table inheritance

Multi-table inheritance requires JOIN support, so this feature can't be fully supported. For convenience it would be nice to allow subclassing a non-abstract model, but only copying its fields as if it were abstract.

Auth password reset URLs

#14881, Patch

Problem
django.contrib.auth's password reset URLs contain a base36-encoded user ID (/reset/<user-id>/<token>/). Several NoSQL backends (MongoDB, SimpleDB, etc.) use string-based primary keys. The password reset feature breaks if the user ID (the primary key) is not an integer (because base36 can only express integers).
Solution
Encode the user ID in a URL-safe variant of base64. This is a backwards-incompatible change that breaks "old-style" password reset URLs, but backwards compatibility should be very easy to implement if required.
Implementation status
This is already implemented in Django-nonrel, but it's not yet backwards-compatible.

Minor issues

The default ordering on permissions requires JOINs. This makes them unusable on NoSQL DBs.

The permission creation code uses an __in lookup with too many values. App Engine can only handle 30 values (except for the primary key which can handle 500). This could be worked around, but the limitation was added for efficiency reasons (__in lookups are converted into a set of queries that are executed in parallel and then de-duplicated). Thus, it's not really a solution to just run multiple of those queries. Instead, the permission creation code should just fetch all permissions at once. Maybe in a later App Engine release this limitation will be removed when App Engine's new query mechanism goes live (which supports OR queries and gets rid of several other limitations).