Version 1 (modified by 14 years ago) ( diff ) | ,
---|
This page documents the requirements for supporting NoSQL (or non-relational) databases with Django.
The Django-nonrel branch of Django already provide support for NoSQL and it requires only minimal changes to Django's ORM. However, for the more interesting features like select_related()
Django's ORM needs to be refactored and simplified in several areas. This wiki page describes the required changes and the current limitations of Django-nonrel.
For the record, Django-nonrel has several backends:
- App Engine: djangoappengine
- MongoDB: django-mongodb-engine
- ElasticSearch: django-elasticsearch
- Cassandra: django_cassandra_backend
Representing result rows
SQLCompiler.results_iter()
currently returns results as simple lists which represent rows. This adds unnecessary complexity to NoSQL backends, especially since they have to map their results (which are dicts) to a specifically ordered list and then Django takes that list and converts it back to a dict which gets passed to the model constructor. The row format is especially inconvenient when combined with select_related()
because then NoSQL backends have to collect all fields in the correct order and also take deferred fields into account.
Instead of returning lists results_iter()
should return more structured data. For example, each result could be wrapped as a dict like this
yield { 'result': {'id': 8, 'some_string_column': 'value', 'some_bool_column': True, ...}, 'related_selection': { 'fk': {'id': 10, ...}, 'fk__user': {...}, ... }, 'annotations': ..., 'extra_select': ..., }
select_related()
Django implements this in a way that requires JOINs, so this doesn't work on non-relational DBs. Still, this feature should be supported by NoSQL backends. Django needs to provide an easier format for NoSQL backends and the result value should also be simplified, as described above in "Representing result rows".
AutoField
In some DB systems the primary key is a string. Currently, AutoField
assumes that it's always an Integer.
Implementing an auto-increment field in SimpleDB would be extremely difficult. I would say impossible, actually. The eventual consistency model just doesn't support it. For the persistence layers I have written on top of SimpleDB, I use a UUID (type 4) as the ID of the object. --garnaat
Conclusion: Portable code should never assume that the "pk" field is a number. If an entity uses a string pk the application should continue to work.
This is already implemented in Django-nonrel.
ListField
NoSQL DBs use ListField
in a lot of places. They are basically a replacement for ManyToManyField
. BTW, some SQL DBs have a special array type which could also be supported via ListField
.
This is already implemented in Django-nonrel.
SetField
Another useful type is SetField
which stores a set instead of a list. On DBs that don't support sets this field can be emulated by storing a list, instead. This is the approach taken by Django-nonrel's App Engine backend.
This is already implemented in Django-nonrel.
DictField
MongoDB and other databases use ListField
in combination with DictField
to completely replace ManyToManyField
in a lot of cases. Django currently doesn't provide an API for querying the data within a DictField
(especially if it's embedded in a ListField
). Ideally, the query API would just use the foo__bar
JOIN syntax.
The field is already implemented in Django-nonrel, but lookups aren't supported, yet.
EmbeddedModelField
This is a field which stores model instances like a "sub-table within a field". Internally, it's just a DictField
which converts model instances to/from dicts. In addition to the DictField
issues this field also has to call the embedded fields' conversion functions, which again requires special support if the JOIN syntax should be supported.
The field is already implemented in Django-nonrel, but lookups aren't supported, yet.
Multi-table inheritance
Multi-table inheritance requires JOIN support, so this feature can't be fully supported.
On non-relational DBs it could be partially emulated with a ListField
that stores all model names which it derives from. E.g., if model B derives from model A it would store model B in model A's table and add B's name (app_b) to the ListField
.
On App Engine this adds deeper composite indexes which is a problem when filtering against multiple ListFields
combining that with inequality filters or results ordering (exploding indexes). Thus, this should only be used at the second inheritance level (seen from Model base class).
Problem: Model A doesn't know about model B, but since both of them live in the same table an A instance has to know about B's fields, so when A is saved it can preserve B's data (you can't modify only specific fields; you always replace the whole row). Either we always keep all data (which means you never free up data after schema changes unless you use a lower-level API) or we keep track of all derived models' fields and preserve those while removing all unused fields (e.g., A would know about B's fields and preserve them when saving). Probably the first solution is the safest.
This is not implemented in Django-nonrel.
INSERT vs UPDATE
Currently, Model.save_base()
runs a check whether the pk already exists in the database. This check is necessary for SQL, but it's unnecessary and inefficient on many NoSQL DBs and it also conflicts with App Engine's optimistic transactions. Thus, Django should not distinguish between insert and update operations on DBs that don't require it.
This comes with a minor problem: Without that check model instances have to track whether they were instantiated from the DB and thus exist in the DB or not. Otherwise the Field.pre_save()
add
parameter won't work correctly and the post_save
signal won't report correctly whether this is a new entity or not.
This is already implemented in Django-nonrel.
delete()
By default, on Model.delete()
Django emulates ON DELETE CASCADE
. On App Engine this is not possible because queries are disabled while running a transaction. Even without transactions this can be very inefficient on App Engine, SimpleDB, and other NoSQL DBs because Django has to run a lot of queries and retrieve a lot of model instance. Even worse, since this operation is so inefficient it can be absolutely impossible to retrieve all related entities if there are significantly more than 1000 entities (on GAE the 1000 results limit has been removed, but it's still not possible to retrieve e.g. 5000 results).
It should be possible for backends to override cascading deletes (e.g. on App Engine the backend might distribute the deletion across background tasks to handle the load).
For now, in Django-nonrel cascading deletes are completely disabled. This obviously is not a good long-term solution.
Transactions
Not all backends support transactions, at all (e.g., SimpleDB). Some (e.g., App Engine) only support optimistic transactions similar to SELECT ... FOR UPDATE
(which isn't exactly the same as @commit_on_success because it really locks items for read/write access). Also, not all backends support separate BEGIN TRANSACTION
and END TRANSACTION
operations, but instead only have an API for calling a complete function within a transaction.
Django-nonrel currently doesn't provide any support for optimistic transactions.
Pagination
On some DBs it's inefficient to request entities using a large offset (queryset[5000:...]
). E.g., App Engine's datastore doesn't actually support offsets. When you use an offset the datastore always starts from offset 0 and throws away all results you didn't request (which means you can't ever query e.g. for the 10000th result). Instead of integer offsets App Engine and SimpleDB provide some kind of "bookmark" which marks the query's current position in the result set. You can pass a bookmark to a query to move the cursor to a certain position in the result set and then query efficiently from there.
Django-nonrel doesn't yet support bookmarks.
count()
Query.count()
is problematic since a scalable count()
method doesn't exist at least on App Engine. It would be nice to be able to pass an upper limit like count(100)
, so if there are more than 100 results it will still return just 100.
Django-nonrel currently just limits the maximum count to 1000.