wiki:NonSqlBackends

Context Navigation

Version 27 (modified by Waldemar Kornewald, 17 years ago) ( diff )
--

Here we collect the high-level refactorings required for non-sql (or non-relational) backends.

The Django ORM work for supporting non-relational DB backends and possibly other Django-internal changes are in: http://bitbucket.org/wkornewald/django-nonrel-multidb/

Related tickets: #10355

Specific backend specifications: AppEngine, SimpleDB

Basics

Database query operations (get, count, ... entities) are provided to Django models via managers. Managers basically wrap around QuerySet methods. Internally Django's QuerySet class uses sql.Query whose default is sql.BaseQuery. sql.Query can be overridden to use a custom Query class (the backend). So we end up writting a Query class in order to provide a specific backend. So anything done in Django's Model methods or QuerySet methods which cannot be done for a specific backend should be moved into the Query in order to allow backends to specify what to do in such cases. Additionally the backends Query class should not be referenced by sql.Query (cause it can have nothing to do with sql)

Emulation of SQL features

If we emulate certain SQL features (e.g., a JOIN), it must be possible to detect when something gets emulated. This is important because you might mistakenly develop code that doesn't scale when your database grows beyond a certain size. In settings.py we could have a flag which specifies whether to throw warnings for emulated features (ENABLE_DB_EMULATION_WARNINGS?). By default, warnings should be enabled, so you never miss potential sources of problems.

sql.subqueries

The query classes defined in this module are used by QuerySet and Model. Their use should be handled by sql.Query, instead, so non-relational backends can override them.

Fields

AutoField

In some DB systems the primary key is a string. Currently, AutoField assumes that it's always an Integer.

Implementing an auto-increment field in SimpleDB would be extremely difficult. I would say impossible, actually. The eventual consistency model just doesn't support it. For the persistence layers I have written on top of SimpleDB, I use a UUID (type 4) as the ID of the object. --garnaat

Conclusion: Portable code should never assume that the "pk" field is a number. If an entity uses a string pk the application should continue to work.

Model

Multi-table inheritance

In the current implementation, multi-table inheritance requires reads and writes across multiple tables. On non-relational DBs this has to be solved differently. For example, with a ListField type you could store all models which you derived from (inspired by App Engine's PolyModel). E.g., if model B derives from model A you'd store model B in the table for model A and add B's name (app_b) to the ListField.

On App Engine this adds deeper composite indexes which is a problem when filtering against multiple ListFields combining that with inequality filters or results ordering (exploding indexes). Thus, this should only be used at the second inheritance level (seen from Model base class).

Problem: Model A doesn't know about model B, but since both of them live in the same table an A instance has to know about B's fields, so when A is saved it can preserve B's data (you can't modify only specific fields; you always replace the whole row). Either we always keep all data (which means you never free up data after schema changes unless you use a lower-level API) or we keep track of all derived models' fields and preserve those while removing all unused fields (e.g., A would know about B's fields and preserve them when saving). Probably the first solution is the safest.

TODO: How do we store field data that doesn't exist in any model definition?

There are some broken assumptions here. Multi-table inheritance is, by definition, just syntactic sugar for one-to-one relations to other tables. This proposal suggests effectively converting multi-table inheritance to single-table inheritance, which breaks the only reason it exists: (a) a single table to query for common parent information and (b) the ability to extend third-party models without modification, whilst still updating the third-party tables. But only on app-engine, leading to different behaviour there than on relational storage systems, meaning nobody can make reasonable assumptions about how their app would run if they are required to have it work in a storage-agnostic fashion. We should fail properly for unsupported features, not have behavioural breakage by appearing to run, but not work the same. If a third-party storage engine cannot support linked tables by any kind of fakery, then apps using multi-table inheritance just can't run on those storage engines. Django shouldn't be crippled to support the lack of relational support in such storage engines. It's called an O*R*M, after all. This isn't a race for the bottom, trying to find the lowest common denominator. Not everything can run on every storage engine (otherwise the only possible option is to only design Django to work with CSV files, since everything else has some features that storage system doesn't). -- Malcolm Tredinnick

What's the problem exactly? I don't see a lot of cases where this would break assumptions. At the Model level it's fully abstracted away, anyway. As long as you don't make explicit queries against the underlying tables, but just work with the models it will be 100% the same behavior. Also, the alternative is that the code doesn't work, at all, on App Engine, so it's better to have 50% of all Django apps that use multi-table inheritance work correctly than 0%. -- wkornewald

save_base()

The distinction between insert and update should be done by sql.Query because not all backends make that distinction, at all. The check whether the pk already exists (which is part of the distinction) should be moved out, too, because it would be unnecessarily inefficient on those backends.

delete()

model.delete() collects all related objects using _collect_sub_objects(). For non-sql backends this is not always possible, for example, when running in a transaction on App Engine (only entities in the same entity group can be fetched from the datastore). This means that we can't guarantee referential integrity and we can't efficiently emulate SQL in this case. So model_instance.delete() should only delete the model_instance by calling sql.Query.delete(). This gives backends the flexibility to override sql.Query.delete() and make use of _collet_sub_objects() if desired and possible.

Transactions

Not all backends support transactions, at all (e.g., SimpleDB). Some (e.g., App Engine) only support transactions similar to "SELECT ... FOR UPDATE" (which isn't exactly the same as @commit_on_success because it really locks items for read/write access). Not all backends support a BEGIN/END TRANSACTION operation, but only provide an interface for calling a function transactionally (like the @commit_on_success decorator).

Queries

TODO: Let's wait for App Engine to support its new bookmarks mechanism before implementing this (key-based pagination doesn't work in all situations). (Is there a link to info about app engine bookmarks? I only found one sentence about cursors on the official roadmap.)

On App Engine you can only retrieve the first 1000 query results. There needs to be support for "bookmarks" which mark the next starting point.

On SimpleDB you can directly retrieve the bookmark of the Nth item and run the query from there.

TODO(mitch?): Find out if this is efficient (even for millions of items) or if it's better to provide bookmarks at a higher level.

SimpleDB provides it's own paging mechanism, returning a token with your result set if there are more results available. You then issue the same query again, passing the token as a parameter. This is by far the most efficient way to handle it in simpledb and scales very well. In boto, I wrapped that whole thing in a generator so you can seamlessly iterate over the entire resultset without dealing with the additional requests so if we wanted to handle paging at a higher level we could use that approach.

count()

query.count() will be problematic since a scalable count() method doesn't exist in app engine (does it exist in SimpleDB?). Perhaps this will be alleviated by cursors/bookmarks, but I suspect we'll have to address this some other way. An automatic sharded counter in the manager would allow for an "almost accurate" count in most situations. It would certainly be good enough for the most popular use case, "how many pages of results do I have?"

SimpleDB supports count() directly. --garnaat

The sharded counter can only be used for counting a very specific query, so you'd either have to specify all possible queries upfront or manage the counter manually. --wkornewald

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text

Issues