Code


Version 23 (modified by wkornewald, 5 years ago) (diff)

--

This wiki page collects design principles and best-practice which would be useful for an App Engine port of Django.

The DB backend itself and other App Engine code will be kept as a separate project/repository: http://bitbucket.org/gumptioncom/django-db-app-engine/

Related tickets: #10192, #10355

Parent wiki page: NonSqlBackends

Porting Django to App Engine: What's needed/different?

In order to understand this proposal you must have read the App Engine documentation. In order to simplify the port it might be possible to reuse a few parts from app-engine-patch.

The following might also apply to other cloud hosts which provide special database and communication interfaces.

Summary

At the Django level we need support for:

  • setting an owner model for ManyToManyFields (i.e., there is no intermediary table and the field's data could be stored on either the model defining the ManyToManyField or on the other model) and ModelForm should handle that case efficiently (i.e., try to save the relations together with the model instead of afterwards in save_m2m)
  • ListField (stores a list of values of a certain type; DB backends could use this internally for ManyToManyFields without an intermediary table) and BinaryField
  • batch save() and delete()
  • email backends
  • running Django from a zip package
  • Permission and ContentType should be fake models which are stored as a string in a CharField

Emulation of SQL features

If we emulate certain SQL features, it must be possible to detect when something gets emulated. This is important because you might mistakenly develop code that doesn't scale when your database grows beyond a certain size. In settings.py we could have a flag which specifies whether to throw warnings for emulated features (ENABLE_DB_EMULATION_WARNINGS?). By default, warnings should be enabled, so you never miss potential sources of problems.

Alternatively, one could have to activate emulation by calling a special function (Model.objects.with_emulation().filter(...)). That's more explicit and less error-prone.

Schemas

Since tables are flexible and don't have schema definitions running "manage.py syncdb" shouldn't be necessary. It can still be supported by emulating a remote DB connection.

Indexes

Queries with inequality filters or sort orders need special index rules, so Django features like the admin interface should have a fall-back mode in which you can't sort query results because the developer can hardly define all possible index rules, especially if the searched property is a list property (in which case you need multiple index rules for all possible numbers of search terms).

Possibly, when App Engine gets full-text search support there could be a fall-back to (or preference for?) running complex queries on the full-text index.

Keys, key_name, key id, parents

In order to stay compatible with normal Django code the id should be emulated with an AutoField and the key_name with a CharField(primary_key=True). Since key_names may not start with a number the actual string is transparently prefixed with a string that could be specified in Meta.gae_key_name_prefix, for example. The user won't notice this encoding when accessing the pk (or the CharField). By default, this prefix is just 'k'.

With this setup most applications should work without any modifications and automatically use key_name in a natural way.

Some applications might mix the use of ids and key_names within the same model. For these cases you can set the prefix to an empty string. This allows for working directly with raw ids and key_names (for simplicity, the pk will always be a string, even if it's an id).

Parents are a special App Engine feature which can't be mapped transparently to Django features, so they could be simulated in two different ways:

  • With a separate gae_parent field that is automatically added to every model or (if this isn't possible: that must be added manually). The pk itself only contains the id or key_name (without the prefix) and thus stays clean and doesn't have to be encoded in a special format. A special function allows for constructing parent paths: make_parent_path(ModelA, 23, ModelB, 'kname'). This function returns a human-readable string (e.g.: "ModelA|23|ModelB|kname") representing the path
  • With a special !GAEKeyField(primary_key=True) that holds an App Engine db.Key object or its str() value.

Portable code should never assume that the "pk" field is a number. If an entity uses a string pk (key_name) the application should continue to work.

TODO: Queries should somehow support ancestor conditions.

Every model should provide properties for: key, key_name, key_id, parent, parent_key (all prefixed with "gae_" to reduce name conflicts)

Transactions

Django could emulate transactions with the commit_on_success decorator. Manual transaction handling and checkpoints can't be implemented with App Engine's current API, though. We might ask Google for help. The problem with commit_on_success is that it should run only once, but App Engine requires that it runs multiple times if an error occurs. The worst that can happen is that someone uses a custom decorator which calls commit_on_success multiple times because this could quickly hit a request limit. Maybe Django should officially change commit_on_success to issue retries?

Datastore batch operations

Datastore writes are very expensive. App Engine provides batch operations for saving and deleting lots of model instances at once (no more than 500 entries, though). Django should provide such an API, too, so code can be optimized. Note that while Django does support a few batch operations they work at the DB level, so the model instances' save() and delete() methods are never called.

The API would be most flexible if it worked like a transaction handler where all save() calls within a function call are collected and then committed afterwards. The implementation wouldn't be trivial, though. It requires maintaining a cache of to-be-saved instances, so filter() calls can check the cache. Also, when a real transaction starts the cache must be flushed and disabled because in transactions we have to interact with the DB directly in order to lock an entity group. Instead of a decorator we could also provide a middleware, but this could lead to problems if, for instance, the view issues an http request (e.g., to start a task) and thus requires that the data has already been stored.

Certain batch operations can already be emulated with the existing API:

Getting lots of model instances by key:

Blog.objects.all().filter(pk__in=[key1, key2, ...])

Deleting a set of rows by key:

Blog.objects.all().filter(pk__in=[key1, key2, ...]).delete()

Changing existing rows:

Blog.objects.all().filter(pk__in=[key1, key2, ...]).update(title='My Blog')

Model relations and JOINs

Since JOINs don't work, Django should fall back to client-side JOIN emulation by issuing multiple queries. Of course, this only works with small datasets and it's inefficient, but that can be documented. It can still be a useful feature.

Many-to-many relations could be emulated with a ListProperty(db.Key), so you can at least issue simple queries, but this can quickly hit the 5000 index entries limit. The alternative of having an intermediate table is useless if you have to issue queries on the data and due to the query limit you wouldn't be able to retrieve more than 1000 related entities, anyway (well, that could be worked around with key-based sorting, but then you have to create an index and you might hit CPU limits if you check a lot of data in one request).

The problem with many-to-many relations is that, for example, ModelForm saves the model instance and and its many-to-many relations in separate steps. With ListProperty this would cause multiple write operations. Also, depending on where the many-to-many relation is defined the changes could affect multiple models at once. One solution is to use batch operations as described above, but this means that all existing many-to-many code has to be changed to use batch operations. An alternative is to change ModelForm and all other many-to-many code to allow for setting the ListProperty before save() is called.

Since this should be transaction-safe the field would have to be defined on a specific model, so that only one entity is affected when adding multiple relations. This means that Django has to make it easy to add new fields to existing models (i.e., add a ManyToManyField to model B, but store the data in the target model A) and it must have knowledge of the storage location of the many-to-many relations since we might not have an intermediate table.

Special field types

The following field types have to be ported to Django:

Zipimport

Django should work from within a zip package. This means at least extending find_commands(), so manage.py commands can work (app-engine-patch already does this). The media files and templates could be exported from the zip file (like it's currently done in app-engine-patch) if that is more efficient.

manage.py commands

Not all manage.py commands should be available on App Engine (e.g., the SQL-related commands). This could probably be detected at runtime based on the DB backend's capabilities. Some commands like "runserver" have to be replaced. This could possibly be done by adding an app to INSTALLED_APPS which redefines a few commands.

We also need an "official" deployment command to emulate "appcfg.py update" and similar commands for other cloud hosts.

Email support

In order to support email functionality it must be possible to provide email backends which handle the actual sending process. App Engine has a special Mail API.

File uploads

The file upload handling code should never assume that it has access to the file system. Instead, it should be supported that the file gets uploaded indirectly into the datastore (e.g., via POST to S3 and then Django just gets notified when the upload is finished). This means that imports of file system functions should be deferred as much as possible.

Permissions and content types

Since we shouldn't depend on manage.py syncdb, the Permission and ContentType models should be replaced with dynamically generated fake model instances (which is also an optimization). Since we can retrieve the list of defined models at runtime we can easily generate those two models at runtime, too. Internally, they could be stored as a simple string (e.g., 'user.can_add') and converted into fake models when the field is accessed. This might require creating a FakeModelField for holding this kind of model.

Future: denormalization

As an alternative to JOIN emulation, denormalization could be provided via a ForeignKey that gets told which attributes of the referenced entity have to be copied. The query would then be formulated as if it crossed a relation, but internally the copied data would be used. Of course, with denormalization when an attribute changes Django must update all affected entities referencing that attribute.

Data integrity could require modifying more model instances than allowed in a single request. A background process (or cron job) could be used to automatically clean up huge amounts of data inconsistency. This would require creating a cleanup task (maybe as a model) which could at the same time be used to correct inconsistent data on-the-fly. The cache backend could optimize this process.