|Version 8 (modified by 8 years ago) (diff),|
This wiki page collects design principles and best-practice which would be useful for an App Engine port of Django.
Porting Django to App Engine: What's needed/different?
The following might also apply to other cloud hosts which provide special database and communication interfaces.
Reminder: Datastore and request limitations
You can't have requests that take longer than 10 seconds and you can't retrieve more than 1000 model instances at once from the datastore. It's also impossible to run more than 30 queries without hitting the 10 sec request limit.
A single entity (actually: a whole entity group) can't handle more than 5 writes per second (writes = save or delete).
Unique properties can only be emulated via the key_name, but this means their values can't be changed afterwards, so we might have to fall back to non-guaranteed uniqueness. Since you can't issue queries from within a transaction we have the problem that we can't even do a simple check in all cases. Probably we can only rely on checking on the ModelForm level, then. Alternatively, we have to document that you can't use transactions on models that have unique properties apart from the PrimaryKey (which can be emulated with the key_name and thus gives us a 100% uniqueness guarantee because it can be used in a transaction).
An entity may not have more than 5000 index entries.
All query filter rules are connected via the AND operator. The OR operator is not supported.
Transactions can only run on a single entity group and you can't run queries within a transaction.
Also not supported:
- JOINs (could be done manually for small datasets)
- sub-queries (ditto)
- DISTINCT queries (i.e., no queryset.dates(), etc.)
- referential integrity
Since tables are flexible and don't have schema definitions running "manage.py syncdb" shouldn't be necessary.
Queries with inequality filters or sort orders need special index rules, so Django features like the admin interface should have a fall-back mode in which you can't sort query results because the developer can hardly define all possible index rules, especially if the searched property is a list property (in which case you need multiple index rules for all possible numbers of search terms).
Possibly, when App Engine gets full-text search support there could be a fall-back to (or preference for?) running complex queries on the full-text index.
Keys, key_name, key id, parents
Django should always assign a key_name to each newly created entity instead of letting App Engine choose a key id, so data can be exported from and imported into the datastore more easily and migrations to other providers become less problematic. A transaction can be used to ensure that no existing entity with the generated key_name gets overwritten.
An interface to the underlying key and key id should be provided, too, but it shouldn't be recommended due to its problems.
The key_name and parent could be emulated with a CharField(primary_key=True) that automatically prefixes the given string with a character, internally. If you only need a key_name it's sufficient to specify a string. If you want to also specify a parent you could create a special encoded string by passing the parent and (optionally) the desired key_name to that function and then passing the result to the CharField. This API would allow for reducing the key_name and parent into a single pk property and staying compatible with existing Django code which wouldn't work if we had separate pk and parent properties.
The pk property should return an url-safe string that contains the key_name without the safety prefix (i.e., the value of the CharField(primary_key=True)) and the parent pk. This is more portable than using the str(Key) because it doesn't contain the model and app name. Moreover, in case you specified a pk manually , the URLs will be much nicer. Even if you don't specify a key_name the URL is still shorter than the str(Key) version.
In order to optimize code it's useful to be able to get the pk value of a ForeignKey without dereferencing its entity.
Queries should support ancestor conditions.
Every model should provide these properties: key, key_name, key_id, parent, parent_key
Django could emulate transactions with the commit_on_success decorator. Manual transaction handling and checkpoints can't be implemented with App Engine's current API, though. We might ask Google for help. The problem with commit_on_success is that it should run only once, but App Engine's run_in_transaction runs up to three times if an error occurs. The worst that can happen is that someone uses a custom decorator which calls commit_on_success multiple times because this could quickly hit a request limit. Maybe Django should officially change commit_on_success to issue retries?
Datastore batch operations
Datastore writes are very expensive. App Engine provides batch operations for saving and deleting lots of model instances at once (no more than 500 entries, though). Django should provide such an API, too, so code can be optimized.
The API would be most flexible if it worked like a transaction handler where all save() calls within a function call are collected and then committed afterwards. The implementation wouldn't be trivial, though. It requires maintaining a cache of to-be-saved instances, so filter() calls can check the cache. Also, when a real transaction starts the cache must be flushed and disabled because in transactions we have to interact with the DB directly in order to lock an entity group. Instead of a decorator we could also provide a middleware, but this could lead to problems if, for instance, the view issues an http request (e.g., to start a task) and thus requires that the data has already been stored.
There are batch operations for getting lots of model instances by key. This could be emulated with
MyModel.objects.all().filter(pk__in=[key1, key2, ...])
Model relations and JOINs
Since JOINs don't work, Django should fall back to client-side JOIN emulation by issuing multiple queries. Of course, this only works with small datasets and it's inefficient, but that can be documented. It can still be a useful feature.
For efficiency it should be possible to retrieve only the pk values of a ForeignKey or ManyToManyField without loading the actual entities from the db.
Many-to-many relations could be emulated with a ListProperty(db.Key), so you can at least issue simple queries, but this can quickly hit the 5000 index entries limit. The alternative of having an intermediate table is useless if you have to issue queries on the data and due to the query limit you wouldn't be able to retrieve more than 1000 related entities, anyway (well, that could be worked around with key-based sorting, but then you have to create an index and you might hit CPU limits if you check a lot of data in one request).
The problem with many-to-many relations is that, for example, ModelForm saves the model instance and and its many-to-many relations in separate steps. With ListProperty this would cause multiple write operations. Also, depending on where the many-to-many relation is defined the changes could affect multiple models at once. One solution is to use batch operations as described above, but this means that all existing many-to-many code has to be changed to use batch operations. An alternative is to change ModelForm and all other many-to-many code to allow for setting the ListProperty before save() is called.
Since this should be transaction-safe the field would have to be defined on a specific model, so that only one entity is affected when adding multiple relations. This means that Django has to make it easy to add new fields to existing models (i.e., add a ManyToManyField to model B, but store the data in the target model A) and it must have knowledge of the storage location of the many-to-many relations since we might not have an intermediate table.
Special field types
The following field types have to be ported to Django:
Django should work from within a zip package. This means at least extending find_commands(), so manage.py commands can work (app-engine-patch already does this). The media files and templates could be exported from the zip file (like it's currently done in app-engine-patch) if that is more efficient.
Not all manage.py commands should be available on App Engine (e.g., the SQL-related commands). This could probably be detected at runtime based on the DB backend's capabilities. Some commands like "runserver" have to be replaced. This could possibly be done by adding an app to INSTALLED_APPS which redefines a few commands.
We also need an "official" deployment command to emulate "appcfg.py update" and similar commands for other cloud hosts.
In order to support email functionality it must be possible to provide email backends which handle the actual sending process. App Engine has a special Mail API.
The file upload handling code should never assume that it has access to the file system. Instead, it should assume that the file gets uploaded directly into the datastore or indirectly (e.g., via POST to S3 and then Django just gets notified when the upload is finished). This means that imports of file system functions should be deferred as much as possible.
Permissions and content types
Since we shouldn't depend on manage.py syncdb, the Permission and ContentType models should be replaced with dynamically generated fake model instances (which is also an optimization). Since we can retrieve the list of defined models at runtime we can easily generate those two models at runtime, too. Internally, they could be stored as a simple string (e.g., 'user.can_add') and converted into fake models when the field is accessed. This might require creating a FakeModelField for holding this kind of model.
As an alternative to JOIN emulation, denormalization could be provided via a ForeignKey that gets told which attributes of the referenced entity have to be copied. The query would then be formulated as if it crossed a relation, but internally the copied data would be used. Of course, with denormalization when an attribute changes Django must update all affected entities referencing that attribute.
Data integrity could require modifying more model instances than allowed in a single request. A background process (or cron job) could be used to automatically clean up huge amounts of data inconsistency. This would require creating a cleanup task (maybe as a model) which could at the same time be used to correct inconsistent data on-the-fly. The cache backend could optimize this process.