Opened 16 years ago

Closed 14 years ago

Last modified 8 years ago

#7052 closed New feature (fixed)

Add support for natural key serialization

Reported by: Jay Hargis Owned by: nobody
Component: Core (Serialization) Version: dev
Severity: Normal Keywords: auth_permission auth_content fixture import
Cc: daevaorn@…, bsn.dev@…, greg@…, Manuel Saelices, russ.amos@…, winhamwr@…, Carl Meyer, Michael Manfre, sanfordarmstrong@…, juan@…, simon@…, kmike84@…, chris@…, jashugan@…, robmadole@… Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

If you perform a fixture dump on auth, so as to have users in your test database to perform unit testing. When you try to run the test server, telling it to load the auth fixture, it *can fail. I turned the logging in PostgreSQL to log every query, and found that it was a foreign key violation that was causing the failure. Further investigation seemed to indicate that since I had altered data in auth_permission and django_content_type by hand, resulting in the sequence of primary keys becoming non sequential, this was causing the problem.

When these tables are rebuilt during test db creation, the id column of the rows do not end up with the same id's as they did in the database which created the fixture. I don't know if this is related, but as far as I could tell it was simply a matter of the id's in these two tables not being 100% sequential. I ended up dumping all the data out, then creating the database and repopulating it in a linear fashion (with respect specifically to those two tables). This fixed the problem and the test server happily created and populated those tables.

If you never alter these tables by hand, the id's should always remain sequential and this problem will not occur.

Attachments (7)

lookup-dictionaries-for-fixtures.patch (34.8 KB ) - added by juan@… 15 years ago.
lookup-dictionaries-for-fixtures-django-1.1.diff (35.4 KB ) - added by robmadole@… 15 years ago.
Same patch as juan@…, just fixed for 1.1 release
t7052-surrogate-key.diff (26.9 KB ) - added by Russell Keith-Magee 14 years ago.
First draft of a surrogate key based solution
t7052-surrogate-key.2.diff (42.9 KB ) - added by Russell Keith-Magee 14 years ago.
Second draft of surrogate-key based serialization.
t7052-surrogate-key.3.diff (44.4 KB ) - added by Russell Keith-Magee 14 years ago.
Third draft - slight tweak to the algorithm resolving dependencies.
t7052-rc1.diff (55.9 KB ) - added by Russell Keith-Magee 14 years ago.
RC1 of natural key lookup patch
t7052-rc2.diff (60.5 KB ) - added by Russell Keith-Magee 14 years ago.
RC2 of natural key lookup patch - adds flag to enable natural key use

Download all attachments as: .zip

Change History (49)

comment:1 by Russell Keith-Magee, 16 years ago

Triage Stage: UnreviewedAccepted

This is a problem I have known about for a while, although I can't recall ever seeing a ticket describing it. Strictly, it's not a bug - the serializers work fine, it's an issue with the consistency between your fixture and the content types that are auto-created in your database. However, it's a pretty common problem and an easy mistake to make, so it deserves a solution.

The best proposal I have thought of is to modify the syntax for foreign key references in serialized data to include a query capability; i.e., instead of just having a JSON fixture containing the literal foreign key content_type: 3, we would allow content_type: { class:'content_type', name:'blah' }, where the inner braces would be interpreted as a query and would resolve to the appropriate foriegn key at time of import. Similar syntax would be required for other serializers. Any other suggestions are also welcome.

comment:2 by Russell Keith-Magee, 16 years ago

Component: Unit test systemSerialization

This is a larger serialization issue, not just unit tests.

comment:3 by Jay Hargis, 16 years ago

This bug is still around, and bites me often but I have figured out that I can get it to work properly as long as I start any changes to the schema by importing a clean working fixture, performing the changes, exporting the fixture (I use XML), then import it to ensure that it will work the next time around.

comment:4 by Jay Hargis, 16 years ago

Tomorrow I am removing my custom permissions http://www.djangoproject.com/documentation/authentication/#custom-permissions

I can't reliably import fixtures and dealing with this bug has grown tiresome. I am guessing that very few people are using custom permissions or this bug would not have existed this long.

in reply to:  1 ; comment:5 by anonymous, 16 years ago

Replying to russellm:

...it's an issue with the consistency between your fixture and the content types that are auto-created in your database. However, it's a pretty common problem and an easy mistake to make, so it deserves a solution.

Actually, Django is creating the fixtures so any inconsistency would be the fault of the serializers. I have since learned that having a custom permission is the sole cause of the problem. My comment about 'if you never edit the data by hand' is NOT actually true.

in reply to:  4 ; comment:6 by Russell Keith-Magee, 16 years ago

Replying to jb0t:

I can't reliably import fixtures and dealing with this bug has grown tiresome. I am guessing that very few people are using custom permissions or this bug would not have existed this long.

Have you considered working on a fix yourself? This is an open source project - no problem gets fixed unless someone volunteers to fix it. Given enough time, I will probably get around to fixing this problem myself, but at the moment I'm a little preoccupied with fixing other bugs and getting v1.0 out the door. However if a suitable fix were to materialize, I would commit it very quickly.

This ticket contains a description of the problem, and a description of a fix that would be accepted. All that is left is for someone to do the work.

If you can't do the coding yourself, but you _REALLY_ need a fix, how about paying someone to do the work for you?

The only course of action that is _guaranteed_ to have no effect is complaining.

in reply to:  6 comment:7 by anonymous, 16 years ago

Replying to russellm:

Have you considered working on a fix yourself? This is an open source project - no problem gets fixed unless someone volunteers to fix it. Given enough time, I will probably get around to fixing this problem myself, but at the moment I'm a little preoccupied with fixing other bugs and getting v1.0 out the door. However if a suitable fix were to materialize, I would commit it very quickly.

This ticket contains a description of the problem, and a description of a fix that would be accepted. All that is left is for someone to do the work.

If you can't do the coding yourself, but you _REALLY_ need a fix, how about paying someone to do the work for you?

The only course of action that is _guaranteed_ to have no effect is complaining.


I would love to fix this problem myself, but it is over my head and important enough that someone who really knows more than I do be the one to correct it.

It seems that after numerous exports and imports, all starting from an original export created by Django, I have found that even when it succeeds the import, it (potentially) does not properly match content type id to permission.

comment:8 by Jay Hargis, 16 years ago

I managed to figure out a suitable way to deal with this bug until it gets fixed (wish i could fix it, but its just beyond me).

I perform a datadump with a usable indent size

python manage.py dumpdata auth app1 app2 --indent=4 --format=xml > initial_data.xml

then manually edit the xml file and remove all serialized permissions objects. this forces them to be recreated when the content types data is created, but keeps all user objects, which ultimately much of the rest of the data is related to.

comment:9 by Erik Allik, 16 years ago

Cc: eallik@… added

comment:10 by Shai Berger, 16 years ago

I think I have a usable workaround. The idea is simple: Allow each class to set its own content_type id.

To do this, we connect a function to the pre_save signal of ContentType; this allows us to modify ContentType instances before they are saved. We make the function take an id for the content_type from the model class.

At the top of the models.py file, after imports, add these lines:

# Make content_type ids consistent
from django.db.models.signals import pre_save
def set_content_type_id(sender, **kwargs):
    content_type = kwargs.get('instance')
    if content_type is None:
        raise Exception, "pre-save signal API changed -- fix me"
    cls = content_type.model_class()
    if getattr(cls,'content_type_id',False):
        content_type.pk = cls.content_type_id
pre_save.connect(set_content_type_id, sender=ContentType)

Now, in each model class, you can add an attribute like so

class MyModel(models.Model):
    ...
    content_type_id = 2001
    ...

class MyOtherModel(models.Model):
    ...
    content_type_id = 2002
    ....

Classes which do not set the content_type_id will have one assigned to them automatically (in sequence).

This is still not perfect -- it places the burden of allocating unique ids on the programmers, rather than the system, but it is still way better than having to manually edit dump files every time.

A note about picking ids: In most cases, you do not need to set ids for all models, and you have less than, say, 200 models, so ids over 1000 should be safe (you can use ids up to a billion or two, so there's really no problem there). Just be careful if your ids sequence is not reset when you recreate the database -- then your ids will just grow and grow. In such a case, I would allocate low numbers and define the sequence to start higher.

Another note: This workaround could lead to a real solution; if the ids, instead of being set by the user, were being set from some hashing of the model name. But there are several problems in this direction: One is finding a hash function that's platform-independent, and a much harder one is finding a conflict-resolution method that guarantees no change when models are added. These are hard problems, and for me the benefit doesn't seem to be worth the effort.

A final note: My first instinct was to add the content_type_id to a Meta nested class, rather than the class itself. Alas, Meta does not allow addition of arbitrary attributes. Perhaps the content_type framework needs a helper-class system like the admin framework's, but for just one attribute, it (again) isn't worth it.

Hope this helps,

Shai.

in reply to:  10 ; comment:11 by Russell Keith-Magee, 16 years ago

Replying to shai:

I think I have a usable workaround. The idea is simple: Allow each class to set its own content_type id.

Ok - so what content type should django-tagging use for a tag? How does the author of django-tagging prevent a clash with the content type for a blog entry in coltrane?

You have to remember that if you allow/force manual specification of content types, you have just created a global namespace where clashes are neither desirable, nor particularly resolvable (since there is no way to fix clashes without modifying the code for the app). We don't want to have to resort to the IANA to fix this sort of problem.

in reply to:  11 ; comment:12 by Shai Berger, 16 years ago

Replying to russellm:

Replying to shai:

I think I have a usable workaround. The idea is simple: Allow each class to set its own content_type id.

Ok - so what content type should django-tagging use for a tag? How does the author of django-tagging prevent a clash with the content type for a blog entry in coltrane?

I said "workaround", not "solution". You are right; this is not particularly useful for reusable, "component" applications (I can think of ways -- like using a "base value" for the app, configured at the server level, and setting the model content_type ids as offsets from this base value -- but this is still not a good way to do things).

However, for "system" applications -- applications that are not intended to be reused as components -- this is workable. It will make jb0t's life a lot easier.

you have just created a global namespace

Em, no. content_types did that. And while it did take care to avoid clashes, it didn't make the namespace properly (reproducibly) serializable. I'm just taking the other side of the trade-off -- clash avoidance responsibility to the user, in return for reproducible serializability.

But yes, I agree; this is still quite far from a proper, complete solution. It is a partial workaround only.

(It does hint to the real solution: Make the model name -- probably fully prefixed -- the pk of content_type. Yes, a string pk. Blasphemy. I know).

in reply to:  12 ; comment:13 by Russell Keith-Magee, 16 years ago

Replying to shai:

I said "workaround", not "solution". You are right; this is not particularly useful for reusable, "component" applications (I can think of ways -- like using a "base value" for the app, configured at the server level, and setting the model content_type ids as offsets from this base value -- but this is still not a good way to do things).

I have an idea how the solution will work, and I've articulated it before. All I need is the time to implement it. Implementing a problematic workaround will only serve to consume time that could be spent fixing the problem.

you have just created a global namespace

Em, no. content_types did that. And while it did take care to avoid clashes, it didn't make the namespace properly (reproducibly) serializable. I'm just taking the other side of the trade-off -- clash avoidance responsibility to the user, in return for reproducible serializability.

content_types namespaces are currently global in the per-app sense. This proposal makes them global in the 'whole globe' sense. Per-app, we can programatically avoiding collisions. We can't do that over the entire globe without the IANA or something similar. String PKs are a non-starter, if only because it means a massive backwards change.

in reply to:  13 comment:14 by Shai Berger, 16 years ago

Replying to russellm:

I have an idea how the solution will work, and I've articulated it before. All I need is the time to implement it. Implementing a problematic workaround will only serve to consume time that could be spent fixing the problem.

I must be missing something; with this workaround, I haven't asked you, or anyone else, to implement anything -- it's just a way for a substantial part of the developers to solve their day-to-day problems, until a real fix is done.


you have just created a global namespace

Em, no. content_types did that. And while it did take care to avoid clashes, it didn't make the namespace properly (reproducibly) serializable. I'm just taking the other side of the trade-off -- clash avoidance responsibility to the user, in return for reproducible serializability.

content_types namespaces are currently global in the per-app sense. This proposal makes them global in the 'whole globe' sense.

content_types ids are currently global in the global sense. That is what I meant.

Per-app, we can programatically avoiding collisions. We can't do that over the entire globe without the IANA or something similar.

Yes, I agreed that people who build their apps to be used as components shouldn't be doing this. You're rejecting my suggestion on the grounds that it doesn't solve the problem for all django apps, ignoring the fact that it is a good workaround for many. My own application, which triggered my need for this, is not going to be part of any system but my own; this is the situation with a substantial part of the django apps -- where IANA etc. are simply not an issue.

Again: I'm not asking you to do anything about this workaround except, well, stop opposing it, or explain why it's bad *for a single-system app*.

comment:15 by Alexander Koshelev, 15 years ago

Cc: daevaorn@… added

comment:16 by Eugene Pyvovarov, 15 years ago

Cc: bsn.dev@… added

comment:17 by Greg Turner, 15 years ago

Cc: greg@… added

Russellm, perhaps as part of your solution, it would be useful to make the PK of content_type be a hash, or literal, of "app_label.model_name", rather than an auto_increment. That would reduce the reliance that fixtures have on database-generated keys (not just in tests, but when they're used in new installations).

in reply to:  17 comment:18 by Russell Keith-Magee, 15 years ago

Replying to cogat:

Russellm, perhaps as part of your solution, it would be useful to make the PK of content_type be a hash, or literal, of "app_label.model_name", rather than an auto_increment. That would reduce the reliance that fixtures have on database-generated keys (not just in tests, but when they're used in new installations).

This is an interesting approach, but it has two problems. Firstly, it's backwards incompatible for anyone with an existing content_types deployment (which is essentially the entire Django userbase). Secondly, while this would work for content_types, it doesn't fix the general problem - referencing data that was created as part of a management hook, rather than a fixture. Even inside the Django codebase, there are examples of dynamically created database content (for example, auth permissions) that suffer from the same general problem. While changing the primary key to be a hash may be a plausible solution for content types, it's not really a good general solution.

comment:19 by Alex Gaynor, 15 years ago

TBH it sounds like a composite key makes most sense for both those examples. Those obviously aren't possible now, but for 1.2 or whenever it might be the best solution.

in reply to:  5 ; comment:20 by vw, 15 years ago

Replying to anonymous:

Replying to russellm:

...it's an issue with the consistency between your fixture and the content types that are auto-created in your database. However, it's a pretty common problem and an easy mistake to make, so it deserves a solution.

Actually, Django is creating the fixtures so any inconsistency would be the fault of the serializers. I have since learned that having a custom permission is the sole cause of the problem. My comment about 'if you never edit the data by hand' is NOT actually true.

Django is creating the fixtures, but I think the inconsistency is still a result of content types being automatically created when unit tests start. There is no way of guaranteeing that automatically created ids for those content types will match the content types in the source database. If those content types were also loaded as fixtures instead of automatically created, they would match. Is there any way to do this?

in reply to:  20 comment:21 by Russell Keith-Magee, 15 years ago

Replying to vw:

Django is creating the fixtures, but I think the inconsistency is still a result of content types being automatically created when unit tests start. There is no way of guaranteeing that automatically created ids for those content types will match the content types in the source database. If those content types were also loaded as fixtures instead of automatically created, they would match. Is there any way to do this?

Yes - serialize your content types as part of the fixture. However, this is not a particularly robust solution, because as soon as you add a new model, a new content type is added to the mix, and you can end up with weird duplicate key exceptions because the newest model isn't guaranteed to be the last content type that is created. Hence the only real solution here is to introduce dynamic lookup behavior into the fixture.

comment:22 by Manuel Saelices, 15 years ago

Cc: Manuel Saelices added

comment:23 by russ, 15 years ago

Cc: russ.amos@… added

comment:24 by Adam Nelson, 15 years ago

This could be solved in a series of steps:

  1. Give good error messages so people don't have to log queries in order to even find out what's happening.
  2. A doc change to lead people here so they know what's going on.
  3. Create a flag allowing these errors to be thrown and the runscript to continue.
  4. Fix the issue robustly as per comment 21

I would love to get involved but I'm not sure where to start on this one.

in reply to:  24 comment:25 by Russell Keith-Magee, 15 years ago

Replying to adamnelson:

This could be solved in a series of steps:

  1. Give good error messages so people don't have to log queries in order to even find out what's happening.

As I mentioned to you on the mailng list, you _should_ be getting error messages. If you aren't, this is itself a bug, and we would appreciate any help you can provide in working out why.

  1. A doc change to lead people here so they know what's going on.

We don't document bugs and other behaviour that is known to be faulty. This is why we have trac. If a bug exists, we aim to fix it. The documentation is our description of how things should work; Trac is our list of ways that we don't meet that target.

  1. Create a flag allowing these errors to be thrown and the runscript to continue.

The runscript _can't_ continue in this case. You have an inconsistent database, and it would be irresponsible to continue.

  1. Fix the issue robustly as per comment 21

I would love to get involved but I'm not sure where to start on this one.

I suspect I will be looking at this in the early phases of v1.2 development.

comment:26 by Adam Nelson, 15 years ago

Here's the ticket regarding the lack of error handling in django_extensions runscript:

http://code.google.com/p/django-command-extensions/issues/detail?id=93

Note that the traceback in comment 2 is only visible when manually running the Python script line by line - not when using the traceback tag or when running runscript generally.

As for continuing after data errors, I think it's reasonable to at least have an option to do that. INSERT IGNORE or REPLACE INTO would be the relevant commands in MySQL. Postgres has no equivalent except DELETE and then INSERT. A duplicate key error really isn't as bad as a table does not exist error or something like that. Even so, having the option to continue a script through even significant errors would be valuable for development based on real data sets, as long as the errors are reported. Keep in mind these aren't syntax errors or fatal errors, but data errors that could be handled with other methods - if there were an option to get through this part.

comment:27 by wwinham, 15 years ago

Cc: winhamwr@… added

comment:28 by Carl Meyer, 15 years ago

Cc: Carl Meyer added

comment:29 by Michael Manfre, 15 years ago

Cc: Michael Manfre added

comment:30 by sanfordarmstrong@…, 15 years ago

Cc: sanfordarmstrong@… added

in reply to:  1 comment:31 by juan@…, 15 years ago

Has patch: set

Replying to russellm:

This is a problem I have known about for a while, although I can't recall ever seeing a ticket describing it. Strictly, it's not a bug - the serializers work fine, it's an issue with the consistency between your fixture and the content types that are auto-created in your database. However, it's a pretty common problem and an easy mistake to make, so it deserves a solution.

The best proposal I have thought of is to modify the syntax for foreign key references in serialized data to include a query capability; i.e., instead of just having a JSON fixture containing the literal foreign key content_type: 3, we would allow content_type: { class:'content_type', name:'blah' }, where the inner braces would be interpreted as a query and would resolve to the appropriate foriegn key at time of import. Similar syntax would be required for other serializers. Any other suggestions are also welcome.

We've implemented your suggestion of dictionaries as ForeignKey (and ManyToManyField) with a small change.

Since the field knows which class it is, we don't put that into the lookup dictionary. All we do is specify which fields need to be dumped to resolve to a unique object in the database.

See the attached patch against Django 1.0, which includes tests and documentation.

comment:32 by anonymous, 15 years ago

Cc: juan@… simon@… added

comment:33 by Russell Keith-Magee, 15 years ago

Awesome, with great shiny bells on it! I'll do a detailed teardown once v1.1 is finalized, but from initial inspection, this looks like an excellent patch. Thanks guys!

As a minor administrative note - a patch of this size and complexity will definitely earn you a credit in the AUTHORS file... if I only knew your names :-) If you need to upload an updated patch, feel free to add your names to AUTHORS.

comment:34 by Mikhail Korobov, 15 years ago

Cc: kmike84@… added

comment:35 by anonymous, 15 years ago

Cc: chris@… added

comment:36 by anonymous, 15 years ago

Cc: jashugan@… added

by robmadole@…, 15 years ago

Same patch as juan@…, just fixed for 1.1 release

comment:37 by robmadole@…, 15 years ago

I had a lot of trouble getting this to patch cleanly (tried several releases 0.9, 1.0, and 1.1). I've included an updated patch (in svn diff format) that fixes the differences. This is against http://code.djangoproject.com/svn/django/tags/releases/1.1/.

comment:38 by robmadole, 15 years ago

Cc: robmadole@… added

by Russell Keith-Magee, 14 years ago

Attachment: t7052-surrogate-key.diff added

First draft of a surrogate key based solution

by Russell Keith-Magee, 14 years ago

Attachment: t7052-surrogate-key.2.diff added

Second draft of surrogate-key based serialization.

by Russell Keith-Magee, 14 years ago

Attachment: t7052-surrogate-key.3.diff added

Third draft - slight tweak to the algorithm resolving dependencies.

by Russell Keith-Magee, 14 years ago

Attachment: t7052-rc1.diff added

RC1 of natural key lookup patch

by Russell Keith-Magee, 14 years ago

Attachment: t7052-rc2.diff added

RC2 of natural key lookup patch - adds flag to enable natural key use

comment:39 by Russell Keith-Magee, 14 years ago

RC2 adds a --natural flag to dumpdata and a use_natural_keys argument to serialize() in order to avoid potential backwards incompatibilities. Thanks to mattimustang for the report.

comment:40 by Russell Keith-Magee, 14 years ago

Resolution: fixed
Status: newclosed

(In [11863]) Fixed #7052 -- Added support for natural keys in serialization.

comment:41 by Tim Graham, 8 years ago

Easy pickings: unset
Severity: Normal
Summary: auth fixture fails to import when running test serverAdd support for natural key serialization
Type: New feature
UI/UX: unset

comment:42 by Erik Allik, 8 years ago

Cc: eallik@… removed
Note: See TracTickets for help on using tickets.
Back to Top