Opened 9 years ago

Closed 9 years ago

#24558 closed New feature (fixed)

django-admin.py dumpdata should be deterministic for VCS and diff friendliness

Reported by: Geoffrey Fairchild Owned by: Simon Charette
Component: Core (Management commands) Version: dev
Severity: Normal Keywords:
Cc: Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I have several projects in which I like to store fixtures that are shared between project instances (e.g., development and production instances). Since fixtures are just text files, they're easily stored in version control; I use git for all my projects.

The problem I'm experiencing is that the dumpdata management command outputs data differently every time it runs (I'm on Django 1.7.7, so it may be different on other versions of Django). I understand why it does this - a lot of the data are stored in dicts or sets, and dicts and sets don't provide any promises on ordering. The reason this is a problem is because, for large fixtures, this can cause significant changes to be perceived by the VCS. Git is pretty smart, but if a 10mb fixture is completely reordered, it can't actually show me what's changed when I do a diff. Additionally, if I re-dump all the data in my database, even if the data haven't changed, git will detect that the files are different because all the data are in a different order.

The feature I'd like is for the dumpdata command to be deterministic; that is, every time it runs, it should produce the same output for the same input. This could even be an option that's turned off by default. This will reduce VCS thrashing and improve the ability for us to diff fixtures in order to understand what's actually changed.

I imagine this could fairly easily be solved by tossing a few sorted statements in the right places, but I'm not sure.

Attachments (2)

terms1.json (5.4 KB ) - added by Geoffrey Fairchild 9 years ago.
terms dump 1
terms2.json (5.4 KB ) - added by Geoffrey Fairchild 9 years ago.
terms dump 2

Download all attachments as: .zip

Change History (9)

comment:1 by Claude Paroz, 9 years ago

Agreed that deterministic ordering is a desirable goal. Now can you tell us what sort of changed ordering are you seeing in your data? It should not be with objects themselves, as they are sorted by their primary key. Is it an issue with serializers.sort_dependencies which gives the model ordering?

by Geoffrey Fairchild, 9 years ago

Attachment: terms1.json added

terms dump 1

by Geoffrey Fairchild, 9 years ago

Attachment: terms2.json added

terms dump 2

comment:2 by Geoffrey Fairchild, 9 years ago

Sure, so here's an example model I have in my swap app:

class Term(models.Model):
    term = models.CharField(max_length=255)
    definition = models.TextField()
    reference = models.TextField()

I've attached the output from two consecutive dumps (./manage.py dumpdata swap.Term).

You can see that the content of each file is identical. And as you say, the objects are indeed sorted by their primary key. The ordering issue is with the key-value pairs printed for each object. Each object is just a dictionary of keys and values, so there's no ordering maintained.

comment:3 by Simon Charette, 9 years ago

Owner: changed from nobody to Simon Charette
Status: newassigned
Triage Stage: UnreviewedAccepted
Version: 1.7master

comment:4 by Simon Charette, 9 years ago

Has patch: set
Needs tests: set

comment:5 by Simon Charette, 9 years ago

Needs tests: unset

comment:6 by Claude Paroz, 9 years ago

Triage Stage: AcceptedReady for checkin

comment:7 by Simon Charette <charette.s@…>, 9 years ago

Resolution: fixed
Status: assignedclosed

In 5bc31234:

Fixed #24558 -- Made dumpdata mapping ordering deterministic.

Thanks to gfairchild for the report and Claude for the review.

Note: See TracTickets for help on using tickets.
Back to Top