Opened 4 years ago

Closed 4 years ago

Last modified 4 years ago

#31869 closed New feature (wontfix)

Improving data migration using `dumpdata` and `loaddata`

Reported by: Matthijs Kooijman Owned by: nobody
Component: Core (Management commands) Version: 3.1
Severity: Normal Keywords:
Cc: matthijs@… Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

At first glance, using manage.py dumpdata and loaddata together seems a great way to make a full copy of an existing django installation (e.g. for migrating to a different server, or getting a local copy of your production data, etc.).

Documentation suggests this should be possible. An obvious way would be to do dumpdata on one system, followed by flush and loaddata on the other system.

However when you try it, you get issues with duplicate keys in the contenttypes and similar tables, things like:

MySQLdb._exceptions.IntegrityError: (1062, "Duplicate entry 'someapp-somemodel' for key 'django_content_type_app_label_model_76bd3d3b_uniq'")

What seems to happen is that flush (as documented) flushes all tables and then reruns "post-synchronization handlers", which create content-types and I think permissions and maybe other things as well. Since dumpdata does dump these tables, this creates a conflict.

Currently, I think you can prevent this by:

  • Making and importing a full database dump outside of Django (e.g. using mysqldump). This is a good way to guarantee a really identical copy (though there might be timezone issues with e.g. Mysql), but is often less convenient and does not work across database types (e.g. dumping a remote MySQL database to a local sqlite database).
  • Using natural keys when dumping. The documentation for `dumpdata --natural-foreign` suggests using natural keys when contenttypes and permissions are involved. I believe this works because the natural foreign keys allow associating any references to these tables to the autocreated versions in the original database. In addition, and I think the documentation does not make this explicit, you would also need to exclude the contenttypes, permissions and any other auto-created models from the dumpdata, or also add --natural primary, which I believe makes loaddata overwrite existing data based on the natural primary key rather than adding new data.
    Having to manually exclude models is quite cumbersome for a quick dump-and-load cycle. Also, if the dumped database would somehow contain *less* contenttypes, permissions, etc. than the autocreated ones, the newly loaded database would still contain the extra ones. More generally, the loaded database is not an identical copy of the original one.
    I also had some issues with this approach, due to circular references in my natural keys, but I think this has since been fixed in git.

I wonder if we can make this process easier somewhow?

One solution that springs to mind is to add a flush --no-handlers option (or something like that), to prevent running the "post synchronization handlers". This would (should) result in empty tables for all tables that are dumped by dumpdata (I think this means all tables empty, except for the migration table). Then doing a dumpdata, flush --no-handlers and loaddata could, I think, produce an exact copy of the database, including matching primary keys.

Or are there any other existing ways to make this easier that I missed and/or could be (better) documented?

Change History (7)

comment:1 by Matthijs Kooijman, 4 years ago

Just realized that I forgot to list one option, which I think is the most practical way to do this right now:

  • Instead of using manage.py flush, use a database-specific management tool (e.g. phpmyadmin) to empty/truncate all tables *except* for migrations. Then just use dumpdata and loaddata, which should get you an identical import, including identical primary keys. This works, but still requires external tools, so is harder to quickly do and/or automate, so a manage.py-only solution would certainly be nicer.

comment:2 by Mariusz Felisiak, 4 years ago

Resolution: wontfix
Status: newclosed

At first glance, using manage.py dumpdata and loaddata together seems a great way to make a full copy of an existing django installation (e.g. for migrating to a different server, or getting a local copy of your production data, etc.). Documentation suggests this should be possible.

I don't see such suggestions in docs. Given that dumpdata isn't meant as a backup tool (databases have their own tool for that), I think should not add more options to make it easier or to document a more or less feasible scenario.

comment:3 by Matthijs Kooijman, 4 years ago

I don't see such suggestions in docs.

This is not explicit, but IMHO implied. There's a command to make a dump of the data for all applications, and there's a command that can load such dumps, which suggests to me that I can use this to make a copy of all my data into another django instance. If not, maybe this should be made explicit in the docs.

Given that dumpdata isn't meant as a backup tool (databases have their own tool for that)

I was not talking about backups here (I agree that database-specific tools are better for this, though I could also see dumpdata as a second-best option), but about copying data from one instance to another. But that's just a matter of definition, I'm assuming that you would also consider that dumpdata is not meant for such data copying?

If so, then what is dumpdata meant for exactly? Is it only for creating test fixtures? Then why does it have a "dump all applications" option and/or why does it include contenttypes and other autogenerated data at all (even by default)?

And even if this copying is not the intention of dumpdata, is there any reason that its intended usecase cannot be expanded to include this? It seems to me that it would be easy to do so, but maybe there are fundamental issues or things that prevent this from working reliably? If so, maybe it would be better to document that fact (maybe with some examples) to make this more explicit and correct user expectations. If you can point out things here, I'd be happy to submit a documentation PR for this.

in reply to:  3 comment:4 by Mariusz Felisiak, 4 years ago

I was not talking about backups here (I agree that database-specific tools are better for this, though I could also see dumpdata as a second-best option), but about copying data from one instance to another. But that's just a matter of definition, I'm assuming that you would also consider that dumpdata is not meant for such data copying?

Sure you can use dumpdata to create a copy of data but you're talking about a specific and advanced scenario (combination of 3 management commands) for overriding and existing and not empty instance. IMO, this is not something that we want/should support or document.

You've already mentioned in the ticket description that it's possible if you exclude Django apps that use the post_migrate signal and use natural keys in dumpdata, however I don't think we should document this because dumpdata isn't meant as a backup/clone tool.

comment:5 by Matthijs Kooijman, 4 years ago

Sure you can use dumpdata to create a copy of data but you're talking about a specific and advanced scenario (combination of 3 management commands) for overriding and existing and not empty instance. IMO, this is not something that we want/should support or document.

I'm not married to the use of three management commands, if just dumpdata and loaddata could achieve the same, that would also be perfect (and using these two commands together does not seem like an advanced scenario to me, really).

As for overriding an existing instance, I believe that the issue would not be different when starting from a non-existing database. You would have to run migrate, then, to create the table structure, which also runs the same handles to create e.g. contenttypes.

One more thing I had wanted to emphasize in my previous comment: Using external database tools is fine for backup and restore, but when moving data between database types, making SQL-based dumps is pointless AFAICS and some higher-level tool like dumpdata would be needed. If dumpdata is not the tool for this, is there an alternative? Or is changing the type of your database and/or moving data between instances with different databases simply not supported by Django then?

This does not seem like a weird or unconventional usecase to me. On particular instance of this is wanting to import some data from a production server (typically mysql or postgres) into a local development copy to reproduce some problem without having to debug on production. Of course, you can have a staging copy of your application for this, or setup mysql/postgresql locally, which certainly makes sense for big and professional deployments, but for small-scale applications, this is a lot of extra work just for this.

Apologies if I come across stubborn, but I feel that just saying "This is unsupported" is insufficient in this case. Maybe this is extra annoying because I'm not the first one to suggest this and this is a repeating discussion for you, but if so that would be even more indication that this usecase should be better documented (even if just as not supported).

Again, I would consider filing a documentation PR to clarify that this usecase is not supported, but I'm still grasping a little bit at what is and is not the intended usecase for dumpdata and loaddata. What defines the line between supported and not supported?

in reply to:  5 comment:6 by Mariusz Felisiak, 4 years ago

Apologies if I come across stubborn, but I feel that just saying "This is unsupported" is insufficient in this case. Maybe this is extra annoying because I'm not the first one to suggest this and this is a repeating discussion for you, but if so that would be even more indication that this usecase should be better documented (even if just as not supported).

We cannot document each unsupported scenario. It's also documented that "You should never include automatically generated objects in a fixture or other serialized data."

This does not seem like a weird or unconventional usecase to me.

Moving data between different databases is always hard, because you can encounter multiple issues e.g. inconsistent or non-existent data types, and much more (that's why we have many feature flags). Adding support for overriding and existing and not empty instance with loaddata and dumpdate is really complicated, and IMO we shouldn't officially support it.

What defines the line between supported and not supported?

If sth is not tested and documented then it's not supported.

You can start a discussion on DevelopersMailingList if you don't agree.

comment:7 by Carlton Gibson, 4 years ago

Hi Matthijs — I think this is an interesting topic. A write up of exactly what's required to move between DBs might be a good starting point for a discussion. (Who knows whether we support it in the end or not, but it's something everyone has run into a few times I'd guess...)

Note: See TracTickets for help on using tickets.
Back to Top