Dumpdata improvement suggestions
|Reported by:||Gwildor Sok||Owned by:||nobody|
|Component:||Core (Management commands)||Version:||master|
|Has patch:||no||Needs documentation:||no|
|Needs tests:||no||Patch needs improvement:||no|
The dumpdata and loaddata are the standard built-in management commands for dumping and loading a database to something else than a big unparsable SQL file. Perhaps using them is in their current form is not the best idea for big databases and a dedicated separate package should be used for that, but the fact is that these commands are currently present in Django, while their usability, from a user's standpoint, is lacking at best. While they do their job, they have a few big shortcomings which make the commands hard to use in their current form. These problems could be tackled with a few big tweaks, which would make these commands worthy again of being in Django natively.
The problems I found are from using the dumpdata command quite intensely over the course of the past two months, with resulting unindented JSON dumps ranging in the 300-400MB area. Instead of using the loaddata command, I built a custom compatibility parser for our project, but I reckon these problems hold true for the loaddata command as well.
In my opinion, the current usability problems with the commands are:
Complete lack of verbosity
In its current form, you have absolutely no clue whether the command is still running properly (when deciding that it has failed and you should kill it and try again) or what its progress is. Because its default behaviour is to return the serialization result to the console, the command output is usually redirected to a separate file, making this even worse. Usually, the only feedback you get from the command is the command actually stopping again and thus giving you control over your console again.
Hence, something like this is extremely common:
$ ./manage.py dumpdata app1 app2 ... --format=json > dump.json $
So, to clarify: between these two lines in the console, there could be an indefinite amount of time. With the dumps I spoke of, the command could last up to two hours, giving no indication at all about its state or progress. This is of course logical, because you are redirecting the output, but to me this is a major usability flaw. The only way you can check if the command is still correctly running is by monitoring the process to see if its memory usage is still increasing. Which actually brings me to my second point...
Keeping everything in memory
During the serialization, the final result is built up in memory and returned as the final step of the command. To me, this is bad for a few reasons:
- When the command stops unexpectedly, you are left with nothing.
- Slows down your computer when you don't have enough memory.
- Slows down itself when you don't have enough memory.
- Possibly a lot of other things that happen when you don't have enough memory (such as being killed by the OS? Not sure if this ever happens in Python, but see it a lot when trying to run big misconfigured Java programs).
This is especially annoying when you get the dreaded "Unable to serialize database" error and the command just stops right there (which could be a ticket of its own), without returning the result it has accumulated up to that point. Combined with the above-mentioned lack of verbosity, this makes the command very annoying to use in some circumstances, depending of course on the size and state of your database.
Possible improvements to address these issues
Off the top of my head, I've come up with these suggestions for improvement, but of course diagnosing the problem correctly is the first and most important step, so I'm mainly listing these to get the discussion going and to end ticket on a more optimistic note.
- add a mandatory argument for dumpdata for a filename to write the result to, or create one automatically if it's not given (such as "dump_20130311_1337.json"). This makes redirecting the output of the command unnecessary, and opens up the ability to add verbosity.
- collect the amount of models (and perhaps rows of data?) to dump, tell this to the user and give progress updates in between. This would fully eliminate the "is it stuck?" question I often have when running the command.
- write one row of data in one JSON object per one line in the file. Perhaps this could be added as a flag on both commands, just like indent is already? This would make it possible to read and write the dump file in a buffered manor, eliminating the need to load the entire result into memory. Now this is a tough one, because it's backwards incompatible if it's not added as a flag and it requires rewriting the loaddata command as well.
Like said, these are just some pointers on how the problems mentioned could be addressed, but I reckon everyone has their on views on it. Thanks in advance for your time and effort.
First ticket I've created, so analogies in advance for any shortcomings.