Opened 13 months ago

Closed 13 months ago

Last modified 13 months ago

#22251 closed Cleanup/optimization (invalid)

Dumpdata improvement suggestions

Reported by: Gwildor Owned by: nobody
Component: Core (Management commands) Version: master
Severity: Normal Keywords:
Cc: Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

The dumpdata and loaddata are the standard built-in management commands for dumping and loading a database to something else than a big unparsable SQL file. Perhaps using them is in their current form is not the best idea for big databases and a dedicated separate package should be used for that, but the fact is that these commands are currently present in Django, while their usability, from a user's standpoint, is lacking at best. While they do their job, they have a few big shortcomings which make the commands hard to use in their current form. These problems could be tackled with a few big tweaks, which would make these commands worthy again of being in Django natively.

The problems I found are from using the dumpdata command quite intensely over the course of the past two months, with resulting unindented JSON dumps ranging in the 300-400MB area. Instead of using the loaddata command, I built a custom compatibility parser for our project, but I reckon these problems hold true for the loaddata command as well.

In my opinion, the current usability problems with the commands are:

Complete lack of verbosity

In its current form, you have absolutely no clue whether the command is still running properly (when deciding that it has failed and you should kill it and try again) or what its progress is. Because its default behaviour is to return the serialization result to the console, the command output is usually redirected to a separate file, making this even worse. Usually, the only feedback you get from the command is the command actually stopping again and thus giving you control over your console again.

Hence, something like this is extremely common:

$ ./manage.py dumpdata app1 app2 ... --format=json > dump.json
$

So, to clarify: between these two lines in the console, there could be an indefinite amount of time. With the dumps I spoke of, the command could last up to two hours, giving no indication at all about its state or progress. This is of course logical, because you are redirecting the output, but to me this is a major usability flaw. The only way you can check if the command is still correctly running is by monitoring the process to see if its memory usage is still increasing. Which actually brings me to my second point...

Keeping everything in memory

During the serialization, the final result is built up in memory and returned as the final step of the command. To me, this is bad for a few reasons:

  • When the command stops unexpectedly, you are left with nothing.
  • Slows down your computer when you don't have enough memory.
  • Slows down itself when you don't have enough memory.
  • Possibly a lot of other things that happen when you don't have enough memory (such as being killed by the OS? Not sure if this ever happens in Python, but see it a lot when trying to run big misconfigured Java programs).

This is especially annoying when you get the dreaded "Unable to serialize database" error and the command just stops right there (which could be a ticket of its own), without returning the result it has accumulated up to that point. Combined with the above-mentioned lack of verbosity, this makes the command very annoying to use in some circumstances, depending of course on the size and state of your database.

Possible improvements to address these issues

Off the top of my head, I've come up with these suggestions for improvement, but of course diagnosing the problem correctly is the first and most important step, so I'm mainly listing these to get the discussion going and to end ticket on a more optimistic note.

  • add a mandatory argument for dumpdata for a filename to write the result to, or create one automatically if it's not given (such as "dump_20130311_1337.json"). This makes redirecting the output of the command unnecessary, and opens up the ability to add verbosity.
  • collect the amount of models (and perhaps rows of data?) to dump, tell this to the user and give progress updates in between. This would fully eliminate the "is it stuck?" question I often have when running the command.
  • write one row of data in one JSON object per one line in the file. Perhaps this could be added as a flag on both commands, just like indent is already? This would make it possible to read and write the dump file in a buffered manor, eliminating the need to load the entire result into memory. Now this is a tough one, because it's backwards incompatible if it's not added as a flag and it requires rewriting the loaddata command as well.

Like said, these are just some pointers on how the problems mentioned could be addressed, but I reckon everyone has their on views on it. Thanks in advance for your time and effort.

First ticket I've created, so analogies in advance for any shortcomings.

Change History (2)

comment:1 Changed 13 months ago by russellm

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Resolution set to invalid
  • Status changed from new to closed

Thanks for the ticket. For future reference, it's better to keep a ticket tied to a single idea. That may mean opening a bunch of smaller tickets, rather than making a single "big ball of mud" ticket. If each ticket is small and independently actionable, we can close of that ticket when it's done. "Big ball of mud" tickets end up living forever because it's very hard to say "this ticket is done".

Looking at the ideas in this ticket, you've got three separate proposals:

  • Add a filename option for dumpdata. This makes sense; -o options are common on command line tools producing tangible output.
  • Provide progress updates during dump/load. At a conceptual level, I agree this would be nice to have, but we're limited by the progress reporting tools of the underlying serialisation frameworks. If you've got any ideas on how this can be achieved, I'd love to hear them.
  • Stream output of dumpdata. I'm pretty sure this is already what is happening - see Ticket #5423 for the discussion around this change. If this isn't working as you expect, then we're going to need more details about the behaviour you're seeing, and what you expect (or would like to see).

I'm going to procedurally close this ticket because it's a "ball of mud". Please don't take that as a discouraging step - I'd encourage you to reopen as three independent tickets, one for each independent idea.

comment:2 Changed 13 months ago by Gwildor

I must say I was already anticipating that, so I wasn't surprised by your reaction, but I decided in the end that one big coherent story would be better to illustrate the point. But maybe the Django-developers group was a better place to do that. Anyway, thanks for taking the time of reading it!

I've opened seperate tickets for the suggestions: #22257, #22258 and #22258.

Note: See TracTickets for help on using tickets.
Back to Top