Opened 17 months ago

Last modified 12 days ago

#22259 new New feature

Per row result for dumpdata

Reported by: Gwildor Owned by: nobody
Component: Core (Management commands) Version: master
Severity: Normal Keywords:
Cc: anubhav9042@… Triage Stage: Someday/Maybe
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

In response of ticket #22251, I'm opening this as a separate issue like requested. You can read the need for this option there, but basically it has to do with memory consumption. This was addressed in #5423 and improved drastically based on the results talked about in the ticket, but dumpdata is still consuming a fair amount of memory, and would benefit from further improvements. Besides that, in its current form, when the command stops unexpectedly, nothing is saved and you don't have an incomplete file which you can use for development or testing purposes while you are running the command again.

In its current form, dumpdata is returning one big JSON object which loaddata has to read into memory and parse before it can start importing again. By writing one row of data in a separate JSON object for it and having one resulting JSON object per line, loaddata could use buffered file reading like Python's readlines function to reduce the memory usage.

Unfortunately, this feature is probably backwards incompatible, although it might be possible to do some fancy reading of the file in the loaddata command to check its file structure. If that's not possible, I reckon it's best to add a new flag to enable this feature.

Change History (5)

comment:1 Changed 17 months ago by aaugustin

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Triage Stage changed from Unreviewed to Someday/Maybe

Overall the idea makes sense but we cannot accept this ticket without a plan to address the backwards-incompatibility. Remember that dumpdata output may be read by tools other by Django.

comment:2 follow-up: Changed 17 months ago by anubhav9042

As mentioned in #22251, this is implemented in #5423. What else is required?

comment:3 in reply to: ↑ 2 Changed 16 months ago by Gwildor

Replying to anubhav9042:

As mentioned in #22251, this is implemented in #5423. What else is required?

What I'm talking about is a whole other way of representing the data. As it is now, one big JSON object is created with all the model instances in it, like so:

[
    {
        'model': 'Foo',
        'pk': 1,
        'fields': {***},
    },
    {
        'model': 'Foo',
        'pk': 2,
        'fields': {***},
    },
]

What I'm going at, is changing this so one JSON object is created and streamed to the output per model instance, resulting in output along the lines as so:

{'model': 'Foo', 'pk': 1, 'fields': {***}}
{'model': 'Foo', 'pk': 2, 'fields': {***}}
{'model': 'Bar', 'pk': 1, 'fields': {***}}

This, in my opinion, has the big advantage that the script that loads the data again, does not have to load all the data into memory before it can start with processing the data.

At the moment, you have to do something like this (of course you can read buffered):

with open(args[0], 'rb') as f:
    fc = f.read()
    data = json.loads(fc)

for row in data:
    process(row)

But when the output option is added the way I proposed, you can do this:

with open(args[0], 'rb') as f:
    for row in f.readline():
        row = json.loads(row)
        process(row)

This way, you don't have to parse one big JSON object.

Another but way smaller advantage is that if the dumpdata command crashes, you would still have some output to use for testing or developing purposes while you wait while running the dumpdata command again. As it is now, I believe you are left with nothing (never seen an even incomplete JSON object printed in the terminal, just the error and nothing more). Although this is a small and arguable advantage, I believe this will have a positive effect on the user friendliness of using the command.

I think the best way to progress (if we want this) is to add a flag for the dumpdata command which enables this behaviour, and add support for it for the loaddata command by making a reasonable guess based on the reading the first line, falling back to the old behaviour if there is no certainty. I fear that other tools have the options to either do the same to support both formats, or simply support one and not the other (which will probably be the current format unless the new format turns out to be very popular).

comment:4 Changed 16 months ago by anubhav9042

  • Cc anubhav9042@… added
Note: See TracTickets for help on using tickets.
Back to Top