Opened 10 years ago

Closed 4 years ago

Last modified 4 years ago

#22259 closed New feature (duplicate)

Per row result for dumpdata

Reported by: Gwildor Sok Owned by: nobody
Component: Core (Management commands) Version: dev
Severity: Normal Keywords:
Cc: anubhav9042@…, Ian Foote Triage Stage: Someday/Maybe
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

In response of ticket #22251, I'm opening this as a separate issue like requested. You can read the need for this option there, but basically it has to do with memory consumption. This was addressed in #5423 and improved drastically based on the results talked about in the ticket, but dumpdata is still consuming a fair amount of memory, and would benefit from further improvements. Besides that, in its current form, when the command stops unexpectedly, nothing is saved and you don't have an incomplete file which you can use for development or testing purposes while you are running the command again.

In its current form, dumpdata is returning one big JSON object which loaddata has to read into memory and parse before it can start importing again. By writing one row of data in a separate JSON object for it and having one resulting JSON object per line, loaddata could use buffered file reading like Python's readlines function to reduce the memory usage.

Unfortunately, this feature is probably backwards incompatible, although it might be possible to do some fancy reading of the file in the loaddata command to check its file structure. If that's not possible, I reckon it's best to add a new flag to enable this feature.

Change History (8)

comment:1 by Aymeric Augustin, 10 years ago

Triage Stage: UnreviewedSomeday/Maybe

Overall the idea makes sense but we cannot accept this ticket without a plan to address the backwards-incompatibility. Remember that dumpdata output may be read by tools other by Django.

comment:2 by ANUBHAV JOSHI, 10 years ago

As mentioned in #22251, this is implemented in #5423. What else is required?

in reply to:  2 comment:3 by Gwildor Sok, 10 years ago

Replying to anubhav9042:

As mentioned in #22251, this is implemented in #5423. What else is required?

What I'm talking about is a whole other way of representing the data. As it is now, one big JSON object is created with all the model instances in it, like so:

[
    {
        'model': 'Foo',
        'pk': 1,
        'fields': {***},
    },
    {
        'model': 'Foo',
        'pk': 2,
        'fields': {***},
    },
]

What I'm going at, is changing this so one JSON object is created and streamed to the output per model instance, resulting in output along the lines as so:

{'model': 'Foo', 'pk': 1, 'fields': {***}}
{'model': 'Foo', 'pk': 2, 'fields': {***}}
{'model': 'Bar', 'pk': 1, 'fields': {***}}

This, in my opinion, has the big advantage that the script that loads the data again, does not have to load all the data into memory before it can start with processing the data.

At the moment, you have to do something like this (of course you can read buffered):

with open(args[0], 'rb') as f:
    fc = f.read()
    data = json.loads(fc)

for row in data:
    process(row)

But when the output option is added the way I proposed, you can do this:

with open(args[0], 'rb') as f:
    for row in f.readline():
        row = json.loads(row)
        process(row)

This way, you don't have to parse one big JSON object.

Another but way smaller advantage is that if the dumpdata command crashes, you would still have some output to use for testing or developing purposes while you wait while running the dumpdata command again. As it is now, I believe you are left with nothing (never seen an even incomplete JSON object printed in the terminal, just the error and nothing more). Although this is a small and arguable advantage, I believe this will have a positive effect on the user friendliness of using the command.

I think the best way to progress (if we want this) is to add a flag for the dumpdata command which enables this behaviour, and add support for it for the loaddata command by making a reasonable guess based on the reading the first line, falling back to the old behaviour if there is no certainty. I fear that other tools have the options to either do the same to support both formats, or simply support one and not the other (which will probably be the current format unless the new format turns out to be very popular).

comment:4 by ANUBHAV JOSHI, 10 years ago

Cc: anubhav9042@… added

comment:6 by Charlie Denton, 5 years ago

Testing this today, and I've been having success with this library for dumping-to/loading-from one-row-per-line JSON: https://github.com/superisaac/django-mljson-serializer

comment:7 by Ian Foote, 5 years ago

Cc: Ian Foote added

comment:8 by Mariusz Felisiak, 4 years ago

Resolution: duplicate
Status: newclosed

I think we can close this as a duplicate of #30190 since you can now use the jsonl format.

Last edited 4 years ago by Mariusz Felisiak (previous) (diff)
Note: See TracTickets for help on using tickets.
Back to Top