Opened 17 years ago

Closed 12 years ago

Last modified 12 years ago

#5423 closed New feature (fixed)

"dumpdata" should stream output one row at a time

Reported by: Adrian Holovaty Owned by: Ramiro Morales
Component: Core (Serialization) Version: dev
Severity: Normal Keywords: dumpdata fixtures memory
Cc: Jason McVetta, Torsten Bronger, kmike84@…, Tom Christie, bgneal@…, riccardo.magliocchetti@…, pelletier.thomas@…, znick, ben@…, dpmcgee@…, jann@…, hakon.erichsen@…, Chris Spencer, mmitar@…, wgordonw1@… Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I attempted to run django-admin.py dumpdata on a database table with ~2,000 records, and it ran out of memory. This problem would be avoided if dumpdata output its dump one line at a time, like an iterator does.

Attachments (13)

5423.diff (798 bytes ) - added by Russell Keith-Magee 17 years ago.
Possible fix for problem
django_dumpdata_streamed_output.diff (5.4 KB ) - added by Bastian Kleineidam <calvin@…> 16 years ago.
streamed json output
django_dumpdata_streamed_output_2.diff (4.8 KB ) - added by Bastian Kleineidam <calvin@…> 16 years ago.
Cleanup up revision, using objects.iterator()
django_dumpdata_streamed_output_3.diff (6.5 KB ) - added by calvin@… 16 years ago.
Newest version, fixing modeltests
dumpdata_streamed_output_django1.0.diff (5.8 KB ) - added by captnswing 15 years ago.
fixed patch to work with django 1.0
dumpdata_streamed_output_django1.0-fix.diff (6.7 KB ) - added by Michał Sałaban 15 years ago.
Corrected bugs.
5423.1.diff (5.7 KB ) - added by Ramiro Morales 13 years ago.
Patch updated to current trunk status
5423.2.diff (5.9 KB ) - added by Ramiro Morales 13 years ago.
Another update, with fixes to JSON output so it is always syntactically correct.
json-postgresq-trunk.png (40.5 KB ) - added by Ramiro Morales 13 years ago.
Memory usage with dumpdata of 50,000 objects, Postgres, JSON format, unpatched trunk
json-postgresq-trunk-patched.png (26.3 KB ) - added by Ramiro Morales 13 years ago.
Memory usage with dumpdata of same 50,000 objects, Postgres, JSON format, trunk patched with 5423.2.diff
5423.3.diff (6.0 KB ) - added by pzinovkin 13 years ago.
Patch updated to current trunk
5423.4.diff (5.7 KB ) - added by Jann Kleen 12 years ago.
combined the original patch with 5423.3.diff
5423.5.diff (5.4 KB ) - added by Luke Plant 12 years ago.
Updated for trunk

Download all attachments as: .zip

Change History (75)

comment:1 by Adrian Holovaty, 17 years ago

Triage Stage: UnreviewedAccepted

comment:2 by Jacob, 17 years ago

Just as a point of reference, I dumped a table with 100k+ rows into YAML without problems... so there's obviously a situation-dependant quality to this one.

by Russell Keith-Magee, 17 years ago

Attachment: 5423.diff added

Possible fix for problem

comment:3 by Rob Hudson <treborhudson@…>, 17 years ago

During django sprint... tested before and after the patch and noticed no effect. The dump consumed about 100M of my free memory while it was churning, then dumped everything to screen and the 100M became free again.

comment:4 by Rob Hudson <treborhudson@…>, 17 years ago

After a bit more testing, I've discovered:

  • Using the default JSON format stream doesn't appear to stream -- it churns until all rows are in memory and then it dumps to stdout. (The XML serializer contrasts this by dumping to stdout immediately.)
  • Using --format=xml the stream option does in fact stream to stdout, but in the end it still consumes the same amount of memory. Perhaps there's a bug in that memory isn't being freed as it iterates?

comment:5 by Bastian Kleineidam <calvin@…>, 16 years ago

Has patch: set

Here is a patch that streams json data per object. It also uses a generator method for the object list instead of storing everything in a list beforehand. This should get rid of your memory problems (you are welcome to test this patch).

Note that yaml output has the same streaming problem which is not fixed here.

by Bastian Kleineidam <calvin@…>, 16 years ago

streamed json output

comment:6 by Bastian Kleineidam <calvin@…>, 16 years ago

Whoops, please ignore the date formatting changes from the patch.

comment:7 by Bastian Kleineidam <calvin@…>, 16 years ago

Patch needs improvement: set

comment:8 by Densetsu no Ero-sennin <densetsu.no.ero.sennin@…>, 16 years ago

I used this patch and changed model.objects.all() to model.objects.iterator() in django/core/management/commands/dumpdata.py. After that I was successfully able to dump a considerably big database (1Gb SQLite file). Python process resident memory size was at about 20M all the time.

comment:9 by Chris Beaven, 16 years ago

Thanks Densetsu. Looks pretty good from a quick skim over - someone else from this ticket want to review the patch and report back?

comment:10 by Bastian Kleineidam <calvin@…>, 16 years ago

Patch needs improvement: unset

The .iterator() thing was the missing piece, nice work. I am adding a cleaned up patch that does not have the date formatting changes from my local repository.

by Bastian Kleineidam <calvin@…>, 16 years ago

Cleanup up revision, using objects.iterator()

comment:11 by anonymous, 16 years ago

I have applied the latest patch 'django_dumpdata_streamed_output_2.diff' from 10/20/07, on the latest version from svn.
I have a huge database (> 300,000 entries - sqlite db)

The error is: "Error: Unable to serialize database: 'NoneType' object has no attribute 'strftime'"

by calvin@…, 16 years ago

Newest version, fixing modeltests

comment:12 by Éric St-Jean, 16 years ago

Even with django_dumpdata_streamed_output_3.diff, i still go OOM trying to dump tables with 250k rows in them to json.

comment:13 by Éric St-Jean, 16 years ago

Apologies to everyone: i'm an idiot. It does stream and doesn't eat memory, django_dumpdata_streamed_output_3.diff works fine with 250k objects... I had DEBUG=True.

by captnswing, 15 years ago

fixed patch to work with django 1.0

by Michał Sałaban, 15 years ago

Corrected bugs.

comment:14 by Michał Sałaban, 15 years ago

This patch has bugs:

  • Ordering models by model._meta.pk.attname fails on models where primary key is OneToOneField ('parent' vs 'parent_id').
  • JSON separator is not placed between model entries, due to first=True inside loop The attached patch corrects them.

comment:15 by Michał Sałaban, 15 years ago

I mean dumpdata_streamed_output_django1.0-fix.diff fixes the bugs mentioned above and there are no more bugs known to me :)

comment:16 by Martin, 15 years ago

I installed the patch, but still have to watch django eat up all the memory without any output, when I try to dump my database (>700.000 records) :(
Debug is False. Is there anything else where I could go wrong? What's the format with the least memory use?

comment:17 by Jason McVetta, 15 years ago

Cc: Jason McVetta added
Keywords: dumpdata loaddata fixtures memory added

It appears the latest patch has already been applied to trunk:

$ patch -p1 < ~/incoming/dumpdata_streamed_output_django1.0-fix.diff 
patching file django/core/management/commands/dumpdata.py
Reversed (or previously applied) patch detected!  Assume -R? [n]

I experienced no problems using dumpdata to export a ~160,000 record fixture, in both json and yaml formats. However, while I was able run loaddata on the json fixture without incident, attempting to load the yaml fixture quickly overwhelmed the memory on my 4GB workstation.

comment:18 by Peter Bengtsson, 15 years ago

@jmcvetta I don't think this has yet been added to trunk. Just checked today, 7 Apr 2009.
Glad to hear that you got it working with your patch but I don't think I can use that patch any more on the trunk since it appears to be very different.

comment:19 by phaesler, 15 years ago

The problem is not in dumpdata, it's in the Django Serialisation code. The core.serializers.python.Serializer doesn't start writing to the output stream until all objects to be serialised have been read into memory.

The attached patches won't work until the above is addressed.

comment:20 by phaesler, 15 years ago

I just opened this: #11258

comment:21 by Russell Keith-Magee, 15 years ago

Dumpdata is a light wrapper around the serializers. Any problem with dumpdata that goes deeper than command arguments not being parsed correctly points to a deeper issue with the serialiers themselves.

comment:22 by phaesler, 15 years ago

Component: django-admin.pySerialization
Patch needs improvement: set

OK, then we're agreed this is a Serializer issue.

One approach would be to have the format Serializers handle the "list of objects" logic manually. E.g. for JSON, have start_serialization write "to the stream; have end_object write a comma to stream if needed and dump the object tree to stream using simplejson; end_serialization needs to write the"; the existing objects member should be removed all together. I might have a bash at a patch this weekend.

comment:23 by phaesler, 15 years ago

Oops. Should have used preview. That should read:

One approach would be to have the format Serializers handle the "list of objects" logic manually. E.g. for JSON, have start_serialization write open-square-bracket to the stream; have end_object write a comma to stream if needed and dump the object tree to stream using simplejson; end_serialization needs to write the close-square-bracket; the existing objects member should be removed all together.

I might have a bash at a patch this weekend.

comment:24 by phaesler, 15 years ago

Bah - half these patches do exactly that. I'll shut up now.

comment:25 by Russell Keith-Magee, 15 years ago

#10301 marked as a duplicate. It has a patch that could be an alternative solution.

comment:26 by Torsten Bronger, 14 years ago

Cc: Torsten Bronger added

comment:27 by Ramiro Morales, 13 years ago

Owner: changed from nobody to Ramiro Morales
Status: newassigned

by Ramiro Morales, 13 years ago

Attachment: 5423.1.diff added

Patch updated to current trunk status

comment:28 by Ramiro Morales, 13 years ago

Patch needs improvement: unset

comment:29 by Ramiro Morales, 13 years ago

Keywords: loaddata removed

Removing loaddata keyword. Loaddata has its own ticket tracking memory its usage issues (#12007).

Please test the latest patch and report back your experiences. I have created a test environment to measure, graph and compare memory usage: https://bitbucket.org/cramm/serializers_memory_usage/src

by Ramiro Morales, 13 years ago

Attachment: 5423.2.diff added

Another update, with fixes to JSON output so it is always syntactically correct.

by Ramiro Morales, 13 years ago

Attachment: json-postgresq-trunk.png added

Memory usage with dumpdata of 50,000 objects, Postgres, JSON format, unpatched trunk

by Ramiro Morales, 13 years ago

Memory usage with dumpdata of same 50,000 objects, Postgres, JSON format, trunk patched with 5423.2.diff

comment:30 by Ramiro Morales, 13 years ago

Results obtained testing with the setup described at and downloadable from https://bitbucket.org/cramm/serializers_memory_usage/src (serialization of 50,000 instances of this model to JSON). Results with XML, YAMS and MySQL, SQLite3 are similar.

Unpatched trunk

Memory usage with dumpdata of 50,000 objects, Postgres, JSON format, unpatched trunk

Patched trunk

Memory usage with dumpdata of same 50,000 objects, Postgres, JSON format, trunk patched with 5423.2.diff

comment:31 by Boris Savelev <boris.savelev@…>, 13 years ago

I try dumpdata with latest patch

python manage.py dumpdata -a > dump.json

at the end of work from top:

446m 429m 3460 R 96.4 12.3   2:54.07 python manage.py dumpdata -a
ls -lha dump.json 
-rw-r--r-- 1 boris boris 77M Мар 10 17:50 dump.json

I din't think that it is ok.

Also, dumpdata doesn't flush data on disk while working
data flush only at the end

Last edited 13 years ago by Ramiro Morales (previous) (diff)

in reply to:  31 comment:32 by Ramiro Morales, 13 years ago

Replying to Boris Savelev <boris.savelev@…>:

Thanks for testing the patch. I've reformatted your comment so it is more readable. Can you tell us which database backend are you using?

comment:33 by Boris Savelev <boris.savelev@…>, 13 years ago

thanks for formating-)

I use MySQL backend.

comment:34 by Mikhail Korobov, 13 years ago

Cc: kmike84@… added

comment:35 by Tom Christie, 13 years ago

Cc: Tom Christie added

comment:36 by Gabriel Hurley, 13 years ago

Severity: Normal
Type: New feature

comment:37 by Brian Neal, 13 years ago

Cc: bgneal@… added

comment:38 by rm_, 13 years ago

Cc: riccardo.magliocchetti@… added
Easy pickings: unset

by pzinovkin, 13 years ago

Attachment: 5423.3.diff added

Patch updated to current trunk

comment:39 by pzinovkin, 13 years ago

Triage Stage: AcceptedReady for checkin
UI/UX: unset

comment:40 by anonymous, 13 years ago

Patch needs improvement: set
Triage Stage: Ready for checkinAccepted

Unfortunately the patch isn't ready, some comments and further testing showed that in the PostgreSQL case big ORM queries are still consuming too mucho memory.

I suspect the patch solves things in the Django side because it implements usage of QuerySet.iterator(). But we still need to add to e.g. our PostgreSQL backend the ability to signal psycopg2/RDBMS it shouldn't get all the queryset from the server, maybe using server-side cursors.

Last edited 12 years ago by Ramiro Morales (previous) (diff)

comment:42 by pelletier.thomas@…, 13 years ago

Cc: pelletier.thomas@… added

comment:43 by znick, 13 years ago

Cc: znick added

comment:44 by ben@…, 13 years ago

Cc: ben@… added

in reply to:  40 comment:45 by Dan McGee, 13 years ago

Replying to ramiro:

Unfortunately the patch isn't ready, some comments and further testing showed that in the PostgreSQL case big ORM queries are still consuming too mucho memory.

I suspect the patch solves things in the Django side because it implemenst usage of QuerySet.iterator(). But we still need to add to e.g. outr PostgreSQL backend the ability to signal psycopg2/RDBMS it shouldn't get all the queryset from the server, maybe using server-side cursors.

I'm going to respectfully disagree with your switch away from "Ready for checkin".

  1. This patch still improves the situation heavily, as it eliminates the double caching problems that existed in both database drivers and the Django querysets.
  2. Looking into server side cursors, I was going to attempt to implement this, but it turns out the Django ORM is built with a fundamental limitation- that you have one connection with one cursor to a given database, period. Take a look at django/db/backends/__init__.py, you will see the cursor() method always returns the single cached cursor object from the BaseDatabaseWrapper object. This works if every query is fully executed and the cursor is available for immediate reuse, but with any sort of iteration, you can't touch the shared cursor object and would need to create a new one solely for that purpose. This brings with it a whole new bag of worms- now you have to encapsulate the cursor and close it when done, among other things, which is a non-trivial addition to the current database code.

So in short, I think holding off on applying this is the wrong decision, and another bug should be opened, related to #13869 and this one, that adds full support for server side cursors and the management of them, or their equivalent, in every database backend. Then it should be a simple 1 or 2 line adjustment to make dumpdata take advantage of this new feature.

comment:46 by Dan McGee, 13 years ago

My last comment looks like it isn't totally correct- Django does give you a new cursor each time you ask for one. Apparently we never close cursors? Not sure if this is a potential problem in some DBMSs.

With that said, I still stand by my view that server-side cursors is fundamentally a different problem and should not hold this patch from going in.

comment:47 by Dan McGee, 13 years ago

New ticket opened with preliminary patch for server-side cursors: #16614.

comment:48 by Dan McGee, 13 years ago

Cc: dpmcgee@… added

by Jann Kleen, 12 years ago

Attachment: 5423.4.diff added

combined the original patch with 5423.3.diff

comment:49 by Jann Kleen, 12 years ago

When using patch 5423.1.diff, 5423.2.diff or 5423.3.diff, we're still saving all the data in a StreamIO object instead of writing it directly to sys.stdout, which caused huge memory usage again...
I combined 5423.diff and 5423.3.diff which solved the problem for me.

comment:50 by Jann Kleen, 12 years ago

Cc: jann@… added

comment:51 by pzinovkin, 12 years ago

Just tried new patch with mysql. For me it's still consumes about 250 megs. But it's way more better than "out of memory".

311m 257m 4040 R   79 25.8   4:36.47 python
236M 2011-10-02 11:28 dump.json

in reply to:  51 comment:52 by Jann Kleen, 12 years ago

Ah, I forgot to post my memory usage ... I have about 50 models with about 3.5m rows in total...

It results in a 2.2G json file and peaks at about 173m of used memory during the process.

We're using PostgreSQL by the way.

Replying to pzinovkin:

Just tried new patch with mysql. For me it's still consumes about 250 megs. But it's way more better than "out of memory".

311m 257m 4040 R   79 25.8   4:36.47 python
236M 2011-10-02 11:28 dump.json

comment:53 by hakon.erichsen@…, 12 years ago

Cc: hakon.erichsen@… added

comment:54 by Chris Spencer, 12 years ago

Cc: Chris Spencer added

comment:55 by Mitar, 12 years ago

Cc: mmitar@… added

by Luke Plant, 12 years ago

Attachment: 5423.5.diff added

Updated for trunk

comment:56 by Luke Plant, 12 years ago

Attempted to update for trunk - will one of the original authors please check this, and then mark it Ready-For-Checkin and I will commit.

I confirm that this helps a lot - one dumpdata process I tried with went from max 247MB to max 77MB.

I agree with toofishes that we should not wait for the much bigger changes that would be required to address the further caching problems with the DB drivers.

comment:57 by nickname123, 12 years ago

Cc: wgordonw1@… added

comment:58 by rm_, 12 years ago

As promised on IRC to cramm (xrmx here) here's some results with / without latest lukeplant's patch on top of django 1.4b1.
First my environment is a bit peculiar, i have a fixed limit of address space (the sum of all kind of memory) a process can use, in this case 96MB. So if my process consume more than 96MB it gets killed.

Before patch:

../venv14/bin/python manage.py dumpdata store > store.json
/venv14/lib/python2.6/site-packages/django/conf/__init__.py:75: DeprecationWarning: The ADMIN_MEDIA_PREFIX setting has been removed; use STATIC_URL instead.
  "use STATIC_URL instead.", DeprecationWarning)
Error: Unable to serialize database: 

process got killed so no backtrace printed, store.json is empty

After patch:

../venv14/bin/python manage.py dumpdata store > store.json
/venv14/lib/python2.6/site-packages/django/conf/__init__.py:75: DeprecationWarning: The ADMIN_MEDIA_PREFIX setting has been removed; use STATIC_URL instead.
  "use STATIC_URL instead.", DeprecationWarning)
$ ls -l store.json
17496440 2012-03-05 10:02 store.json

To conclude the patch makes a huge difference in my environment, from not working to working.

comment:59 by Simon Charette, 12 years ago

Patch needs improvement: unset
Triage Stage: AcceptedReady for checkin

The patch looks good to me. I tried running it against a database that use to exhaust all memory and never achieved to actually finish dumping the data and it succeed.

comment:60 by Claude Paroz <claude@…>, 12 years ago

Resolution: fixed
Status: assignedclosed

In [3b5083bee5e96539dec599106aece9889e70ce05]:

Fixed #5423 -- Made dumpdata output one row at a time.

This should prevent storing all rows in memory when big sets of
data are dumped.
See ticket for heroic contributors.

comment:61 by barry.melton@…, 12 years ago

With these changes, serialized objects now seem to start with "[ ,", which is wrong (including a comma, even on the first result.)

This patch allows me to export my data sets correctly, but does so with an invalid JSON schema, and does not allow me to import them.

in reply to:  61 comment:62 by Claude Paroz, 12 years ago

Replying to barry.melton@…:

With these changes, serialized objects now seem to start with "[ ,", which is wrong (including a comma, even on the first result.)

Looking at the test cases, I don't think your assertion is True. However, it is possible your use case is not covered by test cases. So please provide us more details so as we can try to reproduce the faulty JSON result.

Note: See TracTickets for help on using tickets.
Back to Top