Code

Opened 7 years ago

Closed 2 years ago

Last modified 2 years ago

#5423 closed New feature (fixed)

"dumpdata" should stream output one row at a time

Reported by: adrian Owned by: ramiro
Component: Core (Serialization) Version: master
Severity: Normal Keywords: dumpdata fixtures memory
Cc: jmcvetta, bronger, kmike84@…, tomchristie, bgneal@…, riccardo.magliocchetti@…, pelletier.thomas@…, znick, ben@…, dpmcgee@…, jann@…, hakon.erichsen@…, Cerin, mmitar@…, wgordonw1@… Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I attempted to run django-admin.py dumpdata on a database table with ~2,000 records, and it ran out of memory. This problem would be avoided if dumpdata output its dump one line at a time, like an iterator does.

Attachments (13)

5423.diff (798 bytes) - added by russellm 7 years ago.
Possible fix for problem
django_dumpdata_streamed_output.diff (5.4 KB) - added by Bastian Kleineidam <calvin@…> 7 years ago.
streamed json output
django_dumpdata_streamed_output_2.diff (4.8 KB) - added by Bastian Kleineidam <calvin@…> 7 years ago.
Cleanup up revision, using objects.iterator()
django_dumpdata_streamed_output_3.diff (6.5 KB) - added by calvin@… 6 years ago.
Newest version, fixing modeltests
dumpdata_streamed_output_django1.0.diff (5.8 KB) - added by captnswing 6 years ago.
fixed patch to work with django 1.0
dumpdata_streamed_output_django1.0-fix.diff (6.7 KB) - added by emes 6 years ago.
Corrected bugs.
5423.1.diff (5.7 KB) - added by ramiro 3 years ago.
Patch updated to current trunk status
5423.2.diff (5.9 KB) - added by ramiro 3 years ago.
Another update, with fixes to JSON output so it is always syntactically correct.
json-postgresq-trunk.png (40.5 KB) - added by ramiro 3 years ago.
Memory usage with dumpdata of 50,000 objects, Postgres, JSON format, unpatched trunk
json-postgresq-trunk-patched.png (26.3 KB) - added by ramiro 3 years ago.
Memory usage with dumpdata of same 50,000 objects, Postgres, JSON format, trunk patched with 5423.2.diff
5423.3.diff (6.0 KB) - added by pzinovkin 3 years ago.
Patch updated to current trunk
5423.4.diff (5.7 KB) - added by JannKleen 3 years ago.
combined the original patch with 5423.3.diff
5423.5.diff (5.4 KB) - added by lukeplant 2 years ago.
Updated for trunk

Download all attachments as: .zip

Change History (75)

comment:1 Changed 7 years ago by adrian

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Triage Stage changed from Unreviewed to Accepted

comment:2 Changed 7 years ago by jacob

Just as a point of reference, I dumped a table with 100k+ rows into YAML without problems... so there's obviously a situation-dependant quality to this one.

Changed 7 years ago by russellm

Possible fix for problem

comment:3 Changed 7 years ago by Rob Hudson <treborhudson@…>

During django sprint... tested before and after the patch and noticed no effect. The dump consumed about 100M of my free memory while it was churning, then dumped everything to screen and the 100M became free again.

comment:4 Changed 7 years ago by Rob Hudson <treborhudson@…>

After a bit more testing, I've discovered:

  • Using the default JSON format stream doesn't appear to stream -- it churns until all rows are in memory and then it dumps to stdout. (The XML serializer contrasts this by dumping to stdout immediately.)
  • Using --format=xml the stream option does in fact stream to stdout, but in the end it still consumes the same amount of memory. Perhaps there's a bug in that memory isn't being freed as it iterates?

comment:5 Changed 7 years ago by Bastian Kleineidam <calvin@…>

  • Has patch set

Here is a patch that streams json data per object. It also uses a generator method for the object list instead of storing everything in a list beforehand. This should get rid of your memory problems (you are welcome to test this patch).

Note that yaml output has the same streaming problem which is not fixed here.

Changed 7 years ago by Bastian Kleineidam <calvin@…>

streamed json output

comment:6 Changed 7 years ago by Bastian Kleineidam <calvin@…>

Whoops, please ignore the date formatting changes from the patch.

comment:7 Changed 7 years ago by Bastian Kleineidam <calvin@…>

  • Patch needs improvement set

comment:8 Changed 7 years ago by Densetsu no Ero-sennin <densetsu.no.ero.sennin@…>

I used this patch and changed model.objects.all() to model.objects.iterator() in django/core/management/commands/dumpdata.py. After that I was successfully able to dump a considerably big database (1Gb SQLite file). Python process resident memory size was at about 20M all the time.

comment:9 Changed 7 years ago by SmileyChris

Thanks Densetsu. Looks pretty good from a quick skim over - someone else from this ticket want to review the patch and report back?

comment:10 Changed 7 years ago by Bastian Kleineidam <calvin@…>

  • Patch needs improvement unset

The .iterator() thing was the missing piece, nice work. I am adding a cleaned up patch that does not have the date formatting changes from my local repository.

Changed 7 years ago by Bastian Kleineidam <calvin@…>

Cleanup up revision, using objects.iterator()

comment:11 Changed 7 years ago by anonymous

I have applied the latest patch 'django_dumpdata_streamed_output_2.diff' from 10/20/07, on the latest version from svn.
I have a huge database (> 300,000 entries - sqlite db)

The error is: "Error: Unable to serialize database: 'NoneType' object has no attribute 'strftime'"

Changed 6 years ago by calvin@…

Newest version, fixing modeltests

comment:12 Changed 6 years ago by wiswaud

Even with django_dumpdata_streamed_output_3.diff, i still go OOM trying to dump tables with 250k rows in them to json.

comment:13 Changed 6 years ago by wiswaud

Apologies to everyone: i'm an idiot. It does stream and doesn't eat memory, django_dumpdata_streamed_output_3.diff works fine with 250k objects... I had DEBUG=True.

Changed 6 years ago by captnswing

fixed patch to work with django 1.0

Changed 6 years ago by emes

Corrected bugs.

comment:14 Changed 6 years ago by emes

This patch has bugs:

  • Ordering models by model._meta.pk.attname fails on models where primary key is OneToOneField ('parent' vs 'parent_id').
  • JSON separator is not placed between model entries, due to first=True inside loop The attached patch corrects them.

comment:15 Changed 6 years ago by emes

I mean dumpdata_streamed_output_django1.0-fix.diff fixes the bugs mentioned above and there are no more bugs known to me :)

comment:16 Changed 5 years ago by Martin

I installed the patch, but still have to watch django eat up all the memory without any output, when I try to dump my database (>700.000 records) :(
Debug is False. Is there anything else where I could go wrong? What's the format with the least memory use?

comment:17 Changed 5 years ago by jmcvetta

  • Cc jmcvetta added
  • Keywords dumpdata loaddata fixtures memory added

It appears the latest patch has already been applied to trunk:

$ patch -p1 < ~/incoming/dumpdata_streamed_output_django1.0-fix.diff 
patching file django/core/management/commands/dumpdata.py
Reversed (or previously applied) patch detected!  Assume -R? [n]

I experienced no problems using dumpdata to export a ~160,000 record fixture, in both json and yaml formats. However, while I was able run loaddata on the json fixture without incident, attempting to load the yaml fixture quickly overwhelmed the memory on my 4GB workstation.

comment:18 Changed 5 years ago by peterbe

@jmcvetta I don't think this has yet been added to trunk. Just checked today, 7 Apr 2009.
Glad to hear that you got it working with your patch but I don't think I can use that patch any more on the trunk since it appears to be very different.

comment:19 Changed 5 years ago by phaesler

The problem is not in dumpdata, it's in the Django Serialisation code. The core.serializers.python.Serializer doesn't start writing to the output stream until all objects to be serialised have been read into memory.

The attached patches won't work until the above is addressed.

comment:20 Changed 5 years ago by phaesler

I just opened this: #11258

comment:21 Changed 5 years ago by russellm

Dumpdata is a light wrapper around the serializers. Any problem with dumpdata that goes deeper than command arguments not being parsed correctly points to a deeper issue with the serialiers themselves.

comment:22 Changed 5 years ago by phaesler

  • Component changed from django-admin.py to Serialization
  • Patch needs improvement set

OK, then we're agreed this is a Serializer issue.

One approach would be to have the format Serializers handle the "list of objects" logic manually. E.g. for JSON, have start_serialization write "to the stream; have end_object write a comma to stream if needed and dump the object tree to stream using simplejson; end_serialization needs to write the?"; the existing objects member should be removed all together. I might have a bash at a patch this weekend.

comment:23 Changed 5 years ago by phaesler

Oops. Should have used preview. That should read:

One approach would be to have the format Serializers handle the "list of objects" logic manually. E.g. for JSON, have start_serialization write open-square-bracket to the stream; have end_object write a comma to stream if needed and dump the object tree to stream using simplejson; end_serialization needs to write the close-square-bracket; the existing objects member should be removed all together.

I might have a bash at a patch this weekend.

comment:24 Changed 5 years ago by phaesler

Bah - half these patches do exactly that. I'll shut up now.

comment:25 Changed 5 years ago by russellm

#10301 marked as a duplicate. It has a patch that could be an alternative solution.

comment:26 Changed 4 years ago by bronger

  • Cc bronger added

comment:27 Changed 3 years ago by ramiro

  • Owner changed from nobody to ramiro
  • Status changed from new to assigned

Changed 3 years ago by ramiro

Patch updated to current trunk status

comment:28 Changed 3 years ago by ramiro

  • Patch needs improvement unset

comment:29 Changed 3 years ago by ramiro

  • Keywords loaddata removed

Removing loaddata keyword. Loaddata has its own ticket tracking memory its usage issues (#12007).

Please test the latest patch and report back your experiences. I have created a test environment to measure, graph and compare memory usage: https://bitbucket.org/cramm/serializers_memory_usage/src

Changed 3 years ago by ramiro

Another update, with fixes to JSON output so it is always syntactically correct.

Changed 3 years ago by ramiro

Memory usage with dumpdata of 50,000 objects, Postgres, JSON format, unpatched trunk

Changed 3 years ago by ramiro

Memory usage with dumpdata of same 50,000 objects, Postgres, JSON format, trunk patched with 5423.2.diff

comment:30 Changed 3 years ago by ramiro

Results obtained testing with the setup described at and downloadable from https://bitbucket.org/cramm/serializers_memory_usage/src (serialization of 50,000 instances of this model to JSON). Results with XML, YAMS and MySQL, SQLite3 are similar.

Unpatched trunk

Memory usage with dumpdata of 50,000 objects, Postgres, JSON format, unpatched trunk

Patched trunk

Memory usage with dumpdata of same 50,000 objects, Postgres, JSON format, trunk patched with 5423.2.diff

comment:31 follow-up: Changed 3 years ago by Boris Savelev <boris.savelev@…>

I try dumpdata with latest patch

python manage.py dumpdata -a > dump.json

at the end of work from top:

446m 429m 3460 R 96.4 12.3   2:54.07 python manage.py dumpdata -a
ls -lha dump.json 
-rw-r--r-- 1 boris boris 77M Мар 10 17:50 dump.json

I din't think that it is ok.

Also, dumpdata doesn't flush data on disk while working
data flush only at the end

Last edited 3 years ago by ramiro (previous) (diff)

comment:32 in reply to: ↑ 31 Changed 3 years ago by ramiro

Replying to Boris Savelev <boris.savelev@…>:

Thanks for testing the patch. I've reformatted your comment so it is more readable. Can you tell us which database backend are you using?

comment:33 Changed 3 years ago by Boris Savelev <boris.savelev@…>

thanks for formating-)

I use MySQL backend.

comment:34 Changed 3 years ago by kmike

  • Cc kmike84@… added

comment:35 Changed 3 years ago by tomchristie

  • Cc tomchristie added

comment:36 Changed 3 years ago by gabrielhurley

  • Severity set to Normal
  • Type set to New feature

comment:37 Changed 3 years ago by gremmie

  • Cc bgneal@… added

comment:38 Changed 3 years ago by rm_

  • Cc riccardo.magliocchetti@… added
  • Easy pickings unset

Changed 3 years ago by pzinovkin

Patch updated to current trunk

comment:39 Changed 3 years ago by pzinovkin

  • Triage Stage changed from Accepted to Ready for checkin
  • UI/UX unset

comment:40 follow-up: Changed 3 years ago by anonymous

  • Patch needs improvement set
  • Triage Stage changed from Ready for checkin to Accepted

Unfortunately the patch isn't ready, some comments and further testing showed that in the PostgreSQL case big ORM queries are still consuming too mucho memory.

I suspect the patch solves things in the Django side because it implemenst usage of QuerySet.iterator(). But we still need to add to e.g. outr PostgreSQL backend the ability to signal psycopg2/RDBMS it shouldn't get all the queryset from the server, maybe using server-side cursors.

Version 1, edited 3 years ago by ramiro (previous) (next) (diff)

comment:42 Changed 3 years ago by pelletier.thomas@…

  • Cc pelletier.thomas@… added

comment:43 Changed 3 years ago by znick

  • Cc znick added

comment:44 Changed 3 years ago by ben@…

  • Cc ben@… added

comment:45 in reply to: ↑ 40 Changed 3 years ago by toofishes

Replying to ramiro:

Unfortunately the patch isn't ready, some comments and further testing showed that in the PostgreSQL case big ORM queries are still consuming too mucho memory.

I suspect the patch solves things in the Django side because it implemenst usage of QuerySet.iterator(). But we still need to add to e.g. outr PostgreSQL backend the ability to signal psycopg2/RDBMS it shouldn't get all the queryset from the server, maybe using server-side cursors.

I'm going to respectfully disagree with your switch away from "Ready for checkin".

  1. This patch still improves the situation heavily, as it eliminates the double caching problems that existed in both database drivers and the Django querysets.
  2. Looking into server side cursors, I was going to attempt to implement this, but it turns out the Django ORM is built with a fundamental limitation- that you have one connection with one cursor to a given database, period. Take a look at django/db/backends/__init__.py, you will see the cursor() method always returns the single cached cursor object from the BaseDatabaseWrapper object. This works if every query is fully executed and the cursor is available for immediate reuse, but with any sort of iteration, you can't touch the shared cursor object and would need to create a new one solely for that purpose. This brings with it a whole new bag of worms- now you have to encapsulate the cursor and close it when done, among other things, which is a non-trivial addition to the current database code.

So in short, I think holding off on applying this is the wrong decision, and another bug should be opened, related to #13869 and this one, that adds full support for server side cursors and the management of them, or their equivalent, in every database backend. Then it should be a simple 1 or 2 line adjustment to make dumpdata take advantage of this new feature.

comment:46 Changed 3 years ago by toofishes

My last comment looks like it isn't totally correct- Django does give you a new cursor each time you ask for one. Apparently we never close cursors? Not sure if this is a potential problem in some DBMSs.

With that said, I still stand by my view that server-side cursors is fundamentally a different problem and should not hold this patch from going in.

comment:47 Changed 3 years ago by toofishes

New ticket opened with preliminary patch for server-side cursors: #16614.

comment:48 Changed 3 years ago by toofishes

  • Cc dpmcgee@… added

Changed 3 years ago by JannKleen

combined the original patch with 5423.3.diff

comment:49 Changed 3 years ago by JannKleen

When using patch 5423.1.diff, 5423.2.diff or 5423.3.diff, we're still saving all the data in a StreamIO object instead of writing it directly to sys.stdout, which caused huge memory usage again...
I combined 5423.diff and 5423.3.diff which solved the problem for me.

comment:50 Changed 3 years ago by JannKleen

  • Cc jann@… added

comment:51 follow-up: Changed 3 years ago by pzinovkin

Just tried new patch with mysql. For me it's still consumes about 250 megs. But it's way more better than "out of memory".

311m 257m 4040 R   79 25.8   4:36.47 python
236M 2011-10-02 11:28 dump.json

comment:52 in reply to: ↑ 51 Changed 3 years ago by JannKleen

Ah, I forgot to post my memory usage ... I have about 50 models with about 3.5m rows in total...

It results in a 2.2G json file and peaks at about 173m of used memory during the process.

We're using PostgreSQL by the way.

Replying to pzinovkin:

Just tried new patch with mysql. For me it's still consumes about 250 megs. But it's way more better than "out of memory".

311m 257m 4040 R   79 25.8   4:36.47 python
236M 2011-10-02 11:28 dump.json

comment:53 Changed 3 years ago by hakon.erichsen@…

  • Cc hakon.erichsen@… added

comment:54 Changed 3 years ago by Cerin

  • Cc Cerin added

comment:55 Changed 2 years ago by mitar

  • Cc mmitar@… added

Changed 2 years ago by lukeplant

Updated for trunk

comment:56 Changed 2 years ago by lukeplant

Attempted to update for trunk - will one of the original authors please check this, and then mark it Ready-For-Checkin and I will commit.

I confirm that this helps a lot - one dumpdata process I tried with went from max 247MB to max 77MB.

I agree with toofishes that we should not wait for the much bigger changes that would be required to address the further caching problems with the DB drivers.

comment:57 Changed 2 years ago by nickname123

  • Cc wgordonw1@… added

comment:58 Changed 2 years ago by rm_

As promised on IRC to cramm (xrmx here) here's some results with / without latest lukeplant's patch on top of django 1.4b1.
First my environment is a bit peculiar, i have a fixed limit of address space (the sum of all kind of memory) a process can use, in this case 96MB. So if my process consume more than 96MB it gets killed.

Before patch:

../venv14/bin/python manage.py dumpdata store > store.json
/venv14/lib/python2.6/site-packages/django/conf/__init__.py:75: DeprecationWarning: The ADMIN_MEDIA_PREFIX setting has been removed; use STATIC_URL instead.
  "use STATIC_URL instead.", DeprecationWarning)
Error: Unable to serialize database: 

process got killed so no backtrace printed, store.json is empty

After patch:

../venv14/bin/python manage.py dumpdata store > store.json
/venv14/lib/python2.6/site-packages/django/conf/__init__.py:75: DeprecationWarning: The ADMIN_MEDIA_PREFIX setting has been removed; use STATIC_URL instead.
  "use STATIC_URL instead.", DeprecationWarning)
$ ls -l store.json
17496440 2012-03-05 10:02 store.json

To conclude the patch makes a huge difference in my environment, from not working to working.

comment:59 Changed 2 years ago by charettes

  • Patch needs improvement unset
  • Triage Stage changed from Accepted to Ready for checkin

The patch looks good to me. I tried running it against a database that use to exhaust all memory and never achieved to actually finish dumping the data and it succeed.

comment:60 Changed 2 years ago by Claude Paroz <claude@…>

  • Resolution set to fixed
  • Status changed from assigned to closed

In [3b5083bee5e96539dec599106aece9889e70ce05]:

Fixed #5423 -- Made dumpdata output one row at a time.

This should prevent storing all rows in memory when big sets of
data are dumped.
See ticket for heroic contributors.

comment:61 follow-up: Changed 2 years ago by barry.melton@…

With these changes, serialized objects now seem to start with "[ ,", which is wrong (including a comma, even on the first result.)

This patch allows me to export my data sets correctly, but does so with an invalid JSON schema, and does not allow me to import them.

comment:62 in reply to: ↑ 61 Changed 2 years ago by claudep

Replying to barry.melton@…:

With these changes, serialized objects now seem to start with "[ ,", which is wrong (including a comma, even on the first result.)

Looking at the test cases, I don't think your assertion is True. However, it is possible your use case is not covered by test cases. So please provide us more details so as we can try to reproduce the faulty JSON result.

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.