Context Navigation

← Previous Ticket
Next Ticket →

#16614 new New feature

Support server-side cursors for queryset iteration in database backends

Reported by:	Dan McGee	Owned by:	nobody
Component:	Database layer (models, ORM)	Version:	1.3
Severity:	Normal	Keywords:	memory cursors database
Cc:	dpmcgee@…, nikolai@…, trbs@…, benth, Simon Charette, macek@…, riccardo@…, axel.rau@…, clokep@…, josh.smeaton@…, olivier.tabone@…, mail@…	Triage Stage:	Accepted
Has patch:	no	Needs documentation:	no
Needs tests:	no	Patch needs improvement:	no
Easy pickings:	no	UI/UX:	no
Pull Requests:	~~7844~~ merged, ~~7587~~ merged

Description ¶

This is related to concerns raised in #5423 as well as documentation issues noted in #13869.

Attached is a very rough first cut of a possible patch that adds server-side iteration for all backends that need it, as well as turning it on by default in the results_iter() codepaths, which is only used in the iterator() methods of the various QuerySet classes defined in django.db.models.query.

Observations:

SQLite appears to do iteration right out of the box when using fetchmany(), so no changes are needed.
Oracle (via cx_Oracle) also does it right, fetching by default in batches of 50 when fetchmany() is used- see http://cx-oracle.sourceforge.net/html/cursor.html#Cursor.arraysize
PostgreSQL gets functionality for usage of named cursors (http://initd.org/psycopg/docs/connection.html#connection.cursor), which leave the result set on the server side. The default fetch size of sets fetched this way is 2000, but we always use GET_ITERATOR_CHUNK_SIZE, so adjusting this is not necessary. A named cursor can only be used once.
MySQL gets functionality for usage of SSCursor objects (http://mysql-python.sourceforge.net/MySQLdb-1.2.2/public/MySQLdb.cursors.SSCursor-class.html). This uses mysql_fetch_row() under the covers (http://dev.mysql.com/doc/refman/5.5/en/mysql-fetch-row.html). There are two cons here- it appears the library does not support batch fetches because it uses fetch_(single)_row under the covers, so this could result in significantly more chatter. MySQL also has no real concept of a cursor, so while you are using this server-side "cursor", you are not allowed to perform any other simultaneous queries.

Thoughts and feedback are welcome- I can imagine only enabling this by default for PostgreSQL only, as MySQL's implementation leaves something to be desired. I could also see never doing this by default and allowing it to be configured in DATABASE settings.

This was mildy tested with a random dumpdata operation on a random project using PostgreSQL. The max memory used by the dumpdata after applying this patch and the one from FS#5423, piping to /dev/null, went from 50MB to 26MB.

According to the ticket's flags, the next step(s) to move this issue forward are:

To provide a patch by sending a pull request. Claim the ticket when you start working so that someone else doesn't duplicate effort. Before sending a pull request, review your work against the patch review checklist. Check the "Has patch" flag on the ticket after sending a pull request and include a link to the pull request in the ticket comment when making that update. The usual format is: [https://github.com/django/django/pull/#### PR].

Attachments (1)

support-server-side-cursors.patch (6.7 KB ) - added by Dan McGee 14 years ago.

Download all attachments as: .zip

Change History (34)

by Dan McGee, 14 years ago

Attachment:	support-server-side-cursors.patch added

comment:1 by Dan McGee, 14 years ago

Cc:	dpmcgee@… added

comment:2 by nikolai@…, 14 years ago

Cc:	nikolai@… added

comment:3 by Aymeric Augustin, 14 years ago

Needs documentation:	set
Needs tests:	set
Triage Stage:	Unreviewed → Accepted

Based on slides 55-63 of http://thebuild.com/presentations/unbreaking-django.pdf I think it's a good idea.

comment:4 by Anssi Kääriäinen, 13 years ago

I did a very similar implementation based on discussions in https://groups.google.com/group/django-developers/browse_thread/thread/f19040e2e3229d7a

The main difference is that my version adds a queryset method .chunked(), which will use named cursor for PostgreSQL and will not fetch all the results in one go for SQLite3. SQLite3 has a problem if not fetching all the results in one go as updates done while iterating the results will be seen in the results.

The idea for the .chunked() method is that it will be documented as having backend-specific limitations which the .iterator() approach does not have. The abovementioned SQLite3 limitation is one, PostgreSQL has at least two:

As long as the cursor is open (you have a reference to the iterator in the code), server side resources will be tied. This might be important for cases where you open a lot of cursors and then iterate them in a template.
The named cursor is not iterable at all outside of a transaction (or you need to use WITH HOLD cursors, which will tie the resources even longer). This means named cursor will not be usable in autocommit mode.

I did not yet include anything for MySQL. There was some discussions about MySQL and it seems it could have some deadlocking problems.

The above limitations are not backwards compatible with current behavior of .iterator(). So, there might be some reason not to expose this as a default / settings behavior, but as a different queryset method you can use when you really need it. The conditions when you need this are exceptional, in normal HTML generation you will almost never want to use named cursors.

The patch is at: http://www.google.com/url?sa=D&q=https://github.com/akaariai/django/commit/8990e20df50ce110fe6ddbbdfed7a98987bb5835&usg=AFQjCNGtHz9sSHT0tOIWKXAZAypfvPIHyw

comment:5 by anonymous, 13 years ago

Has there been any movement on this issue? Do the core developers have any plans to merge this patch in the near future?

comment:6 by Luke Plant, 13 years ago

akaariai is now a core developer, and could do this if he is still interested. The idea as described in comment 4 sounds solid to me. It would also be fine to have this implemented for some backends and not others IMO.

The link for the patch no longer works. Hopefully Anssi has a record of it somewhere.

comment:7 by Anssi Kääriäinen, 13 years ago

I renamed the repo to django-old when Django was moved to github. Here is a working link: https://github.com/akaariai/django-old/commit/8990e20df50ce110fe6ddbbdfed7a98987bb5835

I can take care of final polish & commit, but I am not too interested in doing a lot of work on this right now. To get this committed making sure the patch works on current git HEAD, and writing some docs & tests is the way to get this into core.

comment:8 by trbs, 12 years ago

Cc:	trbs@… added

comment:9 by benth, 12 years ago

Cc:	benth added

comment:10 by Aymeric Augustin, 12 years ago

See also http://thebuild.com/blog/2010/12/13/very-large-result-sets-in-django-using-postgresql/

comment:11 by Simon Charette, 12 years ago

Cc:	Simon Charette added

comment:12 by Vlada Macek, 11 years ago

Cc:	macek@… added

Is it really in stage Accepted? :)

I think the missing server-side cursor support keeps the Django ORM down...

comment:13 by Riccardo Attilio Galli, 11 years ago

Cc:	riccardo@… added

comment:16 by Axel Rau, 9 years ago

Cc:	axel.rau@… added

comment:17 by Kevin Turner, 9 years ago

I've updated akaariai's patch from 2012 to server_side_cursor_16614.

(not yet ready to submit, still wants docs and tests)

comment:18 by Kevin Turner, 9 years ago

also, it renames the DatabaseFeatures can_use_chunked_reads to has_safe_chunked_reads, is that a private interface that can be renamed freely, or is it a public interface we need to cautious around?

comment:19 by Carl Meyer, 9 years ago

Technically that's not public API and we can change it if we need to, but it does cause problems for third-party database backends (and I think we've said that we'll mention such changes in the release notes?). So we should only change it if it's really valuable to do so. I'm not immediately seeing a gain in clarity from that change that would justify it, but maybe I'm missing some subtle reason why the new name is more accurate? Unless that's so, I'd lean towards leaving it unchanged.

comment:20 by Kevin Turner, 9 years ago

Okay, I've added a test, which is better than no tests, but still leaves some to be desired. That is, this test makes sure that a call to chunked() does not fail horribly, but it doesn't test the salient difference between chunked and iterator, that chunked uses a server-side cursor and doesn't load large result sets in to memory all at once.

Other things that it seems like might be valuable to test:

what if you have two chunked() queries at once, and alternate taking objects from each of them, do they successfully not get in each other's way?
that may be the same as: if you took the _named_cursor_idx out of the chunked_cursor name, what would break?

Considering the API, is it necessary to add a new QuerySet.chunked(), or can we change QuerySet.iterator() to have this behaviour by default? (because it's what I always assumed that iterator did until I ran out of RAM.)

It does change the run-time characteristics to some degree, but I'm having a hard time coming up with a situation where it would really break an existing use case.

comment:21 by Patrick Cloke, 8 years ago

Cc:	clokep@… added

comment:22 by Josh Smeaton, 8 years ago

Cc:	josh.smeaton@… added

comment:23 by Olivier Tabone, 8 years ago

Cc:	olivier.tabone@… added

comment:24 by Josh Smeaton, 8 years ago

For what it's worth I'm keen on the postgres implementation using SSC from the iterator method. That is, no need to implement another queryset method to support this. Testing will need to be done with and without transactions though.

comment:25 by François Freitag, 8 years ago

Cc:	mail@… added
Needs tests:	unset

I have written a draft for this feature based on previous patch by Anssi and Kevin.
PR

comment:26 by François Freitag, 8 years ago

Needs documentation:	unset

The proposed patch is an implementation detail of PostgreSQL. I believe that an entry in the release note would be acceptable.

This flag also prevents the ticket from entering the pool of patches needing review.

Last edited 8 years ago by François Freitag (previous) (diff)

comment:27 by Tim Graham <timograham@…>, 8 years ago

In ee1c1c69:

Made prefetch_related SQL inspection tests less brittle.

After refs #16614, integers might appear outside the WHERE clause.

comment:28 by Tim Graham, 8 years ago

Triage Stage:	Accepted → Ready for checkin

comment:29 by Tim Graham <timograham@…>, 8 years ago

In f3b7c05:

Refs #16614 -- Made QuerySet.iterator() use server-side cursors on PostgreSQL.

Thanks to Josh Smeaton for the idea of implementing server-side cursors
in PostgreSQL from the iterator method, and Anssi Kääriäinen and Kevin
Turner for their previous work. Also Simon Charette and Tim Graham for
review.

comment:30 by Tim Graham, 8 years ago

Has patch:	unset
Triage Stage:	Ready for checkin → Accepted

The ticket remains open for consideration of using server-side cursors on MySQL.

comment:31 by Tim Graham <timograham@…>, 8 years ago

In 6b6be69:

Refs #16614 -- Prevented database errors from being masked by cursor close.

When an error occurred during the cursor.execute statement, the cursor
is closed. This operation did not fail with client-side cursors. Now,
with server-side cursors, the close operation might fail (example
below). The original error should be raised, not the one raised by
cursor.close(), this is only clean-up code.

For example, one can attempt to create a named cursor for an invalid
query. psycopg will raise an error about the invalid query and the
server-side cursor will not be created on PostgreSQL. When the code
attempts to cursor.close(), it asks psycopg to close a cursor that was
not created. pyscopg raises a new error: psycopg2.OperationalError:
cursor "_django_curs_140365867840512_20" does not exist.

comment:32 by Tim Graham <timograham@…>, 8 years ago

In 05bdf4f4:

Refs #16614 -- Called _prepare_cursor() on every created cursor.

comment:33 by Sergey Fursov, 8 years ago

Hi!

I want to raise one case, which had our team after upgrading to 1.11 version:

our app connects to DB through pgbouncer in transaction pooling mode
after applying changes introduced in this ticket, we started receive errors about non-existent named cursors
this obviously related to the fact, that pgBouncer can't work with cursors run "WITH HOLD" option in transaction pooling mode

(more details here: https://groups.google.com/forum/#!topic/django-users/E39ycUilQ3c)

Our team really want to upgrade to 1.11, but in this case we have to remove pgBouncer as connection pooling tool and connect to postgres DB directly.

Looks like another possible option is wrapping all iterator() calls in their own transaction, but django internally use cursors in several places (e.g. in ModelChoiceIterator).

I think this possible problem should be documented in some way, and it would be great to describe some options for projects to migrate to new version.

(I'm sorry for cross-posting this problem here, but i guess our stack could be common for many other projects)

comment:34 by Josh Smeaton, 8 years ago

Hi Sergey,

This is bad. Most people that use pgbouncer would be using it in transaction mode, which makes running any query with .iterator() a breaking change with no option to skip server side cursors. It was actually my suggestion to transparently use server side cursors, but I never considered the impact on something like pgbouncer.

Can you please create a new ticket for this problem, and link it back here? I'm not sure how we'd go about making the situation right, but let's track that discussion on the other ticket.

comment:35 by Sergey Fursov, 8 years ago

Hi Josh,

Thanks for quick answer and confirming my concerns
Here is the new ticket for tracking discussion related to my problem - #28062

Note: See TracTickets for help on using tickets.

Download in other formats:

TracLink
Image macro
× Delete attachment

Issues