Opened 9 years ago
Last modified 9 years ago
#26530 new New feature
Batch operations on large querysets
Reported by: | Marc Tamlyn | Owned by: | nobody |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | Triage Stage: | Accepted | |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
It's a pretty common requirement to need to perform an operation on all elements of a large queryset - especially in data migrations where you may not be able to express the updates needed purely as an update
call or similar.
It would be very handy to have a batching operation, something like Queryset.batch(lambda obj: obj.do_something(), batch_size=100)
. This would page through the queryset in memory, performing the lambda on every object. Naming is obviously up for discussion. I must have code to do something like this dozens of times.
This is quite similar to how a Paginator works, but the API for pagination feels a bit funny for doing this and it's not much less code than writing it yourself.
Change History (8)
comment:1 by , 9 years ago
comment:2 by , 9 years ago
There are a few ways of approaching it:
Copying roughly a paginator:
count = qs.count() pointer = 0 while pointer < count: for obj in qs[pointer:pointer + batch_size]: do_something(obj) pointer += batch_size
Basing off e.g. a sequential id, can also apply to time series
pointer = 0 while True: # work from oldest first so incoming objects during the run will get processed batch = qs.filter(id__gt=pointer).order_by('id')[:batch_size] if not batch: break for obj in batch: pointer = obj.id do_something(obj)
The operation should also ideally apply to a values or values_list queryset, this is a similar piece of code which doesn't have to worry about memory as much:
ids = qs.values_list('id', flat=True) while user_ids: batch, user_ids = user_ids[:100], user_ids[100:] queue_task(batch)
My motivation for this patch is twofold - partly I'm bored of writing similar code when dealing with large querysets, but also I have seen many developers debugging issues with their code because they haven't realised 10k+ querysets in memory are problematic. Having an easy API to use which is documented, with warnings about why you need this, should help people to be aware of the issues, and make it easy for them to fix them.
A better API suggestion could be for batch in qs.batch(size=100)
. This means quite possibly fixing your broken code is just changing one line.
comment:3 by , 9 years ago
If the idea is to do something for each object, then
for obj in qs.iterator(): obj.do_something()
should give you a lot better memory efficiency. Of course, if using PostgreSQL, the driver will still fetch all the rows into memory.
A very good approach would be to finally tackle the named cursors issue. Then you could just do:
for obj in qs.iterator(cursor_size=100): obj.do_something()
and be done with it. The problem with the named cursor approach is that some databases have more or less hard to overcome limitations of what can be done with the cursor, how transactions work and so on.
If you really want batches of object, then we probably need to use the pointer approach. Otherwise iterating through a large queryset will end up doing queries like select * from the_query offset 100000 limit 100
which is very inefficient, and concurrent modifications could end up introducing the same object in multiple batches.
I'm mildly in favor of adding this, as the addition to API surface isn't large, and there are a lot of ways to implement the batching in mildly wrong ways.
If we are going for this, then I think the API should be the for batch in qs.batch(size=100)
one. The queryset should be ordered in such a way that primary key is the only sorting criteria. We can change that later so that primary or some other unique key is a postfix of the order by, but that is a bit harder to do.
comment:4 by , 9 years ago
Heh, I actually didn't know about iterator()
... Good work reading docs Marc.
Batching for background tasks is definitely common though.
comment:5 by , 9 years ago
I'm ok with the batch(size=100) approach. I haven't had a need for this, and I'm unfamiliar how common such cases are. But if this is something commonly needed, then I think having an ORM method that does this in a way that is both efficient and correct when ran on table modified concurrently is the way to go.
Thinking of this a bit more, batching only on primary key ordering is the only safe approach if we want to definitely avoid seeing the same object in multiple batches. Otherwise for example:
for batch in qs.order_by('mod_date', 'pk').batch(size=1000): for obj in batch: obj.foo = 'bar' obj.mod_date = datetime.now() obj.save()
could end up in indefinite loop.
EDIT: If we had the named cursor based iterator available, then we could easily batch on any queryset without a risk of concurrent modifications causing issues.
comment:6 by , 9 years ago
The server-side cursors ticket is #16614. Should we mark this as a duplicate of that one?
comment:7 by , 9 years ago
It would be nice to see a native solution to this issue. We have had to do this on occasion (G Adventures) and it is a technique used by django-haystack here for example. That project also calls reset_queries()
, but I have not tested the effects of that call (just copied the pattern).
comment:8 by , 9 years ago
Triage Stage: | Unreviewed → Accepted |
---|
Could you give an example of the boilerplate code you intend to replace?