Opened 8 years ago

Last modified 8 years ago

#26530 new New feature

Batch operations on large querysets

Reported by: Marc Tamlyn Owned by: nobody
Component: Database layer (models, ORM) Version: dev
Severity: Normal Keywords:
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

It's a pretty common requirement to need to perform an operation on all elements of a large queryset - especially in data migrations where you may not be able to express the updates needed purely as an update call or similar.

It would be very handy to have a batching operation, something like Queryset.batch(lambda obj: obj.do_something(), batch_size=100). This would page through the queryset in memory, performing the lambda on every object. Naming is obviously up for discussion. I must have code to do something like this dozens of times.

This is quite similar to how a Paginator works, but the API for pagination feels a bit funny for doing this and it's not much less code than writing it yourself.

Change History (8)

comment:1 by Tim Graham, 8 years ago

Could you give an example of the boilerplate code you intend to replace?

comment:2 by Marc Tamlyn, 8 years ago

There are a few ways of approaching it:

Copying roughly a paginator:

count = qs.count()
pointer = 0
while pointer < count:
    for obj in qs[pointer:pointer + batch_size]:
        do_something(obj)
    pointer += batch_size

Basing off e.g. a sequential id, can also apply to time series

pointer = 0
while True:
    # work from oldest first so incoming objects during the run will get processed
    batch = qs.filter(id__gt=pointer).order_by('id')[:batch_size]
    if not batch:
         break
    for obj in batch:
        pointer = obj.id
        do_something(obj)

The operation should also ideally apply to a values or values_list queryset, this is a similar piece of code which doesn't have to worry about memory as much:

ids = qs.values_list('id', flat=True)
while user_ids:
    batch, user_ids = user_ids[:100], user_ids[100:]
    queue_task(batch)

My motivation for this patch is twofold - partly I'm bored of writing similar code when dealing with large querysets, but also I have seen many developers debugging issues with their code because they haven't realised 10k+ querysets in memory are problematic. Having an easy API to use which is documented, with warnings about why you need this, should help people to be aware of the issues, and make it easy for them to fix them.

A better API suggestion could be for batch in qs.batch(size=100). This means quite possibly fixing your broken code is just changing one line.

comment:3 by Anssi Kääriäinen, 8 years ago

If the idea is to do something for each object, then

for obj in qs.iterator():
    obj.do_something()

should give you a lot better memory efficiency. Of course, if using PostgreSQL, the driver will still fetch all the rows into memory.

A very good approach would be to finally tackle the named cursors issue. Then you could just do:

for obj in qs.iterator(cursor_size=100):
    obj.do_something()

and be done with it. The problem with the named cursor approach is that some databases have more or less hard to overcome limitations of what can be done with the cursor, how transactions work and so on.

If you really want batches of object, then we probably need to use the pointer approach. Otherwise iterating through a large queryset will end up doing queries like select * from the_query offset 100000 limit 100 which is very inefficient, and concurrent modifications could end up introducing the same object in multiple batches.

I'm mildly in favor of adding this, as the addition to API surface isn't large, and there are a lot of ways to implement the batching in mildly wrong ways.

If we are going for this, then I think the API should be the for batch in qs.batch(size=100) one. The queryset should be ordered in such a way that primary key is the only sorting criteria. We can change that later so that primary or some other unique key is a postfix of the order by, but that is a bit harder to do.

comment:4 by Marc Tamlyn, 8 years ago

Heh, I actually didn't know about iterator()... Good work reading docs Marc.

Batching for background tasks is definitely common though.

comment:5 by Anssi Kääriäinen, 8 years ago

I'm ok with the batch(size=100) approach. I haven't had a need for this, and I'm unfamiliar how common such cases are. But if this is something commonly needed, then I think having an ORM method that does this in a way that is both efficient and correct when ran on table modified concurrently is the way to go.

Thinking of this a bit more, batching only on primary key ordering is the only safe approach if we want to definitely avoid seeing the same object in multiple batches. Otherwise for example:

for batch in qs.order_by('mod_date', 'pk').batch(size=1000):
    for obj in batch:
        obj.foo = 'bar'
        obj.mod_date = datetime.now()
        obj.save()

could end up in indefinite loop.

EDIT: If we had the named cursor based iterator available, then we could easily batch on any queryset without a risk of concurrent modifications causing issues.

Last edited 8 years ago by Anssi Kääriäinen (previous) (diff)

comment:6 by Tim Graham, 8 years ago

The server-side cursors ticket is #16614. Should we mark this as a duplicate of that one or is that just a prerequisite for this one?

Last edited 8 years ago by Tim Graham (previous) (diff)

comment:7 by Craig Nagy, 8 years ago

It would be nice to see a native solution to this issue. We have had to do this on occasion (G Adventures) and it is a technique used by django-haystack here for example. That project also calls reset_queries(), but I have not tested the effects of that call (just copied the pattern).

comment:8 by Tim Graham, 8 years ago

Triage Stage: UnreviewedAccepted
Note: See TracTickets for help on using tickets.
Back to Top