Opened 12 years ago

Closed 12 years ago

#17668 closed Bug (fixed)

prefetch_related does not work in in_bulk

Reported by: gurets@… Owned by: Luke Plant
Component: Database layer (models, ORM) Version: 1.4-alpha-1
Severity: Release blocker Keywords: prefetch_related, in_bulk
Cc: anssi.kaariainen@… Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Hi,
When using the in_bulk - prefetch_related does not work.

For example: in_bulk uses django-haystack. load data from its storage (solr, xapian..) and search objects in django db backend.

qs = Book.objects.all().prefetch_related('authors')

books = qs.in_bulk([1,2,3])

books at this time without prefetched authors.

Attachments (1)

17668.diff (3.6 KB ) - added by Anssi Kääriäinen 12 years ago.

Download all attachments as: .zip

Change History (11)

comment:1 by Anssi Kääriäinen, 12 years ago

Cc: anssi.kaariainen@… added
Triage Stage: UnreviewedAccepted

The problem is anything using the .iterator() of qs will not do the prefetching. This should be documented, as there is no possibility to both yield object at a time, and do prefetching. Maybe also an assert in the iterator about this would be warranted. Another option is to force the whole query to be iterated over and then doing the prefetch when calling .iterator() and using .prefetch_related() in the same query.

Marking as accepted. Maybe this should be a release blocker?

comment:2 by Anssi Kääriäinen, 12 years ago

Ok, I have a patch for this. It is now an error to use .iterator() together with .prefetch_related(). in_bulk does not use the iterator any more. It first forces the query into a list.

by Anssi Kääriäinen, 12 years ago

Attachment: 17668.diff added

comment:3 by Julien Phalip, 12 years ago

Severity: NormalRelease blocker

Thank you both for the report and patch. Marking this ticket as release blocker since this is a limitation/bug in a new feature.

comment:4 by gurets, 12 years ago

Thank you!

comment:5 by Luke Plant, 12 years ago

I'm not sure about this solution.

I agree that in_bulk does not need to use iterator() - it doesn't gain much by using it, since it immediately consumes the whole generator, it only avoids the creation of a lists. This part of the patch has a serious bug though - it should be list(qs) not list(self). In fact it can be simplified to just: return dict([(obj._get_pk_val(), obj) for obj in qs]) - no need to create a list we will throw away.

However, making iterator() blow up if prefetch_related() is used is more dubious. If you have a 3rd party library, or just some generic code, that uses your model (but is agnostic to your model), and uses iterator(), this patch will cause it to blow up if you used prefetch_related() on a default manager, for example. This patch effectively means that many instances of iterator() would need patching to be safe in the general case. I don't think people who use iterator() shouldn't have to worry about clearing prefetch_related() first - that's a nasty little gotcha that will clutter people's code.

So at the moment I think this should be a documentation fix for iterator(). The note should probably go on both iterator() and prefetch_related() docs.

comment:6 by Anssi Kääriäinen, 12 years ago

There are three possible choices for the iterator() + prefetch_related():

  1. Raise error.
  2. Favor prefetch: when iterator is used, fetch all the objects and related objects in one go, then yield them one at a time.
  3. Favor iterator: throw out the defined prefetch, and do the iterator properly. This is what is currently done.

I suggested option 1 already, but that wasn't favored. From the rest option 2 is in my opinion better than the currently used option 3. Just because the potential performance loss is less in that case. You would get the reduced memory usage given by .iterator() when it is possible to do that. If there is prefetches, they are done correctly, but you lose the memory efficiency of iterator(). For example, in_bulk would have worked perfectly without any changes to it with option 2.

Option 2 should be easy to implement, where the error is raised currently, do:

for obj in list(self):
    yield obj

I haven't actually tested this though...

Anyways, which option is chosen isn't that important. I could live with any of them.

Last edited 12 years ago by Anssi Kääriäinen (previous) (diff)

comment:7 by Luke Plant, 12 years ago

Owner: changed from nobody to Luke Plant

I thought I had replied here, but obviously not. Thanks for setting this down clearly.

I favour option 3 for the following reasons:

iterator() by definition always comes last, and therefore is always in the code that actually generates the query. It is confusing that your iterator() call gets effectively ignored because of something that got passed in to the code in front of you, over which you might have little control, and I think more confusing than the your prefetch_related() getting ignored because of something that comes later. There are many ways in which QuerySet calls can get ignored by something that comes later e.g. order_by() always works this way, prefetch_related() itself works this way if you do prefetch_related(None).

I also don't think that we can correctly judge the potential performance loss - there will almost certainly be cases which contradict our intuition. We should therefore just go on what is the most obvious API, rather than trying to second-guess the developer or the situation.

comment:8 by Anssi Kääriäinen, 12 years ago

I am OK with the above.

Should there be a way to see if there are prefetches? It would be cheap to add a way to see the current prefetches, this would give a chance to skip the .iterator() call if there are prefetches defined.

Or maybe just document the current behavior, fix in_bulk and be done with it. If somebody complains, add a way to check the prefetches then. I have a feeling I have wasted enough everybody's time on this issue already...

comment:9 by Luke Plant, 12 years ago

Type: UncategorizedBug

@akaariai: You haven't wasted anyone's time - your input is really appreciated!

I think we should probably wait until someone asks for a way to check for prefetches. I guess the whole QuerySet API is a bit lacking in this regard - I don't think we have a public API for checking other things like select_related and order_by etc. Perhaps there would be an API that covers all of those e.g. get_query_options() that returns a dictionary of things that have been set.

comment:10 by Luke Plant, 12 years ago

Resolution: fixed
Status: newclosed

In [17600]:

Fixed #17668 - prefetch_related does not work in in_bulk

Thanks to gurets for the report, and akaariai for the initial patch.

Note: See TracTickets for help on using tickets.
Back to Top