Opened 3 years ago

Closed 3 years ago

#17668 closed Bug (fixed)

prefetch_related does not work in in_bulk

Reported by: gurets@… Owned by: lukeplant
Component: Database layer (models, ORM) Version: 1.4-alpha-1
Severity: Release blocker Keywords: prefetch_related, in_bulk
Cc: anssi.kaariainen@… Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Hi,
When using the in_bulk - prefetch_related does not work.

For example: in_bulk uses django-haystack. load data from its storage (solr, xapian..) and search objects in django db backend.

qs = Book.objects.all().prefetch_related('authors')

books = qs.in_bulk([1,2,3])

books at this time without prefetched authors.

Attachments (1)

17668.diff (3.6 KB) - added by akaariai 3 years ago.

Download all attachments as: .zip

Change History (11)

comment:1 Changed 3 years ago by akaariai

  • Cc anssi.kaariainen@… added
  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Triage Stage changed from Unreviewed to Accepted

The problem is anything using the .iterator() of qs will not do the prefetching. This should be documented, as there is no possibility to both yield object at a time, and do prefetching. Maybe also an assert in the iterator about this would be warranted. Another option is to force the whole query to be iterated over and then doing the prefetch when calling .iterator() and using .prefetch_related() in the same query.

Marking as accepted. Maybe this should be a release blocker?

comment:2 Changed 3 years ago by akaariai

Ok, I have a patch for this. It is now an error to use .iterator() together with .prefetch_related(). in_bulk does not use the iterator any more. It first forces the query into a list.

Changed 3 years ago by akaariai

comment:3 Changed 3 years ago by julien

  • Severity changed from Normal to Release blocker

Thank you both for the report and patch. Marking this ticket as release blocker since this is a limitation/bug in a new feature.

comment:4 Changed 3 years ago by gurets

Thank you!

comment:5 Changed 3 years ago by lukeplant

I'm not sure about this solution.

I agree that in_bulk does not need to use iterator() - it doesn't gain much by using it, since it immediately consumes the whole generator, it only avoids the creation of a lists. This part of the patch has a serious bug though - it should be list(qs) not list(self). In fact it can be simplified to just: return dict([(obj._get_pk_val(), obj) for obj in qs]) - no need to create a list we will throw away.

However, making iterator() blow up if prefetch_related() is used is more dubious. If you have a 3rd party library, or just some generic code, that uses your model (but is agnostic to your model), and uses iterator(), this patch will cause it to blow up if you used prefetch_related() on a default manager, for example. This patch effectively means that many instances of iterator() would need patching to be safe in the general case. I don't think people who use iterator() shouldn't have to worry about clearing prefetch_related() first - that's a nasty little gotcha that will clutter people's code.

So at the moment I think this should be a documentation fix for iterator(). The note should probably go on both iterator() and prefetch_related() docs.

comment:6 Changed 3 years ago by akaariai

There are three possible choices for the iterator() + prefetch_related():

  1. Raise error.
  2. Favor prefetch: when iterator is used, fetch all the objects and related objects in one go, then yield them one at a time.
  3. Favor iterator: throw out the defined prefetch, and do the iterator properly. This is what is currently done.

I suggested option 1 already, but that wasn't favored. From the rest option 2 is in my opinion better than the currently used option 3. Just because the potential performance loss is less in that case. You would get the reduced memory usage given by .iterator() when it is possible to do that. If there is prefetches, they are done correctly, but you lose the memory efficiency of iterator(). For example, in_bulk would have worked perfectly without any changes to it with option 2.

Option 2 should be easy to implement, where the error is raised currently, do:

for obj in list(self):
    yield obj

I haven't actually tested this though...

Anyways, which option is chosen isn't that important. I could live with any of them.

Version 0, edited 3 years ago by akaariai (next)

comment:7 Changed 3 years ago by lukeplant

  • Owner changed from nobody to lukeplant

I thought I had replied here, but obviously not. Thanks for setting this down clearly.

I favour option 3 for the following reasons:

iterator() by definition always comes last, and therefore is always in the code that actually generates the query. It is confusing that your iterator() call gets effectively ignored because of something that got passed in to the code in front of you, over which you might have little control, and I think more confusing than the your prefetch_related() getting ignored because of something that comes later. There are many ways in which QuerySet calls can get ignored by something that comes later e.g. order_by() always works this way, prefetch_related() itself works this way if you do prefetch_related(None).

I also don't think that we can correctly judge the potential performance loss - there will almost certainly be cases which contradict our intuition. We should therefore just go on what is the most obvious API, rather than trying to second-guess the developer or the situation.

comment:8 Changed 3 years ago by akaariai

I am OK with the above.

Should there be a way to see if there are prefetches? It would be cheap to add a way to see the current prefetches, this would give a chance to skip the .iterator() call if there are prefetches defined.

Or maybe just document the current behavior, fix in_bulk and be done with it. If somebody complains, add a way to check the prefetches then. I have a feeling I have wasted enough everybody's time on this issue already...

comment:9 Changed 3 years ago by lukeplant

  • Type changed from Uncategorized to Bug

@akaariai: You haven't wasted anyone's time - your input is really appreciated!

I think we should probably wait until someone asks for a way to check for prefetches. I guess the whole QuerySet API is a bit lacking in this regard - I don't think we have a public API for checking other things like select_related and order_by etc. Perhaps there would be an API that covers all of those e.g. get_query_options() that returns a dictionary of things that have been set.

comment:10 Changed 3 years ago by lukeplant

  • Resolution set to fixed
  • Status changed from new to closed

In [17600]:

Fixed #17668 - prefetch_related does not work in in_bulk

Thanks to gurets for the report, and akaariai for the initial patch.

Note: See TracTickets for help on using tickets.
Back to Top