Opened 12 years ago

Closed 12 years ago

Last modified 12 years ago

#16937 closed New feature (fixed)

Prefetch related objects

Reported by: Luke Plant Owned by: nobody
Component: Database layer (models, ORM) Version: 1.3
Severity: Normal Keywords:
Cc: jdunck@…, cg@…, lists@…, django@… Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Related to #6432, here is a patch that adds a 'prefetch_related' method to QuerySet. This method causes the requested related objects to be retrieved in a single batch when the main QuerySet is evaluated.

The result is that the O(n) performance problem often related with M2M and reverse foreign key can be eliminated, at least for the simple but common case when you just want to use .all() and not do further filtering. Obviously it should be used with care, like select_related.

Attachments (11)

prefetch.diff (18.0 KB ) - added by Luke Plant 12 years ago.
Patch implementing ticket, with docs and tests
prefetch_2.diff (18.6 KB ) - added by Luke Plant 12 years ago.
Reworked patch
prefetch_3.diff (37.0 KB ) - added by Luke Plant 12 years ago.
added support for arbitrary depth, instead of single depth
prefetch_3.1.diff (37.0 KB ) - added by Luke Plant 12 years ago.
Fix to get tests to run when you run the whole test suite
prefetch_4.diff (37.0 KB ) - added by Luke Plant 12 years ago.
Made API for clearing consistent with QuerySet.defer
16937_additional_tests.diff (9.1 KB ) - added by Anssi Kääriäinen 12 years ago.
Additional tests
prefetch_5.diff (43.3 KB ) - added by Luke Plant 12 years ago.
Updated patch, see comments
16937_dict_join.diff (1.1 KB ) - added by Anssi Kääriäinen 12 years ago.
prefetch_6.diff (47.9 KB ) - added by Luke Plant 12 years ago.
Various updates, see last comment
prefetch_7.diff (49.4 KB ) - added by Luke Plant 12 years ago.
Added fix for traversing nullable relations
prefetch_1.3.1.diff (23.5 KB ) - added by snyderra@… 12 years ago.
for 1.3.1 stable release

Download all attachments as: .zip

Change History (42)

by Luke Plant, 12 years ago

Attachment: prefetch.diff added

Patch implementing ticket, with docs and tests

comment:1 by Luke Plant, 12 years ago

Triage Stage: UnreviewedDesign decision needed

comment:2 by Jeremy Dunck, 12 years ago

Cc: jdunck@… added

comment:3 by Anssi Kääriäinen, 12 years ago

Some comments on the approach taken in the patch. First a forewarning: I haven't actually tried the patch, just read through the code portion of it.

I wonder if there would be some possibility to have django/db/models/deletion.py and this feature share the related object collection code. Maybe there is some fundamental difference in how collecting needs to be done in each case, but on a quick look it seems there might be some possibility to avoid duplication of very similar code here, especially if depth > 1 is going to be supported.

This will not work on some backends if you have a lot of objects in your queryset. For example the SQLite backend has a limitation of 999 parameters to single SQL query, so if you have 1000 objects, prefetch_related will fail as you need to supply 1000 id values to the query.

If you try to do something stupid, like .prefetch_related('id'), the patch will give out a not so user friendly error message. The same goes if the field doesn't exists at all. In this case Django should give a helpful error message: "the available fields are ...", as when trying to do objs.filter(missing_field=val).

This is just personal opinion: I don't like the idea that the only method where this benefits the user is if you call .all() on the related manager. I would like more an approach where you would just fetch a list of objects to the model, and you could use your own filtering, ordering etc for the prefetch. The original related manager would still be available.

Something like:

# Simple case
Book.objects.prefetch_related('authors')
# Result: a list of books, each having a list of authors as obj.authors_prefetched
# Name the list
Book.objects.prefetch_related('authors', to_attr='authors_lst')
# The same as above, except now you would have the authors in obj.authors_lst
# Use a custom queryset
Book.objects.prefetch_related('authors', qs=Author.objects.filter(age__lte=40).order_by('-age'), to_attr='young_authors')
# Now each book has younger than 40 year old authors in attribute 'young_authors', ordered by their age
# If you want to prefetch multiple related objects, you need to chain the prefetch_related calls.

I am afraid that the patch as implemented will result in situation where you have either the option that only .all() is supported, which doesn't cover nearly all the use cases, or that Django would need to support filtering, ordering etc. in Python code, which is against the QuerySet philosophy.

Still one more point: I think m2m with intermediary model can point to other fields than the PK of the related model - if I am not mistaken the patch does not support this.

Other than these points, a big +1 for this feature. I end up coding something similar in nearly every project I do.

by Luke Plant, 12 years ago

Attachment: prefetch_2.diff added

Reworked patch

comment:4 by Luke Plant, 12 years ago

I had another go at the patch, above, and it is much cleaner now, although involves more changes to the related managers - but the changes make much more sense I think. I'll have a look at the deletion code, I'm not sure if it can easily be re-used. A lot of the duplication in my original patch has been removed, or at least moved to a much more sensible place.

I agree that better error messages would be good. In light of other changes I'm thinking of, they may wait.

I also agree that it would be nice to support different filtering. I think it could be supported using some kind of dictionary or keyword syntax to the same method, probably combined with Q objects e.g.

    Book.objects.prefetch_related(young_authors=('authors', Q(age__lte=40),
                                  old_authors=('authors', Q(age__gte=60))

Or we could allow full querysets in place of the Q objects.

Defined like that, it would be possible to add this later. I think that getting the simple case sorted out and correct is a priority, though, and useful enough on its own that the enhancements can wait.

I would also like to support depth > 1, and I think I would prioritise that over custom filtering, partly because it is harder, and partly because it seems more inconsistent with select_related to leave it single depth.

by Luke Plant, 12 years ago

Attachment: prefetch_3.diff added

added support for arbitrary depth, instead of single depth

comment:5 by Christopher Grebs, 12 years ago

Cc: cg@… added

by Luke Plant, 12 years ago

Attachment: prefetch_3.1.diff added

Fix to get tests to run when you run the whole test suite

comment:6 by Mailing List SVR <lists@…>, 12 years ago

Cc: lists@… added

comment:7 by mkai, 12 years ago

Cc: django@… added

comment:8 by Luke Plant, 12 years ago

For future reference…

I've got some thoughts about extended syntax which I'll put here while I remember.

If we allow for additional attributes to be added, with custom filtering etc., we may need attributes on top of those attributes - I might want to add 'young_authors' and 'young_authors__young_fans' or some such. If we do this, the order of processing these attributes becomes important. If ordering is important, keyword arguments are not a good fit, because order is lost. We could re-order ourselves, but I can imagine circumstances where the developer needs control over the order of the fields that are looked up. (See below for example).

So, we might want to consider a different syntax for extensions - perhaps passing tuples or some other struct instead of a string. I'm not wildly keen on a field plus keyword arguments to control it, followed by another call to prefetch_related() if we need another one - it means we clone the QuerySet a lot, and makes it hard to build up a list or dictionary that we can pass using *args or **kwargs syntax.

Example

Example where order is important:

Suppose we have:

 Foo <-> M2M <-> Bar <-> M2M <-> Baz
                 Bar <-> M2M <-> Goo

Suppose we do Foo.objects.prefetch_related("bars__bazs"), perhaps in a default manager.

Suppose also Foo has a primary_bar attribute, implemented as a property. This might be defined as the Bar object that has the most Baz objects attached, or some other DB based criterion. Since we are prefetching all Bar objects, it makes sense to do this in Python code:

    class Foo(Model):
        bars = ManyToManyField(Bar)

        @property
        def primary_bar(self):
            return sorted(list(self.bars.all()), key=lambda bar: -len(bar.bazs.all()))[0] # untested, you get the idea

Now, the prefetch_related code will allow primary_bar to be traversed - it doesn't care that it's not a real ForeignKey. And, for the primary_bar, we might want to prefetch all Goo objects, but not do that for other Bar objects. So we would do:

    Foo.objects.prefetch_related('bars__bazs',
                                 'primary_bar__goos')

For this, it is important that all the bars are prefetched before we evaluate primary_bar. Thus, order is important, and we need to leave it in the hands of the developer.

(Contrived example, I know, but I've seen code not very far from 'primary_bar', which is in desperate need of prefetch_related, so we don't want to limit usefulness by an API choice that is very hard to rectify).

comment:9 by Luke Plant, 12 years ago

Triage Stage: Design decision neededAccepted

I'm moving to 'Accepted', because I'm interpreting Alex's doubts on the mailing list as a '-0' rather than '-1', and I think I may have addressed them anyway.

To anyone on CC list - the latest patch is very much ready for testing, and that will really help get this in.

by Luke Plant, 12 years ago

Attachment: prefetch_4.diff added

Made API for clearing consistent with QuerySet.defer

comment:10 by Anssi Kääriäinen, 12 years ago

Would this work for future extensibility: If you need order, you will need to call multiple .prefetch_related() calls in the wanted order. There is no defined order for multiple prefetch_related arguments inside one call.

I also wonder is we would need to have keyword arguments to the prefetch_related call in the future. For example collapse:

Author.objects.prefetch_related('books__read_by', collapse=True)
# Fetches all the readers of author's books into author.books__read_by

If you combine the above with custom querysets, the following API possibilities come to mind:

filtering_qs = Readers.objects.filter(age__lte=20)

# A Opts class - uqly and hard to use programmatically, but easy to expand
Author.objects.prefetch_related('young_readers'=PrefetchOpts('books__read_by', qs=filtering_qs, collapse=True))

# One prefetch per one call - same notes as above
Author.objects.prefetch_related(path='books__read_by', to_attr='young_readers', qs=filtering_qs)

# The collapse applies globally. If you have a related field named 'collapse' you are in trouble :)
# This is hard to expand
Author.objects.prefetch_related(young_readers=('books__read_by', filtering_qs), collapse=True)

I am doing a review of the patch (v3.1) currently. The only big things IMHO are if the API is extensible, and if the prefetch should save the objects in obj.related_manager.all() or in a separate list. I guess in the latter question I am in minority... I will post the review once I have done some more testing.

comment:11 by Luke Plant, 12 years ago

Having to call prefetch_related() multiple times to achieve some effects is really not a very nice API, especially if you want to build a list of arguments to pass to it, as I mentioned before (and because it causes unnecessary cloning). An options class/namedtuple/tuple, used as a positional argument rather than keyword (because of the ordering issue I mentioned), seems the best option at the moment.

I don't think there is a problem regarding API extensibility. At the moment we are using a variable list of fields, and passing None to clear the list, which is consistent with lots of QuerySet methods, and allows keyword arguments if we really need them, or to pass in a struct of some kind instead of the string we use currently. We've got plenty of options.

comment:12 by Anssi Kääriäinen, 12 years ago

So the API might look something like this in the future:

Books.objects.prefetch_related(
    Opts('young_readers', path='books__read_by', qs=filtering_qs, collapse=True), 
    'another_related'
)
# or with tuple / namedtuple instead of Opts. The kwarg names are just placeholders.

I can live with that :)

Here is the review, there is a lot of stuff here, but most (all?) of these are issues that do not need to be dealt with at the moment. They can be resolved in later tickets. I will mark "patch needs improvement" but feel free to just unset it if you feel that these issues are better dealt in later tickets.

  • About db_for_read usage: Is this usage correct?
    [contrib/contentypes/generic.py:278 and db/models/fields/related.py:446]
    db = self._db or router.db_for_read(self.model, instance=instances[0])
    

Two possible problems: 1. the model and instance are of different classes, I don't know, maybe
this is OK? 2. Only the first instance of the list is given as a hint, is this OK? I don't know multidb too well, so it might be this is OK.

  • Related to above: no multidb tests.
  • Attached is a failing tests - it is related to having more than 1000 parameters in single SQL query. I think this should be fixed, but this can maybe be done in a later commit.
  • The patch assumes relation to the PK field of the model in m2m cases - this is not necessary the case. Attached is a failing test related to this and a successful test in foreign key case.
  • I don't understand what the following comment means, and there are some typos in it:
    [modeltests/prefetch_related:141-146] 
    The second prefetch_related lookup is a reverse foreign key. This means the query for it can 
    only be something like "first_time_authors__pk__in = [...]" and cannot return more rows then 
    the number of Author objects in the database.  This means that these objects will be reused 
    (since in our data we've arranged for there len(authors1) > Author.objects.count()) 
    
  • I don't like the usage of queryset len() to force the evaluation of the queryset. Better to have explicit method (.evaluate()) or something) and call it than rely on the implicit evaluation when len() is called. Maybe another ticket for this?
  • The current method of select_matching_instances can be slow. The method used here is effectively a nested loop, while using a dict would be considerably faster in many cases. The current method is O(nm) while dict would give O(n + m) but with higer memory cost. Currently fetching 1000 objects and prefetching 5 related objects for each takes 3 seconds on my laptop. Just fetching the objects in two different querysets takes 0.2 seconds. So the joining takes almost 2.8 seconds. In the end of this post is a profile of this run.
  • In models/query.py prefetch_related_objects there is this loop:
          i = 0
          while i < len(fields):
              field = fields[i]
    
    Am I missing something, or wouldn't "for field in fields" work here?
  • Same method as above: Is it good enough just to break here, or should the code raise an exception (not the AttributeError but something more helpful):
         good_objects = True
         for obj in obj_list:
             if not hasattr(obj, '_prefetched_objects_cache'):
                 try:
                     obj._prefetched_objects_cache = {}
                 except AttributeError:
                     # Must be in a QuerySet subclass that is not returning
                     # Model instances, either in Django or 3rd
                     # party. prefetch_related() doesn't make sense, so quit
                     # now.
                     good_objects = False
                     break
         if not good_objects:
             break
    

Is it really necessary to loop through the whole obj_list? One would assume that they are homogenous, so checking just obj_list[0] would be enough?

  • In models/query.py around line 1646 there is this snippet:
       for obj in instances:
           qs = getattr(obj, attname).all()
    
    I am afraid this is somewhat expensive, as this will result in the following code executed for each obj:
        return super(RelatedManager, self).get_query_set().using(db).filter(**(self.core_filters))
    
    That is one QuerySet construction and two clones for each object. This is one good reason not to use relatedmanager.all() for prefetch cache... :)
  • Some tests for multi-table inheritance added in the attachment.
  • I have only skimmed the domcumentation.

As noted in the beginning most of these issues are not blockers but things that might need work later. IMHO the code is high-quality, there are good tests and the documentation looks sane. Even if there are a lot of nitpicks above I think this is very close to be ready for commit.

The promised profile:

        1    0.055    0.055   22.647   22.647 speedtester.py:1(<module>)
55540/30018    0.091    0.000   21.001    0.001 {len}
        1    0.000    0.000   20.967   20.967 transaction.py:206(inner)
        1    0.011    0.011   20.966   20.966 speedtester.py:24(managed)
      2/1    0.000    0.000   20.949   20.949 query.py:90(__iter__)
      3/2    0.002    0.001   20.949   10.474 query.py:75(__len__)
        1    0.000    0.000   20.824   20.824 query.py:545(_prefetch_related_objects)
        1    0.002    0.002   20.824   20.824 query.py:1534(prefetch_related_objects)
        1    0.026    0.026   20.819   20.819 query.py:1627(_prefetch_one_level)
      999    9.552    0.010   18.146    0.018 related.py:451(select_matching_instances)
5011132/5010067    8.638    0.000    8.696    0.000 {getattr}
      999    0.021    0.000    2.016    0.002 related.py:457(all)
      999    0.005    0.000    1.992    0.002 manager.py:115(all)
      999    0.028    0.000    1.987    0.002 related.py:434(get_query_set)
    11988    0.525    0.000    1.554    0.000 base.py:279(__init__)
     2005    0.027    0.000    1.392    0.001 query.py:837(_clone)
     2005    0.111    0.000    1.349    0.001 query.py:233(clone)

comment:13 by Anssi Kääriäinen, 12 years ago

Patch needs improvement: set

Forgot to tick the box...

by Anssi Kääriäinen, 12 years ago

Attachment: 16937_additional_tests.diff added

Additional tests

in reply to:  12 ; comment:14 by Luke Plant, 12 years ago

Thanks very much for the review. Some replies:

Replying to akaariai:

  • About db_for_read usage: Is this usage correct?
    [contrib/contentypes/generic.py:278 and db/models/fields/related.py:446]
    db = self._db or router.db_for_read(self.model, instance=instances[0])
    

Two possible problems: 1. the model and instance are of different classes, I don't know, maybe
this is OK? 2. Only the first instance of the list is given as a hint, is this OK? I don't know multidb too well, so it might be this is OK.

  • Related to above: no multidb tests.

It is basically following the existing code in get_query_set(), which may be incorrect itself, but is also sending using instances that are not of the same type as the model given. The docs suggest this might be OK:

https://docs.djangoproject.com/en/dev/topics/db/multi-db/#topics-db-multi-db-hints

However, since there are many instances here, it is probably better not to send the 'instance' hint at all. Thank you for forcing the issue.

I don't think we should add multi-db tests unless there are specific multi-db issues we need to address. I also don't use multi-db at all, so I would be stabbing in the dark trying to find issues there.


  • Attached is a failing tests - it is related to having more than 1000 parameters in single SQL query. I think this should be fixed, but this can maybe be done in a later commit.

I think we can punt on this one. There are many cases where you can end up with more than 1000 parameters in a single query. If your DB can't do that, and you need to, don't use prefetch_related or get a better DB. If there is the demand, it should be possible to add a fix for this as an enhancement.

  • The patch assumes relation to the PK field of the model in m2m cases - this is not necessary the case. Attached is a failing test related to this and a successful test in foreign key case.

Thanks for this. I thought I was copying the existing code, but that code is only used for the auto-created 'through' model path.

  • I don't understand what the following comment means, and there are some typos in it:
    [modeltests/prefetch_related:141-146] 
    The second prefetch_related lookup is a reverse foreign key. This means the query for it can 
    only be something like "first_time_authors__pk__in = [...]" and cannot return more rows then 
    the number of Author objects in the database.  This means that these objects will be reused 
    (since in our data we've arranged for there len(authors1) > Author.objects.count()) 
    

I will try to clarify this. It is basically testing a side-effect of prefetching, namely that sometimes objects will get re-used. But we don't need a test for it, so maybe I'll just remove it.

  • I don't like the usage of queryset len() to force the evaluation of the queryset. Better to have explicit method (.evaluate()) or something) and call it than rely on the implicit evaluation when len() is called. Maybe another ticket for this?

The internals of QuerySet at this point are very delicate, and optimized. __iter__, __nonzero__, __len__ etc all depend heavily on the internals of each other in order to achieve the optimization required, and we have tests that ensure things are as we expect. I am happy to use the same method here, and I don't think that an evaluate() method would actually add clarity while preserving the optimisations we've done, because it is not as simple as 'evaluate' - what I mean is 'fill the internal cache', and I know that len() has exactly that behaviour.

  • The current method of select_matching_instances can be slow. The method used here is effectively a nested loop, while using a dict would be considerably faster in many cases. The current method is O(nm) while dict would give O(n + m) but with higer memory cost. Currently fetching 1000 objects and prefetching 5 related objects for each takes 3 seconds on my laptop. Just fetching the objects in two different querysets takes 0.2 seconds. So the joining takes almost 2.8 seconds. In the end of this post is a profile of this run.

Nice catch, and thanks for the figures, I'll see what I can do about this, unless you want to have a go.

  • In models/query.py prefetch_related_objects there is this loop:
          i = 0
          while i < len(fields):
              field = fields[i]
    
    Am I missing something, or wouldn't "for field in fields" work here?

The issue is that we are dynamically adding to fields inside the loop, and I thought we would get an error in this case, but I'm thinking of iteration over dictionary and mutating it at the same time. My tests show that a simple 'for' loop works here in Python 2.7 and 3, so that can be changed. Thanks!


  • Same method as above: Is it good enough just to break here, or should the code raise an exception (not the AttributeError but something more helpful):

I don't think we can while preserving backwards compatibility. If we don't fail silently we get exceptions with DateQuerySet and ValuesQuerySet. We could possibly fix them, but there could be 3rd party QuerySet subclasses that do similar things that I don't want to break.

     good_objects = True
     for obj in obj_list:
         if not hasattr(obj, '_prefetched_objects_cache'):
             try:
                 obj._prefetched_objects_cache = {}
             except AttributeError:
                 # Must be in a QuerySet subclass that is not returning
                 # Model instances, either in Django or 3rd
                 # party. prefetch_related() doesn't make sense, so quit
                 # now.
                 good_objects = False
                 break
     if not good_objects:
         break

Is it really necessary to loop through the whole obj_list? One would assume that they are homogenous, so checking just obj_list[0] would be enough?

If it fails on one, we can assume it will fail on all. But we still need to add the cache dictionary to each. We could, however, avoid doing this work on exactly the same list more than once.

  • In models/query.py around line 1646 there is this snippet:
       for obj in instances:
           qs = getattr(obj, attname).all()
    
    I am afraid this is somewhat expensive, as this will result in the following code executed for each obj:
        return super(RelatedManager, self).get_query_set().using(db).filter(**(self.core_filters))
    
    That is one QuerySet construction and two clones for each object. This is one good reason not to use relatedmanager.all() for prefetch cache... :)

The code is exactly the same as what you would otherwise execute - you will be doing [foo.bars.all() for foo in foos] at some point (otherwise why use prefetch_related?), and here we are just doing that up front. The next time you call the all() you get the pre-created version. So I don't think trying to optimise here is necessary. If we find a shortcut which doesn't significantly reduce the clarity, fine, but it can wait.

  • Some tests for multi-table inheritance added in the attachment.

Thank you!

in reply to:  14 comment:15 by Anssi Kääriäinen, 12 years ago

Replying to lukeplant:

It is basically following the existing code in get_query_set(), which may be incorrect itself, but is also sending using instances that are not of the same type as the model given. The docs suggest this might be OK:

https://docs.djangoproject.com/en/dev/topics/db/multi-db/#topics-db-multi-db-hints

However, since there are many instances here, it is probably better not to send the 'instance' hint at all. Thank you for forcing the issue.

I don't think we should add multi-db tests unless there are specific multi-db issues we need to address. I also don't use multi-db at all, so I would be stabbing in the dark trying to find issues there.

I think somebody who uses the multi-db feature should comment about this. Two people stabbing in the dark is not better than one people doing that. Come to think of it, it might be worse... :)

The first instance as a hint is probably somewhat useful as you can get the db used in the query from that. I guess you want to get that from somewhere.

[About too-may-parameters in SQL].

I think we can punt on this one. There are many cases where you can end up with more than 1000 parameters in a single query. If your DB can't do that, and you need to, don't use prefetch_related or get a better DB. If there is the demand, it should be possible to add a fix for this as an enhancement.

Agreed. I checked other core supported databases, and there doesn't seem to be any practical limits.

  • I don't like the usage of queryset len() to force the evaluation of the queryset. Better to have explicit method (.evaluate()) or something) and call it than rely on the implicit evaluation when len() is called. Maybe another ticket for this?

The internals of QuerySet at this point are very delicate, and optimized. __iter__, __nonzero__, __len__ etc all depend heavily on the internals of each other in order to achieve the optimization required, and we have tests that ensure things are as we expect. I am happy to use the same method here, and I don't think that an evaluate() method would actually add clarity while preserving the optimisations we've done, because it is not as simple as 'evaluate' - what I mean is 'fill the internal cache', and I know that len() has exactly that behaviour.

OK, this is then for sure food for another ticket, if even for that. I have to add this to my list of queryset annoyances, it already contains .clone() that does a lot more than cloning...

  • The current method of select_matching_instances can be slow. The method used here is effectively a nested loop, while using a dict would be considerably faster in many cases. The current method is O(nm) while dict would give O(n + m) but with higer memory cost. Currently fetching 1000 objects and prefetching 5 related objects for each takes 3 seconds on my laptop. Just fetching the objects in two different querysets takes 0.2 seconds. So the joining takes almost 2.8 seconds. In the end of this post is a profile of this run.

Nice catch, and thanks for the figures, I'll see what I can do about this, unless you want to have a go.

I can take a stab at this, I have done the hash-join-in-python too many times already...

  • In models/query.py around line 1646 there is this snippet:
       for obj in instances:
           qs = getattr(obj, attname).all()
    
    I am afraid this is somewhat expensive, as this will result in the following code executed for each obj:
        return super(RelatedManager, self).get_query_set().using(db).filter(**(self.core_filters))
    
    That is one QuerySet construction and two clones for each object. This is one good reason not to use relatedmanager.all() for prefetch cache... :)

The code is exactly the same as what you would otherwise execute - you will be doing [foo.bars.all() for foo in foos] at some point (otherwise why use prefetch_related?), and here we are just doing that up front. The next time you call the all() you get the pre-created version. So I don't think trying to optimise here is necessary. If we find a shortcut which doesn't significantly reduce the clarity, fine, but it can wait.

True. But this whole ticket is about optimizing the mentioned code path. On the other hand, the queryset creation takes about as much time as model initialization in the attached profile. Making queryset clone faster would be the key here, and I might have an idea or two about making that faster... In any case, this can be handled later if the need arises.

I will try to get the hash-join patch done in a hour or two.

IMHO it would be good to make a summary of the API extensibility discussion to django-developers and leave it there for a couple of days so that others can have a say. I think the current resolution is good enough, but in this case more eyes means better results.

Otherwise, close to commit ready :)

by Luke Plant, 12 years ago

Attachment: prefetch_5.diff added

Updated patch, see comments

comment:16 by Luke Plant, 12 years ago

I've updated the patch, pulling in most of your tests, and fixing the issues.

Relevant to your work, I've eliminated the 'select_matching_instances' method, so the nested loop now appears in the calling function, which should make your job easier. It turns out all the 'select_matching_instances' methods could be generalised to two attributes passed back from get_prefetch_query_set().

I think I've addressed everything else - or everything else I agreed to address :-)

Signing out for now.

comment:17 by Luke Plant, 12 years ago

Actually, I didn't fix the 'for' loop, and I didn't remove the tests for the instance-sharing. And I didn't add anything to avoid iterating over the same object list more than once to add '_prefetched_objects_cache = {}'. Monday!

comment:18 by anonymous, 12 years ago

Here is the hash-join patch. The test is a little diffferent this time (lost my original database...)

Profile without dict-join:

Loop 0, used 0:00:21.332133
Loop 1, used 0:00:21.286314
Total: 0:00:42.618382
         11312670 function calls (11220378 primitive calls) in 43.177 CPU seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.002    0.002   43.178   43.178 speedtester.py:1(<module>)
68813/17921    0.114    0.000   42.637    0.002 {len}
        1    0.011    0.011   42.636   42.636 transaction.py:206(inner)
        1    0.012    0.012   42.623   42.623 speedtester.py:25(managed)
      4/2    0.000    0.000   42.605   21.303 query.py:90(__iter__)
      6/4    0.004    0.001   42.605   10.651 query.py:75(__len__)
        2    0.000    0.000   42.357   21.178 query.py:545(_prefetch_related_objects)
        2    0.005    0.002   42.357   21.178 query.py:1534(prefetch_related_objects)
        2   19.538    9.769   42.270   21.135 query.py:1627(_prefetch_one_level)
10110574/10107823   17.424    0.000   17.629    0.000 {getattr}
     2008    0.042    0.000    4.144    0.002 related.py:449(all)
     2008    0.010    0.000    4.095    0.002 manager.py:115(all)
     2008    0.056    0.000    4.086    0.002 related.py:434(get_query_set)
     4026    0.054    0.000    2.797    0.001 query.py:837(_clone)
     4026    0.223    0.000    2.713    0.001 query.py:233(clone)
     2010    0.011    0.000    2.370    0.001 query.py:596(filter)
     2012    0.033    0.000    2.360    0.001 query.py:610(_filter_or_exclude)
48312/16104    0.737    0.000    2.292    0.000 copy.py:145(deepcopy)

With dict join:

Loop 0, used 0:00:02.813404
Loop 1, used 0:00:02.763317
Total: 0:00:05.576659
         1254598 function calls (1162306 primitive calls) in 6.158 CPU seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.002    0.002    6.159    6.159 speedtester.py:1(<module>)
68813/17921    0.114    0.000    5.595    0.000 {len}
        1    0.012    0.012    5.594    5.594 transaction.py:206(inner)
        1    0.012    0.012    5.582    5.582 speedtester.py:25(managed)
      4/2    0.000    0.000    5.563    2.782 query.py:90(__iter__)
      6/4    0.004    0.001    5.563    1.391 query.py:75(__len__)
        2    0.000    0.000    5.314    2.657 query.py:545(_prefetch_related_objects)
        2    0.005    0.003    5.314    2.657 query.py:1534(prefetch_related_objects)
        2    0.087    0.044    5.225    2.613 query.py:1627(_prefetch_one_level)
     2008    0.025    0.000    3.925    0.002 related.py:449(all)
     2008    0.009    0.000    3.897    0.002 manager.py:115(all)
     2008    0.048    0.000    3.888    0.002 related.py:434(get_query_set)
     4026    0.053    0.000    2.855    0.001 query.py:837(_clone)
     4026    0.211    0.000    2.774    0.001 query.py:233(clone)
48312/16104    0.739    0.000    2.362    0.000 copy.py:145(deepcopy)

Note that 2/3 of time is used in all().

Here is still another profile, this time the related objects are set in setattr(obj, attname + '_prefetched').

Loop 0, used 0:00:00.811442
Loop 1, used 0:00:00.756192
Total: 0:00:01.567570
         425294 function calls (397258 primitive calls) in 2.121 CPU seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.002    0.002    2.122    2.122 speedtester.py:1(<module>)
42709/17921    0.069    0.000    1.593    0.000 {len}
        1    0.005    0.005    1.578    1.578 transaction.py:206(inner)
        1    0.005    0.005    1.573    1.573 speedtester.py:25(managed)
      4/2    0.000    0.000    1.561    0.781 query.py:90(__iter__)
      6/4    0.004    0.001    1.561    0.390 query.py:75(__len__)
        2    0.000    0.000    1.315    0.657 query.py:545(_prefetch_related_objects)
        2    0.005    0.003    1.314    0.657 query.py:1534(prefetch_related_objects)
    12052    0.092    0.000    1.285    0.000 query.py:229(iterator)

Last, memory usage per style:

original (nest-loop style join): 39023Kb
dict-join: 39043Kb
no related manager: 27811Kb

The memory use is total memory use per process. This means that fetching a list of 1000 objects requires over 10Mb of additional memory for the queryset caching. I am still of the opinion that saving the prefetched objects in obj.books_prefetched instead of obj.books.all() would be a good move. More so because you can't actually do anything to the .all() queryset (filter, order etc) without losing the benefit of the prefetch.

by Anssi Kääriäinen, 12 years ago

Attachment: 16937_dict_join.diff added

comment:19 by Anssi Kääriäinen, 12 years ago

About the concerns of speed and memory usage because of QuerySet caching: I now think that this is not a problem, if in future prefetch_related will be extended in a way that you can avoid the caching. For example:

Book.objects.prefetch_related(R('authors', to_attr='authors_prefetched'))

Now you will have authors in an ordinary list, accessible in book.authors_prefetched. This way you will gain the no related manager memory usage and speed when needed, while in the normal cases you can use the convenient .prefetch_related('authors') and use the related manager.all().

That being said, I don't have any problems any more with the extensibility of the API, or with the implementation where the objects are accessible through related_manager.all(). Once the patch is updated with the minor corrections + the dict_join.diff patch I think this will be ready for commit.

I would like to work already on the custom QuerySet / other extensibility issues so that it could be included already in 1.4. Should I create a new ticket or is it too early? Or of course if Luke Plant has plans already for this work I am more than happy just to be a reviewer.

Sorry for being a bit insistent about the extensibility / related_manager.all() issues. At least there is hope the discussion has been useful for future work.

comment:20 by Anssi Kääriäinen, 12 years ago

I am doing some preliminary studying work for the extensibility patch, and while doing that I have some more nitpicking about small and really small things...

  • Why are prefetch_related_objects and _prefetch_one_level functions instead of methods of Query?
  • Assuming there is a good reason for this, prefetch_related_objects argument "fields" is not well named. related_lookups or something like that might be a better name. If the prefetch_related_objects would be a method, there would not be a need to pass this argument.
  • In query.prefetch_related, the comment
    If prefetch_related() is called with no arguments the list is cleared.
    

is stale. It should say that if called with None as the argument.

  • It would be more consistent to surrounding code to write the .prefetch_related as:
            clone = self._clone()
            if fields == (None,):
                clone._prefetch_related = set()
            else:
                clone._prefetch_related = self._prefetch_related.union(set(fields))
            return clone
    

And it would be defensive to copy the _prefetch_related in .clone() instead of relying the other code only using .union() or other new set returning methods. Or make the set an immutable set.

comment:21 by Anssi Kääriäinen, 12 years ago

And so it happened that I got a little carried away and actually implemented some extensibility already, two new features included:

  • You can define to which attr the prefetched results are saved, this can be used in additional lookups
    qs = Author.objects.all()
    qs.prefetch_related(R('books', to_attr='books_prefetched'),
        R('books_prefetched__readers', to_attr='readers_prefetched'))
    # qs[0].books_prefetched will have related prefetched books
    # qs[0].books_prefetched[0].readers_prefetched will have prefetched readers
    
  • You can use custom querysets
    qs.prefetch_related(
        R('books', to_attr='django_books',
         qs=Book.objects.filter(name__icontains='django').order_by('published_year')), 
        'django_books__readers')
    # qs[0].django_books will have all of the author's django books ordered by published year
    # qs[0].django_books[0].readers.all() will have all the readers of first book
    

I would like to still implement:

    qs.prefetch_related(
        R('books', to_attr='django_books',
          qs=Book.objects.filter(name__icontains='django').order_by('published_year')), 
        R('django_books__readers', to_attr='django_readers', collapse=True)
    )
    # qs[0].django_books will have all of the author's django books ordered by published year
    # qs[0].django_readers will have all the readers of all of the authors books
    # possible optimization is that you don't fetch the "middle" objects at all, by issuing kwarg 
    # skip=True to the first R-obj above

The work can be found from github branch https://github.com/akaariai/django/tree/prefetch_extensibility

There are no new test but the existing ones pass. No new documentation either. I would say that this is more than just a proof of concept, but at the moment not much more.

I posted this here because it would be possible to include more features in one go, that is, in this ticket. Even if it is seen better to first include only the original feature set, this work does confirm that there is no worries about extensibility, and the memory / speed problems do not matter if there is the ability to set the objects into a pure list instead of caching in the related manager if the user so chooses. I would also encourage to look at the new version of prefetch_related_objects.

in reply to:  20 comment:22 by Luke Plant, 12 years ago

Replying to akaariai:

I am doing some preliminary studying work for the extensibility patch, and while doing that I have some more nitpicking about small and really small things...

  • Why are prefetch_related_objects and _prefetch_one_level functions instead of methods of Query?

I thought it would be cleaner to have them outside the main class since it needs very little of the internals of the QuerySet - it is not involved in any of the main business of QuerySet - and it would otherwise be adding ~130 line to a class that is already over 900 lines long. I have found it is often easier to pull things out of a class earlier rather than later, and it leads to cleaner design, since you can't just go and access 'self'. The _prefetch_one_level function, especially, needs no access to 'self'. This is similar to the get_cached_row helper which doesn't need to be a method, even though it can't be re-used outside of that class.

We also have the fact that if you add something to 'self' in order to communicate between prefetch_related_objects and sub-procedures, which can be a very tempting path, you've potentially got big memory leaks. We can see clearly that prefetch_related_objects only adds to the result_cache instances (and further down the tree), and not anywhere else, because it has no access to the QuerySet 'self'.

  • Assuming there is a good reason for this, prefetch_related_objects argument "fields" is not well named. related_lookups or something like that might be a better name. If the prefetch_related_objects would be a method, there would not be a need to pass this argument.

Good call, fixed.

  • In query.prefetch_related, the comment
    If prefetch_related() is called with no arguments the list is cleared.
    

is stale. It should say that if called with None as the argument.

Thanks, fixed.

  • It would be more consistent to surrounding code to write the .prefetch_related as:
            clone = self._clone()
            if fields == (None,):
                clone._prefetch_related = set()
            else:
                clone._prefetch_related = self._prefetch_related.union(set(fields))
            return clone
    

And it would be defensive to copy the _prefetch_related in .clone() instead of relying the other code only using .union() or other new set returning methods. Or make the set an immutable set.

Actually, passing kwargs to _clone() is used in other places, where the change is a simply attribute that immutable. So due to your second point, and the fact that we actually need a list, I'm changing it, thanks. I've also added tests that demonstrate/test the ordering issue.

I've pulled in your dict-based join, thank you.

I've also fixed it so that '.count()' and '.exists()' work as expected (and any other QuerySet methods that can answer from the result_cache, like indexing), by moving changes to relmanager.get_query_set() instead of relmanager.all().

Regarding the expense of creating QuerySets which are not, I think it should be possible to address this with a some kind of QuerySetProxy which has a result cache, and possibly a few other methods like count() and exists(), and for everything else lazily creates the real QuerySet and delegates to it. It would probably involve adding a 'lazy' kwarg to relmanager.get_query_set and creating a subclass of LazyObject. However, I would actually prefer for this to go in without that optimization, so that in any subsequent debugging efforts are not hampered by an extra source of complication.

Finally, I would create a separate ticket for your enhancements, which sound great, but I haven't had time to check them out.

Thanks again for all your work on this.

by Luke Plant, 12 years ago

Attachment: prefetch_6.diff added

Various updates, see last comment

by Luke Plant, 12 years ago

Attachment: prefetch_7.diff added

Added fix for traversing nullable relations

comment:23 by Anssi Kääriäinen, 12 years ago

Has patch: set
Patch needs improvement: unset
Triage Stage: AcceptedReady for checkin
# in documentation
[list(pizza.topppings.filter(spicy=True) for pizza in pizzas]

Note the tree 'p' characters and missing closing ')'. Fixed in attached patch.

I am getting some deprecation warnings in tests: modeltests/prefetch_related/tests.py:172: DeprecationWarning: BaseException.message has been deprecated as of Python 2.6

I have fixed this by using str(cm.exception) instead of cm.exception.message

Otherwise I can't see anything wrong with the patch. Although there is still the issue with db_for_read hints, but I think that can be tackled later.

All tests passed on sqlite3. PostgreSQL passes the prefetch_related test, I don't have time to run full test suite on PostgreSQL just now. I have read through the documentation changes and they look good. I haven't actually built the documentation, though.

I have attached an updated patch which fixes the minor things mentioned above. Marking ready for checkin.

comment:24 by Anssi Kääriäinen, 12 years ago

My review was based on prefetch_6.diff. Now I think I am not going to attach yet another patch for this ticket, the changes mentioned above are trivial to make when reviewing this patch before commit. My "ready for checkin" review is still in effect.

comment:25 by Luke Plant, 12 years ago

Resolution: fixed
Status: newclosed

In [16930]:

Fixed #16937 - added QuerySet.prefetch_related to prefetch many related objects.

Many thanks to akaariai for lots of review and feedback, bug finding,
additional unit tests and performance testing.

comment:26 by mkai, 12 years ago

Thanks for your great work! I'm still amazed by the quality standards of the Django community. Also, the timing is very good - I'm currently refactoring my project with the goal of minimizing database queries and have already made good use of prefetch_related.

I've noticed that the code supports traversing over a GenericForeignKey relation and prefetching its related objects. Would it be possible to also support prefetching the objects that the generic foreign key points to? I'm thinking of a model that has a generic foreign key to a User object, for example. If 30 instances of this model are being traversed, and all 30 instances refer to the same User object, then the User object will be fetched from the database 30 times. I've seen ways around this (http://djangosnippets.org/snippets/1773/ for example), but, other than prefetch_related, all of them evaluate the queryset immediately (and none of them is in Django core).

Do you think prefetching the objects behind generic foreign keys (not their relations) could/ should be added to prefetch_related? Or does it rather belong into django.contrib.contenttypes, as prefetch_generic maybe?

in reply to:  26 comment:27 by Luke Plant, 12 years ago

Replying to mkai:

Do you think prefetching the objects behind generic foreign keys (not their relations) could/ should be added to prefetch_related? Or does it rather belong into django.contrib.contenttypes, as prefetch_generic maybe?

This may be possible. I do not want to add more code into Django core that special-cases anything in contenttypes - we want to go in the opposite direction to that. We have some work going on regarding composite fields - see https://www.djangoproject.com/weblog/2011/apr/25/gsoc/ and #373. After this is done, we are hoping that things like GenericForeignKey can be cleaned up. Then maybe we can find some API whereby prefetch_related (or possibly select_related) will be able to handle them better.

An alternative, as you say, might be something like 'prefetch_generic'. But whether this will be easy to implement (or use) if it is not part of core is another matter.

comment:28 by Luke Plant, 12 years ago

@mkai: I've created #17003 which covers your request.

comment:29 by Luke Plant, 12 years ago

@mkai - #17003 now has a patch which includes support for prefetching GenericForeignKey, it would be great if you could test it in your project that would be great.

by snyderra@…, 12 years ago

Attachment: prefetch_1.3.1.diff added

for 1.3.1 stable release

comment:30 by snyderra@…, 12 years ago

For those impatient like me, the prefetch_1.3.1.diff will apply to the 1.3.1 stable tar.gz and pass all the tests. I am using it on my system and it works as expected. To keep it at the 1.3.1 this is a the prefetch_2 and parts of prefetch_3

comment:31 by Luke Plant, 12 years ago

You should be aware if you are using that patch that the prefetch_2 and prefetch_3 diffs were quite far from ready, and even the API has changed a bit.

Note: See TracTickets for help on using tickets.
Back to Top