Code

Opened 7 years ago

Closed 5 years ago

Last modified 3 years ago

#5420 closed (fixed)

Allow database API users to specify the fields to exclude in a SELECT statement

Reported by: adrian Owned by: mtredinnick
Component: Database layer (models, ORM) Version: master
Severity: Keywords: qs-rf
Cc: msaelices, ferringb@…, marinho, semente@…, research@…, youngj@…, ikelly, alexkoshelev, flosch@… Triage Stage: Accepted
Has patch: yes Needs documentation: yes
Needs tests: no Patch needs improvement: yes
Easy pickings: UI/UX:

Description

This one will help people use their databases more efficiently.

The Django ORM should allow users to specify a list of field names to *exclude* from a QuerySet. If a user attempts to access one of those excluded fields on the resulting model instance, the field will be loaded lazily via a separate query.

This is useful when you know you absolutely will not need to use a particular field in your template, so there's no point in SELECTing that data. This saves memory, and it saves on bandwidth between the database server and the Web server.

Example:

class Person(models.Model):
    name = models.CharField(maxlength=32)
    age = models.IntegerField()
    hometown = models.CharField(maxlength=32)
    is_cool = models.BooleanField()

# My instinct is to call this hide(), but I'm sure there's a better name for it.
>>> p = Person.objects.hide('hometown', 'is_cool').get(name='John Lennon')
>>> p.id
3
>>> p.name
u'John Lennon'

# Does a query to get "hometown", because it was hidden from the QuerySet.
# 'SELECT hometown FROM person WHERE id=3;'
>>> p.hometown
u'Liverpool'

# Does a query to get "is_cool", because it was hidden from the QuerySet.
# 'SELECT is_cool FROM person WHERE id=3;'
>>> p.is_cool
True

In the case of lazily loaded fields, the lazy loading *only* applies to the particular field. E.g., when I accessed p.hometown in the above example, it did *not* also lazily load the rest of the hidden fields ("is_cool").

We should also provide the inverse of hide() -- perhaps called expose()? -- which would take a list of field names to *include* rather than *exclude*. This would be an opt-in instead of an opt-out.

Attachments (2)

queryset_fields_trunk.diff (29.7 KB) - added by dogada 6 years ago.
QuerySet patch: adds fields() and improves values)() methods
lazy.diff (48.8 KB) - added by adunar 6 years ago.
implementation of lazy fields

Download all attachments as: .zip

Change History (34)

comment:1 Changed 7 years ago by adrian

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Triage Stage changed from Unreviewed to Accepted

comment:2 Changed 7 years ago by durdinator

  • Owner changed from nobody to durdinator

comment:3 Changed 7 years ago by durdinator

  • Keywords qs-rf added

comment:4 Changed 7 years ago by durdinator

  • Owner changed from durdinator to nobody

comment:5 Changed 7 years ago by durdinator

Not much point tackling this now if the queryset-refactoring work is coming soon; it'd just break whatever patch was made for this.

comment:6 Changed 6 years ago by Alex

Should the "exposes" method replace values, as I see it the main difference is that exposes seems to still return an object, whereas values returns a dictionary.

comment:7 Changed 6 years ago by msaelices

  • Cc msaelices added

comment:8 Changed 6 years ago by (removed)

  • Cc ferringb@… added

comment:9 Changed 6 years ago by dogada

  • Owner changed from nobody to dogada
  • Status changed from new to assigned

Changed 6 years ago by dogada

QuerySet patch: adds fields() and improves values)() methods

comment:10 Changed 6 years ago by dogada

  • Has patch set

I added a patch http://code.djangoproject.com/attachment/ticket/5420/queryset_fields_trunk.diff that implements QuerySet.fields(*fields, **related_fields) and make possible to load only some from master and related models fields. It allows to tune various object list queries when we need only limited subset of all fields, improve general performance and decrease database load. As side effect of this patch support of selecting fields from related models in QuerySet.values() is implemented too. It was changed signature of this method from values(*fields) to values(*fields, **related_fields) but the change is backward compatible.

Patch doesn't implement lazy field loading, see more details about this and other issues at: http://www.mysoftparade.com/blog/django_orm_performance_patch/

comment:11 Changed 6 years ago by mtredinnick

  • Patch needs improvement set

Thanks for the patch. However, patches in this are of the code against trunk aren't very useful at the moment, since it's been hugely rewritten on the queryset-refactor branch and that is what will be merged into trunk. So if you want to tackle this problem, please write a patch against queryset-refactor. Patches against trunk aren't worth considering at this point in time, because they will be quite different from the final version.

It's a very complete patch, with tests and documentation, however there are some problems (aside from being against code that is scheduled for removal).

  1. The fields() addition probably isn't the nicest way to do this. Adrian called it exclude(), although probably defer() is a better name. You just pass this function a list of fields to exclude. Your fields() call seems to take more than just that: filters and all sorts of things. I think you've gone a little too closely to the raw SQL, rather than allowing the user to specify another modifier on the QuerySet (a list of fields not to include when the query is eventually executed). Your examples don't look as clear as the alternative approach.
  2. The fields we don't immediately load should be deferred (lazy-loaded), not omitted altogether. So when you access those attributes, they should be loaded transparently. This means you can load with some fields being deferred, but still safely pass that object around everywhere and if something else does need to access one of the expensive fields, it will be loaded on demand.
  3. There's a big change to ValuesQuerySet in here which seems unnecessary. This really shouldn't be touching that at all (it looks like you might be trying to use some common stuff, but really the commonality is in the normal QuerySet loading path). ValueQuerySets already have a way to not load particular fields (you just don't list them). No changes needed there.
  4. I'm fairly sure Jacob Kaplan-Moss is a fair way along writing the full version of the functionality needed by this ticket for queryset-refactor. So if you'd like to just wait a little bit, we should have something in-tree shortly that implements this particular feature.

So, thanks a lot (seriously!) for taking the time to write the patch. However, in its current form this won't be applied. As mentioned, though, I believe Jacob's pretty close to having something that works the way we're after, so hold tight and you won't have to write it all again.

comment:12 Changed 6 years ago by dogada

fields() is brother of values() - it share same concept and it's easy to switch output format from model instances to dictionaries and vise versa. IMHO it's better than have in API 2 different methods like show() and hide(). Also if you will use show() and hide() for model fields and values(*args) for dictionary fields it will look a bit strange.

I created a patch for our project because we have serious performance problems with generated by current Django ORM SQL-queries and just share it because I think my patch makes Django ORM much more effective than it is now. We can use our patch and wait for the alternative solution from Jacob Kaplan-Moss, but please make lazy loading of fields optional or by-request, because Django model instances already have signals-related overhead and adding new features to the models may shortly make Django models similar to slow J2EE Entity beans. Thanks.

comment:13 Changed 6 years ago by dogada

  • Owner dogada deleted
  • Status changed from assigned to new

comment:14 Changed 6 years ago by dcramer

http://groups.google.com/group/django-developers/browse_thread/thread/2d4a8f5a4b399ed0

I threw this up the other day. I much prefer an extension to values of some sort (as really, it means the same thing) than new methods for hide/expose. If nothing else, a .fields(), but then we're just replicating what values does (but with partial objects instead of a dictionary).

comment:15 Changed 6 years ago by marinho

  • Cc marinho added

comment:16 Changed 6 years ago by Guilherme M. Gondim <semente@…>

  • Cc semente@… added

comment:17 Changed 6 years ago by jacob

  • Owner set to jacob
  • Status changed from new to assigned

comment:18 Changed 6 years ago by jacob

  • milestone set to 1.0

comment:19 Changed 6 years ago by anonymous

  • Cc research@… added

comment:20 Changed 6 years ago by ubernostrum

  • milestone changed from 1.0 to post-1.0

Since this is a feature request, and we're past the feature-freeze point for 1.0... punt.

comment:21 Changed 6 years ago by adunar

  • Cc youngj@… added
  • Needs documentation set

A few months ago I patched Apture's internal version of Django to support lazy loading/saving of certain fields within Django models. However, I did it a different way than already discussed here.

In particular, our problem was that implicit Django-generated query sets would fetch big text fields that we didn't need. Example :

class Student(models.Model):
    name = models.CharField(max_length=32)
    year = models.IntegerField()
    thesis = models.TextField()

class FavoriteFood(models.Model):
    food = models.CharField(max_length=32)
    reason = models.CharField(max_length=128)
    student = models.ForeignKey(Student)

favorites = FavoriteFood.objects.filter(food='lasagna')
for favorite in favorites:
    print favorite.student.name, 
    # Django just loaded the student's entire thesis :(
    print "likes lasagna because", favorite.reason    

favorites = FavoriteFood.objects.filter(food='chicken enchiladas').select_related()
# Django just loaded a bunch of theses again :(

To solve this, we changed the client interface by adding a boolean 'lazy' parameter to the Field constructor, e.g.:

thesis = models.TextField(lazy=True)

This was implemented by putting a descriptor on lazy fields that keeps track of whether the field has been loaded yet and whether it has been modified. By using a descriptor instead of __setattr__, it doesn't really have an impact on performance for models that don't use lazy fields.

After we migrated our internal Django version to 1.0, I went back and cleaned up the lazy fields code and added support for changing the lazy fields on each query set. This turned out to be considerably harder to do in a way that doesn't degrade performance for clients who don't use lazy fields, which is probably why this ticket has been open for so long despite its obvious importance...

The client interface adds one function to the manager and query set, toggle_fields(fetch=None, lazy=None, fetch_only=None), where each argument can be an array of field names (or None):

# fetches name (assuming thesis was defined with lazy=True)
students = Student.objects.all().toggle_fields(lazy=['year']) 

# fetches name,year,thesis
students2 = Student.objects.all().toggle_fields(fetch=['thesis']) 

# fetches name, year
students3 = Student.objects.all().toggle_fields(fetch_only=['name','year']) 

thesis = students[0].thesis  # lazy-loads thesis
students[0].save()           # saves name, year
students[0].thesis = "Django is awesome"
students[0].save()           # saves name, year, thesis    

Does anyone have ideas for better names for toggle_fields/lazy/fetch/fetch_only? I think that hide and show/expose don't really fit here because the client can still get and set and save the field values whether it is lazy or not. Also, I thought that having one method with different parameters would follow Django's style better than adding 3 new methods. Another question is whether to allow lazy-loading and lazy-saving to be independent; e.g., to have a field that's always loaded but only saved when changed. It probably wouldn't be too hard to support this.

Internally, when a toggle_fields query is executed, the Django ORM dynamically creates a subclass of the model type that has a LazyDescriptor for each of the fields that are lazy for that query (but not the fields that were created with lazy=True). Unlike a typical model subclass, this one is very lightweight. It skips most of the code in ModelBase.__new__, and shares the same _meta object.

One drawback of dynamically creating subclasses is that they are harder to serialize (e.g. with the pickle module), but that's probably possible to support if desired.

Anyway, there shouldn't be much of a performance impact for people not using toggle_fields or lazy=True. In most cases the code checks to see if there are any lazy fields before doing anything different from before. There's just a bit of overhead from some extra conditional tests and function calls. I haven't actually run performance tests on that though.

As part of my patch, I cleaned up some code in db/models/sql/query.py. In particular, the code to get a column's SQL alias was duplicated in 6 places. Also, depending on whether the get_default_columns function was called for the base model or a related model, it took different parameters, did different things, and returned different values. So I split it into two separate functions. This refactoring makes the Query class easier to understand and easier to subclass without duplicating code. That seems like something Django should incorporate even if people don't like the other code in this patch.

One change I'm less sure about is in Query.setup_joins. Basically, when computing the join for a ForeignKey field like student_id, the old code would add a join to the student table even though it doesn't need to (because student_id is stored directly in the original row). Then, add_fields checked for this case and removed the join. I changed it so that setup_joins doesn't add the unnecessary join in the first place. Maybe there's some reason for doing this that I'm not seeing.

Anyway, we're already using this because it makes database access way faster for models that have big text fields that we don't read or write very often. My patch has some regression tests, but I haven't updated the documentation yet. If this is something Django would want to incorporate, I'd be happy to do some more work to improve the patch. Thoughts?

Changed 6 years ago by adunar

implementation of lazy fields

comment:22 Changed 6 years ago by mtredinnick

This patch seems to be trying to do multiple things at once, and, as usual, that makes reviewing harder.

The reason the ticket's still open is actually pretty simple: it was out of scope for 1.0 and subsequent to that we needed to wait until the feature-additions trunk was open, which has now happened. At some point, I'll finish up the patch I've got to implement this feature and commit it. It's not actually too hard to implement (and, yes, using descriptors is the way to do it -- we just insert a descriptor for any fields that aren't immediately loaded with a small query class and wrapper that knows how to load them). We've been more focused on bug fixing than feature additions in the past few weeks, which isn't too surprising. We've just been busy; it's not that hard.

You are implementing quite a different feature: fields that are always lazy loaded. That's a data modelling issue, not a retrieval issue (although pretty much this whole ticket is about working around data modelling that probably should be done differently in the first place, but c'est la vie). If you don't ever want them loaded with the main data, that suggests your data modelling has gone slightly awry and those fields shouldn't be part of the main model. Since foreign key relations are loaded lazily in the normal course of events, simply put the things you don't want loaded with the main model into another table and it will always lazy load upon access. No core changes required there (note that the defer() feature that this ticket introduces will have an extra database query for each field, whereas putting all your fields you don't want immediately, but might want later into a separate table means you could load them all at once). So I don't think we'll take this "lazy" option, as it's really implementing a different way to do data modelling that is better supported by splitting things up into different tables (or using defer() when it's introduced). You're really proposing something entirely different to the feature requested in this ticket.

It's difficult to tell which bits are your new code and which are the cleanups you talk about from the patch. If you could be motivated to split out into a clean patch and make a new ticket for them, I'll have a look. Note that some of the code repetition is for performance reasons (we're calling those functions tens of thousands of times, so saving a function call in the tight loops can be beneficial), but it looks like some of the stuff you've done in get_columns(), could certainly be useful.

Really, I think that if you want to go down this road, it needs a discussion on django-developers and proposal as a feature addition for 1.1. We've already accepted the deferred/immediate option that this ticket is about for 1.1 (since both Adrian and I are in agreement about it and no core maintainer's objected), but your thing is different. We don't like doing design in tickets (since it doesn't involve the right audience and the comments tend to be more noise than signal as false avenues are explored), so please do that on the mailing list.

It looks like quite a lot of work has gone into this, so you might well want to take it further. But that should be done via the mailing list first and then, if there's agreement about the feature, a new ticket with the patch. Thanks.

comment:23 Changed 6 years ago by mtredinnick

  • Owner changed from jacob to mtredinnick
  • Status changed from assigned to new

Reassigning back to me.

comment:25 Changed 6 years ago by ikelly

  • Cc ikelly added

comment:26 Changed 5 years ago by alexkoshelev

  • Cc alexkoshelev added

comment:27 Changed 5 years ago by anonymous

  • milestone post-1.0 deleted

Milestone post-1.0 deleted

comment:28 Changed 5 years ago by jacob

  • milestone set to 1.1 beta

comment:29 Changed 5 years ago by anonymous

  • Cc flosch@… added

comment:30 Changed 5 years ago by mtredinnick

(In [10083]) Fixed #10356 -- Added pure-Python inheritance for models (a.k.a proxy models).

Large portions of this are needed for #5420, so I implemented it fully.
Thanks to Ryan Kelly for an initial patch to get this started.

Refs #5420.

comment:31 Changed 5 years ago by mtredinnick

  • Resolution set to fixed
  • Status changed from new to closed

(In [10090]) Fixed #5420 -- Added support for delayed loading of model fields.

In extreme cases, some fields are expensive to load from the database
(e.g. GIS fields requiring conversion, or large text fields). This
commit adds defer() and only() methods to querysets that allow the
caller to specify which fields should not be loaded unless they are
accessed.

comment:32 Changed 3 years ago by jacob

  • milestone 1.1 beta deleted

Milestone 1.1 beta deleted

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.