Code

Opened 2 years ago

Last modified 11 months ago

#18556 new Cleanup/optimization

.remove() on a reverse foreign key executes too many queries

Reported by: Alex Owned by: nobody
Component: Database layer (models, ORM) Version: 1.4
Severity: Normal Keywords:
Cc: Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

It does one query per object received, instead of just one for the entire batch. Attached is a patch which fixes this. It's technically not backwards compatible because signals are no longer sent, however one could artificially send them. Also, in the event of an error, none of the objects will be modified, whereas currently some of them will be.

Attachments (2)

t18556-remove-efficiency.diff (2.5 KB) - added by Alex 2 years ago.
18556.diff (3.3 KB) - added by timo 17 months ago.

Download all attachments as: .zip

Change History (11)

Changed 2 years ago by Alex

comment:1 Changed 2 years ago by akaariai

  • Has patch set

+1 for this change. There are already situations where the signals framework doesn't catch all changes. Using .update() in this situation seems natural. On the other hand, I do not actually use signals...

If we wanted to keep signals for this, then we should have some sort of pre/post update signal which one could use here. The signal would likely have an argument of "objs" which would be lazily fetched from the DB - if nobody access the objs then no extra work done.

The patch itself looks good to me.

Changed 17 months ago by timo

comment:2 Changed 17 months ago by timo

I think it makes sense not to send any signals for consistency with how clear() works, plus it's going to be backwards-incompatible to some extent with save() no longer being called. I've added documentation, including a note to the backwards-incompatible changes for 1.6.

comment:3 Changed 17 months ago by carljm

The closest parallel to remove() in API terms is not clear(), it's add(). There is no consistency gained by having add() continue to send signals and remove() not. If we're going to do this, the same approach should be used for both add() and remove() (there's also no reason for having an efficiency difference between them).

There's also another backwards incompatibility here; the passed-in object instances themselves are no longer updated. This looks trivial to fix; just restore the setattr line in the loop.

comment:4 Changed 17 months ago by timo

Thanks Carl, good point. I'll be happy to update add() as well. Are you +1 on removing signals for these operations or do you think it needs a discussion on django-developers?

comment:5 Changed 17 months ago by carljm

Yeah, I didn't comment on that in my first comment because I'm not sure :-) I think it will in all likelihood break real code if we stop sending these signals here. On the other hand, I also agree with Alex and Anssi that the current implementation is dumb, update() makes way more sense, and that it would be better to have an update signal.

I wouldn't say I'm a +1, but I'm not -1 either. Somewhere in the zero range, I guess :-) I'm not gonna stand in the way.

comment:6 Changed 17 months ago by akaariai

There is also the possibility to have fast-path of .update() for no listeners case, and loop when there are listeners. The complexity of dual paths isn't big at all, and it is worth some complexity when one can save possibly large amounts of DB resources.

The code as written doesn't do just one query per obj, it does two. .save() will select, then update for each object. This is typical example case where either using update_fields, or applying #16649 would help.

comment:7 Changed 17 months ago by timo

For add(), it looks like we need to at least keep the possibility of executing save() for an object that hasn't been saved yet. See this test https://github.com/django/django/blob/master/tests/modeltests/many_to_one/tests.py#L49-L65

Here's what I came up with so far. Tests pass except for multiple_database. I'm guessing self.model.objects.filter(pk__in=ids).update(**{rel_field.name: self.instance}) needs a slightly different implementation to work with multi-db, but I'm not sure what that would be.

https://github.com/timgraham/django/commit/d6b2720138e53cde0e8f60471e8d92f444e87474

comment:8 Changed 17 months ago by akaariai

The multidb issue is likely due to having the object saved (thus PK set), but into different DB.

Just checking the pk is set isn't enough, consulting the model._state.db + _state.adding would likely yield the correct result. If different than the current DB or adding, then save, else go to update() directly.

This seems to also raise a possible race condition. Assume thread T1 fetches the object from DB, then T2 deletes it, and then T1 issues add(). The result was that the object was resaved to DB, after patch it is that it remains deleted. In the add() case the correct behavior is resave so that after add you can trust that relation actually contains all the objects in the DB.

The race could most reliably be resolved by doing an UPDATE ... RETURNING PK. Check which PKs were updated, those that weren't must be resaved. Unfortunately RETURNING isn't available in MySQL or SQLite, so this idea would only apply to those DBs supporting returning. Others could of course do update(); values_list('pk'); save those not existing in the values_list separately.

Completely another matter is the race condition actually matters in real world situations...

comment:9 Changed 11 months ago by timo

  • Patch needs improvement set

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as new
The owner will be changed from nobody to anonymous. Next status will be 'assigned'
as The resolution will be set. Next status will be 'closed'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.