Opened 10 years ago

Last modified 21 hours ago

#21461 new New feature

Add pre_update and post_update signals

Reported by: loic84 Owned by:
Component: Database layer (models, ORM) Version: dev
Severity: Normal Keywords:
Cc: Simon Charette, Shai Berger, Ben Davis, Sardorbek Imomaliev, Olivier Dalang, Ülgen Sarıkavak Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

Quoting Anssi from the ML post https://groups.google.com/forum/#!msg/django-developers/tCzFMpBm5-c/XLHFY0awVJ8J:

The idea is that pre_update listeners get a queryset that isn't executed. Accessing that queryset might be costly, but if it isn't accessed, there isn't much cost of adding a pre_update signals. For post_update the signal handler would be given a list of PK values (the original queryset doesn't work for post_update, the update might cause different instances to be returned than was updated, consider qs.filter(deleted=False).update(deleted=True))

This feature will also provide an upgrade path for #21169 (for FK RelatedManager.remove() more specifically).

Change History (20)

comment:1 by loic84, 10 years ago

Needs documentation: set
Owner: changed from nobody to loic84
Patch needs improvement: set
Status: newassigned

comment:2 by loic84, 10 years ago

Needs documentation: unset
Patch needs improvement: unset

Now with docs, this should be ready for review.

comment:3 by Simon Charette, 10 years ago

IMO, until we add support for RETURNING (or equivalent on other backends) we should try to mimic the way delete works, that is.

If no signals are attached to (pre|post)_update we do a fast update.

If it's not the case we fetch the pk of the objects we're about to update and use the retrieved list to:

  1. Build the queryset passed to pre_update receivers (pk__in=pk_list);
  2. Issue the update query (WHERE pk IN pk_list);
  3. Pass the pk_list to post_update receivers (actually this could also be a queryset built with pk__in instead of the pk_list itself).

comment:4 by Anssi Kääriäinen, 10 years ago

I am not a big fan of first fetching pk_list, it suffers from these problems:

  • pk__in=hugelist is problematic (SQLite doesn't work if you have more than 1000 items in the list, PostgreSQL performs badly)
  • one extra query is needed
  • there is still no guarantee that the update is race free. The races would be different. Consider case qs.filter(count__lte=0).update(count=F('count') + 1) - two threads executing this statement in parallel could end up updating count from -1 to 1 while that isn't currently possible. In my opinion connecting to pre/post_update altering what .update() does is worse than pre/post update signals not seeing the exact same set of objects that were in fact updated. Notably if deletes happen the latter problem is still there for altered pre/post_update.
  • when we add support for RETURNING we can't optimize the implementation of pre_update

The reason why I voted for pk_list to post_update is that if the updated pk_list is large, the user doesn't need to execute pk__in=hugelist query. The query can be split to parts, or maybe the pk list is all that is needed. OTOH I can see how queryset would make the API more consistent. Maybe pass both?

comment:5 by loic84, 10 years ago

Latest attempt:

  • Document that there is no guarantee that the queryset arg for pre_update will exactly return the updated rows because of concurrency.
  • Ensure that pk_set for post_update is the exact set of updated models.
  • Use chunked updates on SQLite when there are post_update signals.

comment:6 by Simon Charette, 10 years ago

Cc: Simon Charette added

comment:7 by Anssi Kääriäinen, 10 years ago

The latest solution is close to what charettes suggested in comment:3. The biggest distinctions are that pre_update is given the to-be updated queryset instead of to-be-updated queryset.filter(pk_in=pk_set), and for post_update you don't get queryset, you just get pk_set.

The reason for going with pk_set is that in some cases the only thing you need is the set of primary keys, and if that happens to be a large set, then doing pk_in=pk_set query will be expensive (PostgreSQL doesn't like this type of query at all, and SQLite will not even execute a query with more than 1000 pk values.

A possible solution is to provide both the queryset (with pk_in=pk_set filtering pre-applied), *and* the pk_set to pre_/post_ update signals. That way the user has nice API if that is all that is needed, but can also just use the pk_set if that is possible.

comment:8 by Tim Graham, 10 years ago

Triage Stage: UnreviewedAccepted

comment:9 by Shai Berger, 10 years ago

Cc: Shai Berger added

Not sure about this, but from a short look at the code it seems like the pk_set is calculated before the update. That makes perfect sense, but breaks a little if the update changes pk values. I don't see any sensible behavior around this, but it should probably be documented.

comment:10 by loic84, 10 years ago

@shai you are totally right, the limitation on changing pk is something we already discussed but I agree it needs to be documented.

At this point @akaariai and I are not sure about how to proceed with this patch. Do we only introduce post_signal? Is there value in the pre_signal as currently implemented? Are there enough people who want update signals at all?

Opinion welcome.

comment:11 by Ben Davis, 10 years ago

I would love to have an update signal for the purpose of cache invalidation. Currently I use save/delete signals to detect when something has changed to update an expensive calculation. While pk_set idea solves some problems, it's not ideal for large result sets and has unpredictable consequences.

For example, let's say I don't care about the primary keys of objects that were updated, but rather the objects related to those that were updated. I could run a much less expensive query by doing, for example, queryset.distinct().values_list('related_id').

Personally I feel it's better to allow the user be explicit in what is passed around. I would propose that pre_update accept the queryset, and users have the option to return a dict of extra data that will be available to post_update receivers. For example:

@receiver(pre_update, sender=MyModel)
def on_pre_update(sender, update_fields, queryset, **kwargs):
    return {
        'related_pks': queryset.distinct().values_list('relatedmodel_id')
    }

@receiver(post_update, sender=MyModel)
def on_post_update(sender, update_feilds, extra_data, **kwargs):
    related_pks = extra_data.get('related_pks', [])
    for obj in RelatedModel.objects.filter(pk__in=related_pks):
        obj.do_something()

The implementation if QuerySet.update() would look like this:

def update(self, **kwargs):
  """
  Updates all elements in the current QuerySet, setting all the given
  fields to the appropriate values.
  """
  assert self.query.can_filter(), \
      "Cannot update a query once a slice has been taken."
  meta = self.model._meta
  self._for_write = True
  query = self.query.clone(sql.UpdateQuery)
  query.add_update_values(kwargs)
  extra_data = {}
  if not meta.auto_created:
      responses = signals.pre_update.send(
          sender=self.model, update_fields=kwargs, queryset=self,
          using=self.db)
      for rcvr, response in responses:
          extra_data.update(response)
  with transaction.atomic(using=self.db, savepoint=False):
      rows = query.get_compiler(self.db).execute_sql(CURSOR)
  self._result_cache = None
  if not meta.auto_created:
      signals.post_update.send(
          sender=self.model, update_fields=kwargs,
          extra_data=extra_data, using=self.db)
  return rows

It's a bit cleaner and addresses any performance concerns, as well as the issue with updated pks. Regarding comment:10, I think there use cases for both pre_update and post_update, either together or alone.

comment:12 by Ben Davis, 10 years ago

Just created a PR for my above suggestion, with updated docs. I'm assuming this would now be targeted for 1.8 since we missed the feature freeze.

PR: https://github.com/django/django/pull/2619
Commit: https://github.com/bendavis78/django/commit/62a9ea61639a4a6ba9fde0eabc9f546db0a40de8

comment:13 by Shai Berger, 10 years ago

While the idea in comment:11 is interesting, I think the current suggestion is lacking.

First of all, returning values from signals breaks with well-established assumptions, and I wouldn't haste to change that. Further, the suggested code assumes without checking that the returned values can be used to update a dictionary, and worse: Collapses all returned values into one dictionary, allowing different signal handlers to step on each other's toes with no way to find out about it.

I think that the general direction of letting the user specify which data is received has merit, but this should be implemented via the registration API -- that is, either defining different signals ('pre_update', 'pre_update_with_qset', 'pre_update_with_pks') or specifying different senders (something along the lines of @receiver(pre_update, sender=(MyModel, pre_update.WITH_QSET))). Then, update can calculate the required data only if it is, in fact, requested.

comment:14 by Tim Graham, 10 years ago

Patch needs improvement: set

comment:15 by Ben Davis, 10 years ago

Cc: Ben Davis added

Yeah, you're right, using the dict is a messy way to solve the problem. I guess idea was to put the onus of "type" handling on the pre_save & post_save. Ideally the update() method would be agnostic regarding the payloads being passed between the signals. I think you've got the right idea with using sender, though I don't think the API should be prescribing those types (ie pks vs querysets).

Not sure what the right answer is at this point -- maybe someone will come up with a more elegant approach.

comment:16 by Sardorbek Imomaliev, 4 years ago

Cc: Sardorbek Imomaliev added

comment:17 by Olivier Dalang, 3 years ago

Cc: Olivier Dalang added

comment:18 by Olivier Dalang, 3 years ago

For reference, here's a package implementing this : https://github.com/martinphellwig/django-query-signals
It includes also other bulk methods such as delete or bulk_create.

From the code, it looks like it suffers from edge case described above (the original queryset doesn't work for post_update, the update might cause different instances to be returned than was updated, consider qs.filter(deleted=False).update(deleted=True)), but nothing unfixable it seems.

comment:19 by Mariusz Felisiak, 12 months ago

Owner: loic84 removed
Status: assignednew

comment:20 by Ülgen Sarıkavak, 21 hours ago

Cc: Ülgen Sarıkavak added
Note: See TracTickets for help on using tickets.
Back to Top