Opened 2 years ago

Closed 2 years ago

Last modified 2 years ago

#33672 closed Cleanup/optimization (duplicate)

bulk_update - duplicate updates as side effect from batch_size

Reported by: jerch Owned by: nobody
Component: Database layer (models, ORM) Version: 4.0
Severity: Normal Keywords:
Cc: Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by jerch)

While trying to mimick bulk_update behavior for fast_update I stumbled over duplicate updates from different batch_size. To illustrate the issue:

Code highlighting:

>>> from exampleapp.models import Foo
>>> from django.db import connection
>>> Foo.objects.all().values('pk')
<QuerySet [{'pk': 1}, {'pk': 2}, {'pk': 3}, {'pk': 4}]>
>>> pks=[1,2,3,3,2,2,1,1,4]
>>> objs=[Foo(pk=pk) for pk in pks]
>>> Foo.objects.bulk_update(objs, ['f1'])  # within batch_size: all duplicates filtered
>>> connection.queries[-1]['sql']
'UPDATE "exampleapp_foo" SET "f1" = <CASE chain shortened...> WHERE "exampleapp_foo"."id" IN (1, 2, 3, 4)'
>>> Foo.objects.bulk_update(objs, ['f1'], batch_size=4)  # multiple batches
>>> connection.queries[-1]['sql']
'UPDATE "exampleapp_foo" SET "f1" = <CASE chain shortened...> WHERE "exampleapp_foo"."id" IN (4)'
>>> connection.queries[-2]['sql']
'UPDATE "exampleapp_foo" SET "f1" = <CASE chain shortened...> WHERE "exampleapp_foo"."id" IN (2, 1)'
>>> connection.queries[-3]['sql']
'UPDATE "exampleapp_foo" SET "f1" = <CASE chain shortened...> WHERE "exampleapp_foo"."id" IN (1, 2, 3)'

The example above requests a bulk update for a degenerated pk list with duplicates [1,2,3,3,2,2,1,1,4]. batch_size now has a weird effect on which updates "get through":

  • all within one batch - update done [[1,2,3,4]]
  • scattered across batches with batch_size=4 - updates done [[1,2,3], [2,1], [4]]

I understand the need to drop duplicates at a single UPDATE invocation (need that for fast_update as well), but isn't the current behavior surfacing a side effect of an implementation detail here? Imho that behavior is hard to anticipate/control by the user.

Idea/suggestion:
How about treating this more unique/atomic at the bulk_update call level by either doing:

  • filter duplicates for the whole changeset (automatic filtering still might be surprising for users)
  • or disallow/raise for any duplicate in a changeset (would prefer that for code simplicity and more strict behavior)

With such a change bulk_update would always exhibit the same behavior for a given changeset, regardless of batch_size.

(Categorized this as cleanup/optimization, since the unlucky behavior is documented.)

Change History (4)

comment:1 by jerch, 2 years ago

Description: modified (diff)

comment:2 by Mariusz Felisiak, 2 years ago

Resolution: duplicate
Status: newclosed

Thanks for this ticket, however we decided that it is not worth additional complexity (see #29968) and it's enough to document this caveat.

Duplicate of #32388.

comment:3 by jerch, 2 years ago

Would you reconsider the decision, if I'd provide the patch myself? (Would need to apply small changes to bulk_update anyway, if the fast impl should be done...)

Some reasoning from my side:

  • As software dev I'm inclined to lower mistakes as much as possible. Thats even more true for middleware libs like django, where peeps have only limited control through API usage. Here the API only offers to play with batch_size, an argument, thats primary purpose is def. not to change update behavior. Thus it is a nasty side effect.
  • The first comment in https://code.djangoproject.com/ticket/29968, that brought the performance costs up, admits that this behavior is a mistake. And the original implementer answered with "... the behavior could be pretty confusing and it's not that expensive to fail loudly. " How that led to to just documenting the weird behavior instead of fixing it is a mystery to me. He even pointed out, that the performance overhead would be small. (Yes it is, already tested that...)
  • It is the 3rd ticket regarding the same underlying issue (my apology for not checking beforehand). Maybe thats an indicator, that peeps are worried about it?
  • Last but not least - To get the fast update implementation congruent, I need reliable update behavior in bulk_update. To get congruent with its current behavior, I'd need to write a really ugly patch around the batch handling, which costs there ~30% performance, just to get in line with a partially broken behavior. Isn't it better to fix that broken behavior in the first place?

in reply to:  3 comment:4 by Mariusz Felisiak, 2 years ago

Replying to jerch:

Would you reconsider the decision, if I'd provide the patch myself? (Would need to apply small changes to bulk_update anyway, if the fast impl should be done...)

I still believe that it's really niche, however I'd be happy to review your patch (please Refs #32388 -- ... the original ticket).

Note: See TracTickets for help on using tickets.
Back to Top