Opened 3 years ago

Last modified 3 years ago

#33672 closed Cleanup/optimization

bulk_update - duplicate updates as side effect from batch_size — at Initial Version

Reported by: jerch Owned by: nobody
Component: Database layer (models, ORM) Version: 4.0
Severity: Normal Keywords:
Cc: Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

While trying to mimick bulk_update behavior for fast_update I stumbled over duplicate updates from different batch_size. To illustrate the issue:

Code highlighting:

>>> from exampleapp.models import Foo
>>> from django.db import connection
>>> Foo.objects.all().values('pk')
<QuerySet [{'pk': 1}, {'pk': 2}, {'pk': 3}, {'pk': 4}]>
>>> pks=[1,2,3,3,2,2,1,1,4]
>>> objs=[Foo(pk=pk) for pk in pks]
>>> Foo.objects.bulk_update(objs, ['f1'])  # within batch_size: all duplicates filtered
>>> connection.queries[-1]['sql']
'UPDATE "exampleapp_foo" SET "f1" = <CASE chain shortened...> WHERE "exampleapp_foo"."id" IN (1, 2, 3, 4)'
>>> Foo.objects.bulk_update(objs, ['f1'], batch_size=4)  # multiple batches
>>> connection.queries[-1]['sql']
'UPDATE "exampleapp_foo" SET "f1" = <CASE chain shortened...> WHERE "exampleapp_foo"."id" IN (4)'
>>> connection.queries[-2]['sql']
'UPDATE "exampleapp_foo" SET "f1" = <CASE chain shortened...> WHERE "exampleapp_foo"."id" IN (2, 1)'
>>> connection.queries[-3]['sql']
'UPDATE "exampleapp_foo" SET "f1" = <CASE chain shortened...> WHERE "exampleapp_foo"."id" IN (1, 2, 3)'

The example above requests a bulk update for a degenerated pk list with duplicates [1,2,3,3,2,2,1,1,4]. batch_size now has a weird effect on which updates "get through":

  • all within one batch - update done [[1,2,3,4]]
  • scattered across batches with batch_size=4 - updates done [[1,2,3], [2,1], [4]]

I understand the need to drop duplicates at a single UPDATE invocation (need that for fast_update as well), but isn't the current behavior surfacing a side effect of an implementation detail here? Imho that behavior is hard to anticipate/control by the user.

Idea/suggestion:
How about treating this more unique/atomic at the bulk_update call level by either doing:

  • filter duplicates for the whole changeset (automatic filtering might still be surprising for users)
  • or disallow/raise for any duplicate in a changeset (would prefer that for code simplicity and more clear behavior)

With such a change bulk_update would always exhibit the same behavior for a given changeset, regardless of batch_size.

(Categorized this as cleanup/optimization, since the unlucky behavior is documented.)

Change History (0)

Note: See TracTickets for help on using tickets.
Back to Top