Opened 3 years ago
Last modified 3 years ago
#33672 closed Cleanup/optimization
bulk_update - duplicate updates as side effect from batch_size — at Version 1
Reported by: | jerch | Owned by: | nobody |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 4.0 |
Severity: | Normal | Keywords: | |
Cc: | Triage Stage: | Unreviewed | |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description (last modified by )
While trying to mimick bulk_update
behavior for fast_update
I stumbled over duplicate updates from different batch_size. To illustrate the issue:
Code highlighting:
>>> from exampleapp.models import Foo >>> from django.db import connection >>> Foo.objects.all().values('pk') <QuerySet [{'pk': 1}, {'pk': 2}, {'pk': 3}, {'pk': 4}]> >>> pks=[1,2,3,3,2,2,1,1,4] >>> objs=[Foo(pk=pk) for pk in pks] >>> Foo.objects.bulk_update(objs, ['f1']) # within batch_size: all duplicates filtered >>> connection.queries[-1]['sql'] 'UPDATE "exampleapp_foo" SET "f1" = <CASE chain shortened...> WHERE "exampleapp_foo"."id" IN (1, 2, 3, 4)' >>> Foo.objects.bulk_update(objs, ['f1'], batch_size=4) # multiple batches >>> connection.queries[-1]['sql'] 'UPDATE "exampleapp_foo" SET "f1" = <CASE chain shortened...> WHERE "exampleapp_foo"."id" IN (4)' >>> connection.queries[-2]['sql'] 'UPDATE "exampleapp_foo" SET "f1" = <CASE chain shortened...> WHERE "exampleapp_foo"."id" IN (2, 1)' >>> connection.queries[-3]['sql'] 'UPDATE "exampleapp_foo" SET "f1" = <CASE chain shortened...> WHERE "exampleapp_foo"."id" IN (1, 2, 3)'
The example above requests a bulk update for a degenerated pk list with duplicates [1,2,3,3,2,2,1,1,4]
. batch_size
now has a weird effect on which updates "get through":
- all within one batch - update done
[[1,2,3,4]]
- scattered across batches with
batch_size=4
- updates done[[1,2,3], [2,1], [4]]
I understand the need to drop duplicates at a single UPDATE invocation (need that for fast_update
as well), but isn't the current behavior surfacing a side effect of an implementation detail here? Imho that behavior is hard to anticipate/control by the user.
Idea/suggestion:
How about treating this more unique/atomic at the bulk_update
call level by either doing:
- filter duplicates for the whole changeset (automatic filtering still might be surprising for users)
- or disallow/raise for any duplicate in a changeset (would prefer that for code simplicity and more strict behavior)
With such a change bulk_update
would always exhibit the same behavior for a given changeset, regardless of batch_size
.
(Categorized this as cleanup/optimization, since the unlucky behavior is documented.)