Opened 2 months ago
Last modified 7 days ago
#36489 assigned Bug
OneToOneField + concurrent get_or_create results in wrong object in field cache
Reported by: | Brian Atkinson | Owned by: | Jason Hall |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 5.2 |
Severity: | Normal | Keywords: | |
Cc: | Simon Charette, Mariusz Felisiak | Triage Stage: | Accepted |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description (last modified by )
When using get_or_create
and update_or_create
in a context where concurrent creations are happening, the caller that loses the creation race will end up with an incomplete object in the field cache.
Example models:
from django.db import models class Parent(models.Model): name = models.TextField() class Child(models.Model): parent = models.OneToOneField(Parent, on_delete=models.CASCADE)
In effect, we have two separate processes both expecting to be able to call Child.objects.get_or_create(parent=common_parent, defaults={...})
and have the resulting common_parent.child
be the correct result. We are seeing both processes fail the initial get call (https://github.com/django/django/blob/5.2.3/django/db/models/query.py#L946) and proceeding to attempt the create call (https://github.com/django/django/blob/5.2.3/django/db/models/query.py#L953). One of the processes succeeds on the insert, and the other will fail the insert. The process that fails the insert correctly issues the second get call (https://github.com/django/django/blob/5.2.3/django/db/models/query.py#L956) and returns the result.
The bug is that for the process that fails the insert, the object at common_parent.child
is not the value returned from the get_or_create. It seems to be the object that was created for insert, but wasn't cleared when the insert failed. This can be observed through the common_parent.child.id
field being None. This has caused data corruption and other downstream bugs due to the wrong data being cached.
Here is a test case that is a linear representation of the concurrent bug described above:
from unittest import mock from django.db.models import QuerySet from django.test import TestCase from sample.models import Child, Parent class TestBug(TestCase): def test_bug(self): parent = Parent.objects.create(name="Test name") triggered = False def effect(*args, **kwargs): nonlocal triggered if not triggered and kwargs.get("parent") == parent: triggered = True Child.objects.update_or_create( # Force a brand new instance (emulates racing thread). parent=Parent.objects.get(pk=parent.pk), defaults={}, ) return mock.DEFAULT with mock.patch.object( QuerySet, "_extract_model_params", autospec=True, wraps=QuerySet._extract_model_params, side_effect=effect, ): child, created = Child.objects.get_or_create(parent=parent, defaults={}) assert not created assert child.parent_id is not None assert parent.child.id is not None
At first glance this test case will seem quite janky. The patching is being done to emulate the effect of two racing threads by injecting a concurrent call to get_or_create
in the middle of the outer get_or_create
.
Change History (14)
comment:1 by , 2 months ago
Description: | modified (diff) |
---|
comment:2 by , 2 months ago
Cc: | added |
---|---|
Component: | Uncategorized → Database layer (models, ORM) |
Triage Stage: | Unreviewed → Accepted |
Type: | Uncategorized → Bug |
comment:3 by , 2 months ago
Also I don't think the update_or_create call is necessary, we can stick to two get_or_create I think?
You are correct, I missed cleaning that up from an earlier iteration of building the test case. We'd initially seen it with two racing update_or_create
, but upon investigation it is possible with just the underlying get_or_create
. I'm happy to update the example in the initial report with that cleanup if that would be helpful.
Thank you for taking the time to look into this.
comment:4 by , 2 months ago
Thank you for your very detailed report. I understand this is a tricky issue to diagnose, particularly because it should be rare under normal circumstances, but it's hard to determine what should be the right thing to do in this case as the ORM doesn't maintain an identity map (e.g. like SQLAlchemy session for example) so it has to balance a delicate sequence of cache clearing.
A naive approach could be the following but likely breaks usage of select_related("parent").get_or_create(parent=parent)
by forcing an extra query as it systematically clears the cache
-
django/db/models/query.py
diff --git a/django/db/models/query.py b/django/db/models/query.py index 63ab4a873a..b03f318076 100644
a b 18 18 IntegrityError, 19 19 NotSupportedError, 20 20 connections, 21 models, 21 22 router, 22 23 transaction, 23 24 ) … … def get_or_create(self, defaults=None, **kwargs): 984 984 return self.create(**params), True 985 985 except IntegrityError: 986 986 try: 987 return self.get(**kwargs), False987 obj = self.get(**kwargs) 988 988 except self.model.DoesNotExist: 989 989 pass 990 else: 991 get_field = self.model._meta.get_field 992 for field_name, value in kwargs.items(): 993 if not isinstance(value, models.Model): 994 continue 995 try: 996 field = get_field(field_name) 997 except exceptions.FieldDoesNotExist: 998 continue 999 if getattr(field, "one_to_one", False): 1000 field.remote_field.delete_cached_value(value) 1001 return obj, False 990 1002 raise 991 1003 992 1004 get_or_create.alters_data = True
while the ORM doesn't have a session wide concept of an identity map though it has one per queryset stored in the QuerySet.known_related_objects: dict[Field, dict]
attribute so it might be usable for this purpose? It seems like it would be coherent with how many related managers work by maintaining affinity (e.g. author.books.get_or_create
)
-
django/db/models/query.py
diff --git a/django/db/models/query.py b/django/db/models/query.py index 63ab4a873a..b03f318076 100644
a b 18 18 IntegrityError, 19 19 NotSupportedError, 20 20 connections, 21 models, 21 22 router, 22 23 transaction, 23 24 ) … … def get_or_create(self, defaults=None, **kwargs): 973 974 # The get() needs to be targeted at the write database in order 974 975 # to avoid potential transaction consistency problems. 975 976 self._for_write = True 977 get_field = self.model._meta.get_field 978 known_related_objects = {} 979 for field_name, value in kwargs.items(): 980 if not isinstance(value, models.Model): 981 continue 982 try: 983 field = get_field(field_name) 984 except exceptions.FieldDoesNotExist: 985 continue 986 if field.remote_field: 987 known_related_objects.setdefault(field, {})[value.pk] = value 988 qs = self 989 if known_related_objects: 990 qs = qs._clone() 991 for field, objects in known_related_objects.items(): 992 qs._known_related_objects.setdefault(field, {}).update(objects) 976 993 try: 977 return self.get(**kwargs), False994 return qs.get(**kwargs), False 978 995 except self.model.DoesNotExist: 979 996 params = self._extract_model_params(defaults, **kwargs) 980 997 # Try to create an object using passed params. 981 998 try: 982 999 with transaction.atomic(using=self.db): 983 1000 params = dict(resolve_callables(params)) 984 return self.create(**params), True1001 return qs.create(**params), True 985 1002 except IntegrityError: 986 1003 try: 987 return self.get(**kwargs), False1004 return qs.get(**kwargs), False 988 1005 except self.model.DoesNotExist: 989 1006 pass 990 1007 raise
it does bring a few questions though mainly should update_or_create
be updated as well and to what extent? Should defaults
and create_defaults
also be considered? Only when creation actually takes place or systematically?
comment:5 by , 2 months ago
The part that bit us is that the object left behind from the failed creation could have different field values from the actual row read from the database. The example above had much of that stripped, but imagine one of the fields provided during creation is context sensitive (example: UUID of the request), then the parent.child
could have data that differs from the returned row.
More concretely, consider adding the following to the Child table:
context_id = models.UUIDField(default=uuid.uuid4)
I would expect the following to hold:
assert child.context_id == parent.child.context_id
From my perspective as a user, I might prefer an additional read to populate that relation over that inequality not holding. At least I think that would be my preference. The comments about an "identity map" make me confident that there are a lot more details around how the APIs should be behaving that I'm not unaware of. Is there an assumption that the following should be true?
assert child is parent.child
Thanks for taking the time to consider this issue.
comment:6 by , 2 months ago
From my perspective as a user, I might prefer an additional read to populate that relation over that inequality not holding. At least I think that would be my preference.
Proper cache invalidation would effectively be the less risky option but it's far from trivial do to correctly. We can't systematically clear the cache as it might discard explicit relationship retrieval through select_related
as alluded above.
Is there an assumption that the following should be true?
assert child is parent.child
No, in particular when created is False
like it's the case here.
Think of it this way, get_or_create
is meant to be a concurrency safe hybrid between get
and create
so users don't have to re-implement it in an unsafe way every time the upsert pattern is needed.
In the case of get
the child is parent.child
assumption will never hold true, that is
>>> child = Child.objects.get(parent=parent) >>> child is parent.child False
as there is no session identity mapper. On the hand, because of implementation details, the assumption will hold true for create
>>> child = Child.objects.create(parent=parent) >>> child is parent.child True
which brings the question of what exactly should be done in cases where get_or_create
falls back to get
(created is False
) given it's meant to be an hybrid. Is it better to keep things as they are today and avoid introducing an exception when a to-one relationship is provided and the object already exists or is it better to align it to-many related managers which offer a way to tap-in this per-queryset identity mapper logic that to-one relationship cannot take advantage of because attribute access doesn't yield a manager
>>> author.books.create(title="2001: A Space Odyssey") >>> book, created = author.books.get_or_create(title="2001: A Space Odyssey") >>> created # get() was used. False >>> book.author is author True
I'll have to ponder on that one a bit because while I can see the appeal it introduces significant complexity with high risk of subtle breakages. If we are fixing this I'm of opinion that we should make it work for get_or_create
under all circumstances where get
is used and not only on the IntegrityError
fallback as that would be even more confusing than it is today.
comment:7 by , 6 weeks ago
Owner: | set to |
---|---|
Status: | new → assigned |
comment:8 by , 6 weeks ago
Has patch: | set |
---|
comment:9 by , 5 weeks ago
Patch needs improvement: | set |
---|
comment:10 by , 5 weeks ago
Triage Stage: | Accepted → Ready for checkin |
---|
comment:11 by , 5 weeks ago
Triage Stage: | Ready for checkin → Accepted |
---|
Please refer to the contributing documentation, you can't mark your own patch RFC.
comment:12 by , 5 weeks ago
Patch needs improvement: | unset |
---|
comment:14 by , 7 days ago
Natalia, Simon and Sarah - thank you for looking at this issue and the patch.
Jason - thanks for making the patch in the first place.
Given that the patch doesn't seem to fully solve the problem, and given that this issue affects all OneToOneFields and causes a subtle data corruption that could be hard to detect and diagnose, I have two questions:
- Is there a way to raise awareness of this issue and seek help in fixing? We would offer to give it a shot, but this is very deep ORM code and we don't feel like we have sufficient experience to attempt the patch.
- In the meantime, could you advise on the work around? My current thought is that we should use ForeignKey(unique=True) and eat the Django check warning https://github.com/django/django/blob/main/django/db/models/fields/related.py#L1079 This would force us to use
child_set.first()
but that's a small ergonomic price to pay. Do you have any thoughts on this approach or a better suggestion instead?
Thank you Brian Atkinson for taking the time to create this ticket. I have reviewed the test case that you provided, which is quite helpful to understand your report.
I was able to reproduce the issue as described. When a concurrent
get_or_create()
call results in anIntegrityError
, the model instance created for the first attempt is incorrectly cached on the reverse side of the one-to-one relation. This leads to inconsistent behavior, such as accessingparent.child
and getting an unsavedChild
instance withpk=None
.A few notes: I believe the issue is somewhere between the related descriptors and how the reverse 1-1 is set but I haven't been able to exactly pin point the precise place that needs fixing. Also I don't think the
update_or_create
call is necessary, we can stick to twoget_or_create
I think? or is there any value/difference to use one of each?