Opened 2 months ago

Last modified 7 days ago

#36489 assigned Bug

OneToOneField + concurrent get_or_create results in wrong object in field cache

Reported by: Brian Atkinson Owned by: Jason Hall
Component: Database layer (models, ORM) Version: 5.2
Severity: Normal Keywords:
Cc: Simon Charette, Mariusz Felisiak Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Brian Atkinson)

When using get_or_create and update_or_create in a context where concurrent creations are happening, the caller that loses the creation race will end up with an incomplete object in the field cache.

Example models:

from django.db import models

class Parent(models.Model):
    name = models.TextField()

class Child(models.Model):
    parent = models.OneToOneField(Parent, on_delete=models.CASCADE)

In effect, we have two separate processes both expecting to be able to call Child.objects.get_or_create(parent=common_parent, defaults={...}) and have the resulting common_parent.child be the correct result. We are seeing both processes fail the initial get call (https://github.com/django/django/blob/5.2.3/django/db/models/query.py#L946) and proceeding to attempt the create call (https://github.com/django/django/blob/5.2.3/django/db/models/query.py#L953). One of the processes succeeds on the insert, and the other will fail the insert. The process that fails the insert correctly issues the second get call (https://github.com/django/django/blob/5.2.3/django/db/models/query.py#L956) and returns the result.

The bug is that for the process that fails the insert, the object at common_parent.child is not the value returned from the get_or_create. It seems to be the object that was created for insert, but wasn't cleared when the insert failed. This can be observed through the common_parent.child.id field being None. This has caused data corruption and other downstream bugs due to the wrong data being cached.

Here is a test case that is a linear representation of the concurrent bug described above:

from unittest import mock

from django.db.models import QuerySet
from django.test import TestCase

from sample.models import Child, Parent

class TestBug(TestCase):
    def test_bug(self):
        parent = Parent.objects.create(name="Test name")

        triggered = False

        def effect(*args, **kwargs):
            nonlocal triggered
            if not triggered and kwargs.get("parent") == parent:
                triggered = True
                Child.objects.update_or_create(
                    # Force a brand new instance (emulates racing thread).
                    parent=Parent.objects.get(pk=parent.pk),
                    defaults={},
                )
            return mock.DEFAULT

        with mock.patch.object(
            QuerySet,
            "_extract_model_params",
            autospec=True,
            wraps=QuerySet._extract_model_params,
            side_effect=effect,
        ):
            child, created = Child.objects.get_or_create(parent=parent, defaults={})
            assert not created

        assert child.parent_id is not None
        assert parent.child.id is not None

At first glance this test case will seem quite janky. The patching is being done to emulate the effect of two racing threads by injecting a concurrent call to get_or_create in the middle of the outer get_or_create.

Change History (14)

comment:1 by Brian Atkinson, 2 months ago

Description: modified (diff)

comment:2 by Natalia Bidart, 2 months ago

Cc: Simon Charette Mariusz Felisiak added
Component: UncategorizedDatabase layer (models, ORM)
Triage Stage: UnreviewedAccepted
Type: UncategorizedBug

Thank you Brian Atkinson for taking the time to create this ticket. I have reviewed the test case that you provided, which is quite helpful to understand your report.

I was able to reproduce the issue as described. When a concurrent get_or_create() call results in an IntegrityError, the model instance created for the first attempt is incorrectly cached on the reverse side of the one-to-one relation. This leads to inconsistent behavior, such as accessing parent.child and getting an unsaved Child instance with pk=None.

A few notes: I believe the issue is somewhere between the related descriptors and how the reverse 1-1 is set but I haven't been able to exactly pin point the precise place that needs fixing. Also I don't think the update_or_create call is necessary, we can stick to two get_or_create I think? or is there any value/difference to use one of each?

comment:3 by Brian Atkinson, 2 months ago

Also I don't think the update_or_create call is necessary, we can stick to two get_or_create I think?

You are correct, I missed cleaning that up from an earlier iteration of building the test case. We'd initially seen it with two racing update_or_create, but upon investigation it is possible with just the underlying get_or_create. I'm happy to update the example in the initial report with that cleanup if that would be helpful.

Thank you for taking the time to look into this.

comment:4 by Simon Charette, 2 months ago

Thank you for your very detailed report. I understand this is a tricky issue to diagnose, particularly because it should be rare under normal circumstances, but it's hard to determine what should be the right thing to do in this case as the ORM doesn't maintain an identity map (e.g. like SQLAlchemy session for example) so it has to balance a delicate sequence of cache clearing.

A naive approach could be the following but likely breaks usage of select_related("parent").get_or_create(parent=parent) by forcing an extra query as it systematically clears the cache

  • django/db/models/query.py

    diff --git a/django/db/models/query.py b/django/db/models/query.py
    index 63ab4a873a..b03f318076 100644
    a b  
    1818    IntegrityError,
    1919    NotSupportedError,
    2020    connections,
     21    models,
    2122    router,
    2223    transaction,
    2324)
    def get_or_create(self, defaults=None, **kwargs):  
    984984                    return self.create(**params), True
    985985            except IntegrityError:
    986986                try:
    987                     return self.get(**kwargs), False
     987                    obj = self.get(**kwargs)
    988988                except self.model.DoesNotExist:
    989989                    pass
     990                else:
     991                    get_field = self.model._meta.get_field
     992                    for field_name, value in kwargs.items():
     993                        if not isinstance(value, models.Model):
     994                            continue
     995                        try:
     996                            field = get_field(field_name)
     997                        except exceptions.FieldDoesNotExist:
     998                            continue
     999                        if getattr(field, "one_to_one", False):
     1000                            field.remote_field.delete_cached_value(value)
     1001                    return obj, False
    9901002                raise
    9911003
    9921004    get_or_create.alters_data = True

while the ORM doesn't have a session wide concept of an identity map though it has one per queryset stored in the QuerySet.known_related_objects: dict[Field, dict] attribute so it might be usable for this purpose? It seems like it would be coherent with how many related managers work by maintaining affinity (e.g. author.books.get_or_create)

  • django/db/models/query.py

    diff --git a/django/db/models/query.py b/django/db/models/query.py
    index 63ab4a873a..b03f318076 100644
    a b  
    1818    IntegrityError,
    1919    NotSupportedError,
    2020    connections,
     21    models,
    2122    router,
    2223    transaction,
    2324)
    def get_or_create(self, defaults=None, **kwargs):  
    973974        # The get() needs to be targeted at the write database in order
    974975        # to avoid potential transaction consistency problems.
    975976        self._for_write = True
     977        get_field = self.model._meta.get_field
     978        known_related_objects = {}
     979        for field_name, value in kwargs.items():
     980            if not isinstance(value, models.Model):
     981                continue
     982            try:
     983                field = get_field(field_name)
     984            except exceptions.FieldDoesNotExist:
     985                continue
     986            if field.remote_field:
     987                known_related_objects.setdefault(field, {})[value.pk] = value
     988        qs = self
     989        if known_related_objects:
     990            qs = qs._clone()
     991            for field, objects in known_related_objects.items():
     992                qs._known_related_objects.setdefault(field, {}).update(objects)
    976993        try:
    977             return self.get(**kwargs), False
     994            return qs.get(**kwargs), False
    978995        except self.model.DoesNotExist:
    979996            params = self._extract_model_params(defaults, **kwargs)
    980997            # Try to create an object using passed params.
    981998            try:
    982999                with transaction.atomic(using=self.db):
    9831000                    params = dict(resolve_callables(params))
    984                     return self.create(**params), True
     1001                    return qs.create(**params), True
    9851002            except IntegrityError:
    9861003                try:
    987                     return self.get(**kwargs), False
     1004                    return qs.get(**kwargs), False
    9881005                except self.model.DoesNotExist:
    9891006                    pass
    9901007                raise

it does bring a few questions though mainly should update_or_create be updated as well and to what extent? Should defaults and create_defaults also be considered? Only when creation actually takes place or systematically?

comment:5 by Brian Atkinson, 2 months ago

The part that bit us is that the object left behind from the failed creation could have different field values from the actual row read from the database. The example above had much of that stripped, but imagine one of the fields provided during creation is context sensitive (example: UUID of the request), then the parent.child could have data that differs from the returned row.

More concretely, consider adding the following to the Child table:

context_id = models.UUIDField(default=uuid.uuid4)

I would expect the following to hold:

assert child.context_id == parent.child.context_id

From my perspective as a user, I might prefer an additional read to populate that relation over that inequality not holding. At least I think that would be my preference. The comments about an "identity map" make me confident that there are a lot more details around how the APIs should be behaving that I'm not unaware of. Is there an assumption that the following should be true?

assert child is parent.child

Thanks for taking the time to consider this issue.

comment:6 by Simon Charette, 2 months ago

From my perspective as a user, I might prefer an additional read to populate that relation over that inequality not holding. At least I think that would be my preference.

Proper cache invalidation would effectively be the less risky option but it's far from trivial do to correctly. We can't systematically clear the cache as it might discard explicit relationship retrieval through select_related as alluded above.

Is there an assumption that the following should be true?
assert child is parent.child

No, in particular when created is False like it's the case here.

Think of it this way, get_or_create is meant to be a concurrency safe hybrid between get and create so users don't have to re-implement it in an unsafe way every time the upsert pattern is needed.

In the case of get the child is parent.child assumption will never hold true, that is

>>> child = Child.objects.get(parent=parent)
>>> child is parent.child
False

as there is no session identity mapper. On the hand, because of implementation details, the assumption will hold true for create

>>> child = Child.objects.create(parent=parent)
>>> child is parent.child
True

which brings the question of what exactly should be done in cases where get_or_create falls back to get (created is False) given it's meant to be an hybrid. Is it better to keep things as they are today and avoid introducing an exception when a to-one relationship is provided and the object already exists or is it better to align it to-many related managers which offer a way to tap-in this per-queryset identity mapper logic that to-one relationship cannot take advantage of because attribute access doesn't yield a manager

>>> author.books.create(title="2001: A Space Odyssey")
>>> book, created = author.books.get_or_create(title="2001: A Space Odyssey")
>>> created  # get() was used.
False
>>> book.author is author
True

I'll have to ponder on that one a bit because while I can see the appeal it introduces significant complexity with high risk of subtle breakages. If we are fixing this I'm of opinion that we should make it work for get_or_create under all circumstances where get is used and not only on the IntegrityError fallback as that would be even more confusing than it is today.

comment:7 by Jason Hall, 6 weeks ago

Owner: set to Jason Hall
Status: newassigned

comment:8 by Jason Hall, 6 weeks ago

Has patch: set

comment:9 by Sarah Boyce, 5 weeks ago

Patch needs improvement: set

comment:10 by Jason Hall, 5 weeks ago

Triage Stage: AcceptedReady for checkin

comment:11 by Simon Charette, 5 weeks ago

Triage Stage: Ready for checkinAccepted

Please refer to the contributing documentation, you can't mark your own patch RFC.

comment:12 by Jason Hall, 5 weeks ago

Patch needs improvement: unset

comment:13 by Jason Hall, 5 weeks ago

I understand. Thank you.

comment:14 by Rushabh Doshi, 7 days ago

Natalia, Simon and Sarah - thank you for looking at this issue and the patch.
Jason - thanks for making the patch in the first place.

Given that the patch doesn't seem to fully solve the problem, and given that this issue affects all OneToOneFields and causes a subtle data corruption that could be hard to detect and diagnose, I have two questions:

  1. Is there a way to raise awareness of this issue and seek help in fixing? We would offer to give it a shot, but this is very deep ORM code and we don't feel like we have sufficient experience to attempt the patch.
  2. In the meantime, could you advise on the work around? My current thought is that we should use ForeignKey(unique=True) and eat the Django check warning https://github.com/django/django/blob/main/django/db/models/fields/related.py#L1079 This would force us to use child_set.first() but that's a small ergonomic price to pay. Do you have any thoughts on this approach or a better suggestion instead?
Note: See TracTickets for help on using tickets.
Back to Top