#26804 closed Bug (fixed)
update_or_create() updates all columns, which could cause a race condition
Reported by: | Jensen Cochran | Owned by: | Jensen Cochran |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | update_or_create |
Cc: | Triage Stage: | Accepted | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | yes |
Easy pickings: | no | UI/UX: | no |
Description
The update_or_create() helper function in Django is passed a 'defaults' kwarg, which contains the fields that we expect the function to update. When .save() is called in the function to do the update, it does not pass the update_fields parameter to .save(), and so the SQL generated to complete the update sets every column (except pk), even those not passed in 'defaults'.
This creates a potential race condition. In the period of time between when the function calls .get() to retrieve the object and when it calls .save(), some other process may have updated a column in that row that has nothing to do with this update. When save() is called, these changes will be overwritten. This is not an issue for the fields passed into 'defaults' since we expected those to be updated anyway. The main issue is that update_or_create updates fields/columns other than those passed in 'defaults', and does so in a way that could cause data loss.
Example:
class Authors(models.Model): id = models.AutoField(primary_key=True) name = models.CharField(max_length=128, unique=True) newest_release = models.DateTimeField() active = models.BooleanField(default=True) Authors.objects.update_or_create( name='Bob', defaults={ 'newest_release': timezone.now() }) # Django update_or_create code from https://github.com/django/django/blob/master/django/db/models/query.py def update_or_create(self, defaults=None, **kwargs): """ Looks up an object with the given kwargs, updating one with defaults if it exists, otherwise creates a new one. Returns a tuple (object, created), where created is a boolean specifying whether an object was created. """ defaults = defaults or {} lookup, params = self._extract_model_params(defaults, **kwargs) self._for_write = True try: obj = self.get(**lookup) except self.model.DoesNotExist: obj, created = self._create_object_from_params(lookup, params) if created: return obj, created for k, v in six.iteritems(defaults): setattr(obj, k, v) obj.save(using=self.db) return obj, False
Assume that the author Bob already exists in the database.
update_or_create retrieves the Author object using .get().
Meanwhile, another process runs the following SQL:
UPDATE authors SET active = FALSE WHERE id = 1234;
Then, update_or_create sets the values from 'defaults' on the object, then calls .save() on the object. Generated SQL would look something like
UPDATE authors SET name='Bob', active=TRUE, newest_release='#######'::timestamp, WHERE id = 1234"
Notice that both 'name' and 'active' are set in the SQL, even though they were not passed in defaults. Bob has been wrongly made active again, even though the update portion of the update_or_create() call had nothing to do with the 'active' field. Besides the data issues, it is also not efficient to have the database update every field/column when there is no need.
I have tested this behavior by putting a time.sleep() into update_or_create right after the .get() and updating the database during this sleep.
Once the sleep finished, my changes were overwritten by .save()
My initial thought to patch this was to change
obj.save(using=self.db)
in update_or_create into
obj.save(using=self.db, update_fields=defaults.keys())
However, this causes a single test to fail in generic_relations, since update_fields in .save() does not accept GenericForeignKey fields, which may be bug in and of itself. There are other ways to fix this like locking the row, but that would be less efficient and not fix the underlying issue.
ERROR: test_update_or_create_defaults (generic_relations.tests.GenericRelationsTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib64/python2.7/unittest/case.py", line 365, in run testMethod() File "/home/jensen/sandbox/django/tests/generic_relations/tests.py", line 540, in test_update_or_create_defaults tag, created = TaggedItem.objects.update_or_create(tag="shiny", defaults={'content_object': diamond}) File "/home/jensen/sandbox/django/django/db/models/manager.py", line 85, in manager_method return getattr(self.get_queryset(), name)(*args, **kwargs) File "/home/jensen/sandbox/django/django/db/models/query.py", line 493, in update_or_create obj.save(using=self.db, update_fields=defaults.keys()) File "/home/jensen/sandbox/django/django/db/models/base.py", line 782, in save % ', '.join(non_model_fields)) ValueError: The following fields do not exist in this model or are m2m fields: content_object
Change History (12)
comment:1 by , 8 years ago
Summary: | update_or_create() updates all columns → update_or_create() updates all columns, which could cause a race condition |
---|
comment:2 by , 8 years ago
Triage Stage: | Unreviewed → Accepted |
---|
comment:4 by , 8 years ago
Needs tests: | set |
---|
comment:5 by , 8 years ago
Needs tests: | unset |
---|
comment:6 by , 8 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:7 by , 8 years ago
Triage Stage: | Accepted → Ready for checkin |
---|
comment:8 by , 8 years ago
Patch needs improvement: | set |
---|---|
Triage Stage: | Ready for checkin → Accepted |
Awaiting resolution of #26884.
follow-up: 11 comment:10 by , 8 years ago
Would it make sense to avoid the same race condition on get_or_create ? Although an added select_for_update on a single get query which might be the use case on most of the cases (99% selects, 1% actually create), would reduce the performance of those gets by quite a lot since a select_for_update adds a lock/write to the row. So should probably be documented or come with an optional flag.
comment:11 by , 8 years ago
The race condition in get_or_create would be: get_or_create does not find a matching row, then another process inserts a row that get_or_create would have matched, then get_or_create inserts it's row. Now, two rows in the table exist with duplicate values (for the columns that were being searched against).
If you have a unique index in your database for the columns you passed to get_or_create and this race condition occurred, you would not have duplicate values inserted. Instead, you would get a db exception. So it is possible to avoid the overhead of the transaction in some cases, but only if we know that the columns are unique/unique_together.
No, actually locking the row is the only valid solution. Otherwise you still have a race condition -- when someone else updates the record, it is possible that your
update_or_create
should be creating a new row rather than updating the existing one. IMO the right fix is to change the rowinto