Code

Opened 11 months ago

Closed 9 months ago

Last modified 8 months ago

#20429 closed New feature (fixed)

Add QuerySet.update_or_create method

Reported by: tunixman Owned by: elektrrrus
Component: Database layer (models, ORM) Version: master
Severity: Normal Keywords:
Cc: timograham@… Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

It seems handy to have a function that would update an existing record in the database with the values of defaults if one is found, instead of just returning a flag to the caller like get_or_create does. We have a codebase using this, but it may also be useful to the world at large. The attached implementation makes a passing effort at not updating too much of the model, but should probably actually restrict itself to just fields and related objects instead of any attribute.

Attachments (1)

upsert.py (710 bytes) - added by tunixman 11 months ago.
Sample implementation of update_or_create.

Download all attachments as: .zip

Change History (33)

Changed 11 months ago by tunixman

Sample implementation of update_or_create.

comment:1 Changed 11 months ago by russellm

  • Needs documentation set
  • Needs tests set
  • Patch needs improvement set
  • Triage Stage changed from Unreviewed to Accepted

Upsert is a fairly common idea; seems worthwhile to me.

Regarding implementation: there's a lot of common ground between get_or_create and update_or_create, so I'm wondering if there should be some common base implementation, rather than layering one over the other.

comment:2 Changed 11 months ago by elektrrrus

  • Owner changed from nobody to elektrrrus
  • Status changed from new to assigned

comment:3 Changed 11 months ago by tunixman

I hadn't thought about that. That's a good idea I'd be willing to look into if needed.

comment:4 Changed 11 months ago by elektrrrus

comment:5 Changed 11 months ago by tunixman

Should we at all be concerned in the update code about limiting updated attributes to just fields? It seems like it could potentially allow some subtle and difficult to identify bugs if any updates are allowed.

comment:6 Changed 11 months ago by elektrrrus

Yes, You have right. I'll add checking if updated fields are model attributes.

comment:7 Changed 11 months ago by elektrrrus

I've changed approach according to HonzaKral proposition, currently utilizing update method on QuerySet.

comment:8 Changed 11 months ago by tunixman

Using update() won't call save on the created or updated object (https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.update), which is slightly different from the get_or_create() API (https://docs.djangoproject.com/en/dev/ref/models/querysets/#get-or-create). The difference may be something we should keep in mind when deciding on the final implementation. My preference is that they behave the same in this regard (i.e. save() is called with both functions).

comment:9 Changed 11 months ago by elektrrrus

Yes, i know. I will talk also tomorrow with other developers on sprints about this approach.
Maybe the best way would be to choose through some kwarg to use update or save?

Version 1, edited 11 months ago by elektrrrus (previous) (next) (diff)

comment:10 Changed 11 months ago by tunixman

Maybe calling it update_or_create is misleading, too, and calling it something like upsert is better. This would avoid the name collision with update on QuerySet, and then we can follow the semantics of get_or_create more closely. But I do think following the semantics of get_or_create are more important than other considerations. Callers will either expect save to have been called after updating to perform application-level consistency checks, or people will need to remember to call save after each update_or_create call, and being safe by default is important to me.

comment:12 Changed 11 months ago by elektrrrus

After short talk with @russellm decision is to follow get_or_create approach with calling save on model.
I'll update pull request shortly.

comment:13 Changed 11 months ago by elektrrrus

Pull request updated.

comment:14 Changed 11 months ago by russellm

Review notes on the pull request.

Also - when submitting a patch that you want reviewed, it helps if you uncheck the "Patch needs improvement" flag. The list of patches needing review is essentially the list with a patch, that doesn't need improvement, and doesn't need tests or docs. If you don't update the flags, your patch will likely end up being missed for review.

comment:15 Changed 11 months ago by elektrrrus

  • Patch needs improvement unset

comment:16 Changed 11 months ago by elektrrrus

@russelm, waitng for Your second review, i've pushed new commit with updates according Your sugessions(and also uncheck "Patch needs improvement" flag).

comment:17 Changed 11 months ago by wim@…

  • Patch needs improvement set

Is the current patch still at: https://github.com/django/django/pull/1132/files ?

Probably by accident documentation of the first and last method have been removed.

comment:18 Changed 10 months ago by timo

  • Cc timograham@… added
  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Summary changed from Implementation of update_or_create to Add QuerySet.update_or_create method

Updated pull request to address outstanding issues: https://github.com/django/django/pull/1248

comment:19 Changed 10 months ago by anonymous

  • Patch needs improvement set

Some backends like mysql can handle upsert natively, update_or_create should use that as advantage.

comment:20 Changed 10 months ago by timo

I imagine that would increase the complexity of the patch quite a bit. I wonder if we need that optimization for the first version of this? Is the anonymous commenter willing to implement it?

comment:21 Changed 10 months ago by timo

  • Patch needs improvement unset

As suggested in IRC, I used djangobench to test this change, but I didn't get any conclusive results (this is my first time using djangobench, so I'm not certain how to interpret the results). Sometimes djangobench report a "significant" difference (both slower and faster), other times there was no difference. Here are a couple results. I'm assuming the difference is negligible enough to ignore.

$ djangobench --control=master --experiment=ticket_20429 query_get_or_create -t 500

Running 'query_get_or_create' benchmark ...
Min: 0.000931 -> 0.000944: 1.0141x slower
Avg: 0.000997 -> 0.000979: 1.0187x faster
Significant (t=3.474695)
Stddev: 0.00011 -> 0.00005: 2.1908x smaller (N = 500)

Running 'query_get_or_create' benchmark ...
Min: 0.000935 -> 0.000947: 1.0130x slower
Avg: 0.000975 -> 0.001007: 1.0335x slower
Significant (t=-7.464501)
Stddev: 0.00006 -> 0.00008: 1.3357x larger (N = 500)

Running 'query_get_or_create' benchmark ...
Min: 0.000931 -> 0.000939: 1.0087x slower
Avg: 0.000999 -> 0.001006: 1.0076x slower
Not significant
Stddev: 0.00014 -> 0.00013: 1.0402x smaller (N = 500)

comment:22 Changed 10 months ago by loic84

As mentioned on IRC, I believe this method suffers from a race-condition (especially if used in a view) which should be documented.

Warning:

This method is prone to race-condition and can result in multiple rows being
inserted simultaneously if uniqueness is not enforced at the database level
(see either ``unique`` or ``unique_together``).

During a race where uniqueness is inforced, all simultaneous calls to
``create_or_update`` but one will raise an ``IntegrityError``.

Without uniqueness inforcement, any ``create_or_update`` call with the same set
of parameters that happens after a race will raise ``MultipleObjectsReturned``.

Like get_or_create, it might be good to recommend it as a convenience for scripts rather than something to use in a view.

Refs #12579

comment:23 Changed 10 months ago by tunixman

If you're not enforcing uniqueness constraints in the database you're going to have this race condition no matter what. Perhaps the documentation around creating models and the basics of the tutorials could be updated to reflect this. Documenting it as something endemic only to these methods does nothing to improve the data consistency of Django applications and would rather just discourage people from using these methods which are basically the codification of the proper way to handle these cases along side proper database implementation.

comment:24 Changed 10 months ago by loic84

The integrity issue is just a side effect of the race-condition, even if uniqueness is enforced at the database level, you still need to be aware of the potential 500 you will be getting.

I agree that any get() followed by create() in user code suffers from the same issue, but when you do it manually you have a chance to think about the potential problems. There is a layer of opacity when you use a queryset method like update_or_create() and you may assume that all problems have been solved by the framework. Eventually you get to shoot yourself in the foot, which results in tickets like #12579.

update() documents that it's free of race-condition, select_for_update() is obviously free of race-condition on supported backends, it wouldn't be too hard to assume that update_or_create() provides such guarantee as well.

I agree that the race-condition for these common patterns should be documented (if that's not already the case), but convenience methods like get_or_create() and update_or_create() also need to be clear about their implementation and potential consequences.

comment:25 Changed 10 months ago by timo

I've updated the patch with the documentation suggested by @loic84 and also to reflect the fact that this will have to go into 1.7 rather than 1.6.

comment:26 follow-up: Changed 10 months ago by tunixman

Race conditions are conditions that silently accept corruption, so database constraints are the exact opposite of race conditions. They inform the programmer that data integrity was preserved, rather than silently corrupting it. Dealing with those exceptions at that point properly is the right thing, because it prevents having to go through the database much later trying to understand why there are duplicates of some rows but not others, and why there are other more opaque exceptions (like .get() failing), or why relations to one item are actually to several different items. Those are race conditions, but loudly saying "I'm sorry I can't corrupt your data for you" is not a race condition, and I strongly think Django should be choosing the non-race-condition approach with loud and immediate errors rather than the race-condition approach with no errors at the time of corruption, but loud and opaque errors much later.

comment:27 in reply to: ↑ 26 Changed 10 months ago by loic84

  • Triage Stage changed from Accepted to Ready for checkin

Replying to tunixman:

I strongly think Django should be choosing the non-race-condition approach with loud and immediate errors rather than the race-condition approach with no errors at the time of corruption, but loud and opaque errors much later.

I'm not sure I understand how you would do that. If you are referring to the backend specific upsert, it could be added in the future.

@timo: Left a comment on the PR, other than that I'm happy with the current patch.

comment:28 Changed 10 months ago by tunixman

Can you take me off this ticket then? I'm through and don't need the notifications about it anymore.

comment:29 Changed 9 months ago by timo

  • Triage Stage changed from Ready for checkin to Accepted

I've updated the patch, changing the signature of get_or_create and update_or_create to (defaults=None, **kwargs). I'd like another person to review the patch before committing it.

comment:30 Changed 9 months ago by Tim Graham <timograham@…>

  • Resolution set to fixed
  • Status changed from assigned to closed

In 6272d2f155adb4f32ef129d57e9eb5493ebde6ed:

Fixed #20429 -- Added QuerySet.update_or_create

Thanks tunixman for the suggestion and Loic Bistuer for the review.

comment:31 Changed 8 months ago by Tim Graham <timograham@…>

In f7290581fe2106c08d97215ab93e27cf6b27e100:

Fixed a regression with get_or_create and virtual fields.

refs #20429

Thanks Simon Charette for the report and review.

comment:32 Changed 8 months ago by Andrew Godwin <andrew@…>

In 3f416f637918cc162877be95a59d50825b203089:

Fixed a regression with get_or_create and virtual fields.

refs #20429

Thanks Simon Charette for the report and review.

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.