Context Navigation

← Previous Ticket
Next Ticket →

#35793 assigned New feature

Add support for atomic upserts

Reported by:	Storm Heg	Owned by:	YashRaj1506
Component:	Database layer (models, ORM)	Version:
Severity:	Normal	Keywords:
Cc:	Storm Heg, Lily Foote, Simon Charette, John Speno	Triage Stage:	Accepted
Has patch:	no	Needs documentation:	no
Needs tests:	no	Patch needs improvement:	no
Easy pickings:	no	UI/UX:	no

Description

Hello,

I took over maintenance of a package called wagtail-ab-testing which performs a raw SQL query to insert or update (UPSERT) a model atomically (link to relevant source code)

This raw sql query confused me, but as it turns out it is there because this the Django ORM does not appear support atomic upserts without taking a lock on the row, which hurts performance

Specifically, the (performant) raw SQL for PostgreSQL looks like this:

INSERT INTO %s (ab_test_id, version, date, hour, participants, conversions)
VALUES (%ss, %ss, %%s, %%s, %%s, %%s)
ON CONFLICT (ab_test_id, version, date, hour)
  DO UPDATE SET participants = %s.participants + %%s, conversions = %s.conversions + %%s;

There are a few things to note here:

There is an unique constraint on combination of ab_test_id, version, date, and hour columns that prevents duplicate objects from being created.
The participants and conversions columns are updated atomically; That is: these columns are incremented by the given values, not set directly. This is important for atomicity. We don't want multiple concurrent database calls to overwrite each other.

The more common way to do this would be to use the `update_or_create` method, but this internally takes a lock on the row in case of updates, which is not acceptable because this is a hot code path with a lot of concurrent requests. We can’t afford to take a lock on the row.

I think it would be great if Django ORM supported atomic upserts without taking a lock on the row. I’m not sure what the api for that would look like, suggestions are welcome.

As for database support, PostgreSQL and SQLite support this syntax since PostgreSQL 9.5 and SQLite 3.24.0 respectively. MySQL and MariaDB do not support this natively, but it can be emulated using INSERT ... ON DUPLICATE KEY UPDATE syntax and creating a unique index on the columns.

Oracle apparently supports something similar using the vastly different MERGE statement syntax. That might be a bit of a challenge.

Looking for feedback on this idea. Is this something that would be worth supporting in Django? I’ll admit it’s a bit of a niche feature, but it’s a very useful one in certain cases. I would love to get rid of the raw SQL query in wagtail-ab-testing and have something that works across all supported databases.

Storm

Change History (10)

comment:1 by Sarah Boyce, 20 months ago

Resolution:	→ wontfix
Status:	new → closed

Hi Storm!

I think it might be worth asking on the Django forum for help/input and then confirming if something new would need adding to Django (and what that would look like).
We also ask contributors to get feedback from the community through a forum discussion first, so we can confirm there is community consensus that we should add this feature

If you get positive feedback from the forum, can you link it back here and reopen the ticket?

It feels like what you need might be possible using a combination of force_insert and F expressions, but I'm not sure if I followed completely and I could be wrong

comment:2 by Storm Heg, 19 months ago

Thanks for the reply Sarah – and again, great meeting you at DjangoCon US 2024!

I wasn't aware that feature requests usually go through the forum. I've moved my request over there: https://forum.djangoproject.com/t/add-support-for-atomic-upserts/35813

comment:3 by Lily Foote, 19 months ago

Cc:	Lily Foote added
Resolution:	wontfix
Status:	closed → new
Triage Stage:	Unreviewed → Accepted

As discussed in the forum post, the most likely way forward with this is to update bulk_create to support general expressions in update_fields.

comment:4 by Simon Charette, 19 months ago

As a clarification the ORM already supports "atomic upsert" through bulk_create (as defined by INSERT ON CONFLICT DO UPDATE) but lacks a way to set the update value to anything but the value provided by the excluded row on conflict.

One thing that we'll need to figure out though is what field references (e.g. F or any str passed to expressions) should resolve to. Should it be the excluded row (the value specified for associated field of the instance passed to bulk_create), should it be the value of conflicting row, or should we allow both by requiring F('excluded__field') and F('conflicting__field') to be specified?

In other words what should the following resolve to

PostView.objects.bulk_create(
    [PostView(post_id=post_id, user_id=user_id, count=0)],
    unique_fields=["post_id", "user_id"],
    unique_updates={
        "count": F("count") + 1
    }
)

INSERT INTO postview(post_id, user_id, count) VALUES(:post_id, :user_id, :count)
ON CONFLICT (post_id, user_id) DO UPDATE SET count = postview.update + 1

Stormheg's report requested that it defaults to the conflicting row (not the excluded one) and I think it makes the most sense but I'm unsure if all backends support that. Another argument for this using the conflicting row value over the excluded one is that the latter can be emulated with a Case(When(...)) if necessary (return distinct value based on unique key match) while the other way around is not possible. It's also easier to implement resolving to the INSERT table alias (while disallowing JOINs) as no alias re-pointing needs to take place and if a need eventually arise to support excluded references it would be straightforward to implement a ExcludedF(F) class that always resolve to the backend specific equivalent of Postgres's excluded.

Version 0, edited 19 months ago by Simon Charette (next)

comment:5 by Simon Charette, 19 months ago

Cc:	Simon Charette added

comment:6 by John Speno, 19 months ago

Cc:	John Speno added

comment:7 by YashRaj1506, 19 months ago

Owner:	set to YashRaj1506
Status:	new → assigned

comment:8 by YashRaj1506, 18 months ago

One thing i have realised that the bulk_create doesn't supports F expressions directly, so a part of this feature would also include how to incorporate the value for the current conflicting_row to the value of the dictionaries which we will be passing, cause the bulk_create will throw the errors at the tests. There can be two ways either get the values from the db in the current dictionary of update_fields, and when operations are done on the python side, the batched_insert will send the data to db or After the the bulk insert has been done , a separate code shall run to update those field values which raised conflicts...... but i dont think using update in a bulk_create is a nice approach? I need some opinions and advices on this..

Last edited 18 months ago by YashRaj1506 (previous) (diff)

comment:9 by Take Weiland, 9 months ago

I just found this ticket after I already made https://code.djangoproject.com/ticket/36567
I am copying the relevant info here.

All supported DBs (all except Oracle, for which update_conflicts is not implemented) support referring to the values that were about to be inserted.
PostgreSQL and SQLite explicitly state support for referring to the row that is conflicting and that will be updated (https://www.postgresql.org/docs/current/sql-insert.html, https://sqlite.org/lang_upsert.html)
MySQL does not state such support, however their examples show it being done (ON DUPLICATE KEY UPDATE c=c+1; from https://dev.mysql.com/doc/refman/9.4/en/insert-on-duplicate.html)
MariaDB does not state such support, however considering it started from MySQL I would be surprised if it is not supported.

My API bikeshedding was something like this:

MyModel.objects.bulk_create(
  objs,
  update_conflicts=True,
  update_fields={
    'field_b': Coalesce(F('field_b'), Inserted('field_b'))
  }
)

I use Inserted here, but it could just as well be Excluded.

Last edited 9 months ago by Take Weiland (previous) (diff)

comment:10 by Simon Charette, 9 months ago

Thanks for the backends support investigation, that's quite useful!

I use Inserted here, but it could just as well be Excluded.

I think Excluded would be preferable as its closer to the SQL naming and refers to the conflicting row. I feel like Inserted is ambiguous as it's effectively tentatively inserted but failed.

Last edited 9 months ago by Simon Charette (previous) (diff)

Note: See TracTickets for help on using tickets.

Download in other formats:

Issues