Opened 10 years ago

Closed 6 years ago

Last modified 3 years ago

#23646 closed New feature (fixed)

Add QuerySet.bulk_update() to to efficiently update many models

Reported by: Brillgen Developers Owned by: Tom Forbes
Component: Database layer (models, ORM) Version: 1.9
Severity: Normal Keywords:
Cc: Tom Forbes Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Brillgen Developers)

Update
django-bulk-upate (​https://github.com/aykut/django-bulk-update) which could/should be integrated into django core since we have a bulk_create and this corresponds well with the save (could be called bulk save if it makes sense). App works with an identical api to bulk_create and so should be a good candidate to be considered for inclusion in core.

Database backends like Postgresql support doing multiple value update for the same field in different rows using a single update query. Please see this sample below:

http://stackoverflow.com/questions/18797608/update-multiple-rows-in-same-query-using-postgresql

Example:
Currently:

Books.objects.all().update(price=10)

Proposed (syntax can vary):

Books.objects.all().update({'pk': 1, 'price': 10}, {'pk': 2, 'price': 25}])

Idea is to do it in sql for performance reasons but still use the ORM and not use a raw database connection.

However, such an interface is not exposed via the Django ORM. Would this be accepted as a patch? if so, would a list dictionary with the various fields:values to be updated be a good way to provide the input to ORM

Change History (18)

comment:1 by Brillgen Developers, 10 years ago

Description: modified (diff)

comment:2 by Russell Keith-Magee, 10 years ago

Resolution: wontfix
Status: newclosed

The link you've provided suggests that the ORM example you've provided would be interpreted as:

UPDATE books_book 
SET 
 price = 10 where pk = 1,
 price = 25 where pk = 2

So... what in the ORM statement identifies which column is to be used in the WHERE clause, and which is to be updated? As currently expressed, it's ambiguous. You'd need to modify the statement to be something like:

Books.objects.all().update('pk', values={'pk': 1, 'price': 10}, {'pk': 2, 'price': 25})

which nominates which values in the data are for the select.

However, even with those changes, I'm not sure I see why doing this as a single statement version would be preferable to:

Book.objects.filter(pk=1).update(price=10)
Book.objects.filter(pk=2).update(price=25)

AFAIK, it's no faster on the database side (although feel free to prove me wrong on that), and I don't find a "one liner API" especially expressive or clear for this feature.

For that reason, I'm marking this wontfix. If you feel the idea is still worth pursuing, feel free to start a discussion on django-developers and make your case. (I know you started one on django-users, but in this case, you need the eyeballs from -dev).

comment:3 by Shai Berger, 10 years ago

On an old project, ~10 years ago and not Django based, we used a "bulk update" procedure that did:

1) Bulk-insert the new records into a temporary table
2) update in one statement using a join (SQL Server lets you do that)

On hundreds of records, this was significantly faster than updating one-by-one (I no longer have access to any hard data, and it's irrelevant anyways).

Anyway, I suspect a better API would involve passing a collection of objects and selecting just the fields to update: Book.objects.update_many(books, 'price')

comment:4 by Brillgen Developers, 9 years ago

Resolution: wontfix
Status: closednew

Reopening even though it was closed by a core team member because another core member has reported the significant performance improvement.

Also, there is an app called django-bulk-upate (https://github.com/aykut/django-bulk-update) which could/should be integrated into django core since we have a bulk_create and this corresponds well with the save (could be called bulk save if it makes sense). App works with an identical api to bulk_create and so should be a good candidate to be considered for inclusion in core.

comment:5 by Brillgen Developers, 9 years ago

Description: modified (diff)
Summary: query set sql update to change different values by different keysquery set sql update to change different values by different keys (include django-bulk-update in core)
Version: 1.71.9

comment:6 by Tim Graham, 9 years ago

Triage Stage: UnreviewedAccepted

Tentatively accepting, pending discussion on the DevelopersMailingList to approve the idea and API.

comment:7 by Tim Graham, 7 years ago

Has patch: set

#29037 is a duplicate with a patch.

comment:8 by Tom Forbes, 7 years ago

Cc: Tom Forbes added
Owner: changed from nobody to Tom Forbes
Status: newassigned

Sorry for the duplicate, I'm not sure how my search missed this. I've got a patch that implements this at a basic level, there are some small optimizations that could be done at a later stage (including adding Postgres specific syntax).

comment:9 by Tim Graham, 6 years ago

Patch needs improvement: set
Summary: query set sql update to change different values by different keys (include django-bulk-update in core)Add QuerySet.bulk_save() to to efficiently update many models

comment:10 by Tom Forbes, 6 years ago

Patch needs improvement: unset

I've made the changes requested by the review. I also expanded the tests a fair bit and ran into one case where setting a model field to an expression (e.g F() or a Function) would fail, because the field values where passed to Value().

This got me thinking, what's the 'correct' result of Value(F('id')), or Value(Lower('field'))? Currently it fails in weird ways when resolved with an Expression, but should it perhaps return the resolved Expression instead? Or should we error loudly in this case?

comment:11 by Tim Graham, 6 years ago

Patch needs improvement: set

The documentation says, "A Value() object represents the smallest possible component of an expression: a simple value. When you need to represent the value of an integer, boolean, or string within an expression, you can wrap that value within a Value()." I don't think Value() is meant to wrap other expressions.

comment:12 by Tom Forbes, 6 years ago

Patch needs improvement: unset

I've moved the tests around as requested and changed the documentation.

I also had an idea for a potential optimization by using simple case statements (https://code.djangoproject.com/ticket/29710) after reading the Oracle documentation where they make a clear distinction between the two (simple or searched).

Version 0, edited 6 years ago by Tom Forbes (next)

comment:13 by Simon Charette, 6 years ago

Summary: Add QuerySet.bulk_save() to to efficiently update many modelsAdd QuerySet.bulk_update() to to efficiently update many models

As mentioned on the PR I think it'd be worth investigating (in a follow up ticket) if we could push the primary conditional update logic to the compiler level (UpdateQuery/SQLUpdateCompiler) in order to use VALUES instead of CASE(WHEN)) on backends that support it. Given this API is meant to be used for large number of objects I assume using VALUES would perform significantly better because it'd be easier for query planners to inspect the query.

FWIW UpdateQuery already has an update_batch method used by model deletion but it doesn't support per-primary key values.

in reply to:  13 comment:14 by Tom Forbes, 6 years ago

Replying to Simon Charette:

As mentioned on the PR I think it'd be worth investigating (in a follow up ticket) if we could push the primary conditional update logic to the compiler level (UpdateQuery/SQLUpdateCompiler) in order to use VALUES instead of CASE(WHEN)) on backends that support it. Given this API is meant to be used for large number of objects I assume using VALUES would perform significantly better because it'd be easier for query planners to inspect the query.

FWIW UpdateQuery already has an update_batch method used by model deletion but it doesn't support per-primary key values.

Yep, in #29037 I noted that Postgres has specific syntax for it and I was going to open a ticket once (or if!) this is merged. I'm not sure what other databases support this, and I spent today looking at how to implement it. It requires a custom join on VALUES which I'm really not sure how to implement. If you have any idea I'd love a point in the right direction.

comment:15 by Simon Charette, 6 years ago

It requires a custom join on VALUES which I'm really not sure how to implement. If you have any idea I'd love a point in the right direction.

Hmm I'll have to dig into it a bit more as well but I was hoping it would be possible without introducing a new type of JOIN by having SQLUpdateCompiler do the heavy lifting.

I suppose having a look at how QuerySet.extra(tables, where) is implemented could guide us towards implementing it at the UpdateQuery level though some form of JOIN.

From my local tests it seems like we could use UNION ALL to replace VALUES on MySQL at least.

UPDATE table, (
    SELECT 1 id, 2 first, 3 second
    UNION ALL SELECT 2, 3, 4
    UNION ALL ...
) AS table_update
SET table.first = table_update.first, table.second = table_update.second
WHERE table.id IN (...) AND table.id = table_update.id

I haven't tested performance on SQLite and MySQL but the query seem to perform significantly faster (3-5x) on PostgreSQL using VALUES() instead of CASE(WHEN) when updating two columns on 1k rows of a table with 10k rows. The performance difference was getting larger as more columns were updated (I tried 1 to 5).

comment:16 by Tim Graham <timograham@…>, 6 years ago

Resolution: fixed
Status: assignedclosed

In 9cbdb440:

Fixed #23646 -- Added QuerySet.bulk_update() to efficiently update many models.

in reply to:  15 comment:17 by Tom Forbes, 6 years ago

Thank you for this! I've added a ticket to support this: https://code.djangoproject.com/ticket/29771#ticket

Replying to Simon Charette:

It requires a custom join on VALUES which I'm really not sure how to implement. If you have any idea I'd love a point in the right direction.

Hmm I'll have to dig into it a bit more as well but I was hoping it would be possible without introducing a new type of JOIN by having SQLUpdateCompiler do the heavy lifting.

I suppose having a look at how QuerySet.extra(tables, where) is implemented could guide us towards implementing it at the UpdateQuery level though some form of JOIN.

From my local tests it seems like we could use UNION ALL to replace VALUES on MySQL at least.

UPDATE table, (
    SELECT 1 id, 2 first, 3 second
    UNION ALL SELECT 2, 3, 4
    UNION ALL ...
) AS table_update
SET table.first = table_update.first, table.second = table_update.second
WHERE table.id IN (...) AND table.id = table_update.id

I haven't tested performance on SQLite and MySQL but the query seem to perform significantly faster (3-5x) on PostgreSQL using VALUES() instead of CASE(WHEN) when updating two columns on 1k rows of a table with 10k rows. The performance difference was getting larger as more columns were updated (I tried 1 to 5).

comment:18 by Mariusz Felisiak <felisiak.mariusz@…>, 3 years ago

In d35ce682:

Fixed #33506 -- Made QuerySet.bulk_update() perform atomic writes against write database.

The lack of _for_write = True assignment in bulk_update prior to
accessing self.db resulted in the db_for_read database being used to
wrap batched UPDATEs in a transaction.

Also tweaked the batch queryset creation to also ensure they are
executed against the same database as the opened transaction under all
circumstances.

Refs #23646, #33501.

Note: See TracTickets for help on using tickets.
Back to Top