Opened 5 years ago

Last modified 14 months ago

#31202 assigned Cleanup/optimization

Bulk update suffers from poor performance with large numbers of models and columns

Reported by: Tom Forbes Owned by: Tom Forbes
Component: Database layer (models, ORM) Version: dev
Severity: Normal Keywords:
Cc: Mikuláš Poul, Florian Demmer, John Speno, Akash Kumar Sen Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Tom Forbes)

A user has reported seeing extremely slow update times when using bulk_update. With the django-bulk-update package, which does not use the expressions API and constructs raw SQL directly (https://github.com/aykut/django-bulk-update/blob/master/django_bulk_update/helper.py#L202), an update with 100,000 objects and and 10 fields takes 24 seconds. With the built in bulk_update it takes 2 minutes and 24 seconds - 6x as slow.

The user has provided a reproduction case here: https://github.com/mikicz/bulk-update-tests/blob/master/apps/something/models.py and https://github.com/mikicz/bulk-update-tests/blob/master/apps/something/test_bulk_update.py

From an initial look at the profiling that has been provided (https://github.com/aykut/django-bulk-update/files/4060369/inbuilt_pyinstrument.txt) it seems a lot of overhead comes from building the SQL query rather than executing it - I think if we can improve the performance it could speed up other workloads.

See: https://github.com/aykut/django-bulk-update/issues/75#issuecomment-576886385

Change History (15)

comment:1 by Tom Forbes, 5 years ago

Description: modified (diff)
Owner: changed from nobody to Tom Forbes
Status: newassigned

comment:2 by Mikuláš Poul, 5 years ago

Cc: Mikuláš Poul added

comment:3 by Simon Charette, 5 years ago

Triage Stage: UnreviewedAccepted

From looking at the profiles the performance issue is likely due to the fact Django uses Case, When, and Value expressions that needs to be resolved instead of raw SQL like the third party package does.

I suspect using RawSQL with CASE/WHEN SQL templates instead of Case/When expressions would significantly speed up the current implementation.

Last edited 5 years ago by Simon Charette (previous) (diff)

comment:4 by Mikuláš Poul, 5 years ago

I was running some more experiments earlier, just now got to analysing them: it doesn't seem like it depends on the type of field, at least in the couple of types I have in my project specifically (see https://github.com/mikicz/bulk-update-tests/blob/master/single_field_comparison.pdf). The performance is linear, in both number of objects to be updated and number of fields to be updated (see https://github.com/mikicz/bulk-update-tests/blob/master/two_fields_comparison.pdf and https://github.com/mikicz/bulk-update-tests/blob/master/all_fields_comparison.pdf).
The code that was running the experiment is in the experiment app.

comment:5 by Tom Forbes, 5 years ago

I've been taking a look at this today. It's been a while since I've poked around inside the expressions API, but I can see a couple of issues:

  1. We generate a _lot_ of allocations during the whole expression resolution process. 40% of the time is spent allocating according to vmprof
  2. "resolve_expression" for the Case statement is very copy happy. Each resolution copies all the inner nodes, which then copy themselves.
  3. We apparently needlessly call "resolve_expression" *twice* for each bulk update. On the first invocation we call [it from this method](https://github.com/django/django/blob/3259983f569151232d8e3b0c3d0de3a858c2b265/django/db/models/sql/subqueries.py#L105-L109) which replaces the Q objects with WhereNodes. We then resolve *again* during execution [from this with statement](https://github.com/django/django/blob/master/django/db/models/query.py#L769-L770).

I'll keep investigating, but after naively disabling the first resolution locally everything worked as expected, which implies for the happy path this might be wasted.

comment:6 by Florian Demmer, 5 years ago

Cc: Florian Demmer added

comment:7 by Hannes Ljungberg, 4 years ago

One suggestion is to use the UPDATE..FROM-syntax along with VALUES on backends which supports it. For example:

UPDATE mytable
SET
  mytext = myvalues.mytext,
  myint = myvalues.myint
FROM (
  VALUES
    (1, 'textA', 99),
    (2, 'textB', 88),
) AS myvalues (pk, mytext, myint)
WHERE mytable.pk = myvalues.pk

I know PostgreSQL supports this syntax but I haven't looked into the others.

comment:8 by Chris Jerdonek, 4 years ago

Here's another approach I thought would be worth mentioning. PostgreSQL supports COPY FROM for copying data from a file (or stream) to a table. For bulk_update() then, you could use COPY FROM to copy the data to a temp table, and then do the update from that. I wonder if that would be faster than even django-bulk-update's approach, in part because little SQL would be needed.

comment:9 by jerch, 3 years ago

I have started to implement an UPDATE FROM VALUES variant in https://github.com/netzkolchose/django-computedfields/blob/master/computedfields/fast_update.py, as I needed faster updates in the lib.

First, the speedup is huge, it performs 10 - 25 times better than the current bulk_update implementation.

But ofc, there are several issues with this:

  • only supported by newer DBMS versions (SQLITE 3.33+, MariaDB 10.3+, Mysql 8, Oracle unclear)
  • not official ANSI SQL, thus the backends need their very own SQL templates
  • MariaDB and Mysql8 have different templates, which is not covered by Django's backend distinction
  • not easy to fit into the ORM's SQL compiler templates, thus I went with string formatting for now
  • F expressions cannot be used in VALUES at all

For Postgres imho COPY FROM would be even faster, but I did not do it due to the needed complexity with totally different code paths for just postgres.

comment:10 by jerch, 3 years ago

Some early tests with COPY FROM indicate, that it even outperforms UPDATE FROM VALUES by far (at least 4 times faster). But the code is even more degenerated and violates django paradigms in almost every aspect. This is probably nothing for django itself, but maybe a third party lib, that people can use if they are aware of basic restrictions.

What I find bothersome with COPY FROM:

  • needs dealing with temporary tables, columns partially bootstrapped from target tables
  • might need own cast/escape rules for more complex field types (have not yet looked at psycopg3's copy cursor mechanics)
  • might not work for all field types / payloads (this most likely depends on alot on the used transport format/delimiters)
  • impact of index on pk field of temp table uncertain (whether index creation outweighs the index gain on the final update)

Yeah well, this needs alot more investigation, before it will be useable with a nice interface...

Version 0, edited 3 years ago by jerch (next)

comment:11 by jerch, 3 years ago

Some update on COPY FROM:

Did a first playground implementation just to see the effects on performance, see https://gist.github.com/jerch/fd0fae0107ce7b153b7540111b2e89ab. The chart over there shows the mean runtime of 100 runs of x updated records in a 100k table, updating just one integer field per record. The perf tests were done with plain postgres 14 docker image, with no tweaking of any settings.

The implementation copy_update is not yet optimized for perf or neatly integrated yet, it is just to get an idea, where the ballpark for COPY FROM would end up. bulk_update is django's current implementation, django-bulk-update is from here: https://github.com/aykut/django-bulk-update, fast_update is my early impl of direct UPDATE FROM VALUES from above.

Some observations from that:

  • bulk_update shows much worse runtime behavior than all others (thus accounting is stopped early)
  • django-bulk-update can keep up a bit longer, but then shows the same worse runtime growth (prolly due to the same SQL logic used?). This gets really worse for updates >5k (not shown).
  • copy_update has much higher setup costs (1 to 256 updated records, kinda expected due to temp table and index creation)
  • between 4096 to 8192 updates copy_update starts to pay off compared to fast_update, at 32k updates it is almost twice as fast
  • not shown in the charts: for higher update counts it keeps gaining ground (almost being 4 times faster for 1M update records)
  • There is something going on between 256-512 updates, as almost all implementations show a steep jump there (postgres b-tree depth change? did not investigate that further...)

Some early insights from that:

  • As already stated above in an earlier comment, django's bulk_update is currently pretty wasteful, as it even drops far behind django-bulk-update, which uses the same SQL update strategy.
  • fast_update is the winner in small to medium update counts, up to ~10k.
  • copy_update starts to shine for update counts >10k.
Last edited 3 years ago by jerch (previous) (diff)

comment:12 by jerch, 3 years ago

Made some progress in https://github.com/netzkolchose/django-fast-update, which contains fast_update for Postgres, sqlite, MariaDB and Mysql8, as well as copy_update for postgres. The package is currently alpha (though quite complete in CI tests), would be glad to get some feedback and some real world tests.

Some notes about the implementations - I kinda gave up on closer integration of fast_update with ORM internals, this needs someone with deeper knowledge and time to actually do it. copy_update is more a PoC with additional constraints (like values' transport repr relies alot on correct __str__ output), imho this could be realized more straight forward with psycopg3's superior COPY support, once django moved there.

comment:13 by John Speno, 3 years ago

Cc: John Speno added

comment:14 by Taylor H, 2 years ago

Hey everyone, just wanted to mention that I created the following library for very fast model updates and inserts in Postgres. We have been using it in production for over 2 years.

https://github.com/cedar-team/django-bulk-load

More details about the library:
https://decode.cedar.com/fast-django-model-inserts-with-postgres/

comment:15 by Akash Kumar Sen, 14 months ago

Cc: Akash Kumar Sen added
Note: See TracTickets for help on using tickets.
Back to Top