Opened 5 years ago
Last modified 13 months ago
#31202 assigned Cleanup/optimization
Bulk update suffers from poor performance with large numbers of models and columns
Reported by: | Tom Forbes | Owned by: | Tom Forbes |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | Mikuláš Poul, Florian Demmer, John Speno, Akash Kumar Sen | Triage Stage: | Accepted |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description (last modified by )
A user has reported seeing extremely slow update times when using bulk_update. With the django-bulk-update package, which does not use the expressions API and constructs raw SQL directly (https://github.com/aykut/django-bulk-update/blob/master/django_bulk_update/helper.py#L202), an update with 100,000 objects and and 10 fields takes 24 seconds. With the built in bulk_update it takes 2 minutes and 24 seconds - 6x as slow.
The user has provided a reproduction case here: https://github.com/mikicz/bulk-update-tests/blob/master/apps/something/models.py and https://github.com/mikicz/bulk-update-tests/blob/master/apps/something/test_bulk_update.py
From an initial look at the profiling that has been provided (https://github.com/aykut/django-bulk-update/files/4060369/inbuilt_pyinstrument.txt) it seems a lot of overhead comes from building the SQL query rather than executing it - I think if we can improve the performance it could speed up other workloads.
See: https://github.com/aykut/django-bulk-update/issues/75#issuecomment-576886385
Change History (15)
comment:1 by , 5 years ago
Description: | modified (diff) |
---|---|
Owner: | changed from | to
Status: | new → assigned |
comment:2 by , 5 years ago
Cc: | added |
---|
comment:3 by , 5 years ago
Triage Stage: | Unreviewed → Accepted |
---|
comment:4 by , 5 years ago
I was running some more experiments earlier, just now got to analysing them: it doesn't seem like it depends on the type of field, at least in the couple of types I have in my project specifically (see https://github.com/mikicz/bulk-update-tests/blob/master/single_field_comparison.pdf). The performance is linear, in both number of objects to be updated and number of fields to be updated (see https://github.com/mikicz/bulk-update-tests/blob/master/two_fields_comparison.pdf and https://github.com/mikicz/bulk-update-tests/blob/master/all_fields_comparison.pdf).
The code that was running the experiment is in the experiment
app.
comment:5 by , 5 years ago
I've been taking a look at this today. It's been a while since I've poked around inside the expressions API, but I can see a couple of issues:
- We generate a _lot_ of allocations during the whole expression resolution process. 40% of the time is spent allocating according to vmprof
- "resolve_expression" for the Case statement is very copy happy. Each resolution copies all the inner nodes, which then copy themselves.
- We apparently needlessly call "resolve_expression" *twice* for each bulk update. On the first invocation we call [it from this method](https://github.com/django/django/blob/3259983f569151232d8e3b0c3d0de3a858c2b265/django/db/models/sql/subqueries.py#L105-L109) which replaces the Q objects with WhereNodes. We then resolve *again* during execution [from this with statement](https://github.com/django/django/blob/master/django/db/models/query.py#L769-L770).
I'll keep investigating, but after naively disabling the first resolution locally everything worked as expected, which implies for the happy path this might be wasted.
comment:6 by , 5 years ago
Cc: | added |
---|
comment:7 by , 4 years ago
One suggestion is to use the UPDATE..FROM
-syntax along with VALUES
on backends which supports it. For example:
UPDATE mytable SET mytext = myvalues.mytext, myint = myvalues.myint FROM ( VALUES (1, 'textA', 99), (2, 'textB', 88), ) AS myvalues (pk, mytext, myint) WHERE mytable.pk = myvalues.pk
I know PostgreSQL supports this syntax but I haven't looked into the others.
comment:8 by , 4 years ago
Here's another approach I thought would be worth mentioning. PostgreSQL supports COPY FROM
for copying data from a file (or stream) to a table. For bulk_update()
then, you could use COPY FROM
to copy the data to a temp table, and then do the update from that. I wonder if that would be faster than even django-bulk-update
's approach, in part because little SQL would be needed.
comment:9 by , 3 years ago
I have started to implement an UPDATE FROM VALUES variant in https://github.com/netzkolchose/django-computedfields/blob/master/computedfields/fast_update.py, as I needed faster updates in the lib.
First, the speedup is huge, it performs 10 - 25 times better than the current bulk_update implementation.
But ofc, there are several issues with this:
- only supported by newer DBMS versions (SQLITE 3.33+, MariaDB 10.3+, Mysql 8, Oracle unclear)
- not official ANSI SQL, thus the backends need their very own SQL templates
- MariaDB and Mysql8 have different templates, which is not covered by Django's backend distinction
- not easy to fit into the ORM's SQL compiler templates, thus I went with string formatting for now
- F expressions cannot be used in VALUES at all
For Postgres imho COPY FROM would be even faster, but I did not do it due to the needed complexity with totally different code paths for just postgres.
comment:10 by , 3 years ago
Some early tests with COPY FROM indicate, that it even outperforms UPDATE FROM VALUES by far (at least 4 times faster). But the code is even more degenerated and violates django paradigms in almost every aspect. This is probably nothing for django itself, but maybe a third party lib, that people can use if they are aware of basic restrictions.
What I find bothersome with COPY FROM:
- needs dealing with temporary tables, columns partially bootstrapped from target tables
- might need own cast/escape rules for more complex field types (have not yet looked at psycopg3's copy cursor mechanics)
- might not work for all field types / payloads (this most likely depends alot on the used transport format/delimiters)
- impact of index on pk field of temp table uncertain (whether index creation outweighs the index gain on the final update)
Yeah well, this needs alot more investigation, before it will be useable with a nice interface...
comment:11 by , 3 years ago
Some update on COPY FROM:
Did a first playground implementation just to see the effects on performance, see https://gist.github.com/jerch/fd0fae0107ce7b153b7540111b2e89ab. The chart over there shows the mean runtime of 100 runs of x updated records in a 100k table, updating just one integer field per record. The perf tests were done with plain postgres 14 docker image, with no tweaking of any settings.
The implementation copy_update
is not yet optimized for perf or neatly integrated yet, it is just to get an idea, where the ballpark for COPY FROM would end up. bulk_update
is django's current implementation, django-bulk-update
is from here: https://github.com/aykut/django-bulk-update, fast_update
is my early impl of direct UPDATE FROM VALUES from above.
Some observations from that:
bulk_update
shows much worse runtime behavior than all others (thus accounting is stopped early)django-bulk-update
can keep up a bit longer, but then shows the same worse runtime growth (prolly due to the same SQL logic used?). This gets really worse for updates >5k (not shown).copy_update
has much higher setup costs (1 to 256 updated records, kinda expected due to temp table and index creation)- between 4096 to 8192 updates
copy_update
starts to pay off compared tofast_update
, at 32k updates it is almost twice as fast - not shown in the charts: for higher update counts it keeps gaining ground (almost being 4 times faster for 1M update records)
- There is something going on between 256-512 updates, as almost all implementations show a steep jump there (postgres b-tree depth change? did not investigate that further...)
Some early insights from that:
- As already stated above in an earlier comment, django's
bulk_update
is currently pretty wasteful, as it even drops far behinddjango-bulk-update
, which uses the same SQL update strategy. fast_update
is the winner in small to medium update counts, up to ~10k.copy_update
starts to shine for update counts >10k.
comment:12 by , 3 years ago
Made some progress in https://github.com/netzkolchose/django-fast-update, which contains fast_update
for Postgres, sqlite, MariaDB and Mysql8, as well as copy_update
for postgres. The package is currently alpha (though quite complete in CI tests), would be glad to get some feedback and some real world tests.
Some notes about the implementations - I kinda gave up on closer integration of fast_update
with ORM internals, this needs someone with deeper knowledge and time to actually do it. copy_update
is more a PoC with additional constraints (like values' transport repr relies alot on correct __str__
output), imho this could be realized more straight forward with psycopg3's superior COPY support, once django moved there.
comment:13 by , 3 years ago
Cc: | added |
---|
comment:14 by , 2 years ago
Hey everyone, just wanted to mention that I created the following library for very fast model updates and inserts in Postgres. We have been using it in production for over 2 years.
https://github.com/cedar-team/django-bulk-load
More details about the library:
https://decode.cedar.com/fast-django-model-inserts-with-postgres/
comment:15 by , 13 months ago
Cc: | added |
---|
From looking at the profiles the performance issue is likely due to the fact Django uses
Case
,When
, andValue
expressions that needs to be resolved instead of raw SQL like the third party package does.I suspect using
RawSQL
withCASE
/WHEN
SQL templates instead ofCase
/When
expressions would significantly speed up the current implementation.