Context Navigation

← Previous Ticket
Next Ticket →

#31202 assigned Cleanup/optimization

Bulk update suffers from poor performance with large numbers of models and columns

Reported by:	Tom Forbes	Owned by:	Tom Forbes
Component:	Database layer (models, ORM)	Version:	dev
Severity:	Normal	Keywords:
Cc:	Mikuláš Poul, Florian Demmer, John Speno, Akash Kumar Sen	Triage Stage:	Accepted
Has patch:	no	Needs documentation:	no
Needs tests:	no	Patch needs improvement:	no
Easy pickings:	no	UI/UX:	no
Pull Requests:	How to create a pull request

Description (last modified by Tom Forbes) ¶

A user has reported seeing extremely slow update times when using bulk_update. With the django-bulk-update package, which does not use the expressions API and constructs raw SQL directly (https://github.com/aykut/django-bulk-update/blob/master/django_bulk_update/helper.py#L202), an update with 100,000 objects and and 10 fields takes 24 seconds. With the built in bulk_update it takes 2 minutes and 24 seconds - 6x as slow.

The user has provided a reproduction case here: https://github.com/mikicz/bulk-update-tests/blob/master/apps/something/models.py and https://github.com/mikicz/bulk-update-tests/blob/master/apps/something/test_bulk_update.py

From an initial look at the profiling that has been provided (https://github.com/aykut/django-bulk-update/files/4060369/inbuilt_pyinstrument.txt) it seems a lot of overhead comes from building the SQL query rather than executing it - I think if we can improve the performance it could speed up other workloads.

See: https://github.com/aykut/django-bulk-update/issues/75#issuecomment-576886385

According to the ticket's flags, the next step(s) to move this issue forward are:

To provide a patch by sending a pull request. Claim the ticket when you start working so that someone else doesn't duplicate effort. Before sending a pull request, review your work against the patch review checklist. Check the "Has patch" flag on the ticket after sending a pull request and include a link to the pull request in the ticket comment when making that update. The usual format is: [https://github.com/django/django/pull/#### PR].

Change History (17)

comment:1 by Tom Forbes, 5 years ago

Description:	modified (diff)
Owner:	changed from nobody to Tom Forbes
Status:	new → assigned

comment:2 by Mikuláš Poul, 5 years ago

Cc:	Mikuláš Poul added

comment:3 by Simon Charette, 5 years ago

Triage Stage:	Unreviewed → Accepted

From looking at the profiles the performance issue is likely due to the fact Django uses Case, When, and Value expressions that needs to be resolved instead of raw SQL like the third party package does.

I suspect using RawSQL with CASE/WHEN SQL templates instead of Case/When expressions would significantly speed up the current implementation.

Last edited 5 years ago by Simon Charette (previous) (diff)

comment:4 by Mikuláš Poul, 5 years ago

I was running some more experiments earlier, just now got to analysing them: it doesn't seem like it depends on the type of field, at least in the couple of types I have in my project specifically (see https://github.com/mikicz/bulk-update-tests/blob/master/single_field_comparison.pdf). The performance is linear, in both number of objects to be updated and number of fields to be updated (see https://github.com/mikicz/bulk-update-tests/blob/master/two_fields_comparison.pdf and https://github.com/mikicz/bulk-update-tests/blob/master/all_fields_comparison.pdf).
The code that was running the experiment is in the experiment app.

comment:5 by Tom Forbes, 5 years ago

I've been taking a look at this today. It's been a while since I've poked around inside the expressions API, but I can see a couple of issues:

We generate a _lot_ of allocations during the whole expression resolution process. 40% of the time is spent allocating according to vmprof
"resolve_expression" for the Case statement is very copy happy. Each resolution copies all the inner nodes, which then copy themselves.
We apparently needlessly call "resolve_expression" *twice* for each bulk update. On the first invocation we call [it from this method](https://github.com/django/django/blob/3259983f569151232d8e3b0c3d0de3a858c2b265/django/db/models/sql/subqueries.py#L105-L109) which replaces the Q objects with WhereNodes. We then resolve *again* during execution [from this with statement](https://github.com/django/django/blob/master/django/db/models/query.py#L769-L770).

I'll keep investigating, but after naively disabling the first resolution locally everything worked as expected, which implies for the happy path this might be wasted.

comment:6 by Florian Demmer, 5 years ago

Cc:	Florian Demmer added

comment:7 by Hannes Ljungberg, 4 years ago

One suggestion is to use the UPDATE..FROM-syntax along with VALUES on backends which supports it. For example:

UPDATE mytable
SET
  mytext = myvalues.mytext,
  myint = myvalues.myint
FROM (
  VALUES
    (1, 'textA', 99),
    (2, 'textB', 88),
) AS myvalues (pk, mytext, myint)
WHERE mytable.pk = myvalues.pk

I know PostgreSQL supports this syntax but I haven't looked into the others.

comment:8 by Chris Jerdonek, 4 years ago

Here's another approach I thought would be worth mentioning. PostgreSQL supports COPY FROM for copying data from a file (or stream) to a table. For bulk_update() then, you could use COPY FROM to copy the data to a temp table, and then do the update from that. I wonder if that would be faster than even django-bulk-update's approach, in part because little SQL would be needed.

comment:9 by jerch, 3 years ago

I have started to implement an UPDATE FROM VALUES variant in https://github.com/netzkolchose/django-computedfields/blob/master/computedfields/fast_update.py, as I needed faster updates in the lib.

First, the speedup is huge, it performs 10 - 25 times better than the current bulk_update implementation.

But ofc, there are several issues with this:

only supported by newer DBMS versions (SQLITE 3.33+, MariaDB 10.3+, Mysql 8, Oracle unclear)
not official ANSI SQL, thus the backends need their very own SQL templates
MariaDB and Mysql8 have different templates, which is not covered by Django's backend distinction
not easy to fit into the ORM's SQL compiler templates, thus I went with string formatting for now
F expressions cannot be used in VALUES at all

For Postgres imho COPY FROM would be even faster, but I did not do it due to the needed complexity with totally different code paths for just postgres.

comment:10 by jerch, 3 years ago

Some early tests with COPY FROM indicate, that it even outperforms UPDATE FROM VALUES by far (at least 4 times faster). But the code is even more degenerated and violates django paradigms in almost every aspect. This is probably nothing for django itself, but maybe a third party lib, that people can use if they are aware of basic restrictions.

What I find bothersome with COPY FROM:

needs dealing with temporary tables, columns partially bootstrapped from target tables
might need own cast/escape rules for more complex field types (have not yet looked at psycopg3's copy cursor mechanics)
might not work for all field types / payloads (this most likely depends alot on the used transport format/delimiters)
impact of index on pk field of temp table uncertain (whether index creation outweighs the index gain on the final update)

Yeah well, this needs alot more investigation, before it will be useable with a nice interface...

Last edited 3 years ago by jerch (previous) (diff)

comment:11 by jerch, 3 years ago

Some update on COPY FROM:

Did a first playground implementation just to see the effects on performance, see https://gist.github.com/jerch/fd0fae0107ce7b153b7540111b2e89ab. The chart over there shows the mean runtime of 100 runs of x updated records in a 100k table, updating just one integer field per record. The perf tests were done with plain postgres 14 docker image, with no tweaking of any settings.

The implementation copy_update is not yet optimized for perf or neatly integrated yet, it is just to get an idea, where the ballpark for COPY FROM would end up. bulk_update is django's current implementation, django-bulk-update is from here: https://github.com/aykut/django-bulk-update, fast_update is my early impl of direct UPDATE FROM VALUES from above.

Some observations from that:

bulk_update shows much worse runtime behavior than all others (thus accounting is stopped early)
django-bulk-update can keep up a bit longer, but then shows the same worse runtime growth (prolly due to the same SQL logic used?). This gets really worse for updates >5k (not shown).
copy_update has much higher setup costs (1 to 256 updated records, kinda expected due to temp table and index creation)
between 4096 to 8192 updates copy_update starts to pay off compared to fast_update, at 32k updates it is almost twice as fast
not shown in the charts: for higher update counts it keeps gaining ground (almost being 4 times faster for 1M update records)
There is something going on between 256-512 updates, as almost all implementations show a steep jump there (postgres b-tree depth change? did not investigate that further...)

Some early insights from that:

As already stated above in an earlier comment, django's bulk_update is currently pretty wasteful, as it even drops far behind django-bulk-update, which uses the same SQL update strategy.
fast_update is the winner in small to medium update counts, up to ~10k.
copy_update starts to shine for update counts >10k.

Last edited 3 years ago by jerch (previous) (diff)

comment:12 by jerch, 3 years ago

Made some progress in https://github.com/netzkolchose/django-fast-update, which contains fast_update for Postgres, sqlite, MariaDB and Mysql8, as well as copy_update for postgres. The package is currently alpha (though quite complete in CI tests), would be glad to get some feedback and some real world tests.

Some notes about the implementations - I kinda gave up on closer integration of fast_update with ORM internals, this needs someone with deeper knowledge and time to actually do it. copy_update is more a PoC with additional constraints (like values' transport repr relies alot on correct __str__ output), imho this could be realized more straight forward with psycopg3's superior COPY support, once django moved there.

comment:13 by John Speno, 3 years ago

Cc:	John Speno added

comment:14 by Taylor H, 3 years ago

Hey everyone, just wanted to mention that I created the following library for very fast model updates and inserts in Postgres. We have been using it in production for over 2 years.

https://github.com/cedar-team/django-bulk-load

More details about the library:
https://decode.cedar.com/fast-django-model-inserts-with-postgres/

comment:15 by Akash Kumar Sen, 18 months ago

Cc:	Akash Kumar Sen added

comment:16 by Adam Sołtysik, 2 months ago

Even though the thread specifically mentions "large numbers of columns", performance issues are noticeable even with something as simple as ManyToManyField.

Let's say I have 1 million users to add to a group. Django's group.users.add(*user_ids) takes 38 seconds, while the same SQL query built directly in Python takes 10 seconds. Similarly, to remove all the users, group.users.remove(*user_ids) takes 7.5 seconds, while a raw SQL query takes 2 seconds. This is a ~4x performance difference (not even considering how much of the time is used by the DB), and it's not the biggest I've seen after rewriting some other queries. I'm struggling to imagine what could be taking so long for Django to build those.

In our project we also tried using the django-bulk-load library, as suggested above. It's certainly faster, but there are still some issues. First, the library still requires creating Django object instances, which is another known bottleneck (discussed e.g. in https://forum.djangoproject.com/t/how-to-avoid-the-overhead-of-model-instances-in-bulk-create/25538). Second, the COPY FROM approach actually turns out to be slower than a direct INSERT INTO in our Postgres database. Overall, our bare SQL queries ended up being 2x-3x faster than operations performed with django-bulk-load, which seems to be worth the slight increase in code length.

comment:17 by Simon Charette, 2 months ago

Just to make it clear, this ticket is specifically about bulk_update poor performance which neither related managers add or remove methods make use of.

In the case of group.users.add, assuming no m2m_changed signal is connected, it should result in a single bulk_create call which effectively suffers from the instance creation tax discussed on the forum.

Not saying that bulk_create couldn't be made faster, in fact in should be significantly faster on Postgres in 5.2, but that this is likely not the right place to discuss it as this ticket is focused on a very different set of issues that bulk_update is facing.

Note: See TracTickets for help on using tickets.

Download in other formats:

Issues