Opened 7 months ago

Last modified 4 months ago

#31202 assigned Cleanup/optimization

Bulk update suffers from poor performance with large numbers of models and columns

Reported by: Tom Forbes Owned by: Tom Forbes
Component: Database layer (models, ORM) Version: master
Severity: Normal Keywords:
Cc: Mikuláš Poul, Florian Demmer Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Tom Forbes)

A user has reported seeing extremely slow update times when using bulk_update. With the django-bulk-update package, which does not use the expressions API and constructs raw SQL directly (https://github.com/aykut/django-bulk-update/blob/master/django_bulk_update/helper.py#L202), an update with 100,000 objects and and 10 fields takes 24 seconds. With the built in bulk_update it takes 2 minutes and 24 seconds - 6x as slow.

The user has provided a reproduction case here: https://github.com/mikicz/bulk-update-tests/blob/master/apps/something/models.py and https://github.com/mikicz/bulk-update-tests/blob/master/apps/something/test_bulk_update.py

From an initial look at the profiling that has been provided (https://github.com/aykut/django-bulk-update/files/4060369/inbuilt_pyinstrument.txt) it seems a lot of overhead comes from building the SQL query rather than executing it - I think if we can improve the performance it could speed up other workloads.

See: https://github.com/aykut/django-bulk-update/issues/75#issuecomment-576886385

Change History (6)

comment:1 Changed 7 months ago by Tom Forbes

Description: modified (diff)
Owner: changed from nobody to Tom Forbes
Status: newassigned

comment:2 Changed 7 months ago by Mikuláš Poul

Cc: Mikuláš Poul added

comment:3 Changed 7 months ago by Simon Charette

Triage Stage: UnreviewedAccepted

From looking at the profiles the performance issue is likely due to the fact Django uses Case, When, and Value expressions that needs to be resolved instead of raw SQL like the third party package does.

I suspect using RawSQL with CASE/WHEN SQL templates instead of Case/When expressions would significantly speed up the current implementation.

Last edited 7 months ago by Simon Charette (previous) (diff)

comment:4 Changed 7 months ago by Mikuláš Poul

I was running some more experiments earlier, just now got to analysing them: it doesn't seem like it depends on the type of field, at least in the couple of types I have in my project specifically (see https://github.com/mikicz/bulk-update-tests/blob/master/single_field_comparison.pdf). The performance is linear, in both number of objects to be updated and number of fields to be updated (see https://github.com/mikicz/bulk-update-tests/blob/master/two_fields_comparison.pdf and https://github.com/mikicz/bulk-update-tests/blob/master/all_fields_comparison.pdf).
The code that was running the experiment is in the experiment app.

comment:5 Changed 6 months ago by Tom Forbes

I've been taking a look at this today. It's been a while since I've poked around inside the expressions API, but I can see a couple of issues:

  1. We generate a _lot_ of allocations during the whole expression resolution process. 40% of the time is spent allocating according to vmprof
  2. "resolve_expression" for the Case statement is very copy happy. Each resolution copies all the inner nodes, which then copy themselves.
  3. We apparently needlessly call "resolve_expression" *twice* for each bulk update. On the first invocation we call [it from this method](https://github.com/django/django/blob/3259983f569151232d8e3b0c3d0de3a858c2b265/django/db/models/sql/subqueries.py#L105-L109) which replaces the Q objects with WhereNodes. We then resolve *again* during execution [from this with statement](https://github.com/django/django/blob/master/django/db/models/query.py#L769-L770).

I'll keep investigating, but after naively disabling the first resolution locally everything worked as expected, which implies for the happy path this might be wasted.

comment:6 Changed 4 months ago by Florian Demmer

Cc: Florian Demmer added
Note: See TracTickets for help on using tickets.
Back to Top