Opened 7 months ago

Closed 6 months ago

Last modified 6 months ago

#35399 closed Cleanup/optimization (duplicate)

Reduce the "Case-When" sequence for a bulk_update when the values for a certain field are the same.

Reported by: Willem Van Onsem Owned by: nobody
Component: Database layer (models, ORM) Version: 5.0
Severity: Normal Keywords: db, bulk_update, case, when
Cc: Triage Stage: Unreviewed
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Django's .bulk_update(..) seems to work with sequences of Case-When *per* field and *per* record to update. In other words, if we have three items we want to update with pk being 1, 4 and 5, and we have a field yn and the records for pk 1 and 5 have value y, whereas the one for pk=4 it is n, then this will make a query that looks like:

UPDATE my_table
SET yn=CASE WHEN id=1 THEN 'y' WHEN id=4 THEN 'n' WHEN id=5 THEN 'y' END
WHERE id IN (1,4,5)

It thus builds a long sequence of Case-Whens, which is not efficient.

There are for most databases solutions with a temporary table that is then upserted into the main table. The Django ORM could probably use existent tools for this where we first make a temporary model, create it with the migration model, insert in bulk, then upsert with a query, and finally remove the temporary table. But a problem with this is, we might not have privileges to create an extra table in that database.

But a low-hanging optimization is to first *group* the values together that we can group together. This can be done with a defaultdict that maps the value to a list of primary keys for that value. In case the value is not hashable, we can fallback on the original case-when logic. But for values like strings, this would reduce the query to:

UPDATE my_table
SET yn=CASE WHEN id IN (1, 5) THEN 'y' WHEN id=4 THEN 'n' END
WHERE id IN (1,4,5)

That being said, I think bulk_updates should still be done in a more efficient manner, perhaps moving it to the engines, since a lot of databases allow the creation of temporary tables to make upserts a lot more efficient.

Attachments (1)

group_the_case-whens_into_sub-chunks_.patch (1.9 KB ) - added by Willem Van Onsem 7 months ago.

Download all attachments as: .zip

Change History (9)

by Willem Van Onsem, 7 months ago

comment:1 by Simon Charette, 7 months ago

Resolution: duplicate
Status: newclosed

This was discussed in #35124 recently and at lead to this discussion about the complexity of and cost of hashing certain values quickly. Unless you can provide benchmarks that this is actually faster I think time is better invested towards #29771 and #31202.

comment:2 by Willem Van Onsem, 6 months ago

I created a small testcase that has a speedup of 53.2% ±4.073% (p=95%):

def benching(n=100, m=10):
	def benchmark():
		EventStatistics.objects.bulk_update(es, fields=('scans_in',))
	def benchmark2():
		EventStatistics.objects.bulk_update2(es, fields=('scans_in',))

	es = list(EventStatistics.objects.order_by('?')[:n])
	for e in es:
		e.scans_in = random.randint(0, m)
	without = timeit.timeit(benchmark, number=15)

	es = list(EventStatistics.objects.order_by('?')[:n])
	for e in es:
		e.scans_in = random.randint(0, 10)
	_with = timeit.timeit(benchmark2, number=15)

	speedup = round(100*_with/without)  # percentage of the new one compared to the old
	print(m, n, without, _with, f'{speedup}%')
	return speedup

Where I added the bulk_update2 method to the QuerySet. The code is the proposed change, whereas bulk_update is the original one.

Here scans_in is an IntegerField, m is the number of different values (well randomly determined, so an upper bound), n is the number of records to update. I used random records to avoid that these are all located near each other, and furthermore used different ones for the two, to avoid that the indexes for example are cached, or some other speedup. Every test ran 15 times with the timeit module with the following results (first two columns are m and n, then the timings in seconds, and finally the time of the speedup, compared to the original):

10 100 0.23681159999978263 0.1789207000110764 76%
10 500 1.103456400000141 0.5554370999889215 50%
10 1000 2.4681710000004387 1.3000370000081602 53%
10 5000 13.48228929999459 6.651690699989558 49%
10 10000 33.66046780000033 15.352994000000763 46%
20 100 0.25399490000563674 0.17354469999554567 68%
20 500 1.041674799998873 0.5452321000047959 52%
20 1000 2.2131319000036456 1.3127163000026485 59%
20 5000 26.524631100008264 10.656895499996608 40%
20 10000 36.67004110000562 14.241265099990414 39%
50 100 0.21403420000569895 0.13851690001320094 65%
50 500 1.2294649999967078 0.6158274999907007 50%
50 1000 2.1560433999984525 1.1467072000086773 53%
50 5000 14.895891699998174 7.622537400005967 51%
50 10000 34.32338119999622 15.66809479999938 46%
100 100 0.24263419999624602 0.15937580000900198 66%
100 500 1.225836999990861 0.634526800000458 52%
100 1000 2.245729800008121 1.1381603000045288 51%
100 5000 13.832920399989234 6.757217299993499 49%
100 10000 39.29845810000552 19.19906909999554 49%

or as a table with the "speedup" (smaller is a stronger speedup):

---  ---  ---  ----  ----  -----
x    100  500  1000  5000  10000
10   76%  50%  53%   49%   46%
20   68%  52%  59%   40%   39%
50   65%  50%  53%   51%   46%
100  66%  52%  51%   49%   49%
---  ---  ---  ----  ----  -----

I also ran these in the opposite direction, to make sure it was not that the first queries somehow let the database respond faster (because the PostgreSQL database probably contains code pages, indexes, etc. that could be in the cache). Then we achieve the following results:

10 100 0.16778360000171233 0.26211979999789037 156%
10 500 0.575189399998635 1.1075479999999516 193%
10 1000 1.0994390999985626 2.76130669999111 251%
10 5000 6.549138200003654 14.862665100008599 227%
10 10000 14.595775299996603 41.87046229999396 287%
20 100 0.19783109999843873 0.26739639999868814 135%
20 500 0.8581540000013774 1.7171591000078479 200%
20 1000 1.173938799998723 2.096230099996319 179%
20 5000 7.9105590999970445 15.384056599999894 194%
20 10000 16.410469500013278 36.77300239999022 224%
50 100 0.22504739998839796 0.22487839999666903 100%
50 500 0.6390990999934729 1.1038762999960454 173%
50 1000 1.2254422000114573 2.0602585000015097 168%
50 5000 6.814096200003405 13.404201700002886 197%
50 10000 18.954565500011086 41.927377500003786 221%
100 100 0.2481432000058703 0.22796000000380445 92%
100 500 0.7834087000082945 1.3490471999975853 172%
100 1000 1.5713398999942 2.1639006000041263 138%
100 5000 7.152072100012447 13.975013000002946 195%
100 10000 16.15594609999971 36.78454979999515 228%

or as a table with the parameters (greater is a stronger speedup) we get 186.5% ±21.048% (p=95%):

---  ----  ----  ----  ----  -----
x    100   500   1000  5000  10000
10   156%  193%  251%  227%  287%
20   135%  200%  179%  194%  224%
50   100%  173%  168%  197%  221%
100  92%   172%  138%  195%  228%
---  ----  ----  ----  ----  -----

Finally I also moved the generation or random numbers into the testcase, this because if the values are once updated, this might reduce the query time, since it does not require writing to a disk, so:

def benching(n=100, m=10):
	def benchmark():
		es = list(EventStatistics.objects.order_by('?')[:n])
		for e in es:
			e.scans_in = random.randint(0, m)
		EventStatistics.objects.bulk_update(es, fields=('scans_in',))
	def benchmark2():
		es = list(EventStatistics.objects.order_by('?')[:n])
		for e in es:
			e.scans_in = random.randint(0, 10)
		EventStatistics.objects.bulk_update2(es, fields=('scans_in',))

	without = timeit.timeit(benchmark, number=15)
	_with = timeit.timeit(benchmark2, number=15)

	speedup = round(100*_with/without)  # percentage of the new one compared to the old
	print(m, n, without, _with, f'{speedup}%')
	return speedup

This will "flatten out" the speedup, since the part where we assign values also counts as time, and this is the same for both versions. With the original bulk_update first, we get the following results (76.55% ±10.293%, p=95%):

10 100 2.384173699989333 2.2892083000042476 96%
10 500 3.9282099999982165 2.949110200002906 75%
10 1000 5.0259445999981835 5.3350902999955 106%
10 5000 20.10881380000501 11.846710900004837 59%
10 10000 50.26291349998792 24.37554049999744 48%
20 100 1.7966537000029348 2.1353922999987844 119%
20 500 2.8477611999987857 3.5514096999977482 125%
20 1000 4.702102600000217 3.8196782000013627 81%
20 5000 19.417894899990642 12.532883399995626 65%
20 10000 45.69321430000127 20.617325500003062 45%
50 100 1.9369860000006156 1.8516613000101643 96%
50 500 3.1590656000043964 2.6837016999925254 85%
50 1000 4.705449500004761 2.9762405999936163 63%
50 5000 17.380039900002885 10.148739099997329 58%
50 10000 40.880203600012464 20.52081229999021 50%
100 100 1.7478849000035552 1.6249813000031281 93%
100 500 3.1730380000080913 2.2462558999977773 71%
100 1000 4.052220900004613 3.3273870999983046 82%
100 5000 17.978366199997254 11.050199799996335 61%
100 10000 38.403917099989485 20.502074199990602 53%
---  ----  ----  ----  ----  -----
x    100   500   1000  5000  10000
10   96%   75%   106%  59%   48%
20   119%  125%  81%   65%   45%
50   96%   85%   63%   58%   50%
100  93%   71%   82%   61%   53%
---  ----  ----  ----  ----  -----

and with the "improved" version first (142.9% ±19.563%, p=95%):

10 100 2.5029061999957776 1.902948799994192 76%
10 500 2.3854384000005666 3.409526500006905 143%
10 1000 3.062690600010683 4.10313469999528 134%
10 5000 10.875457200003439 17.695201499998802 163%
10 10000 21.629492600011872 48.996730700004264 227%
20 100 2.12547889999405 2.262495199989644 106%
20 500 2.699372699993546 3.4100137000059476 126%
20 1000 3.840364799994859 4.572843499990995 119%
20 5000 10.895366700002342 19.433951699989848 178%
20 10000 22.678348500005086 42.50577279999561 187%
50 100 3.069858699993347 2.5085584000044037 82%
50 500 2.7575027999992017 3.3038902999978745 120%
50 1000 3.572029700007988 4.635487000006833 130%
50 5000 10.223477299994556 20.650777300004847 202%
50 10000 30.925161999999546 56.15645150000637 182%
100 100 3.82010420000006 2.949311400006991 77%
100 500 3.6471351999934996 6.216737800001283 170%
100 1000 6.9234703999973135 9.566761500012944 138%
100 5000 26.763996299996506 25.526094300003024 95%
100 10000 25.562641399999848 51.87259610000183 203%
---  ----  ----  ----  ----  -----
x    100   500   1000  5000  10000
10   76%   143%  134%  163%  227%
20   106%  126%  119%  178%  187%
50   82%   120%  130%  202%  182%
100  77%   170%  138%  95%   203%
---  ----  ----  ----  ----  -----

Some extra ideas to optimize is to put the Case with the largest number of pks first, since that is then the first one that will be taken, avoiding to have to walk over all cases by the database, which is essentially linear search.

comment:3 by Willem Van Onsem, 6 months ago

Resolution: duplicate
Status: closednew

comment:4 by Simon Charette, 6 months ago

Thank you for taking the time to perform the benchmarks, I think you missed the point brought up about hashing being expensive (or simply not usable) on non-literal values though.

In other words int.__hash__ is pretty fast but Expression.__hash__ isn't and you need to perform an implicit one to gather values in defaultdict(list). Since bulk_update support both expressions and literal assignment the benchmark must be run against both to be representative.

Try running them with assignments of F("scans_in") + Value(random.randint(0, m), output_field=IntegerField()) instead and you should see a slowdown.

The need for _hashability_ also poses another problem. What should be done with values that are not hashable such as JSONField assignment of dicts?

objects = [
    Object(pk=1, data={"foo": "bar"}),
    Object(pk=2, data={"foo": "bar"}),
]
Object.objects.bulk_update(es, fields=["data"])

When all of these edge cases are accounted for (we'll need a route for non-hashable values mixed with hashable values) I still believe that time would be better spend not using CASE / WHEN at all and favor investing efforts towards #29771 and #31202.

Last edited 6 months ago by Simon Charette (previous) (diff)

comment:5 by Natalia Bidart, 6 months ago

Resolution: duplicate
Status: newclosed
Type: New featureCleanup/optimization

Hello Willem, while the benchmarks are useful, considering Simon's responses and Django's usual procedures for accepting changes, I believe it would be best to continue the conversation in the forum topic that has already been created for this issue. This way, more members of the community will receive notifications from the forum, unlike ticket reports:

https://forum.djangoproject.com/t/could-bulk-update-aggregate-the-pks-in-when-clauses/27093

comment:6 by Willem Van Onsem, 6 months ago

Hi, that is already in the changeset: in case the item is not hashable, it will raise a type error: we catch that, and then fallback on the original way of doing it, without much delay. So it is more a boosting mechanism that when fails, is "disabled" when running through it.

The hash of an expression indeed is take more time than the one of an int, but the idea is that this often will still pay off, since the cascade of Case - When essentially forces the database into linear search, while it *might* be possible that a certain database finds a way, if that is not the case, it will first look at the first When, if that fails, the next one, etc. so for ~1000 Whens, it will on average take 500 items per row.

I agree that a temporary table is probably more ideal.

Last edited 6 months ago by Willem Van Onsem (previous) (diff)

comment:7 by Simon Charette, 6 months ago

it will raise a type error: we catch that, and then fallback on the original way of doing it, without much delay.

Creating and catching exceptions is not cheap either so I'm not sure we can qualify it as without much delay. I don't see any benchmark backing it is what I mean.

the idea is that this often will still pay off, since the cascade of Case - When essentially forces the database into linear search

I see what you mean with the database level linear scan but my concerns is that the extra amount time we spend massaging the data on the Python side pre-emptively in hope of preventing the database level slowdown from happening is going to cause more harm than good.

You benchmarked against are the happy path where a there's a single field to update, the values are all literals, the values are all hashable. I think that a proper solution should ensure that the Python layer tax of organizing the data doesn't outweigh the expected database level benefits.

Having separate benchmarks for query / SQL generation in both variants of bulk_update vs database level execution of the SQL on supported backends would help in taking more advised decision here. All of that can take place on the forum though.

comment:8 by Willem Van Onsem, 6 months ago

Well typically hashing should run linear with the data, so Value(1) should indeed take more time than 1, but not dramatically more (by the way, it already hashes Value(1), since it first checks the if not hasattr(attr, "resolve_expression") first, and wraps it into a Value, so that is where the current benmarks originate from. If we would have done =Value(random.randint(0, 10)) for this benchmark, it would make no difference. From the moment it encounters a hash error, it sets the dictionary to None, and thus no longer hashes, and saves these cycles, it thus will stop looking for hashes if one of the items can not be hashed.

But probably the main reason why it would be very strange that the hashing would increase time significantly is that in order to generate an SQL counterpart of some expression (like F('a') + Value(1), it will *also* take linear). So essentially *building* the SQL query with *all* Case / When` items, will always take approximately the same effort as hashing all these items, since both run linear in the "size" of the SQL expressions (or at least, that is a reasonable assumption), so we will lose that effort anyway. It will thus probably at most ~double the amount of effort to generate the query, if there are (close) to no duplicate expressions available, and my experience is that building the query itself, often is not really the bottleneck: once to generate the hashes, and once to generate the query. If there are duplicate expressions, it saves also on generate parts of the SQL query, which again will probably not have much impact in a positive or negative way, since that is almost never the bottleneck.

I will try to run some benchmarks with F('scans_in') + Value(random.randint(0, 10)) expressions tomorrow.

Last edited 6 months ago by Willem Van Onsem (previous) (diff)
Note: See TracTickets for help on using tickets.
Back to Top