Models.save() refactoring: check updated rows to determine action
|Reported by:||akaariai||Owned by:||nobody|
|Component:||Database layer (models, ORM)||Version:||1.3|
|Cc:||anssi.kaariainen@…, marti@…||Triage Stage:||Ready for checkin|
|Has patch:||yes||Needs documentation:||no|
|Needs tests:||no||Patch needs improvement:||no|
The attached patch implements a new logic for models.save(). Instead of the old
SELECT if found: UPDATE else: INSERT The patch uses UPDATE if rows modified: DONE else: INSERT
The logic is naturally more complex when tacking in account force_update, force_insert and all the corner cases.
As a teaser: load 1000 objects from db, update field, save. Almost 50% faster now. (Remember to run in one transaction).
The gain is that when the save() results in an update, there is no need for an extra select. In my somewhat limited tests, with PostgreSQL and 10000 saves of a simple model in single transaction the result was almost 50% speed up. The speedup is similar when using SQLite3.
For the insert case, when there is no PK value for the model, there is no effect. The result is an insert as always.
For the insert case where there is a PK value defined, and there is no matching pk value in the DB, the idea results in a speed loss. The reason is that when doing the UPDATE which doesn't update anything, the data must be transferred to the db. The speed loss is very small for models which have not too many / wide fields, and according to my quick testing, it is somewhere from 10 to 20% with very wide (100000 chars) models. This is localhost setup, so if there is a slow (as in throughput) network involved, I would imagine this would result in pretty much exactly 50% speed loss. There is patch attached to this ticket mitigating this ticket. In the end of this post there is an explanation how to mitigate this problem.
There is one big catch: It is currently very well documented that Django does a select and then based on that an update or an insert. I can't see changing that could break applications (famous last words) so this is just a documentation issue. The other downside is that this makes save_base more complicated. Especially the patch no2, we are sometimes selecting and sometimes updating as first action, and it is not easy to guess the path taken.
The speedtests I did are basically different ways of saving models with pk set and not set. I will post some benchmarks if this is not thrown out on backwards compatibility / complexity claims. I don't have any benchmarks worth posting at the moment.
All tests passed using SQLite3, running PostgreSQL tests currently. No new tests added. Needs documentation, but I might not be the best person to do that. MultiDB testing would be welcome. I hope to get some feedback. Negative or positive...
To work around the unnecessary data transfer problem, I used the model._state.adding variable. The idea is simple. If we are adding (the model did not come from the db) and there is a pk value, we don't expect there to be any data matching for our primary key. So we use the old select check in this case to avoid the potential network traffic. Based on the select, set force_insert or force_update to get the desired action. This mimics exactly what was done before, expect for error messages in obscure cases. Those can be fixed, too.
When the _state.adding is False (the model did come from the db) we expect the save() to result in an update. So we use the above described method.
There is still one situation where that guess can fail. Multidb. Save the model first in master db, then in slave dbs. The pk is set and adding=False, but we are still adding in reality. For this there is a check for self._state.db <> using. I would except that in these situations people would use force_insert=True, but I am no multidb person so not sure about that.
I did a third small optimization. The idea is again simple. When saving a inheriting model, the parents are saved first (there is no change here). If the parent was inserted, we know this model must also be inserted. It is not possible for a model to exist without it's parent. This way we don't need to do any checking, we can force_insert. For update the same doesn't work: it is possible that the parent exists already, and we are "extending" it at the moment, saving a new model. This optimization can result also in almost 50% speedup in the best case, although the cases are more rare. I don't know any downsides to this optimization.
There are three patches. The first is the base idea, the second is the _state.adding trick and the third is the parent inheritance thingy. These must be applied in order.
Sorry for the long post. I was going to do just the update and check rows returned idea. But why not do just one small thing more... :)
Change History (21)
comment:1 Changed 5 years ago by aaugustin
- Needs documentation unset
- Needs tests unset
- Patch needs improvement unset
- Triage Stage changed from Unreviewed to Design decision needed
comment:6 Changed 4 years ago by akaariai
- Triage Stage changed from Design decision needed to Ready for checkin
comment:9 Changed 4 years ago by akaariai
- Patch needs improvement set
- Triage Stage changed from Ready for checkin to Accepted
comment:10 Changed 3 years ago by akaariai
- Patch needs improvement unset
- Triage Stage changed from Accepted to Ready for checkin
comment:11 Changed 3 years ago by Anssi Kääriäinen <akaariai@…>
- Resolution set to fixed
- Status changed from new to closed