#16649 closed Cleanup/optimization (fixed)
Models.save() refactoring: check updated rows to determine action
Reported by: | Anssi Kääriäinen | Owned by: | nobody |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 1.3 |
Severity: | Normal | Keywords: | |
Cc: | anssi.kaariainen@…, marti@… | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
The attached patch implements a new logic for models.save(). Instead of the old
SELECT if found: UPDATE else: INSERT The patch uses UPDATE if rows modified: DONE else: INSERT
The logic is naturally more complex when tacking in account force_update, force_insert and all the corner cases.
As a teaser: load 1000 objects from db, update field, save. Almost 50% faster now. (Remember to run in one transaction).
The gain is that when the save() results in an update, there is no need for an extra select. In my somewhat limited tests, with PostgreSQL and 10000 saves of a simple model in single transaction the result was almost 50% speed up. The speedup is similar when using SQLite3.
For the insert case, when there is no PK value for the model, there is no effect. The result is an insert as always.
For the insert case where there is a PK value defined, and there is no matching pk value in the DB, the idea results in a speed loss. The reason is that when doing the UPDATE which doesn't update anything, the data must be transferred to the db. The speed loss is very small for models which have not too many / wide fields, and according to my quick testing, it is somewhere from 10 to 20% with very wide (100000 chars) models. This is localhost setup, so if there is a slow (as in throughput) network involved, I would imagine this would result in pretty much exactly 50% speed loss. There is patch attached to this ticket mitigating this ticket. In the end of this post there is an explanation how to mitigate this problem.
There is one big catch: It is currently very well documented that Django does a select and then based on that an update or an insert. I can't see changing that could break applications (famous last words) so this is just a documentation issue. The other downside is that this makes save_base more complicated. Especially the patch no2, we are sometimes selecting and sometimes updating as first action, and it is not easy to guess the path taken.
The speedtests I did are basically different ways of saving models with pk set and not set. I will post some benchmarks if this is not thrown out on backwards compatibility / complexity claims. I don't have any benchmarks worth posting at the moment.
All tests passed using SQLite3, running PostgreSQL tests currently. No new tests added. Needs documentation, but I might not be the best person to do that. MultiDB testing would be welcome. I hope to get some feedback. Negative or positive...
To work around the unnecessary data transfer problem, I used the model._state.adding variable. The idea is simple. If we are adding (the model did not come from the db) and there is a pk value, we don't expect there to be any data matching for our primary key. So we use the old select check in this case to avoid the potential network traffic. Based on the select, set force_insert or force_update to get the desired action. This mimics exactly what was done before, expect for error messages in obscure cases. Those can be fixed, too.
When the _state.adding is False (the model did come from the db) we expect the save() to result in an update. So we use the above described method.
There is still one situation where that guess can fail. Multidb. Save the model first in master db, then in slave dbs. The pk is set and adding=False, but we are still adding in reality. For this there is a check for self._state.db <> using. I would except that in these situations people would use force_insert=True, but I am no multidb person so not sure about that.
I did a third small optimization. The idea is again simple. When saving a inheriting model, the parents are saved first (there is no change here). If the parent was inserted, we know this model must also be inserted. It is not possible for a model to exist without it's parent. This way we don't need to do any checking, we can force_insert. For update the same doesn't work: it is possible that the parent exists already, and we are "extending" it at the moment, saving a new model. This optimization can result also in almost 50% speedup in the best case, although the cases are more rare. I don't know any downsides to this optimization.
There are three patches. The first is the base idea, the second is the _state.adding trick and the third is the parent inheritance thingy. These must be applied in order.
Sorry for the long post. I was going to do just the update and check rows returned idea. But why not do just one small thing more... :)
Attachments (3)
Change History (21)
by , 13 years ago
Attachment: | save_no_select.diff added |
---|
by , 13 years ago
Attachment: | save_adding_hint.diff added |
---|
When saving, choose between select or update path based on model._state
by , 13 years ago
Attachment: | save_inherit.diff added |
---|
Optimization for inherited model insertion, now the correct version
comment:1 by , 13 years ago
Triage Stage: | Unreviewed → Design decision needed |
---|
The idea makes sense and I think it's worth studying.
However, Trac isn't the best place to reach an audience for comments, I suggest writing to django-developers if you haven't done so yet.
comment:2 by , 13 years ago
akaariai, can you back up a few steps and explain *why* you think this is a good idea? As far as I can tell, the speed benefits are just a trade-off: you trade create performance for update performance. Either call we make is going to adversely affect some users, so that's exactly why the force_update
/_insert
arguments exist. As it stands, I'm fairly well against the idea: I can't see the benefit in making a fundamental change like this, and I *can* see a lot of downside to changing the behavior of save()
. More information might change my mind, though.
Thanks!
comment:3 by , 13 years ago
In summary, update case now requires just one query while previously it required 2. Insert case still requires 2 queries. Unfortunately the update -> insert way can be a little bit more expensive than doing select -> insert. This is due to the need to transfer the data to the DB two times. There are still reasons why I think this is an optimization:
- I think we will be gaining more from the update case speedup than we are going to lose in the insert case. Needs benchmarks.
- In the common case of an autofield pk we are even more likely to gain. Autofield gets its value on first save and should not change after that. If the field is set to a value it is very likely we are going to do an update. If it is not set, we are going to do an insert directly.
- When loading fixtures, or when saving the same model to another DB we are likely going to do an insert. In this case we can use a heuristic: if we are adding the object to the DB (instance._state.adding=True) or if we are saving to a different DB (instance._state.db <> using) we use the old method of SELECT -> insert / update.
All three above combined, I don't think there are many cases where losses are expected. The only case I can think of is reusing an object in a loop while changing its primary key, something like "obj = Obj(); for i in range(0, 100): obj.pk=i; obj.save()" - but I don't think that is a case we should optimize for. In fact, I think that should either result in error or update of the PK value, but that is another ticket's problem.
Of course, the best optimization would be to use database specific methods (MERGE or INSERT ON DUPLICATE KEY UPDATE). This would allow for single query save in every case. Unfortunately there is no such method for PostgreSQL, at least not any method I know of. It might make sense to investigate this route first.
Sorry for making the original post so long, there are a lot of ideas mixed in one ticket. I will try to be a better trac-citizen in the future :)
comment:4 by , 12 years ago
A bit more information about what the approach suggested is about. Assume you load a model from DB, change some fields and save it. We currently do:
SELECT - load from DB [change fields]; save() In save: SELECT - assert that the PK exists in the DB if yes: UPDATE else: INSERT
The second select above is usually redundant. If we know the object is loaded from DB (self._state.adding == False), and we are saving it to the same DB it was loaded from (self._state.db == the db we are saving to), then it is very likely the object exists in the DB. So, instead we should do:
In save: if same db, not adding: exists = UPDATE else: exists = SELECT if exists: UPDATE if not exists: INSERT
So, in the case the model is already in the DB, we do just a single update instead of select + update. If it so happens it doesn't exist (which should be very unlikely), we do update with zero rows modified and still know to do an insert.
This does lead to behaviour change on MySQL when there is a concurrent insert to the same table. Currently the SELECT sees nothing, and the insert will lead to integrity error. After this, the update will block until the inserting transaction is committed, and then update the matching row. However, on PostgreSQL the update sees nothing, so there is no behaviour change there. For MySQL users this will be better than what we have now. For PostgreSQL users there is no change from behaviour standpoint.
comment:5 by , 12 years ago
Sorry, the explanations above are a little confusing. Still another try:
s = SomeModel.objects.get(pk=someval) s.somecol = someval s.save()
Here save is implemented as SELECT - if not found INSERT - else UPDATE. The SELECT here is redundant, we have the information that SELECT was just done in model._state. So, we can do directly an UPDATE. If it happens so that nothing is updated (likely because of concurrent delete) then we will still do the INSERT and things work as expected.
I have done a full refactor of model.save_base(), this includes 4 parts:
- What this ticket deals with - trying directly to UPDATE if the model was fetched from the same DB we are saving to. Also assorted cleanup to make this possible.
- Cleanup to proxy parents handling (the logic is somewhat ugly currently)
- A bug fix for #17341 (do not commit transaction after every parent model save)
- Splitting save_base into parts so that the saving logic is easier to follow
Above, the bug fix is of course needed, and the proxy parents handling cleanup is IMO also needed.
I can't see any downsides to trying UPDATE directly as done in the patch. This should result in clear performance improvement in most cases. There is a problem that we have documented very explicitly the sequence of SQL commands .save() does but I don't think that documentation should be considered part of the public API. So, docs update is needed.
The refactoring of .save_base() into parts is a stylistic question. I surprisingly prefer the refactored way. Some examples of things that are hard to see from the current writing of the code:
- Which signals are sent for which models
- The actual hard work is done in the if not meta.proxy: branch. However, it is impossible that meta.proxy is True here. And it isn't exactly easy to see this.
- The origin parameter is a bit weird - it is None except for the outermost table save.
Still, I'd like to get a confirmation that others prefer the new writing, too.
EDIT: forgot to mention that all tests pass on all core databases.
comment:6 by , 12 years ago
Triage Stage: | Design decision needed → Ready for checkin |
---|
Squash-rebase of the above branch available from: https://github.com/akaariai/django/compare/django:master...model_save_refactor_squash
I am moving this out of DDN - I have written to django-developers and I haven't seen any -1 (or +1 for that matter) when I suggested this change. I take this as nobody opposing the change. I think the only reason to oppose this feature is that something might break. As far as I know nothing should break, but the only way to find out for sure is moving this in and getting real world usage of the new save() way.
A final review (if only looking for typos & stale comments) is welcome. Another thing I'd still like to get reviewed is the structuring of the new code - do the method names make sense, are the splitpoints sensible?
comment:9 by , 12 years ago
Patch needs improvement: | set |
---|---|
Triage Stage: | Ready for checkin → Accepted |
Thanks for the comments! The Finnish dialect of English has special grammar...
comment:10 by , 12 years ago
Patch needs improvement: | unset |
---|---|
Triage Stage: | Accepted → Ready for checkin |
Patch updated, https://github.com/akaariai/django/compare/django:master...model_save_refactor_squash. I removed the "if from same DB" check. Now the patch just always tries UPDATE first if the model has PK set. It will be easy enough to change the behaviour later on if that is needed.
EDIT: stupid mistakes in comment fixed...
comment:11 by , 12 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
follow-up: 13 comment:12 by , 12 years ago
While using update_fields on one of my projects I noticed a possible reason to revert the way update count is used in this ticket's committed patch. On PostgreSQL, if there is a trigger for the updated table, and this trigger returns NULL in some conditions, then the UPDATE statement is seen as affecting zero rows. This will result in Django seeing zero rows updated, and this again will result in error when using update_fields as the update didn't seem to affect any rows.
For plain .save() in 1.6 the result will be IntegrityError - the update doesn't seem to affect any rows, but there is a row matching the update condition in the DB, so Django will try to UPDATE, see no rows matched, and then try to INSERT -> boom.
The end result is that code that worked in 1.5 will not work in 1.6 if there are ON UPDATE triggers which return NULL. I am not sure about how common this condition will be. It will be easy to change the detection in save_base() from UPDATE - if not updated - INSERT to SELECT - if found UPDATE - else INSERT. But I would like to wait and see if this problem is common enough to worry about...
comment:13 by , 11 years ago
Cc: | added |
---|
We hit the problem mentioned in comment 12 while trying out Django 1.6 beta. We are using the PostgreSQL built-in suppress_redundant_updates_trigger trigger on some tables to save I/O when saving unchanged rows. With Django 1.6 this now causes IntegrityErrors.
No big deal for us, we can drop the trigger when we migrate to 1.6, but I can imagine there are situations where trigger logic is a necessary part of the application. Maybe there should be a workaround for those poor users, like a per-model option to fall back to the old save behavior?
comment:14 by , 11 years ago
Maybe something like select_on_save or somesuch _meta option. I'd like to keep the default False as trying direct update is the correct thing to do most of the time. I will check how hard this is to do.
comment:15 by , 11 years ago
I've added a crude patch for "select_on_save" option. Docs missing, maybe more tests needed, too. See https://github.com/akaariai/django/commit/1ca398b0352b11868ec236f92cf17b6ce82ff88c
EDIT: If we want to do something about this a new ticket is a good idea. Also, the only currently known issue is PostgreSQL's ON UPDATE trigger that returns NULL.
save() without select