Opened 13 years ago

Closed 12 years ago

Last modified 11 years ago

#16649 closed Cleanup/optimization (fixed)

Models.save() refactoring: check updated rows to determine action

Reported by: Anssi Kääriäinen Owned by: nobody
Component: Database layer (models, ORM) Version: 1.3
Severity: Normal Keywords:
Cc: anssi.kaariainen@…, marti@… Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

The attached patch implements a new logic for models.save(). Instead of the old

SELECT
if found:
    UPDATE
else:
    INSERT

The patch uses
UPDATE
if rows modified:
    DONE
else:
    INSERT

The logic is naturally more complex when tacking in account force_update, force_insert and all the corner cases.

As a teaser: load 1000 objects from db, update field, save. Almost 50% faster now. (Remember to run in one transaction).

The gain is that when the save() results in an update, there is no need for an extra select. In my somewhat limited tests, with PostgreSQL and 10000 saves of a simple model in single transaction the result was almost 50% speed up. The speedup is similar when using SQLite3.

For the insert case, when there is no PK value for the model, there is no effect. The result is an insert as always.

For the insert case where there is a PK value defined, and there is no matching pk value in the DB, the idea results in a speed loss. The reason is that when doing the UPDATE which doesn't update anything, the data must be transferred to the db. The speed loss is very small for models which have not too many / wide fields, and according to my quick testing, it is somewhere from 10 to 20% with very wide (100000 chars) models. This is localhost setup, so if there is a slow (as in throughput) network involved, I would imagine this would result in pretty much exactly 50% speed loss. There is patch attached to this ticket mitigating this ticket. In the end of this post there is an explanation how to mitigate this problem.

There is one big catch: It is currently very well documented that Django does a select and then based on that an update or an insert. I can't see changing that could break applications (famous last words) so this is just a documentation issue. The other downside is that this makes save_base more complicated. Especially the patch no2, we are sometimes selecting and sometimes updating as first action, and it is not easy to guess the path taken.

The speedtests I did are basically different ways of saving models with pk set and not set. I will post some benchmarks if this is not thrown out on backwards compatibility / complexity claims. I don't have any benchmarks worth posting at the moment.

All tests passed using SQLite3, running PostgreSQL tests currently. No new tests added. Needs documentation, but I might not be the best person to do that. MultiDB testing would be welcome. I hope to get some feedback. Negative or positive...


To work around the unnecessary data transfer problem, I used the model._state.adding variable. The idea is simple. If we are adding (the model did not come from the db) and there is a pk value, we don't expect there to be any data matching for our primary key. So we use the old select check in this case to avoid the potential network traffic. Based on the select, set force_insert or force_update to get the desired action. This mimics exactly what was done before, expect for error messages in obscure cases. Those can be fixed, too.

When the _state.adding is False (the model did come from the db) we expect the save() to result in an update. So we use the above described method.

There is still one situation where that guess can fail. Multidb. Save the model first in master db, then in slave dbs. The pk is set and adding=False, but we are still adding in reality. For this there is a check for self._state.db <> using. I would except that in these situations people would use force_insert=True, but I am no multidb person so not sure about that.

I did a third small optimization. The idea is again simple. When saving a inheriting model, the parents are saved first (there is no change here). If the parent was inserted, we know this model must also be inserted. It is not possible for a model to exist without it's parent. This way we don't need to do any checking, we can force_insert. For update the same doesn't work: it is possible that the parent exists already, and we are "extending" it at the moment, saving a new model. This optimization can result also in almost 50% speedup in the best case, although the cases are more rare. I don't know any downsides to this optimization.

There are three patches. The first is the base idea, the second is the _state.adding trick and the third is the parent inheritance thingy. These must be applied in order.

Sorry for the long post. I was going to do just the update and check rows returned idea. But why not do just one small thing more... :)

Attachments (3)

save_no_select.diff (2.5 KB ) - added by Anssi Kääriäinen 13 years ago.
save() without select
save_adding_hint.diff (2.1 KB ) - added by Anssi Kääriäinen 13 years ago.
When saving, choose between select or update path based on model._state
save_inherit.diff (3.1 KB ) - added by Anssi Kääriäinen 13 years ago.
Optimization for inherited model insertion, now the correct version

Download all attachments as: .zip

Change History (21)

by Anssi Kääriäinen, 13 years ago

Attachment: save_no_select.diff added

save() without select

by Anssi Kääriäinen, 13 years ago

Attachment: save_adding_hint.diff added

When saving, choose between select or update path based on model._state

by Anssi Kääriäinen, 13 years ago

Attachment: save_inherit.diff added

Optimization for inherited model insertion, now the correct version

comment:1 by Aymeric Augustin, 13 years ago

Triage Stage: UnreviewedDesign decision needed

The idea makes sense and I think it's worth studying.

However, Trac isn't the best place to reach an audience for comments, I suggest writing to django-developers if you haven't done so yet.

comment:2 by Jacob, 13 years ago

akaariai, can you back up a few steps and explain *why* you think this is a good idea? As far as I can tell, the speed benefits are just a trade-off: you trade create performance for update performance. Either call we make is going to adversely affect some users, so that's exactly why the force_update/_insert arguments exist. As it stands, I'm fairly well against the idea: I can't see the benefit in making a fundamental change like this, and I *can* see a lot of downside to changing the behavior of save(). More information might change my mind, though.

Thanks!

comment:3 by Anssi Kääriäinen, 13 years ago

In summary, update case now requires just one query while previously it required 2. Insert case still requires 2 queries. Unfortunately the update -> insert way can be a little bit more expensive than doing select -> insert. This is due to the need to transfer the data to the DB two times. There are still reasons why I think this is an optimization:

  • I think we will be gaining more from the update case speedup than we are going to lose in the insert case. Needs benchmarks.
  • In the common case of an autofield pk we are even more likely to gain. Autofield gets its value on first save and should not change after that. If the field is set to a value it is very likely we are going to do an update. If it is not set, we are going to do an insert directly.
  • When loading fixtures, or when saving the same model to another DB we are likely going to do an insert. In this case we can use a heuristic: if we are adding the object to the DB (instance._state.adding=True) or if we are saving to a different DB (instance._state.db <> using) we use the old method of SELECT -> insert / update.

All three above combined, I don't think there are many cases where losses are expected. The only case I can think of is reusing an object in a loop while changing its primary key, something like "obj = Obj(); for i in range(0, 100): obj.pk=i; obj.save()" - but I don't think that is a case we should optimize for. In fact, I think that should either result in error or update of the PK value, but that is another ticket's problem.

Of course, the best optimization would be to use database specific methods (MERGE or INSERT ON DUPLICATE KEY UPDATE). This would allow for single query save in every case. Unfortunately there is no such method for PostgreSQL, at least not any method I know of. It might make sense to investigate this route first.

Sorry for making the original post so long, there are a lot of ideas mixed in one ticket. I will try to be a better trac-citizen in the future :)

comment:4 by Anssi Kääriäinen, 12 years ago

A bit more information about what the approach suggested is about. Assume you load a model from DB, change some fields and save it. We currently do:

SELECT - load from DB
[change fields]; save()
In save:
    SELECT - assert that the PK exists in the DB
    if yes:
        UPDATE
    else:
        INSERT

The second select above is usually redundant. If we know the object is loaded from DB (self._state.adding == False), and we are saving it to the same DB it was loaded from (self._state.db == the db we are saving to), then it is very likely the object exists in the DB. So, instead we should do:

In save:
    if same db, not adding:
        exists = UPDATE
    else:
        exists = SELECT
        if exists:
            UPDATE
    if not exists:
        INSERT

So, in the case the model is already in the DB, we do just a single update instead of select + update. If it so happens it doesn't exist (which should be very unlikely), we do update with zero rows modified and still know to do an insert.

This does lead to behaviour change on MySQL when there is a concurrent insert to the same table. Currently the SELECT sees nothing, and the insert will lead to integrity error. After this, the update will block until the inserting transaction is committed, and then update the matching row. However, on PostgreSQL the update sees nothing, so there is no behaviour change there. For MySQL users this will be better than what we have now. For PostgreSQL users there is no change from behaviour standpoint.

comment:5 by Anssi Kääriäinen, 12 years ago

Sorry, the explanations above are a little confusing. Still another try:

s = SomeModel.objects.get(pk=someval)
s.somecol = someval
s.save()

Here save is implemented as SELECT - if not found INSERT - else UPDATE. The SELECT here is redundant, we have the information that SELECT was just done in model._state. So, we can do directly an UPDATE. If it happens so that nothing is updated (likely because of concurrent delete) then we will still do the INSERT and things work as expected.

I have done a full refactor of model.save_base(), this includes 4 parts:

  • What this ticket deals with - trying directly to UPDATE if the model was fetched from the same DB we are saving to. Also assorted cleanup to make this possible.
  • Cleanup to proxy parents handling (the logic is somewhat ugly currently)
  • A bug fix for #17341 (do not commit transaction after every parent model save)
  • Splitting save_base into parts so that the saving logic is easier to follow

Above, the bug fix is of course needed, and the proxy parents handling cleanup is IMO also needed.

I can't see any downsides to trying UPDATE directly as done in the patch. This should result in clear performance improvement in most cases. There is a problem that we have documented very explicitly the sequence of SQL commands .save() does but I don't think that documentation should be considered part of the public API. So, docs update is needed.

The refactoring of .save_base() into parts is a stylistic question. I surprisingly prefer the refactored way. Some examples of things that are hard to see from the current writing of the code:

  • Which signals are sent for which models
  • The actual hard work is done in the if not meta.proxy: branch. However, it is impossible that meta.proxy is True here. And it isn't exactly easy to see this.
  • The origin parameter is a bit weird - it is None except for the outermost table save.

Still, I'd like to get a confirmation that others prefer the new writing, too.

EDIT: forgot to mention that all tests pass on all core databases.

Last edited 12 years ago by Anssi Kääriäinen (previous) (diff)

comment:6 by Anssi Kääriäinen, 12 years ago

Triage Stage: Design decision neededReady for checkin

Squash-rebase of the above branch available from: https://github.com/akaariai/django/compare/django:master...model_save_refactor_squash

I am moving this out of DDN - I have written to django-developers and I haven't seen any -1 (or +1 for that matter) when I suggested this change. I take this as nobody opposing the change. I think the only reason to oppose this feature is that something might break. As far as I know nothing should break, but the only way to find out for sure is moving this in and getting real world usage of the new save() way.

A final review (if only looking for typos & stale comments) is welcome. Another thing I'd still like to get reviewed is the structuring of the new code - do the method names make sense, are the splitpoints sensible?

comment:7 by Anssi Kääriäinen, 12 years ago

I updated the branch with some last minute polish. Will commit soon.

comment:8 by Tim Graham, 12 years ago

I hope you won't find my grammar nitpicks too exacting. :-)

comment:9 by Anssi Kääriäinen, 12 years ago

Patch needs improvement: set
Triage Stage: Ready for checkinAccepted

Thanks for the comments! The Finnish dialect of English has special grammar...

comment:10 by Anssi Kääriäinen, 12 years ago

Patch needs improvement: unset
Triage Stage: AcceptedReady for checkin

Patch updated, https://github.com/akaariai/django/compare/django:master...model_save_refactor_squash. I removed the "if from same DB" check. Now the patch just always tries UPDATE first if the model has PK set. It will be easy enough to change the behaviour later on if that is needed.

EDIT: stupid mistakes in comment fixed...

Last edited 12 years ago by Anssi Kääriäinen (previous) (diff)

comment:11 by Anssi Kääriäinen <akaariai@…>, 12 years ago

Resolution: fixed
Status: newclosed

In 6b4834952dcce0db5cbc1534635c00ff8573a6d8:

Fixed #16649 -- Refactored save_base logic

Model.save() will use UPDATE - if not updated - INSERT instead of
SELECT - if found UPDATE else INSERT. This should save a query when
updating, but will cost a little when inserting model with PK set.

Also fixed #17341 -- made sure .save() commits transactions only after
the whole model has been saved. This wasn't the case in model
inheritance situations.

The save_base implementation was refactored into multiple methods.
A typical chain for inherited save is:
save_base()

_save_parents(self)

for each parent:

_save_parents(parent)
_save_table(parent)

_save_table(self)

comment:12 by Anssi Kääriäinen, 12 years ago

While using update_fields on one of my projects I noticed a possible reason to revert the way update count is used in this ticket's committed patch. On PostgreSQL, if there is a trigger for the updated table, and this trigger returns NULL in some conditions, then the UPDATE statement is seen as affecting zero rows. This will result in Django seeing zero rows updated, and this again will result in error when using update_fields as the update didn't seem to affect any rows.

For plain .save() in 1.6 the result will be IntegrityError - the update doesn't seem to affect any rows, but there is a row matching the update condition in the DB, so Django will try to UPDATE, see no rows matched, and then try to INSERT -> boom.

The end result is that code that worked in 1.5 will not work in 1.6 if there are ON UPDATE triggers which return NULL. I am not sure about how common this condition will be. It will be easy to change the detection in save_base() from UPDATE - if not updated - INSERT to SELECT - if found UPDATE - else INSERT. But I would like to wait and see if this problem is common enough to worry about...

in reply to:  12 comment:13 by Marti Raudsepp, 11 years ago

Cc: marti@… added

We hit the problem mentioned in comment 12 while trying out Django 1.6 beta. We are using the PostgreSQL built-in suppress_redundant_updates_trigger trigger on some tables to save I/O when saving unchanged rows. With Django 1.6 this now causes IntegrityErrors.

No big deal for us, we can drop the trigger when we migrate to 1.6, but I can imagine there are situations where trigger logic is a necessary part of the application. Maybe there should be a workaround for those poor users, like a per-model option to fall back to the old save behavior?

comment:14 by Anssi Kääriäinen, 11 years ago

Maybe something like select_on_save or somesuch _meta option. I'd like to keep the default False as trying direct update is the correct thing to do most of the time. I will check how hard this is to do.

comment:15 by Anssi Kääriäinen, 11 years ago

I've added a crude patch for "select_on_save" option. Docs missing, maybe more tests needed, too. See https://github.com/akaariai/django/commit/1ca398b0352b11868ec236f92cf17b6ce82ff88c

Version 0, edited 11 years ago by Anssi Kääriäinen (next)

comment:16 by Anssi Kääriäinen, 11 years ago

I added a new ticket for the incompatibility issue, see #20988.

comment:17 by Anssi Kääriäinen <akaariai@…>, 11 years ago

In e973ee6a9879969b8ae05bb7ff681172cc5386a5:

Fixed #20988 -- Added model meta option select_on_save

The option can be used to force pre 1.6 style SELECT on save behaviour.
This is needed in case the database returns zero updated rows even if
there is a matching row in the DB. One such case is PostgreSQL update
trigger that returns NULL.

Reviewed by Tim Graham.

Refs #16649

comment:18 by Anssi Kääriäinen <akaariai@…>, 11 years ago

In 76e38a21777243fec58c640d617bb71a251c5ba1:

[1.6.x] Fixed #20988 -- Added model meta option select_on_save

The option can be used to force pre 1.6 style SELECT on save behaviour.
This is needed in case the database returns zero updated rows even if
there is a matching row in the DB. One such case is PostgreSQL update
trigger that returns NULL.

Reviewed by Tim Graham.

Refs #16649

Backport of e973ee6a9879969b8ae05bb7ff681172cc5386a5 from master

Conflicts:

django/db/models/options.py
tests/basic/tests.py

Note: See TracTickets for help on using tickets.
Back to Top