Opened 6 years ago

Last modified 3 months ago

#9519 new New feature

Add QuerySet.bulk_delete() that issues only a single SQL query

Reported by: Tarken Owned by: nobody
Component: Database layer (models, ORM) Version: master
Severity: Normal Keywords: database, queryset, delete
Cc: tobias, kmike84@…, dev@…, kkumler, denilsonsa@…, zachborboa Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Example:

from django.db import models

class MyModel(models.Model):
    my_id = models.IntegerField(primary_key=True)
    col1 = models.IntegerField()
    col2 = models.CharField(max_length=1)
    col3 = models.TextField()
    
    class Meta:
        unique_together = ('my_id', 'col1', 'col2') # "Fake" a multi-part primary key

# This works for creating all 3, as force_insert is used in create()
MyModel.objects.create(my_id=1, col1=5, col2='a', col3='foo')
MyModel.objects.create(my_id=1, col1=5, col2='b', col3='bar')
MyModel.objects.create(my_id=1, col1=10, col2='a', col3='baz')

MyModel.objects.filter(my_id=1, col1=5).delete()

This deletes all of the objects created above, since deletion is done only based on model._meta.pk. This is fine, except when you are using multi-part primary keys and emulating that pkey as suggested in #373.

In my opinion, the delete operation should be executed with respect to the filter(), extra(), etc which were previously performed on the QuerySet.

This is, in a way, a part of ticket #373, but I believe it can be viewed as a completely separate issue. In some cases, you do want to delete based on columns other than the primary key. Why build a list of all of those primary keys when you've already specified all of the qualifiers in a filter()?

Change History (29)

comment:1 Changed 6 years ago by jacob

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Triage Stage changed from Unreviewed to Design decision needed

comment:2 Changed 6 years ago by tobias

  • Cc tobias added

This is an issue for me as well. Imagine the cycle:

1) SELECT 100 of the objects you want to delete (this can take awhile depending on the WHERE clause)

2) DELETE those objects by ID

Now imagine it repeated 10000 times. I'm trying to delete millions of rows from a table, so the current way .delete() works is completely unusable.

IMHO .delete should just use the FROM table, JOINs (if any), and WHERE clause from the queryset itself.

comment:3 Changed 5 years ago by tobias

  • Owner changed from nobody to tobias
  • Status changed from new to assigned

comment:4 Changed 5 years ago by tobias

  • Summary changed from QuerySet.delete() should operate with respect to previous query operations to QuerySet.delete() should issue only a single SQL query

comment:5 Changed 4 years ago by cdestigter

+1 for this. The way it is currently implemented, all the objects that are to be deleted are pulled into memory, which is a nightmare for >10000 objects.

This could be implemented as QuerySet.delete_raw() or QuerySet.delete(raw=True) without breaking the existing API.

comment:6 Changed 4 years ago by erikrose

Me too.

comment:7 Changed 4 years ago by lukeplant

  • Severity set to Normal
  • Type set to New feature

comment:8 Changed 4 years ago by natevw

  • Easy pickings unset
  • UI/UX unset

+1

The docs led me to believe Django would bulk delete using DELETE...WHERE instead of having to stream every column from a bazillion rows.

comment:9 Changed 4 years ago by carljm

  • Summary changed from QuerySet.delete() should issue only a single SQL query to Add QuerySet.bulk_delete() that issues only a single SQL query
  • Triage Stage changed from Design decision needed to Accepted
  • Version changed from 1.0 to SVN

.delete() has to be implemented the way it is in order to support cascade-deletion. If Django didn't do cascade-deletion in the ORM it would be much more difficult to write cross-database-compatible ORM code. So the implementation of .delete() is not going to change (it does do the cascade more efficiently now, since #7539 was fixed, than it did when this ticket was filed).

The issue reported originally here (that deletion doesn't work with duplicate PKs) is a non-issue. Django's ORM assumes PKs are unique, and the database will enforce that unless your database is broken or you've manually modified the tables. That's not going to change, regardless of what hacky workarounds were suggested on #373.

All that said, I'd be open to a new .bulk_delete() method (parallel to the .bulk_create() method that was recently added) that would do deletion as requested here. It would have a number of caveats, primarily that it wouldn't do FK cascades, so you'd have to have your database set up correctly to handle that natively (ON DELETE CASCADE or whatnot) or you might get integrity errors.

natevw, if you think the documentation for the current .delete() method could better explain its limitations, that's a separate issue - please file a new ticket for it, ideally with a patch to the docs.

comment:10 Changed 4 years ago by carljm

After some discussion with Alex in IRC: the deletion code can reliably tell whether cascade is needed or not - if you have no FKs pointing to the model, or all of them are set to on_delete=DO_NOTHING, you don't need cascade. If you do, there's no point in giving you an easy way to shoot yourself in the foot with an IntegrityError. So however we spell the "bulk delete" option, it should detect whether a bulk delete is actually possible (i.e. no cascade needed); if it's not possible, it should either raise an error or fall back to the current deletion approach.

Between those two, I lean towards raising an error. The conditions that determine whether bulk delete is possible are deterministic and in the programmer's full control: it depends only on the models, not on the state of the database or anything else. So requesting a bulk delete on a model that needs to cascade deletes is simply an error, and it should fail fast so you can just fix your code to not request the bulk delete. That way you can know that if you have a working call to bulk delete, it is actually doing a bulk delete (else it would be failing).

We can't just quietly "do the right thing" under the covers because there is one unavoidable semantic difference: the pre_delete and post_delete signals. We currently send them for all deleted instances, and "bulk delete" will have no way of doing that. That will just have to be documented as a limitation of the bulk delete.

comment:11 Changed 4 years ago by tobias

  • Owner changed from tobias to nobody
  • Status changed from assigned to new

comment:12 Changed 3 years ago by kmike

  • Cc kmike84@… added

comment:13 Changed 3 years ago by kmike

Probably incomplete implementation/workaround that works for me:

from django.db import transaction
from django.db.models.sql.subqueries import DeleteQuery

def truncate_queryset(qs):
    """
    Deletes all records matched by queryset using

        DELETE from table WHERE <condition>

    query without fetching PK values for all items in original queryset.
    """

    delete_query = qs.query.clone(DeleteQuery)

    # transaction management code is copied from QuerySet.update
    if not transaction.is_managed(using=qs.db):
        transaction.enter_transaction_management(using=qs.db)
        forced_managed = True
    else:
        forced_managed = False
    try:
        delete_query.get_compiler(qs.db).execute_sql(None)
        if forced_managed:
            transaction.commit(using=qs.db)
        else:
            transaction.commit_unless_managed(using=qs.db)
    finally:
        if forced_managed:
            transaction.leave_transaction_management(using=qs.db)

comment:14 Changed 3 years ago by brillgen

  • Cc dev@… added

comment:15 Changed 3 years ago by akaariai

Note that a regular .delete() can avoid the fetching if there are no cascades, and there are no signals. We currently have no API to ask if somebody is listening for a signal, but we could easily add one - pre_delete.has_listeners(SomeModel) for example. This way we could automatically do the right thing in .delete() without need for a new operation. And, we could optimize deletion of cascades, too.

If there are signals but we want to skip sending those, we could just add a .delete(signals=False) option for that. I am -1 on ignoring cascades. This will lead to integrity errors. Or worse, there are DBs which do not enforce foreign keys.

comment:16 Changed 2 years ago by kkumler

  • Cc kkumler added

comment:17 follow-up: Changed 2 years ago by akaariai

The idea in comment:15 has been implemented - is there still need for separate bulk_delete()?

comment:18 in reply to: ↑ 17 Changed 2 years ago by ptone

Replying to akaariai:

The idea in comment:15 has been implemented - is there still need for separate bulk_delete()?

I agree about ignoring cascades being a non-starter - if you need that level of manipulation, drop to SQL.

The question is whether there are times you have a non-cascading model, you have signals, but you want to just delete a batch of objects. This is analogous to update method not firing signals.

whether this is bulk_delete or delete(signals=False) might be a minor point, I'd lean slightly toward a separate method as being cleaner to document as a special case, and to not introduce the concept of signals=False as being something that people would request be a general pattern for other methods.

If we wanted to offer some disabling signals more generally, perhaps we could look into a context manager for that.

comment:19 Changed 2 years ago by akaariai

API idea: How about qs.values().delete() doing a no-signals delete? We could also implement the cascades by fetching the PK values to be deleted. This should make .values().delete() faster and less memory hungry, yet cascades are implemented.

To me this API seems correct - values() is doing "fetch the raw values only, not objects" so values().delete() is doing "do the deletes by fetching raw values only, not objects".

Currently .values().delete() doesn't work at all (except accidentally they do for fast-path deletes...).

The disabling signals idea is dangerous if one can do global "signals disabled" call. For example "with disable_signals(): do_something()" will break model.__init__ in do_something() for models having ImageFields or GenericForeignKeys. To be safe the only way to disable the signals should be by listing the signal target functions explicitly. In addition, the disabling should be thread-safe and really fast (at least for the no-signals-disabled case). Model.__init__ is limited somewhat by signal.send speed already. So, my initial feeling is to not do this.

I would love it if there was some way to tell the DB to set foreign key constraints to ON DELETE CASCADE temporarily. But, there doesn't seem to be any way to do this in SQL.

comment:20 Changed 2 years ago by ptone

Yes - the global signal killer context manager is a bad idea - I was thinking of the narrow context of limiting test interactions - where it could be useful - but would be a bad choice for public API.

My only hangup on .values().delete() is that I think of values returning data - not objects, so should be decoupled (conceptually) from the DB - and so delete seems to leak back to the DB. That is a statement of my conceptual consistency issue - not that it wouldn't be a clean way to implement it on the internals side.

Basically trying to muck with signals enable/disable in any public API is sticky business. I still think a bulk_delete with lots of warning in the docs is the simplest way forward - if there is a good way forward at all.

comment:21 Changed 2 years ago by akaariai

API proposal for bulk_delete():

  1. Does not send signals.
  2. Does "fast-path" deletion like in normal deletion where possible.
  3. [maybe] If cascades are needed, then cascading is implemented by fetching just the needed values from the DB, not as objects but as raw values.

The problem is that no. 3 above seems to be somewhat messy to implement. So, maybe just error out if cascades are needed instead?

comment:22 Changed 2 years ago by carljm

If I'm not mistaken, the need for cascade is entirely determined by the model schema; for a given call to bulk_delete, the need for cascade won't change at runtime. In other words, you're not going to have code calling bulk_delete that usually works but occasionally fails. So it seems fine to me to omit (3) and just raise an error if bulk_delete is used on a model requiring cascade. Implementing (3) sounds to me like a lot of additional deletion code and bug surface area, for little benefit.

Last edited 2 years ago by carljm (previous) (diff)

comment:23 Changed 2 years ago by akaariai

We can always add cascades to bulk_delete if needed, so the suggested API is:

  • no signals
  • error on cascade
  • due to the above items we can do fast-path deletion always

Sadly this means no bulk_delete for inheritance cases.

comment:24 follow-up: Changed 2 years ago by charettes

This is sad indeed, was hoping to use this to delete inherited instance while avoiding base instance deletion.

comment:25 in reply to: ↑ 24 Changed 2 years ago by carljm

I consider inheritance to be a different case from cascade (both in concept, because you have a conceptually single object that spans rows in multiple tables, and in that it's implemented as a special case, the only case where you might 'cascade' in the reverse direction across an FK/OneToOne).

I could see perhaps implementing something like (3) for inheritance only, not for other cascades. Though I'm just as happy not allowing bulk_delete in inheritance cases.

In any case, I don't think the behavior charettes was hoping for is preferable. bulk_delete should either delete both parent and child rows in inheritance cases, or it should refuse to do anything at all. Deleting child rows only violates the inheritance abstraction. It should either just be done in raw SQL, or if it's really got convincing use-cases it should get a more explicit API, not occur as a surprising side-effect of bulk_delete.

comment:26 Changed 2 years ago by akaariai

Do we still want this? The deletion code is now intelligent enough to do fast-path deletion where possible. We could improve it by having "no_signals" arg, and making it only fetch values() for cascade columns if there is no need for sending the signals (latter is likely tricky to do).

We should not have a delete which does not respect cascades along foreign keys. Some DBs do not enforce the foreign keys so this would be a real gun-pointed-at-foot.

I'm inclined to close this as wontfix...

comment:27 Changed 20 months ago by tobias

I like the new logic in 1.5, but I still think you should be able to apply the DO_NOTHING logic on a per-query basis, rather than having to use whatever policy you set in the ForeignKey declaration across the project as a whole. For example, if I know I'm going to be deleting millions of rows, I'll want to enable this, but for normal use at just about every other point in my project, the default behavior is preferable. Could we just add something like a force_fast option to the delete() method that basically did the same thing as having DO_NOTHING on every foreign key in the model being deleted?

For others who stumble upon this ticket and wonder how the new code in 1.5 works, here's a link to the docs:

https://docs.djangoproject.com/en/1.5/ref/models/querysets/#django.db.models.query.QuerySet.delete

comment:28 Changed 6 months ago by denilsonsa

  • Cc denilsonsa@… added

comment:29 Changed 3 months ago by zachborboa

  • Cc zachborboa added
Note: See TracTickets for help on using tickets.
Back to Top