Opened 15 years ago

Closed 10 months ago

Last modified 10 months ago

#9519 closed New feature (wontfix)

Add QuerySet.bulk_delete() that issues only a single SQL query

Reported by: Joey Wilhelm Owned by: Akash Kumar Sen
Component: Database layer (models, ORM) Version: dev
Severity: Normal Keywords: database, queryset, delete
Cc: Tobias McNulty, kmike84@…, dev@…, kkumler, denilsonsa@…, Zach Borboa, Akash Kumar Sen Triage Stage: Unreviewed
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

Example:

from django.db import models

class MyModel(models.Model):
    my_id = models.IntegerField(primary_key=True)
    col1 = models.IntegerField()
    col2 = models.CharField(max_length=1)
    col3 = models.TextField()
    
    class Meta:
        unique_together = ('my_id', 'col1', 'col2') # "Fake" a multi-part primary key

# This works for creating all 3, as force_insert is used in create()
MyModel.objects.create(my_id=1, col1=5, col2='a', col3='foo')
MyModel.objects.create(my_id=1, col1=5, col2='b', col3='bar')
MyModel.objects.create(my_id=1, col1=10, col2='a', col3='baz')

MyModel.objects.filter(my_id=1, col1=5).delete()

This deletes all of the objects created above, since deletion is done only based on model._meta.pk. This is fine, except when you are using multi-part primary keys and emulating that pkey as suggested in #373.

In my opinion, the delete operation should be executed with respect to the filter(), extra(), etc which were previously performed on the QuerySet.

This is, in a way, a part of ticket #373, but I believe it can be viewed as a completely separate issue. In some cases, you do want to delete based on columns other than the primary key. Why build a list of all of those primary keys when you've already specified all of the qualifiers in a filter()?

Change History (40)

comment:1 by Jacob, 15 years ago

Triage Stage: UnreviewedDesign decision needed

comment:2 by Tobias McNulty, 14 years ago

Cc: Tobias McNulty added

This is an issue for me as well. Imagine the cycle:

1) SELECT 100 of the objects you want to delete (this can take awhile depending on the WHERE clause)

2) DELETE those objects by ID

Now imagine it repeated 10000 times. I'm trying to delete millions of rows from a table, so the current way .delete() works is completely unusable.

IMHO .delete should just use the FROM table, JOINs (if any), and WHERE clause from the queryset itself.

comment:3 by Tobias McNulty, 14 years ago

Owner: changed from nobody to Tobias McNulty
Status: newassigned

comment:4 by Tobias McNulty, 14 years ago

Summary: QuerySet.delete() should operate with respect to previous query operationsQuerySet.delete() should issue only a single SQL query

comment:5 by Craig de Stigter, 13 years ago

+1 for this. The way it is currently implemented, all the objects that are to be deleted are pulled into memory, which is a nightmare for >10000 objects.

This could be implemented as QuerySet.delete_raw() or QuerySet.delete(raw=True) without breaking the existing API.

comment:6 by erikrose, 13 years ago

Me too.

comment:7 by Luke Plant, 13 years ago

Severity: Normal
Type: New feature

comment:8 by Nathan Vander Wilt, 13 years ago

Easy pickings: unset
UI/UX: unset

+1

The docs led me to believe Django would bulk delete using DELETE...WHERE instead of having to stream every column from a bazillion rows.

comment:9 by Carl Meyer, 13 years ago

Summary: QuerySet.delete() should issue only a single SQL queryAdd QuerySet.bulk_delete() that issues only a single SQL query
Triage Stage: Design decision neededAccepted
Version: 1.0SVN

.delete() has to be implemented the way it is in order to support cascade-deletion. If Django didn't do cascade-deletion in the ORM it would be much more difficult to write cross-database-compatible ORM code. So the implementation of .delete() is not going to change (it does do the cascade more efficiently now, since #7539 was fixed, than it did when this ticket was filed).

The issue reported originally here (that deletion doesn't work with duplicate PKs) is a non-issue. Django's ORM assumes PKs are unique, and the database will enforce that unless your database is broken or you've manually modified the tables. That's not going to change, regardless of what hacky workarounds were suggested on #373.

All that said, I'd be open to a new .bulk_delete() method (parallel to the .bulk_create() method that was recently added) that would do deletion as requested here. It would have a number of caveats, primarily that it wouldn't do FK cascades, so you'd have to have your database set up correctly to handle that natively (ON DELETE CASCADE or whatnot) or you might get integrity errors.

natevw, if you think the documentation for the current .delete() method could better explain its limitations, that's a separate issue - please file a new ticket for it, ideally with a patch to the docs.

comment:10 by Carl Meyer, 13 years ago

After some discussion with Alex in IRC: the deletion code can reliably tell whether cascade is needed or not - if you have no FKs pointing to the model, or all of them are set to on_delete=DO_NOTHING, you don't need cascade. If you do, there's no point in giving you an easy way to shoot yourself in the foot with an IntegrityError. So however we spell the "bulk delete" option, it should detect whether a bulk delete is actually possible (i.e. no cascade needed); if it's not possible, it should either raise an error or fall back to the current deletion approach.

Between those two, I lean towards raising an error. The conditions that determine whether bulk delete is possible are deterministic and in the programmer's full control: it depends only on the models, not on the state of the database or anything else. So requesting a bulk delete on a model that needs to cascade deletes is simply an error, and it should fail fast so you can just fix your code to not request the bulk delete. That way you can know that if you have a working call to bulk delete, it is actually doing a bulk delete (else it would be failing).

We can't just quietly "do the right thing" under the covers because there is one unavoidable semantic difference: the pre_delete and post_delete signals. We currently send them for all deleted instances, and "bulk delete" will have no way of doing that. That will just have to be documented as a limitation of the bulk delete.

comment:11 by Tobias McNulty, 13 years ago

Owner: changed from Tobias McNulty to nobody
Status: assignednew

comment:12 by Mikhail Korobov, 12 years ago

Cc: kmike84@… added

comment:13 by Mikhail Korobov, 12 years ago

Probably incomplete implementation/workaround that works for me:

from django.db import transaction
from django.db.models.sql.subqueries import DeleteQuery

def truncate_queryset(qs):
    """
    Deletes all records matched by queryset using

        DELETE from table WHERE <condition>

    query without fetching PK values for all items in original queryset.
    """

    delete_query = qs.query.clone(DeleteQuery)

    # transaction management code is copied from QuerySet.update
    if not transaction.is_managed(using=qs.db):
        transaction.enter_transaction_management(using=qs.db)
        forced_managed = True
    else:
        forced_managed = False
    try:
        delete_query.get_compiler(qs.db).execute_sql(None)
        if forced_managed:
            transaction.commit(using=qs.db)
        else:
            transaction.commit_unless_managed(using=qs.db)
    finally:
        if forced_managed:
            transaction.leave_transaction_management(using=qs.db)

comment:14 by Brillgen Developers, 12 years ago

Cc: dev@… added

comment:15 by Anssi Kääriäinen, 12 years ago

Note that a regular .delete() can avoid the fetching if there are no cascades, and there are no signals. We currently have no API to ask if somebody is listening for a signal, but we could easily add one - pre_delete.has_listeners(SomeModel) for example. This way we could automatically do the right thing in .delete() without need for a new operation. And, we could optimize deletion of cascades, too.

If there are signals but we want to skip sending those, we could just add a .delete(signals=False) option for that. I am -1 on ignoring cascades. This will lead to integrity errors. Or worse, there are DBs which do not enforce foreign keys.

comment:16 by kkumler, 11 years ago

Cc: kkumler added

comment:17 by Anssi Kääriäinen, 11 years ago

The idea in comment:15 has been implemented - is there still need for separate bulk_delete()?

in reply to:  17 comment:18 by Preston Holmes, 11 years ago

Replying to akaariai:

The idea in comment:15 has been implemented - is there still need for separate bulk_delete()?

I agree about ignoring cascades being a non-starter - if you need that level of manipulation, drop to SQL.

The question is whether there are times you have a non-cascading model, you have signals, but you want to just delete a batch of objects. This is analogous to update method not firing signals.

whether this is bulk_delete or delete(signals=False) might be a minor point, I'd lean slightly toward a separate method as being cleaner to document as a special case, and to not introduce the concept of signals=False as being something that people would request be a general pattern for other methods.

If we wanted to offer some disabling signals more generally, perhaps we could look into a context manager for that.

comment:19 by Anssi Kääriäinen, 11 years ago

API idea: How about qs.values().delete() doing a no-signals delete? We could also implement the cascades by fetching the PK values to be deleted. This should make .values().delete() faster and less memory hungry, yet cascades are implemented.

To me this API seems correct - values() is doing "fetch the raw values only, not objects" so values().delete() is doing "do the deletes by fetching raw values only, not objects".

Currently .values().delete() doesn't work at all (except accidentally they do for fast-path deletes...).

The disabling signals idea is dangerous if one can do global "signals disabled" call. For example "with disable_signals(): do_something()" will break model.__init__ in do_something() for models having ImageFields or GenericForeignKeys. To be safe the only way to disable the signals should be by listing the signal target functions explicitly. In addition, the disabling should be thread-safe and really fast (at least for the no-signals-disabled case). Model.__init__ is limited somewhat by signal.send speed already. So, my initial feeling is to not do this.

I would love it if there was some way to tell the DB to set foreign key constraints to ON DELETE CASCADE temporarily. But, there doesn't seem to be any way to do this in SQL.

comment:20 by Preston Holmes, 11 years ago

Yes - the global signal killer context manager is a bad idea - I was thinking of the narrow context of limiting test interactions - where it could be useful - but would be a bad choice for public API.

My only hangup on .values().delete() is that I think of values returning data - not objects, so should be decoupled (conceptually) from the DB - and so delete seems to leak back to the DB. That is a statement of my conceptual consistency issue - not that it wouldn't be a clean way to implement it on the internals side.

Basically trying to muck with signals enable/disable in any public API is sticky business. I still think a bulk_delete with lots of warning in the docs is the simplest way forward - if there is a good way forward at all.

comment:21 by Anssi Kääriäinen, 11 years ago

API proposal for bulk_delete():

  1. Does not send signals.
  2. Does "fast-path" deletion like in normal deletion where possible.
  3. [maybe] If cascades are needed, then cascading is implemented by fetching just the needed values from the DB, not as objects but as raw values.

The problem is that no. 3 above seems to be somewhat messy to implement. So, maybe just error out if cascades are needed instead?

comment:22 by Carl Meyer, 11 years ago

If I'm not mistaken, the need for cascade is entirely determined by the model schema; for a given call to bulk_delete, the need for cascade won't change at runtime. In other words, you're not going to have code calling bulk_delete that usually works but occasionally fails. So it seems fine to me to omit (3) and just raise an error if bulk_delete is used on a model requiring cascade. Implementing (3) sounds to me like a lot of additional deletion code and bug surface area, for little benefit.

Last edited 11 years ago by Carl Meyer (previous) (diff)

comment:23 by Anssi Kääriäinen, 11 years ago

We can always add cascades to bulk_delete if needed, so the suggested API is:

  • no signals
  • error on cascade
  • due to the above items we can do fast-path deletion always

Sadly this means no bulk_delete for inheritance cases.

comment:24 by Simon Charette, 11 years ago

This is sad indeed, was hoping to use this to delete inherited instance while avoiding base instance deletion.

in reply to:  24 comment:25 by Carl Meyer, 11 years ago

I consider inheritance to be a different case from cascade (both in concept, because you have a conceptually single object that spans rows in multiple tables, and in that it's implemented as a special case, the only case where you might 'cascade' in the reverse direction across an FK/OneToOne).

I could see perhaps implementing something like (3) for inheritance only, not for other cascades. Though I'm just as happy not allowing bulk_delete in inheritance cases.

In any case, I don't think the behavior charettes was hoping for is preferable. bulk_delete should either delete both parent and child rows in inheritance cases, or it should refuse to do anything at all. Deleting child rows only violates the inheritance abstraction. It should either just be done in raw SQL, or if it's really got convincing use-cases it should get a more explicit API, not occur as a surprising side-effect of bulk_delete.

comment:26 by Anssi Kääriäinen, 11 years ago

Do we still want this? The deletion code is now intelligent enough to do fast-path deletion where possible. We could improve it by having "no_signals" arg, and making it only fetch values() for cascade columns if there is no need for sending the signals (latter is likely tricky to do).

We should not have a delete which does not respect cascades along foreign keys. Some DBs do not enforce the foreign keys so this would be a real gun-pointed-at-foot.

I'm inclined to close this as wontfix...

comment:27 by Tobias McNulty, 11 years ago

I like the new logic in 1.5, but I still think you should be able to apply the DO_NOTHING logic on a per-query basis, rather than having to use whatever policy you set in the ForeignKey declaration across the project as a whole. For example, if I know I'm going to be deleting millions of rows, I'll want to enable this, but for normal use at just about every other point in my project, the default behavior is preferable. Could we just add something like a force_fast option to the delete() method that basically did the same thing as having DO_NOTHING on every foreign key in the model being deleted?

For others who stumble upon this ticket and wonder how the new code in 1.5 works, here's a link to the docs:

https://docs.djangoproject.com/en/1.5/ref/models/querysets/#django.db.models.query.QuerySet.delete

comment:28 by Denilson Figueiredo de Sá, 9 years ago

Cc: denilsonsa@… added

comment:29 by Zach Borboa, 9 years ago

Cc: Zach Borboa added

comment:30 by Craig de Stigter, 8 years ago

For those googling, kmike's workaround above is quite out of date. Since django 1.5+ you can use:

    queryset._raw_delete(queryset.db)

Private API, be careful, yadda yadda

Last edited 8 years ago by Craig de Stigter (previous) (diff)

comment:31 by Marcin Nowak, 7 years ago

Emitting additional bulk_post/pre_delete signal with queryset instance as an instance argument will be good enough. Just leave current logic but add additional signals. I'd like to connect handler(s) which will delete related objects also in bulk. Thank you.

comment:32 by Akash Kumar Sen, 10 months ago

Cc: Akash Kumar Sen added
Owner: changed from nobody to Akash Kumar Sen
Status: newassigned

https://docs.djangoproject.com/en/4.2/topics/db/queries/#topics-db-queries-delete

As referred in the documentation,

Keep in mind that this will, whenever possible, be executed purely in SQL, and so the delete() methods of individual object instances will not necessarily be called during the process. If you’ve provided a custom delete() method on a model class and want to ensure that it is called, you will need to “manually” delete instances of that model (e.g., by iterating over a QuerySet and calling delete() on each object individually) rather than using the bulk delete() method of a QuerySet.

The current delete() method for a queryset already deletes objects in bulk. Do we need another bulk_delete() method for the same?

Last edited 10 months ago by Mariusz Felisiak (previous) (diff)

in reply to:  32 ; comment:33 by Mariusz Felisiak, 10 months ago

Replying to Akash Kumar Sen:

The current delete() method for a queryset already deletes objects in bulk. Do we need another bulk_delete() method for the same?

Have you checked all comments and how it actually works? delete() collects objects PKs (direct and related), deletes, sends signals, etc. This issue is about adding bulk_delete() method which will bypass all these steps and perform DELETE based on the queryset filters.

in reply to:  33 comment:34 by Akash Kumar Sen, 10 months ago

Replying to Mariusz Felisiak:

Have you checked all comments and how it actually works? delete() collects objects PKs (direct and related), deletes, sends signals, etc. This issue is about adding bulk_delete() method which will bypass all these steps and perform DELETE based on the queryset filters.

I tried to compare the number of queries with the analogy of bulk_create . But now the need seems reasonable, will try to create a patch with the following tasks performed.

  1. Introduce a new method called bulk_delete() in the QuerySet API.
  2. The bulk_delete() method should operate on the queryset and issue a single SQL query for deletion.
  3. The deletion should respect the previous query operations performed on the queryset(or take them as kwargs) such as filters and extra conditions.
  4. To optimize performance, the bulk_delete() method can take advantage of a "fast-path" deletion, similar to the existing delete operation and should not send any signals to avoid unnecessary overhead.
  5. If cascading deletes are needed, the bulk_delete() method should handle them by fetching only the necessary values from the database, without instantiating full objects.

comment:35 by Akash Kumar Sen, 10 months ago

After some research I would like to amend my earlier proposition to the following:

  • It does not make sense to delete an object only, ignoring the related models.
  • The current deletion already takes care of the deletion in constant number of queries for cascading relations.
  • So if we can just ignore the signals and perform fast deletion of the models those have pre/post save signals on deletion that will be enough.

Let me know if there is any possible flaws in my proposition.

comment:36 by Akash Kumar Sen, 10 months ago

Has patch: set

comment:37 by Akash Kumar Sen, 10 months ago

Patch needs improvement: set

comment:38 by Simon Charette, 10 months ago

I think comment:26 is still very relevant today even if it was made 10 years ago.

making it only fetch values() for cascade columns if there is no need for sending the signals (latter is likely tricky to do).

This was implemented in #30191 and further improvements were made recently as well such as #33928.

All the proposed method bulk_delete method does is copy over delete implementation and allow the deletion collector to ignore signals when determining if the fast path can be taken so I don't think it warrants a dedicated method and it feels like the usage of the bulk_ prefix is slightly misleading.

Given support for database level on_delete is likely to land sooner than later (#21961) how relevant would a new delete(ignore_signals) flag given the Pandora box in opens terms of supporting ignoring signals in all other interfaces the ORM supports (Model.save, Model.delete, etc).

Last edited 10 months ago by Simon Charette (previous) (diff)

comment:39 by Mariusz Felisiak, 10 months ago

Resolution: wontfix
Status: assignedclosed
Triage Stage: AcceptedUnreviewed

I agree with Ansi and Simon, the database level on_delete is coming which makes this ticket less relevant and not worth the complexity.

Akash, thanks for all your efforts.

comment:40 by Akash Kumar Sen, 10 months ago

That sounds reasonable. My initial proposition in comment:32 was also to close this. As the maximum it can achieve is ignoring the signals, because deletion is optimized in all the other cases. So it does not deserve a new method with or without the bulk_ prefix.

Last edited 10 months ago by Akash Kumar Sen (previous) (diff)
Note: See TracTickets for help on using tickets.
Back to Top