#9519 closed New feature (wontfix)
Add QuerySet.bulk_delete() that issues only a single SQL query
Reported by: | Joey Wilhelm | Owned by: | Akash Kumar Sen |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | database, queryset, delete |
Cc: | Tobias McNulty, kmike84@…, dev@…, kkumler, denilsonsa@…, Zach Borboa, Akash Kumar Sen | Triage Stage: | Unreviewed |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | yes |
Easy pickings: | no | UI/UX: | no |
Description
Example:
from django.db import models class MyModel(models.Model): my_id = models.IntegerField(primary_key=True) col1 = models.IntegerField() col2 = models.CharField(max_length=1) col3 = models.TextField() class Meta: unique_together = ('my_id', 'col1', 'col2') # "Fake" a multi-part primary key # This works for creating all 3, as force_insert is used in create() MyModel.objects.create(my_id=1, col1=5, col2='a', col3='foo') MyModel.objects.create(my_id=1, col1=5, col2='b', col3='bar') MyModel.objects.create(my_id=1, col1=10, col2='a', col3='baz') MyModel.objects.filter(my_id=1, col1=5).delete()
This deletes all of the objects created above, since deletion is done only based on model._meta.pk. This is fine, except when you are using multi-part primary keys and emulating that pkey as suggested in #373.
In my opinion, the delete operation should be executed with respect to the filter(), extra(), etc which were previously performed on the QuerySet.
This is, in a way, a part of ticket #373, but I believe it can be viewed as a completely separate issue. In some cases, you do want to delete based on columns other than the primary key. Why build a list of all of those primary keys when you've already specified all of the qualifiers in a filter()?
Change History (40)
comment:1 by , 16 years ago
Triage Stage: | Unreviewed → Design decision needed |
---|
comment:2 by , 15 years ago
Cc: | added |
---|
comment:3 by , 15 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:4 by , 14 years ago
Summary: | QuerySet.delete() should operate with respect to previous query operations → QuerySet.delete() should issue only a single SQL query |
---|
comment:5 by , 14 years ago
+1 for this. The way it is currently implemented, all the objects that are to be deleted are pulled into memory, which is a nightmare for >10000 objects.
This could be implemented as QuerySet.delete_raw() or QuerySet.delete(raw=True) without breaking the existing API.
comment:7 by , 14 years ago
Severity: | → Normal |
---|---|
Type: | → New feature |
comment:8 by , 13 years ago
Easy pickings: | unset |
---|---|
UI/UX: | unset |
+1
The docs led me to believe Django would bulk delete using DELETE...WHERE instead of having to stream every column from a bazillion rows.
comment:9 by , 13 years ago
Summary: | QuerySet.delete() should issue only a single SQL query → Add QuerySet.bulk_delete() that issues only a single SQL query |
---|---|
Triage Stage: | Design decision needed → Accepted |
Version: | 1.0 → SVN |
.delete() has to be implemented the way it is in order to support cascade-deletion. If Django didn't do cascade-deletion in the ORM it would be much more difficult to write cross-database-compatible ORM code. So the implementation of .delete() is not going to change (it does do the cascade more efficiently now, since #7539 was fixed, than it did when this ticket was filed).
The issue reported originally here (that deletion doesn't work with duplicate PKs) is a non-issue. Django's ORM assumes PKs are unique, and the database will enforce that unless your database is broken or you've manually modified the tables. That's not going to change, regardless of what hacky workarounds were suggested on #373.
All that said, I'd be open to a new .bulk_delete() method (parallel to the .bulk_create() method that was recently added) that would do deletion as requested here. It would have a number of caveats, primarily that it wouldn't do FK cascades, so you'd have to have your database set up correctly to handle that natively (ON DELETE CASCADE or whatnot) or you might get integrity errors.
natevw, if you think the documentation for the current .delete() method could better explain its limitations, that's a separate issue - please file a new ticket for it, ideally with a patch to the docs.
comment:10 by , 13 years ago
After some discussion with Alex in IRC: the deletion code can reliably tell whether cascade is needed or not - if you have no FKs pointing to the model, or all of them are set to on_delete=DO_NOTHING, you don't need cascade. If you do, there's no point in giving you an easy way to shoot yourself in the foot with an IntegrityError
. So however we spell the "bulk delete" option, it should detect whether a bulk delete is actually possible (i.e. no cascade needed); if it's not possible, it should either raise an error or fall back to the current deletion approach.
Between those two, I lean towards raising an error. The conditions that determine whether bulk delete is possible are deterministic and in the programmer's full control: it depends only on the models, not on the state of the database or anything else. So requesting a bulk delete on a model that needs to cascade deletes is simply an error, and it should fail fast so you can just fix your code to not request the bulk delete. That way you can know that if you have a working call to bulk delete, it is actually doing a bulk delete (else it would be failing).
We can't just quietly "do the right thing" under the covers because there is one unavoidable semantic difference: the pre_delete and post_delete signals. We currently send them for all deleted instances, and "bulk delete" will have no way of doing that. That will just have to be documented as a limitation of the bulk delete.
comment:11 by , 13 years ago
Owner: | changed from | to
---|---|
Status: | assigned → new |
comment:12 by , 13 years ago
Cc: | added |
---|
comment:13 by , 13 years ago
Probably incomplete implementation/workaround that works for me:
from django.db import transaction from django.db.models.sql.subqueries import DeleteQuery def truncate_queryset(qs): """ Deletes all records matched by queryset using DELETE from table WHERE <condition> query without fetching PK values for all items in original queryset. """ delete_query = qs.query.clone(DeleteQuery) # transaction management code is copied from QuerySet.update if not transaction.is_managed(using=qs.db): transaction.enter_transaction_management(using=qs.db) forced_managed = True else: forced_managed = False try: delete_query.get_compiler(qs.db).execute_sql(None) if forced_managed: transaction.commit(using=qs.db) else: transaction.commit_unless_managed(using=qs.db) finally: if forced_managed: transaction.leave_transaction_management(using=qs.db)
comment:14 by , 12 years ago
Cc: | added |
---|
comment:15 by , 12 years ago
Note that a regular .delete() can avoid the fetching if there are no cascades, and there are no signals. We currently have no API to ask if somebody is listening for a signal, but we could easily add one - pre_delete.has_listeners(SomeModel
) for example. This way we could automatically do the right thing in .delete() without need for a new operation. And, we could optimize deletion of cascades, too.
If there are signals but we want to skip sending those, we could just add a .delete(signals=False) option for that. I am -1 on ignoring cascades. This will lead to integrity errors. Or worse, there are DBs which do not enforce foreign keys.
comment:16 by , 12 years ago
Cc: | added |
---|
follow-up: 18 comment:17 by , 12 years ago
The idea in comment:15 has been implemented - is there still need for separate bulk_delete()?
comment:18 by , 12 years ago
Replying to akaariai:
The idea in comment:15 has been implemented - is there still need for separate bulk_delete()?
I agree about ignoring cascades being a non-starter - if you need that level of manipulation, drop to SQL.
The question is whether there are times you have a non-cascading model, you have signals, but you want to just delete a batch of objects. This is analogous to update method not firing signals.
whether this is bulk_delete or delete(signals=False) might be a minor point, I'd lean slightly toward a separate method as being cleaner to document as a special case, and to not introduce the concept of signals=False as being something that people would request be a general pattern for other methods.
If we wanted to offer some disabling signals more generally, perhaps we could look into a context manager for that.
comment:19 by , 12 years ago
API idea: How about qs.values().delete() doing a no-signals delete? We could also implement the cascades by fetching the PK values to be deleted. This should make .values().delete() faster and less memory hungry, yet cascades are implemented.
To me this API seems correct - values() is doing "fetch the raw values only, not objects" so values().delete() is doing "do the deletes by fetching raw values only, not objects".
Currently .values().delete() doesn't work at all (except accidentally they do for fast-path deletes...).
The disabling signals idea is dangerous if one can do global "signals disabled" call. For example "with disable_signals(): do_something()" will break model.__init__
in do_something() for models having ImageFields or GenericForeignKeys. To be safe the only way to disable the signals should be by listing the signal target functions explicitly. In addition, the disabling should be thread-safe and really fast (at least for the no-signals-disabled case). Model.__init__
is limited somewhat by signal.send speed already. So, my initial feeling is to not do this.
I would love it if there was some way to tell the DB to set foreign key constraints to ON DELETE CASCADE temporarily. But, there doesn't seem to be any way to do this in SQL.
comment:20 by , 12 years ago
Yes - the global signal killer context manager is a bad idea - I was thinking of the narrow context of limiting test interactions - where it could be useful - but would be a bad choice for public API.
My only hangup on .values().delete() is that I think of values returning data - not objects, so should be decoupled (conceptually) from the DB - and so delete seems to leak back to the DB. That is a statement of my conceptual consistency issue - not that it wouldn't be a clean way to implement it on the internals side.
Basically trying to muck with signals enable/disable in any public API is sticky business. I still think a bulk_delete with lots of warning in the docs is the simplest way forward - if there is a good way forward at all.
comment:21 by , 12 years ago
API proposal for bulk_delete():
- Does not send signals.
- Does "fast-path" deletion like in normal deletion where possible.
- [maybe] If cascades are needed, then cascading is implemented by fetching just the needed values from the DB, not as objects but as raw values.
The problem is that no. 3 above seems to be somewhat messy to implement. So, maybe just error out if cascades are needed instead?
comment:22 by , 12 years ago
If I'm not mistaken, the need for cascade is entirely determined by the model schema; for a given call to bulk_delete
, the need for cascade won't change at runtime. In other words, you're not going to have code calling bulk_delete
that usually works but occasionally fails. So it seems fine to me to omit (3) and just raise an error if bulk_delete
is used on a model requiring cascade. Implementing (3) sounds to me like a lot of additional deletion code and bug surface area, for little benefit.
comment:23 by , 12 years ago
We can always add cascades to bulk_delete if needed, so the suggested API is:
- no signals
- error on cascade
- due to the above items we can do fast-path deletion always
Sadly this means no bulk_delete for inheritance cases.
follow-up: 25 comment:24 by , 12 years ago
This is sad indeed, was hoping to use this to delete inherited instance while avoiding base instance deletion.
comment:25 by , 12 years ago
I consider inheritance to be a different case from cascade (both in concept, because you have a conceptually single object that spans rows in multiple tables, and in that it's implemented as a special case, the only case where you might 'cascade' in the reverse direction across an FK/OneToOne).
I could see perhaps implementing something like (3) for inheritance only, not for other cascades. Though I'm just as happy not allowing bulk_delete
in inheritance cases.
In any case, I don't think the behavior charettes was hoping for is preferable. bulk_delete
should either delete both parent and child rows in inheritance cases, or it should refuse to do anything at all. Deleting child rows only violates the inheritance abstraction. It should either just be done in raw SQL, or if it's really got convincing use-cases it should get a more explicit API, not occur as a surprising side-effect of bulk_delete
.
comment:26 by , 12 years ago
Do we still want this? The deletion code is now intelligent enough to do fast-path deletion where possible. We could improve it by having "no_signals" arg, and making it only fetch values() for cascade columns if there is no need for sending the signals (latter is likely tricky to do).
We should not have a delete which does not respect cascades along foreign keys. Some DBs do not enforce the foreign keys so this would be a real gun-pointed-at-foot.
I'm inclined to close this as wontfix...
comment:27 by , 11 years ago
I like the new logic in 1.5, but I still think you should be able to apply the DO_NOTHING
logic on a per-query basis, rather than having to use whatever policy you set in the ForeignKey
declaration across the project as a whole. For example, if I know I'm going to be deleting millions of rows, I'll want to enable this, but for normal use at just about every other point in my project, the default behavior is preferable. Could we just add something like a force_fast
option to the delete()
method that basically did the same thing as having DO_NOTHING
on every foreign key in the model being deleted?
For others who stumble upon this ticket and wonder how the new code in 1.5 works, here's a link to the docs:
https://docs.djangoproject.com/en/1.5/ref/models/querysets/#django.db.models.query.QuerySet.delete
comment:28 by , 10 years ago
Cc: | added |
---|
comment:29 by , 10 years ago
Cc: | added |
---|
comment:30 by , 9 years ago
For those googling, kmike's workaround above is quite out of date. Since django 1.5+ you can use:
queryset._raw_delete(queryset.db)
Private API, be careful, yadda yadda
comment:31 by , 8 years ago
Emitting additional bulk_post/pre_delete signal with queryset instance as an instance
argument will be good enough. Just leave current logic but add additional signals. I'd like to connect handler(s) which will delete related objects also in bulk. Thank you.
follow-up: 33 comment:32 by , 18 months ago
Cc: | added |
---|---|
Owner: | changed from | to
Status: | new → assigned |
https://docs.djangoproject.com/en/4.2/topics/db/queries/#topics-db-queries-delete
As referred in the documentation,
Keep in mind that this will, whenever possible, be executed purely in SQL, and so the delete() methods of individual object instances will not necessarily be called during the process. If you’ve provided a custom delete() method on a model class and want to ensure that it is called, you will need to “manually” delete instances of that model (e.g., by iterating over a QuerySet and calling delete() on each object individually) rather than using the bulk delete() method of a QuerySet.
The current delete()
method for a queryset already deletes objects in bulk. Do we need another bulk_delete()
method for the same?
follow-up: 34 comment:33 by , 18 months ago
Replying to Akash Kumar Sen:
The current
delete()
method for a queryset already deletes objects in bulk. Do we need anotherbulk_delete()
method for the same?
Have you checked all comments and how it actually works? delete()
collects objects PKs (direct and related), deletes, sends signals, etc. This issue is about adding bulk_delete()
method which will bypass all these steps and perform DELETE
based on the queryset filters.
comment:34 by , 18 months ago
Replying to Mariusz Felisiak:
Have you checked all comments and how it actually works?
delete()
collects objects PKs (direct and related), deletes, sends signals, etc. This issue is about addingbulk_delete()
method which will bypass all these steps and performDELETE
based on the queryset filters.
I tried to compare the number of queries with the analogy of bulk_create
. But now the need seems reasonable, will try to create a patch with the following tasks performed.
- Introduce a new method called bulk_delete() in the QuerySet API.
- The bulk_delete() method should operate on the queryset and issue a single SQL query for deletion.
- The deletion should respect the previous query operations performed on the queryset(or take them as kwargs) such as filters and extra conditions.
- To optimize performance, the bulk_delete() method can take advantage of a "fast-path" deletion, similar to the existing delete operation and should not send any signals to avoid unnecessary overhead.
- If cascading deletes are needed, the bulk_delete() method should handle them by fetching only the necessary values from the database, without instantiating full objects.
comment:35 by , 17 months ago
After some research I would like to amend my earlier proposition to the following:
- It does not make sense to delete an object only, ignoring the related models.
- The current deletion already takes care of the deletion in constant number of queries for cascading relations.
- So if we can just ignore the signals and perform fast deletion of the models those have pre/post save signals on deletion that will be enough.
Let me know if there is any possible flaws in my proposition.
comment:37 by , 17 months ago
Patch needs improvement: | set |
---|
comment:38 by , 17 months ago
I think comment:26 is still very relevant today even if it was made 10 years ago.
making it only fetch values() for cascade columns if there is no need for sending the signals (latter is likely tricky to do).
This was implemented in #30191 and further improvements were made recently as well such as #33928.
All the proposed method bulk_delete
method does is copy over delete
implementation and allow the deletion collector to ignore signals when determining if the fast path can be taken so I don't think it warrants a dedicated method and it feels like the usage of the bulk_
prefix is slightly misleading.
Given support for database level on_delete
is likely to land sooner than later (#21961) how relevant would a new delete(ignore_signals)
flag given the Pandora box in opens terms of supporting ignoring signals in all other interfaces the ORM supports (Model.save
, Model.delete
, etc).
comment:39 by , 17 months ago
Resolution: | → wontfix |
---|---|
Status: | assigned → closed |
Triage Stage: | Accepted → Unreviewed |
I agree with Ansi and Simon, the database level on_delete
is coming which makes this ticket less relevant and not worth the complexity.
Akash, thanks for all your efforts.
comment:40 by , 17 months ago
That sounds reasonable. My initial proposition in comment:32 was also to close this. As the maximum it can achieve is ignoring the signals, because deletion is optimized in all the other cases. So it does not deserve a new method with or without the bulk_
prefix.
This is an issue for me as well. Imagine the cycle:
1) SELECT 100 of the objects you want to delete (this can take awhile depending on the WHERE clause)
2) DELETE those objects by ID
Now imagine it repeated 10000 times. I'm trying to delete millions of rows from a table, so the current way .delete() works is completely unusable.
IMHO .delete should just use the FROM table, JOINs (if any), and WHERE clause from the queryset itself.