Opened 12 years ago
Last modified 2 years ago
#19527 new New feature
Allow QuerySet.bulk_create() to set the primary key of its objects
Reported by: | Vlada Macek | Owned by: | nobody |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | oracle QuerySet.bulk_create |
Cc: | charette.s@…, gwahl@…, w2lkm2n@…, mszamot@…, Akos Ladanyi, Shai Berger, acrefoot@…, benth, dev@…, Mariusz Felisiak | Triage Stage: | Accepted |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
(Sorry if I missed a pre-existing ticket for this.)
There is an IMO big caveat of bulk_create(): It does not set primary key of objects it creates.
In my current project, using bulk_create would improve performance, but as I often need an id of the newly created objects, I have to save them one by one.
I think it would be perfect if bulk_create starts to set the primary keys of the objects in the given sequence. At least postgresql supports it ("RETURNING id" clause gives us a set of id's).
Attachments (3)
Change History (50)
comment:1 by , 12 years ago
comment:2 by , 12 years ago
Thank you. In this case when:
- not all DBs support PK returning in a crystal way,
- PK assigning takes time making it less-than-fastest way,
- PK assigning is not always required by the developer
then, IMHO, an option bulk_create(..., set_primary_keys=False) makes most sense:
- bulk_create() would always use the fastest multiple-tuple creation,
- the developer will be allowed to use set_primary_keys=True only in case the underlying DB supports it in a ways the Django devs will have agreed to implement (not too hacky ways).
I thought it could be as hacky as it possibly is already implemented for some DBs in save().
comment:3 by , 12 years ago
Triage Stage: | Unreviewed → Design decision needed |
---|
comment:4 by , 12 years ago
Cc: | added |
---|---|
Version: | 1.4 → master |
I'd like to work on a patch for this if it gets accepted since it's step toward bulk creation of inherited models.
IMHO the set_primary_keys=False
also makes more sense.
We could just raise a NotImplementedError
if set_primary_key=True and not connection.features.can_return_id_from_insert
just like what distinct
does when specifying fields on a unsupported backend.
comment:5 by , 12 years ago
I keep thinking that what we *really* want is some form of bulk_save(). This should work even if the models are modified instead of created. The definition would be something along the lines of "make sure the instances given (must be of same class) are saved, preferably by fastest way possible". Primary keys are returned. Signals are sent (maybe with an opt-out of "send_signals=False"). This could mean that the implementation for some backends is loop over instances, save each one separately. If there is no need to send signals, some backends could implement the operation in single query (at least PostgreSQL 9.1+ with writeable CTEs, Oracle with merge(?)).
Still, this doesn't mean a better bulk_create wouldn't be useful. But even here I think we want to have a form of bulk_create where signals are sent, and primary keys are set. Why? This makes the form of bulk_create() much more useful for 3rd party apps and even for django-core.
The user wants to create the models in fastest way possible. If the bulk_create() isn't possible for some model, what is the user going to do? Well, they need to loop over the models and save() them separately. So, IMO, we could do that as well directly in bulk_create. If the definition is create the models in bulk queries, if backend doesn't support this, then raise an error, it means many 3rd party apps can't use bulk_create at all. So they will need to loop the models just in case the backend doesn't support the operation, losing the potential performance gain for backends that could do the operation faster.
A discussion at django-developers might be useful. The big question is how we want to define bulk_create. Is it "create the models using bulk queries, otherwise fail" or is it "create the models as fast as possible"? My idea is that we want some way to "create the models as fast as possible, send_signals=True/False, set_primary_keys=True/False". It might mean the fastest way possible is loop over models, save each with force_insert=True.
comment:6 by , 12 years ago
Cc: | added |
---|
Whether it's implemented as bulk_save() or bulk_create(), I love it.
comment:7 by , 12 years ago
Either bulk_create
with set_primary_keys
or bulk_save
sound really good, I'd love to have any of this features.
comment:9 by , 12 years ago
Triage Stage: | Design decision needed → Accepted |
---|
by , 12 years ago
Attachment: | django_patch_bulk_create_sets_pk-1.5.1.patch added |
---|
A simple implementation of a bulk_create that sets primary keys on inserted objects. Only tested with psychopg2. Applies to git tag "1.5.1".
by , 12 years ago
Attachment: | bulk_create_and_create_schema_django_v1.5.1.patch added |
---|
(UPDATED) A simple implementation of a bulk_create that sets primary keys on inserted objects. Only tested with psychopg2. Applies to git tag "1.5.1".
comment:10 by , 11 years ago
A few notes:
- I'm not sure why the schema stuff is in this patch?
for i in range(len(ids))
would be better written withizip
(found in six I believe)- Needs tests!
- Docs should be updated to reflect this.
comment:11 by , 11 years ago
Needs tests: | set |
---|---|
Patch needs improvement: | set |
comment:12 by , 11 years ago
Cc: | added |
---|
comment:13 by , 11 years ago
Cc: | added |
---|
comment:14 by , 11 years ago
Cc: | added |
---|
by , 11 years ago
Attachment: | django_patch_bulk_create_sets_pk-22198e6c.patch added |
---|
A simple implementation of a bulk_create that sets primary keys on inserted objects. Includes a test and slightly modified docs. Only tested on postgres_psycopg2, so I need someone to test it on the Oracle backend. Applies to 22198e6c (current master)
comment:15 by , 11 years ago
Alex et. al:
I've added a test, used zip, and modified the list of caveats on bulk_create in the docs. I'd like to get this merged into master, so I'd like someone with Oracle to run the bulk_create tests and confirm if it works or not.
Let me know if there's something else that I can do to help get this patch approved.
comment:16 by , 11 years ago
Cc: | added |
---|---|
Keywords: | oracle added |
This won't work on Oracle, because Oracle has a different implementation of RETURNING; where on PostgreSQL, INSERT RETURNING creates records that can be fetched, Oracle requires the statement to provide output variables, and those variables are filled in already in the statement execution.
Work is being done now to support more arbitrary RETURNING clauses, in #21454; I'll try to make sure the Oracle fixes there also fit this use case.
comment:17 by , 11 years ago
I'm not sure what you mean by
Oracle requires the statement to provide output variables, and those variables are filled in already in the statement execution.
It seems like return_insert_id
is used the same way for Oracle and Postgres in django/db/models/sql/compiler.py
comment:18 by , 11 years ago
Look at your fetch_returned_insert_ids()
function in BaseDatabaseOperations
; that function is obviously wrong for Oracle, and other code is also (though less obviously) missing.
It's true that return_insert_id
is used the same way, but it does something different; where for Postgres it just adds the clause, for Oracle it also binds an output variable. That difference also requires different handling later, when fetching the value (for Postgres you need a fetch()
, for Oracle you just read the bound variable).
For bulk, life on Oracle gets harder -- since you're now returning an array of values, you need to prepare an array variable to accept it; you cannot use the same return_insert_id
for the bulk and single case. In particular, for the bulk case, you need to pass in the number of values to be returned.
Hope this makes things clearer,
Shai.
comment:19 by , 11 years ago
Cc: | added |
---|
comment:20 by , 11 years ago
Cc: | added |
---|
comment:21 by , 10 years ago
Can we please release the postgres support first since it should work?
There's an effort to improve native postgres support and it should be a part of it in 1.8 in my opinion.
comment:22 by , 10 years ago
Cc: | added |
---|
comment:23 by , 9 years ago
Keywords: | QuerySet.bulk_create added |
---|---|
Summary: | bulk_create() can set the primary key → Allow QuerySet.bulk_create() to set the primary key of its objects |
comment:24 by , 9 years ago
Hi again! This week, we're finalizing the work to open-source an app that depends on this patch.
What needs to happen to get this into mainline as soon as we can?
comment:25 by , 9 years ago
Sending a pull request with an updated patch would be a good first step. Also giving it a "self-review" (or having a colleague review it) using the PatchReviewChecklist would be great.
comment:26 by , 9 years ago
Has patch: | set |
---|---|
Needs tests: | unset |
A pull request to add support for this on PostgreSQL needs a rebase.
comment:27 by , 9 years ago
Patch needs improvement: | unset |
---|
comment:28 by , 9 years ago
Patch needs improvement: | set |
---|
comment:29 by , 9 years ago
Patch needs improvement: | unset |
---|
comment:31 by , 9 years ago
I guess there should be
+.. versionchanged:: 1.10
instead of
+.. versionchanged:: 1.0
in the patch.
comment:33 by , 9 years ago
Has patch: | unset |
---|
(leaving the ticket open for the Oracle implementation)
follow-up: 35 comment:34 by , 9 years ago
We should also consider if we want an implementation for MySQL and SQLite.
On SQLite there is no concurrency when inserting to given table, so we can guess the inserted pk values from the last pk value in the table after the insert. On MySQL we might need to pre-allocate the ids.
A safe-but-slow implementation is to resort to one item at time insert if primary keys are requested.
There is a lot of work to having a bulk_create with primary keys for all backends, but this would allow usage of bulk_create in cases where we can't do that now.
comment:35 by , 9 years ago
Replying to akaariai:
A safe-but-slow implementation is to resort to one item at time insert if primary keys are requested.
If I may, I'd prefer to trust users they know their situation and limitations. No magic fallbacks. I love Django for its doing what it's saying.
IMHO once the user demands bulk insert, then fast bulk insert should always be done and when PKs are requested on an unsupporting backend, I'd expect an not-implemented error with the explanation and recommendation raised. This get caught during development, right?
Personally, I don't mind at all if at these not essential cases, the framework transparently admits that all backends are not the same.
Many thanks to everyone involved so far! I'm looking forward to the feature much!
comment:36 by , 9 years ago
I see your point.
The problem is that 3rd party apps are forced to skip bulk_create for all backends because the app might be used on backend that doesn't support fetching primary keys. Thus users of postgres pay for limitations in MySQL.
The solution might be a method like maybe_bulk_create() - insert data as fast as possible, with bulk_create() if the backend supports returning primary keys, with loop and save() otherwise.
comment:37 by , 9 years ago
Thank you for enlightening.
I tend to see this backend compat feature to rather configure the existing call, so personally I see the way of adding a new API call as merely complicating the stuff more. Many newbies are already frowning upon these parts of ORM I guess.
So why not to allow both, the compatibility requirements of 3rd party apps as well as efficient modern feature to those who can afford it and want to avoid one by one inserts?
I'd propose s/t like this so the behavior stays the same:
def bulk_create(..., allow_fallback=True):
comment:39 by , 9 years ago
The chain.from_iterable()
addition broke Oracle:
File "/home/tim/code/django/django/db/models/sql/compiler.py", line 1038, in as_sql return [(" ".join(result), tuple(chain.from_iterable(params)))] TypeError: 'Oracle_datetime' object is not iterable
params
is something like: ['admin', u'0001_initial', Oracle_datetime(2016, 3, 4, 12, 1, 9, 38631), <django.db.backends.oracle.utils.InsertIdVar object at 0x7fb0343a9710>]
.
comment:41 by , 8 years ago
Isn't the feature a part of just released Django 1.10 allowing this ticket to be closed?
But mainly: BIG thanks for everyone involved! :-)
comment:42 by , 8 years ago
Per comment:33, "(leaving the ticket open for the Oracle implementation)".
comment:44 by , 8 years ago
With the implementation of this new feature, shouldn't the language of https://docs.djangoproject.com/en/1.10/ref/models/instances/#explicitly-specifying-auto-primary-key-values also be changed in addition to the changes made to https://docs.djangoproject.com/en/1.10/ref/models/querysets/ ?
I'd offer to do it via a pull request but I'm not sure I know enough about how all this is implemented to write it correctly. In fact, if I'm wrong about the need to update the language of https://docs.djangoproject.com/en/1.10/ref/models/instances/#explicitly-specifying-auto-primary-key-values, that would be proof I don't understand! ;-)
comment:45 by , 7 years ago
Cc: | added |
---|
comment:46 by , 3 years ago
It's also supported on SQLite 3.35+ (see 93ed71e05802a47774b52503cdc3442686d686c1) and MariaDB 10.5+ (see 98abc0c90e0eb7fe3a71cfa360962cf59605f1d3).
Out of the core DBs Postgres and Oracle support RETURNING, and it is easy to deduce the value when using SQLite as there aren't concurrent transactions.
When using MySQL it seems the best approach is to manually allocate a chunk from the sequence, assign the values to objs and then save them. Alternative is to not support this option at all when using MySQL.
It might make some sense to alter the bulk_create() definition to "create the given objects in the fastest way available". This could mean the fastest way is to issue separate query for each object. For app developers this would be handy, one could rely on bulk_create() in every mass creation situation. It might not be faster than a for loop with a .save(), but it would be always available.