Opened 3 years ago

Last modified 10 days ago

#19527 new New feature

Allow QuerySet.bulk_create() to set the primary key of its objects

Reported by: Tuttle Owned by: nobody
Component: Database layer (models, ORM) Version: master
Severity: Normal Keywords: oracle QuerySet.bulk_create
Cc: charette.s@…, gwahl@…, w2lkm2n@…, mszamot@…, AkosLadanyi, shai, acrefoot@…, benth, dev@… Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: yes Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

(Sorry if I missed a pre-existing ticket for this.)

There is an IMO big caveat of bulk_create(): It does not set primary key of objects it creates.

In my current project, using bulk_create would improve performance, but as I often need an id of the newly created objects, I have to save them one by one.

I think it would be perfect if bulk_create starts to set the primary keys of the objects in the given sequence. At least postgresql supports it ("RETURNING id" clause gives us a set of id's).

Attachments (3)

django_patch_bulk_create_sets_pk-1.5.1.patch (6.5 KB) - added by acrefoot@… 2 years ago.
A simple implementation of a bulk_create that sets primary keys on inserted objects. Only tested with psychopg2. Applies to git tag "1.5.1".
bulk_create_and_create_schema_django_v1.5.1.patch (8.8 KB) - added by acrefoot@… 2 years ago.
(UPDATED) A simple implementation of a bulk_create that sets primary keys on inserted objects. Only tested with psychopg2. Applies to git tag "1.5.1".
django_patch_bulk_create_sets_pk-22198e6c.patch (8.5 KB) - added by acrefoot@… 20 months ago.
A simple implementation of a bulk_create that sets primary keys on inserted objects. Includes a test and slightly modified docs. Only tested on postgres_psycopg2, so I need someone to test it on the Oracle backend. Applies to 22198e6c (current master)

Download all attachments as: .zip

Change History (28)

comment:1 Changed 3 years ago by akaariai

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

Out of the core DBs Postgres and Oracle support RETURNING, and it is easy to deduce the value when using SQLite as there aren't concurrent transactions.

When using MySQL it seems the best approach is to manually allocate a chunk from the sequence, assign the values to objs and then save them. Alternative is to not support this option at all when using MySQL.

It might make some sense to alter the bulk_create() definition to "create the given objects in the fastest way available". This could mean the fastest way is to issue separate query for each object. For app developers this would be handy, one could rely on bulk_create() in every mass creation situation. It might not be faster than a for loop with a .save(), but it would be always available.

comment:2 Changed 3 years ago by Tuttle

Thank you. In this case when:

  • not all DBs support PK returning in a crystal way,
  • PK assigning takes time making it less-than-fastest way,
  • PK assigning is not always required by the developer

then, IMHO, an option bulk_create(..., set_primary_keys=False) makes most sense:

  1. bulk_create() would always use the fastest multiple-tuple creation,
  2. the developer will be allowed to use set_primary_keys=True only in case the underlying DB supports it in a ways the Django devs will have agreed to implement (not too hacky ways).

I thought it could be as hacky as it possibly is already implemented for some DBs in save().

comment:3 Changed 3 years ago by aaugustin

  • Triage Stage changed from Unreviewed to Design decision needed

comment:4 Changed 3 years ago by charettes

  • Cc charette.s@… added
  • Version changed from 1.4 to master

I'd like to work on a patch for this if it gets accepted since it's step toward bulk creation of inherited models.

IMHO the set_primary_keys=False also makes more sense.

We could just raise a NotImplementedError if set_primary_key=True and not connection.features.can_return_id_from_insert just like what distinct does when specifying fields on a unsupported backend.

comment:5 Changed 3 years ago by akaariai

I keep thinking that what we *really* want is some form of bulk_save(). This should work even if the models are modified instead of created. The definition would be something along the lines of "make sure the instances given (must be of same class) are saved, preferably by fastest way possible". Primary keys are returned. Signals are sent (maybe with an opt-out of "send_signals=False"). This could mean that the implementation for some backends is loop over instances, save each one separately. If there is no need to send signals, some backends could implement the operation in single query (at least PostgreSQL 9.1+ with writeable CTEs, Oracle with merge(?)).

Still, this doesn't mean a better bulk_create wouldn't be useful. But even here I think we want to have a form of bulk_create where signals are sent, and primary keys are set. Why? This makes the form of bulk_create() much more useful for 3rd party apps and even for django-core.

The user wants to create the models in fastest way possible. If the bulk_create() isn't possible for some model, what is the user going to do? Well, they need to loop over the models and save() them separately. So, IMO, we could do that as well directly in bulk_create. If the definition is create the models in bulk queries, if backend doesn't support this, then raise an error, it means many 3rd party apps can't use bulk_create at all. So they will need to loop the models just in case the backend doesn't support the operation, losing the potential performance gain for backends that could do the operation faster.

A discussion at django-developers might be useful. The big question is how we want to define bulk_create. Is it "create the models using bulk queries, otherwise fail" or is it "create the models as fast as possible"? My idea is that we want some way to "create the models as fast as possible, send_signals=True/False, set_primary_keys=True/False". It might mean the fastest way possible is loop over models, save each with force_insert=True.

comment:6 Changed 3 years ago by gwahl@…

  • Cc gwahl@… added

Whether it's implemented as bulk_save() or bulk_create(), I love it.

comment:7 Changed 3 years ago by irodrigo17@…

Either bulk_create with set_primary_keys or bulk_save sound really good, I'd love to have any of this features.

comment:8 Changed 2 years ago by campos.ddc

Yes please!

comment:9 Changed 2 years ago by aaugustin

  • Triage Stage changed from Design decision needed to Accepted

Changed 2 years ago by acrefoot@…

A simple implementation of a bulk_create that sets primary keys on inserted objects. Only tested with psychopg2. Applies to git tag "1.5.1".

Changed 2 years ago by acrefoot@…

(UPDATED) A simple implementation of a bulk_create that sets primary keys on inserted objects. Only tested with psychopg2. Applies to git tag "1.5.1".

comment:10 Changed 2 years ago by Alex

A few notes:

  • I'm not sure why the schema stuff is in this patch?
  • for i in range(len(ids)) would be better written with izip (found in six I believe)
  • Needs tests!
  • Docs should be updated to reflect this.

comment:11 Changed 2 years ago by anonymous

  • Needs tests set
  • Patch needs improvement set

comment:12 Changed 2 years ago by Walkman

  • Cc w2lkm2n@… added

comment:13 Changed 2 years ago by mszamot@…

  • Cc mszamot@… added

comment:14 Changed 22 months ago by AkosLadanyi

  • Cc AkosLadanyi added

Changed 20 months ago by acrefoot@…

A simple implementation of a bulk_create that sets primary keys on inserted objects. Includes a test and slightly modified docs. Only tested on postgres_psycopg2, so I need someone to test it on the Oracle backend. Applies to 22198e6c (current master)

comment:15 Changed 20 months ago by acrefoot@…

Alex et. al:

I've added a test, used zip, and modified the list of caveats on bulk_create in the docs. I'd like to get this merged into master, so I'd like someone with Oracle to run the bulk_create tests and confirm if it works or not.

Let me know if there's something else that I can do to help get this patch approved.

comment:16 Changed 20 months ago by shai

  • Cc shai added
  • Keywords oracle added

This won't work on Oracle, because Oracle has a different implementation of RETURNING; where on PostgreSQL, INSERT RETURNING creates records that can be fetched, Oracle requires the statement to provide output variables, and those variables are filled in already in the statement execution.

Work is being done now to support more arbitrary RETURNING clauses, in #21454; I'll try to make sure the Oracle fixes there also fit this use case.

comment:17 Changed 20 months ago by acrefoot@…

I'm not sure what you mean by

Oracle requires the statement to provide output variables, and those variables are filled in already in the statement execution.

It seems like return_insert_id is used the same way for Oracle and Postgres in django/db/models/sql/compiler.py

comment:18 Changed 20 months ago by shai

Look at your fetch_returned_insert_ids() function in BaseDatabaseOperations; that function is obviously wrong for Oracle, and other code is also (though less obviously) missing.

It's true that return_insert_id is used the same way, but it does something different; where for Postgres it just adds the clause, for Oracle it also binds an output variable. That difference also requires different handling later, when fetching the value (for Postgres you need a fetch(), for Oracle you just read the bound variable).

For bulk, life on Oracle gets harder -- since you're now returning an array of values, you need to prepare an array variable to accept it; you cannot use the same return_insert_id for the bulk and single case. In particular, for the bulk case, you need to pass in the number of values to be returned.

Hope this makes things clearer,

Shai.

comment:19 Changed 19 months ago by acrefoot@…

  • Cc acrefoot@… added

comment:20 Changed 17 months ago by benth

  • Cc benth added

comment:21 Changed 13 months ago by the_drow

Can we please release the postgres support first since it should work?
There's an effort to improve native postgres support and it should be a part of it in 1.8 in my opinion.

comment:22 Changed 6 months ago by brillgen

  • Cc dev@… added

comment:23 Changed 3 weeks ago by timgraham

  • Keywords QuerySet.bulk_create added
  • Summary changed from bulk_create() can set the primary key to Allow QuerySet.bulk_create() to set the primary key of its objects

comment:24 Changed 10 days ago by acrefoot

Hi again! This week, we're finalizing the work to open-source an app that depends on this patch.

What needs to happen to get this into mainline as soon as we can?

comment:25 Changed 10 days ago by timgraham

Sending a pull request with an updated patch would be a good first step. Also giving it a "self-review" (or having a colleague review it) using the PatchReviewChecklist would be great.

Note: See TracTickets for help on using tickets.
Back to Top