Opened 12 years ago

Last modified 2 years ago

#19527 new New feature

Allow QuerySet.bulk_create() to set the primary key of its objects

Reported by: Vlada Macek Owned by: nobody
Component: Database layer (models, ORM) Version: dev
Severity: Normal Keywords: oracle QuerySet.bulk_create
Cc: charette.s@…, gwahl@…, w2lkm2n@…, mszamot@…, Akos Ladanyi, Shai Berger, acrefoot@…, benth, dev@…, Mariusz Felisiak Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

(Sorry if I missed a pre-existing ticket for this.)

There is an IMO big caveat of bulk_create(): It does not set primary key of objects it creates.

In my current project, using bulk_create would improve performance, but as I often need an id of the newly created objects, I have to save them one by one.

I think it would be perfect if bulk_create starts to set the primary keys of the objects in the given sequence. At least postgresql supports it ("RETURNING id" clause gives us a set of id's).

Attachments (3)

django_patch_bulk_create_sets_pk-1.5.1.patch (6.5 KB ) - added by acrefoot@… 11 years ago.
A simple implementation of a bulk_create that sets primary keys on inserted objects. Only tested with psychopg2. Applies to git tag "1.5.1".
bulk_create_and_create_schema_django_v1.5.1.patch (8.8 KB ) - added by acrefoot@… 11 years ago.
(UPDATED) A simple implementation of a bulk_create that sets primary keys on inserted objects. Only tested with psychopg2. Applies to git tag "1.5.1".
django_patch_bulk_create_sets_pk-22198e6c.patch (8.5 KB ) - added by acrefoot@… 11 years ago.
A simple implementation of a bulk_create that sets primary keys on inserted objects. Includes a test and slightly modified docs. Only tested on postgres_psycopg2, so I need someone to test it on the Oracle backend. Applies to 22198e6c (current master)

Download all attachments as: .zip

Change History (50)

comment:1 by Anssi Kääriäinen, 12 years ago

Out of the core DBs Postgres and Oracle support RETURNING, and it is easy to deduce the value when using SQLite as there aren't concurrent transactions.

When using MySQL it seems the best approach is to manually allocate a chunk from the sequence, assign the values to objs and then save them. Alternative is to not support this option at all when using MySQL.

It might make some sense to alter the bulk_create() definition to "create the given objects in the fastest way available". This could mean the fastest way is to issue separate query for each object. For app developers this would be handy, one could rely on bulk_create() in every mass creation situation. It might not be faster than a for loop with a .save(), but it would be always available.

comment:2 by Vlada Macek, 12 years ago

Thank you. In this case when:

  • not all DBs support PK returning in a crystal way,
  • PK assigning takes time making it less-than-fastest way,
  • PK assigning is not always required by the developer

then, IMHO, an option bulk_create(..., set_primary_keys=False) makes most sense:

  1. bulk_create() would always use the fastest multiple-tuple creation,
  2. the developer will be allowed to use set_primary_keys=True only in case the underlying DB supports it in a ways the Django devs will have agreed to implement (not too hacky ways).

I thought it could be as hacky as it possibly is already implemented for some DBs in save().

comment:3 by Aymeric Augustin, 12 years ago

Triage Stage: UnreviewedDesign decision needed

comment:4 by Simon Charette, 12 years ago

Cc: charette.s@… added
Version: 1.4master

I'd like to work on a patch for this if it gets accepted since it's step toward bulk creation of inherited models.

IMHO the set_primary_keys=False also makes more sense.

We could just raise a NotImplementedError if set_primary_key=True and not connection.features.can_return_id_from_insert just like what distinct does when specifying fields on a unsupported backend.

comment:5 by Anssi Kääriäinen, 12 years ago

I keep thinking that what we *really* want is some form of bulk_save(). This should work even if the models are modified instead of created. The definition would be something along the lines of "make sure the instances given (must be of same class) are saved, preferably by fastest way possible". Primary keys are returned. Signals are sent (maybe with an opt-out of "send_signals=False"). This could mean that the implementation for some backends is loop over instances, save each one separately. If there is no need to send signals, some backends could implement the operation in single query (at least PostgreSQL 9.1+ with writeable CTEs, Oracle with merge(?)).

Still, this doesn't mean a better bulk_create wouldn't be useful. But even here I think we want to have a form of bulk_create where signals are sent, and primary keys are set. Why? This makes the form of bulk_create() much more useful for 3rd party apps and even for django-core.

The user wants to create the models in fastest way possible. If the bulk_create() isn't possible for some model, what is the user going to do? Well, they need to loop over the models and save() them separately. So, IMO, we could do that as well directly in bulk_create. If the definition is create the models in bulk queries, if backend doesn't support this, then raise an error, it means many 3rd party apps can't use bulk_create at all. So they will need to loop the models just in case the backend doesn't support the operation, losing the potential performance gain for backends that could do the operation faster.

A discussion at django-developers might be useful. The big question is how we want to define bulk_create. Is it "create the models using bulk queries, otherwise fail" or is it "create the models as fast as possible"? My idea is that we want some way to "create the models as fast as possible, send_signals=True/False, set_primary_keys=True/False". It might mean the fastest way possible is loop over models, save each with force_insert=True.

comment:6 by gwahl@…, 12 years ago

Cc: gwahl@… added

Whether it's implemented as bulk_save() or bulk_create(), I love it.

comment:7 by irodrigo17@…, 12 years ago

Either bulk_create with set_primary_keys or bulk_save sound really good, I'd love to have any of this features.

comment:8 by campos.ddc, 12 years ago

Yes please!

comment:9 by Aymeric Augustin, 11 years ago

Triage Stage: Design decision neededAccepted

by acrefoot@…, 11 years ago

A simple implementation of a bulk_create that sets primary keys on inserted objects. Only tested with psychopg2. Applies to git tag "1.5.1".

by acrefoot@…, 11 years ago

(UPDATED) A simple implementation of a bulk_create that sets primary keys on inserted objects. Only tested with psychopg2. Applies to git tag "1.5.1".

comment:10 by Alex Gaynor, 11 years ago

A few notes:

  • I'm not sure why the schema stuff is in this patch?
  • for i in range(len(ids)) would be better written with izip (found in six I believe)
  • Needs tests!
  • Docs should be updated to reflect this.

comment:11 by anonymous, 11 years ago

Needs tests: set
Patch needs improvement: set

comment:12 by György Kiss, 11 years ago

Cc: w2lkm2n@… added

comment:13 by mszamot@…, 11 years ago

Cc: mszamot@… added

comment:14 by Akos Ladanyi, 11 years ago

Cc: Akos Ladanyi added

by acrefoot@…, 11 years ago

A simple implementation of a bulk_create that sets primary keys on inserted objects. Includes a test and slightly modified docs. Only tested on postgres_psycopg2, so I need someone to test it on the Oracle backend. Applies to 22198e6c (current master)

comment:15 by acrefoot@…, 11 years ago

Alex et. al:

I've added a test, used zip, and modified the list of caveats on bulk_create in the docs. I'd like to get this merged into master, so I'd like someone with Oracle to run the bulk_create tests and confirm if it works or not.

Let me know if there's something else that I can do to help get this patch approved.

comment:16 by Shai Berger, 11 years ago

Cc: Shai Berger added
Keywords: oracle added

This won't work on Oracle, because Oracle has a different implementation of RETURNING; where on PostgreSQL, INSERT RETURNING creates records that can be fetched, Oracle requires the statement to provide output variables, and those variables are filled in already in the statement execution.

Work is being done now to support more arbitrary RETURNING clauses, in #21454; I'll try to make sure the Oracle fixes there also fit this use case.

comment:17 by acrefoot@…, 11 years ago

I'm not sure what you mean by

Oracle requires the statement to provide output variables, and those variables are filled in already in the statement execution.

It seems like return_insert_id is used the same way for Oracle and Postgres in django/db/models/sql/compiler.py

comment:18 by Shai Berger, 11 years ago

Look at your fetch_returned_insert_ids() function in BaseDatabaseOperations; that function is obviously wrong for Oracle, and other code is also (though less obviously) missing.

It's true that return_insert_id is used the same way, but it does something different; where for Postgres it just adds the clause, for Oracle it also binds an output variable. That difference also requires different handling later, when fetching the value (for Postgres you need a fetch(), for Oracle you just read the bound variable).

For bulk, life on Oracle gets harder -- since you're now returning an array of values, you need to prepare an array variable to accept it; you cannot use the same return_insert_id for the bulk and single case. In particular, for the bulk case, you need to pass in the number of values to be returned.

Hope this makes things clearer,

Shai.

comment:19 by acrefoot@…, 11 years ago

Cc: acrefoot@… added

comment:20 by benth, 10 years ago

Cc: benth added

comment:21 by Omer Katz, 10 years ago

Can we please release the postgres support first since it should work?
There's an effort to improve native postgres support and it should be a part of it in 1.8 in my opinion.

comment:22 by Brillgen Developers, 10 years ago

Cc: dev@… added

comment:23 by Tim Graham, 9 years ago

Keywords: QuerySet.bulk_create added
Summary: bulk_create() can set the primary keyAllow QuerySet.bulk_create() to set the primary key of its objects

comment:24 by acrefoot, 9 years ago

Hi again! This week, we're finalizing the work to open-source an app that depends on this patch.

What needs to happen to get this into mainline as soon as we can?

comment:25 by Tim Graham, 9 years ago

Sending a pull request with an updated patch would be a good first step. Also giving it a "self-review" (or having a colleague review it) using the PatchReviewChecklist would be great.

comment:26 by Tim Graham, 9 years ago

Has patch: set
Needs tests: unset

A pull request to add support for this on PostgreSQL needs a rebase.

comment:27 by Tim Graham, 9 years ago

Patch needs improvement: unset

Two different update pull requests: #5917 and #5936

comment:28 by Tim Graham, 9 years ago

Patch needs improvement: set

comment:29 by Tim Graham, 9 years ago

Patch needs improvement: unset

comment:30 by Tim Graham <timograham@…>, 9 years ago

In 04240b23:

Refs #19527 -- Allowed QuerySet.bulk_create() to set the primary key of its objects.

PostgreSQL support only.

Thanks Vladislav Manchev and alesasnouski for working on the patch.

comment:31 by Vlada Macek, 9 years ago

I guess there should be

+.. versionchanged:: 1.10

instead of

+.. versionchanged:: 1.0

in the patch.

comment:32 by Tim Graham <timograham@…>, 9 years ago

In 1d17bb4:

Refs #19527 -- Fixed typo in docs/ref/models/querysets.txt.

comment:33 by Tim Graham, 9 years ago

Has patch: unset

(leaving the ticket open for the Oracle implementation)

comment:34 by Anssi Kääriäinen, 9 years ago

We should also consider if we want an implementation for MySQL and SQLite.

On SQLite there is no concurrency when inserting to given table, so we can guess the inserted pk values from the last pk value in the table after the insert. On MySQL we might need to pre-allocate the ids.

A safe-but-slow implementation is to resort to one item at time insert if primary keys are requested.

There is a lot of work to having a bulk_create with primary keys for all backends, but this would allow usage of bulk_create in cases where we can't do that now.

in reply to:  34 comment:35 by Vlada Macek, 9 years ago

Replying to akaariai:

A safe-but-slow implementation is to resort to one item at time insert if primary keys are requested.

If I may, I'd prefer to trust users they know their situation and limitations. No magic fallbacks. I love Django for its doing what it's saying.

IMHO once the user demands bulk insert, then fast bulk insert should always be done and when PKs are requested on an unsupporting backend, I'd expect an not-implemented error with the explanation and recommendation raised. This get caught during development, right?

Personally, I don't mind at all if at these not essential cases, the framework transparently admits that all backends are not the same.

Many thanks to everyone involved so far! I'm looking forward to the feature much!

comment:36 by Anssi Kääriäinen, 9 years ago

I see your point.

The problem is that 3rd party apps are forced to skip bulk_create for all backends because the app might be used on backend that doesn't support fetching primary keys. Thus users of postgres pay for limitations in MySQL.

The solution might be a method like maybe_bulk_create() - insert data as fast as possible, with bulk_create() if the backend supports returning primary keys, with loop and save() otherwise.

comment:37 by Vlada Macek, 9 years ago

Thank you for enlightening.

I tend to see this backend compat feature to rather configure the existing call, so personally I see the way of adding a new API call as merely complicating the stuff more. Many newbies are already frowning upon these parts of ORM I guess.

So why not to allow both, the compatibility requirements of 3rd party apps as well as efficient modern feature to those who can afford it and want to avoid one by one inserts?

I'd propose something like this so the behavior stays the same:

def bulk_create(..., allow_fallback=True):
Last edited 9 years ago by Tim Graham (previous) (diff)

comment:38 by Anssi Kääriäinen, 9 years ago

Yes, that API is completely fine for me.

comment:39 by Tim Graham, 9 years ago

The chain.from_iterable() addition broke Oracle:

  File "/home/tim/code/django/django/db/models/sql/compiler.py", line 1038, in as_sql
    return [(" ".join(result), tuple(chain.from_iterable(params)))]
TypeError: 'Oracle_datetime' object is not iterable

params is something like: ['admin', u'0001_initial', Oracle_datetime(2016, 3, 4, 12, 1, 9, 38631), <django.db.backends.oracle.utils.InsertIdVar object at 0x7fb0343a9710>].

comment:40 by Tim Graham <timograham@…>, 9 years ago

In 359be446:

Refs #19527 -- Fixed SQL compiler regression causing Oracle failure.

comment:41 by Vlada Macek, 8 years ago

Isn't the feature a part of just released Django 1.10 allowing this ticket to be closed?

But mainly: BIG thanks for everyone involved! :-)

comment:42 by Tim Graham, 8 years ago

Per comment:33, "(leaving the ticket open for the Oracle implementation)".

comment:43 by Vlada Macek, 8 years ago

I'm sorry, Tim.

comment:44 by Malik A. Rumi, 8 years ago

With the implementation of this new feature, shouldn't the language of https://docs.djangoproject.com/en/1.10/ref/models/instances/#explicitly-specifying-auto-primary-key-values also be changed in addition to the changes made to https://docs.djangoproject.com/en/1.10/ref/models/querysets/ ?

I'd offer to do it via a pull request but I'm not sure I know enough about how all this is implemented to write it correctly. In fact, if I'm wrong about the need to update the language of https://docs.djangoproject.com/en/1.10/ref/models/instances/#explicitly-specifying-auto-primary-key-values, that would be proof I don't understand! ;-)

comment:45 by Mariusz Felisiak, 7 years ago

Cc: Mariusz Felisiak added

comment:46 by Mariusz Felisiak, 3 years ago

It's also supported on SQLite 3.35+ (see 93ed71e05802a47774b52503cdc3442686d686c1) and MariaDB 10.5+ (see 98abc0c90e0eb7fe3a71cfa360962cf59605f1d3).

comment:47 by Mariusz Felisiak, 2 years ago

#34011 was a duplicate with the implementation idea for MySQL.

Note: See TracTickets for help on using tickets.
Back to Top