Opened 7 years ago

Closed 6 years ago

Last modified 5 years ago

#28668 closed New feature (fixed)

Add ON CONFLICT IGNORE support to QuerySet.bulk_create()

Reported by: Tom Forbes Owned by: Tom Forbes
Component: Database layer (models, ORM) Version: dev
Severity: Normal Keywords:
Cc: Дилян Палаузов Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Tim Graham)

When using bulk_create it would be nice to support ON CONFLICT DO NOTHING, which allows existing rows to be included in the bulk_create call, e.g:

Comment.objects.create(name='test', text='test')
Comment.objects.bulk_create(Comment(name='test', text='test'), on_conflict=IGNORE)  # Does not throw an exception

All the databases we support have syntax for this, as well as updating/replacing fields.

This is a spin-off of #28641, just including the ON CONFLICT idea which is backwards-compatible and would be easier to implement than the others.

Attachments (3)

on_conflict_postgresql.patch (6.5 KB ) - added by Дилян Палаузов 6 years ago.
this adds INSERT ON CONFLICT IGNORE to Postgresql, where only IDs of the actually inserted rows are returned
on_conflict_postgresql_signals_ids.patch (13.3 KB ) - added by Дилян Палаузов 6 years ago.
In addition to the other file, this one add support for sending post_save for the newly created objects and retrieving the ids of the objects provided as parameters to bulk_create, that already existed in the database.
on_conflict_ignore.patch (43.0 KB ) - added by Дилян Палаузов 6 years ago.
The on_conflict_ignore.patch adds on_conflict='ignore' to QuerySet.bulk_create and empowers bulk_create optionally to send post_save signals, when the ID of the inserted object is known to bulk_create. For postgresql bulk_create offers in addition to retrieve the IDs of the newly inserted objects, when using on_conflict='ignore' and, with a second query, to find the PKs pf the supplied objs that were already in the databse. When supported by the backend (so not Oracle), on_conflict='ignore' is added to bulk_create() in django.db.models.fields.create_forward_many_to_many_manager.ManyRelatedManager._add_items, making ManyRelatedManager.add() thread-safe, possibly resolving #19544.

Download all attachments as: .zip

Change History (27)

comment:1 by Tom Forbes, 7 years ago

Description: modified (diff)

comment:2 by Tim Graham, 7 years ago

Description: modified (diff)
Summary: Add support for ON CONFLICT to bulk_createAdd ON CONFLICT support to QuerySet.bulk_create()
Triage Stage: UnreviewedAccepted

comment:3 by Tom Forbes, 7 years ago

Owner: changed from nobody to Tom Forbes
Status: newassigned

If in the future we want to add support for ON CONFLICT UPDATE we should design the API with that in mind.

Perhaps have DO_NOTHING as a sentinel object, so in the future we can pass in something else to describe updating (like a dictionary or somesuch?).

comment:4 by Tom Forbes, 7 years ago

Has patch: set

comment:5 by Дилян Палаузов, 7 years ago

Add documentation what bulk_create() is supposed to return.

I assume this not going to work, as Postgresql has can_return_ids_from_bulk_insert, but 'INSERT ON CONFLICT DO NOTHING RETURNING id' returns only the ids of the inserted rows, so the assert will fail. In particular not inserted rows are not returned.

-                ids = self._batched_insert(objs_without_pk, fields, batch_size)
+                ids = self._batched_insert(objs_without_pk, fields, batch_size, on_conflict=on_conflict)
	         if connection.features.can_return_ids_from_bulk_insert:
		     assert len(ids) == len(objs_without_pk)

comment:6 by Tom Forbes, 7 years ago

Good point, I'll fix it up tomorrow. There are some other issues I've found tonight as well.

comment:7 by Дилян Палаузов, 7 years ago

Cc: Дилян Палаузов added

comment:8 by Tom Forbes, 7 years ago

I fixed that assertion by skipping returning ID's from the insert if on_conflict is used. They are incompatible, as you cannot tell which models where inserted and which where skipped.

comment:9 by Дилян Палаузов, 7 years ago

For a table like CREATE TABLE t (id SERIAL PRIMARY KEY, name VARCHAR(10) UNIQUE, comment VARCHAR(10)); it is possible to tell which models were inserted with a query like

WITH
  to_be_inserted AS (SELECT * FROM (VALUES ('name12', 'comment12'), ('name5', 'comment5'), ('name6', 'comment6')) as g(name, comment)),
  successfully_inserted AS (
      INSERT INTO t ("name", "comment" ) SELECT *
        FROM to_be_inserted ON CONFLICT DO NOTHING RETURNING *)
SELECT s.id FROM to_be_inserted AS b
  LEFT JOIN successfully_inserted AS s ON (b.name = s.name AND b.comment = s.comment);

where to_be_inserted contains the values that are going into the database. The returned column contains NULL for values that were presented in the database, and the id for the inserted rows.

With the proposed changes

-                if connection.features.can_return_ids_from_bulk_insert:
+                if connection.features.can_return_ids_from_bulk_insert and not on_conflict == 'ignore':
                      assert len(ids) == len(objs_without_pk)
                  for obj_without_pk, pk in zip(objs_without_pk, ids):
                      obj_without_pk.pk = pk

the for-loop will not work, as the amount of ids is not the same as the amount of obj_without_pk.

comment:10 by Chris Beck, 7 years ago

Is there any consideration of this being included in a future 1.11.x release? I'd really like to be able to use bulk_create for a fire-hose of data that we're getting from a 3rd party who don't guarantee uniqueness so at the moment our only solution is to remove the unique_together attribute and enforce it at the reporting level.

comment:11 by Tim Graham, 7 years ago

No, new features aren't backported to stable branches. See our supported versions policy.

comment:12 by Дилян Палаузов, 6 years ago

Let's start with the most complex case: calling .objects.bulk_create(..., on_conflict='ignore') where the caller wants to determine which of the supplied objects were actually inserted (in order to call signals on them). We can start with the attached minimalistic patch, which does this for Postgresql as only RDBMS allowing efficient handling of this most complex case, and see how to implement it in the other RDBMs.

Unfortunately, the sequence counter is automatically increased for each object provided as parameter, even if at the end the object was already in the database.

I propose extending the logic for bulk_create to send implicit post_save signals for Postgresql.

by Дилян Палаузов, 6 years ago

this adds INSERT ON CONFLICT IGNORE to Postgresql, where only IDs of the actually inserted rows are returned

by Дилян Палаузов, 6 years ago

In addition to the other file, this one add support for sending post_save for the newly created objects and retrieving the ids of the objects provided as parameters to bulk_create, that already existed in the database.

comment:13 by Tom Forbes, 6 years ago

Hey, I recommend consolidating your patches into a single PR on Github and linking it here, that's the development process Django now follows.

comment:14 by Дилян Палаузов, 6 years ago

How essential is it, that objects inserted in the database, have their ._state.adding and ._state.db set? The code from bulk_create(..., on_conflict='ignore') below is not called when ids=[], which happens when database does not return the IDs.

                for obj_without_pk, pk in zip(objs_without_pk, ids):
                    obj_without_pk.pk = pk
                    obj_without_pk._state.adding = False
                    obj_without_pk._state.db = self.db

by Дилян Палаузов, 6 years ago

Attachment: on_conflict_ignore.patch added

The on_conflict_ignore.patch adds on_conflict='ignore' to QuerySet.bulk_create and empowers bulk_create optionally to send post_save signals, when the ID of the inserted object is known to bulk_create. For postgresql bulk_create offers in addition to retrieve the IDs of the newly inserted objects, when using on_conflict='ignore' and, with a second query, to find the PKs pf the supplied objs that were already in the databse. When supported by the backend (so not Oracle), on_conflict='ignore' is added to bulk_create() in django.db.models.fields.create_forward_many_to_many_manager.ManyRelatedManager._add_items, making ManyRelatedManager.add() thread-safe, possibly resolving #19544.

comment:15 by Tom Forbes, 6 years ago

Hey, please create a branch on Github and submit a Pull Request with these changes if you want them to be reviewed and tested.

in reply to:  15 comment:16 by Shlomo Anglister, 6 years ago

Replying to Tom Forbes:

Hey, please create a branch on Github and submit a Pull Request with these changes if you want them to be reviewed and tested.

Anyway to contact the contributor to expedite this?

comment:17 by Дилян Палаузов, 6 years ago

The contributor does not work currently with Django. To expedite the process you can follow what was stated in the last comment.

comment:18 by Tim Graham <timograham@…>, 6 years ago

Resolution: fixed
Status: assignedclosed

In f1fbef6:

Fixed #28668 -- Allowed QuerySet.bulk_create() to ignore insert conflicts.

comment:19 by Дилян Палаузов, 6 years ago

It is good to see progress here but can this ticket be kept open until (for Postgres) also the possibility is implemented to return the IDs, and the possibility tonsend post_save?

To what I remember .add() had also concurrent deficiencies, which could be solved with ON CONFLICT IGNORE, I am nkt sure they are addressed with the fix.

comment:20 by Tim Graham, 6 years ago

Summary: Add ON CONFLICT support to QuerySet.bulk_create()Add ON CONFLICT IGNORE support to QuerySet.bulk_create()

I think we should open new tickets for further enhancements.

comment:21 by Simon Charette, 6 years ago

... also the possibility is implemented to return the IDs

How would you expect this to work? Are you aware that ON CONFLICT IGNORE RETURNING * won't return conflicting rows?

Would you expect pk to only be assigned to objects that didn't cause conflict? That seems like a weird API to me but I can't think of anything better.

Last edited 6 years ago by Simon Charette (previous) (diff)

comment:22 by Дилян Палаузов, 6 years ago

When I wrote initially on this, I meant having the knkwledge of which records were inserted or nkt, when possible. I don't see a point to open a new ticket, proving all the history on the matter is already here.

I expect it to work as in tbe proposed patch.

comment:23 by Дилян Палаузов, 6 years ago

Resolution: fixed
Status: closednew

comment:24 by Tom Forbes, 6 years ago

Resolution: fixed
Status: newclosed

Hey Дилян, I think this ticket is considered fixed now. If you wish to add backend specific support for returning the pk's of inserted records we should do this as another ticket, and it no doubt will need a bit of discussion around the implementation. There is not much of value here that we could not carry over to a new ticket.

I'm happy to make a new ticket if you do not wish to?

Version 0, edited 6 years ago by Tom Forbes (next)
Note: See TracTickets for help on using tickets.
Back to Top