Context Navigation

← Previous Ticket
Next Ticket →

#36526 closed Cleanup/optimization (fixed)

Document memory usage of bulk_update and ways to batch updates.

Reported by:	Anže Pečar	Owned by:	Natalia Bidart
Component:	Database layer (models, ORM)	Version:	5.2
Severity:	Normal	Keywords:
Cc:		Triage Stage:	Ready for checkin
Has patch:	yes	Needs documentation:	no
Needs tests:	no	Patch needs improvement:	no
Easy pickings:	no	UI/UX:	no

Description (last modified by Anže Pečar)

I recently tried to update a large number of objects with:

things = list(Thing.objects.all()) # A large number of objects e.g. > 1_000_000
Thing.objects.bulk_update(things, ["description"], batch_size=300)

The first line above fits into the available memory (~2GB in my case), but the second line caused a SIGTERM, even though I had an additional 2GB of available memory. This was a bit surprising as I wasn't expecting bulk_update to use this much memory since all the objects to update were already loaded.

My solution was:

for batch in batched(things, 300):
     Thing.objects.bulk_update(batch, ["description"], batch_size=300)

The first example bulk_update used 2.8GB of memory, but in the second example, it only used 62MB.

A GitHub repository that reproduces the problem with memray results.

As we can see from the memray flamegraph the majority of the memory in my example (2.1GB) is used to prepare the when statement for all the batches before executing them. If we change this to generate the when statement only for the current batch the memory consumption is going to be greatly reduced. I'd be happy to contribute this patch unless there are concerns on adding more compute between update queries and making the transaction longer. Let me know :)

This might be related to https://code.djangoproject.com/ticket/31202, but I decided to open a new issue because I wouldn't mind waiting longer for bulk_update to complete, but the SIGTERM surprised me.

Change History (22)

comment:1 by Anže Pečar, 4 months ago

Description:	modified (diff)

comment:2 by Anže Pečar, 4 months ago

Description:	modified (diff)

comment:3 by Anže Pečar, 4 months ago

Description:	modified (diff)

comment:4 by Anže Pečar, 4 months ago

Component:	Uncategorized → Database layer (models, ORM)

comment:5 by Jason Hall, 4 months ago

Owner:	set to Jason Hall
Status:	new → assigned

comment:6 by Jason Hall, 4 months ago

Has patch:	set

comment:7 by Simon Charette, 4 months ago

I feel like we should likely close this one in favor of #31202 particularly because the proposed solution of of batching the resolving of the expression while the transaction is opened is likely going to cause more harm than good until the query resolving and compilation of the batches is made faster.

comment:8 by Jason Hall, 4 months ago

Got it -- thanks for the feedback.

comment:9 by Natalia Bidart, 4 months ago

Resolution:	→ duplicate
Status:	assigned → closed
Type:	Uncategorized → Cleanup/optimization

Thank you Anže Pečar for taking the time to create this report, and thank you Jason Hall for your interest in fixing this.

Simon, I agree with you this report is a dupe of #31202, but I also wonder if we should consider a similar note to the one added in #28231 for bulk_create and more efficient batching?

follow-up: 15 comment:10 by Anže Pečar, 4 months ago

Natalia Bidart, I pointed out in my initial description that the two issues are related but I am still not fully convinced that they are duplicates. In my case I was updating a large number of objects for several hours and it wouldn't have made a difference if the query took an extra hour or two. What did make a difference was that the script was killed with a SIGTERM when the container ran out of memory. :(

Could we reopen until we fully understand what the performance impact of the code changes proposed from Jason Hall? I made a quick benchmark earlier today and Jason's solution with the longer transaction ended up being 6% slower (29.76s vs 28s) but I wanted to also test it on a dataset with more columns as was the example in #31202.

follow-up: 12 comment:11 by Simon Charette, 4 months ago

I also wonder if we should consider a similar note to the one added in #28231 for bulk_create and more efficient batching?

Natalia, I think we should yes.

My immediate reaction when reviewing the ticket was to have a look at the bulk_update documentation and it's effectively not entirely clear what batching refers to (query batching vs objects materialization). 52aa26e6979ba81b00f1593d5ee8c5c73aaa6391 made it very clear that manual generator slicing must be used to prevent evaluation.

in reply to: 11 comment:12 by Natalia Bidart, 3 months ago

Patch needs improvement:	set
Resolution:	duplicate
Status:	closed → new
Triage Stage:	Unreviewed → Accepted

Replying to Simon Charette:

I also wonder if we should consider a similar note to the one added in #28231 for bulk_create and more efficient batching?

Natalia, I think we should yes.

My immediate reaction when reviewing the ticket was to have a look at the bulk_update documentation and it's effectively not entirely clear what batching refers to (query batching vs objects materialization). 52aa26e6979ba81b00f1593d5ee8c5c73aaa6391 made it very clear that manual generator slicing must be used to prevent evaluation.

Reopening with the goal to add the clarification in the docs similar to what bulk_create has.

comment:13 by Natalia Bidart, 3 months ago

Summary:	bulk_update uses more memory than expected → Documment memory usage of bulk_update and ways to batch updates.

comment:14 by Natalia Bidart, 3 months ago

Summary:	Documment memory usage of bulk_update and ways to batch updates. → Document memory usage of bulk_update and ways to batch updates.

in reply to: 10 comment:15 by Natalia Bidart, 3 months ago

Replying to Anže Pečar:

Natalia Bidart, I pointed out in my initial description that the two issues are related but I am still not fully convinced that they are duplicates. In my case I was updating a large number of objects for several hours and it wouldn't have made a difference if the query took an extra hour or two. What did make a difference was that the script was killed with a SIGTERM when the container ran out of memory. :(

Could we reopen until we fully understand what the performance impact of the code changes proposed from Jason Hall? I made a quick benchmark earlier today and Jason's solution with the longer transaction ended up being 6% slower (29.76s vs 28s) but I wanted to also test it on a dataset with more columns as was the example in #31202.

Thanks for the clarification. As Simon noted in the PR, performing the expression resolution while the transaction is open (which is known to take significant time) is likely to be more harmful than the memory overhead it incurs.

Since manual batching already works around the memory issue, we don't plan to pursue changes to defer expression resolution within bulk_update other than what is covered in #31202. I'll work on a patch to improve/extend the docs.

comment:16 by Natalia Bidart, 3 months ago

Owner:	changed from Jason Hall to Natalia Bidart
Patch needs improvement:	unset
Status:	new → assigned

comment:17 by Natalia Bidart, 3 months ago

comment:18 by Jacob Walls, 2 months ago

Patch needs improvement:	set

comment:19 by Natalia Bidart, 2 months ago

Patch needs improvement:	unset
Triage Stage:	Accepted → Ready for checkin

comment:20 by nessita <124304+nessita@…>, 2 months ago

Resolution:	→ fixed
Status:	assigned → closed

In 608d3eb:

Fixed #36526 -- Doc'd QuerySet.bulk_update() memory usage when batching.

Thanks Simon Charette for the review.

comment:21 by Natalia <124304+nessita@…>, 2 months ago

In 953163e:

[6.0.x] Fixed #36526 -- Doc'd QuerySet.bulk_update() memory usage when batching.

Thanks Simon Charette for the review.

Backport of 608d3ebc8889863d43be1090d634b9507fe4a85e from main.

comment:22 by Natalia <124304+nessita@…>, 2 months ago

In 80b9c8f5:

[5.2.x] Fixed #36526 -- Doc'd QuerySet.bulk_update() memory usage when batching.

Thanks Simon Charette for the review.

Backport of 608d3ebc8889863d43be1090d634b9507fe4a85e from main.

Note: See TracTickets for help on using tickets.

Download in other formats:

Issues

Context Navigation

#36526 closed Cleanup/optimization (fixed)

Document memory usage of bulk_update and ways to batch updates.

Description (last modified by Anže Pečar)

Change History (22)

comment:1 by Anže Pečar, 4 months ago

comment:2 by Anže Pečar, 4 months ago

comment:3 by Anže Pečar, 4 months ago

comment:4 by Anže Pečar, 4 months ago

comment:5 by Jason Hall, 4 months ago

comment:6 by Jason Hall, 4 months ago

comment:7 by Simon Charette, 4 months ago

comment:8 by Jason Hall, 4 months ago

comment:9 by Natalia Bidart, 4 months ago

follow-up: 15 comment:10 by Anže Pečar, 4 months ago

follow-up: 12 comment:11 by Simon Charette, 4 months ago

in reply to: 11 comment:12 by Natalia Bidart, 3 months ago

comment:13 by Natalia Bidart, 3 months ago

comment:14 by Natalia Bidart, 3 months ago

in reply to: 10 comment:15 by Natalia Bidart, 3 months ago

comment:16 by Natalia Bidart, 3 months ago

comment:17 by Natalia Bidart, 3 months ago

comment:18 by Jacob Walls, 2 months ago

comment:19 by Natalia Bidart, 2 months ago

comment:20 by nessita <124304+nessita@…>, 2 months ago

comment:21 by Natalia <124304+nessita@…>, 2 months ago

comment:22 by Natalia <124304+nessita@…>, 2 months ago

Download in other formats:

Django Links

Learn More

Get Involved

Get Help

Follow Us

Support Us