Code

Opened 5 years ago

Last modified 13 months ago

#11572 new Cleanup/optimization

Very high memory usage by big sitemaps

Reported by: riklaunim Owned by: nobody
Component: contrib.sitemaps Version: master
Severity: Normal Keywords:
Cc: simon@… Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I'm using Py 2.5, Django 1.X, Nginx/FastCGI hosting at megiteam.pl. The site has a big sitemap - 9K elements, 1,7MB sitemap.xml file. The sitemap is done by the book with Sitemap framework:

class MainMap(Sitemap):
	changefreq = "never"
	priority = 0.5
	
	def items(self):
		return JobOffer.objects.filter(published=True, inactive=False)
	
	def lastmod(self, obj):
		return obj.published_at

The problem is that looking at memstat -v after requesting sitemap.xml shows that memory usage boosts from 6MB just after restart to 105MB, and keeps at that level for every next request of the sitemap (where the file is 1,7MB). If I limit the query to 1000 elements I get ~22MB memory usage.

Attachments (0)

Change History (9)

comment:1 Changed 5 years ago by ubernostrum

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Resolution set to invalid
  • Status changed from new to closed

I'm not sure what the bug is here; querying thousands of objects in one go (as you're doing when you fetch the list of items) should be expected to increase memory use. You may want to look into alternative query methods which allow you to conserve memory, but in general this is going to be more memory-intensive as the number of objects involved grows.

comment:2 Changed 4 years ago by esaj

  • Resolution invalid deleted
  • Status changed from closed to reopened

How about we modify the implementation to stream the response, instead of rendering the template into potentially quite a large string for every request?

comment:3 Changed 4 years ago by ubernostrum

  • Resolution set to invalid
  • Status changed from reopened to closed

Your memory problems are mostly due to the overhead of all those model objects; the rendered template won't be nearly as big. At any rate, there are other tickets open regarding streaming HTTP responses, which I'd advise you to look at.

comment:4 Changed 3 years ago by EmilStenstrom

@ubernostrum: This issue took down our site and we've been debugging it for a week straight. Who thought a sitemap could make the apache process consume too much memory and finally get the process killed?

Anyway. I believe two things should be done to prevent this from happening to sites that grow:

1) Lower the default number of elements that get inserted into a sitemap. Let people up this value with an override to be the old one.

2) The memory is not released after the sitemap has been generated (I think). Tested on Windows by looking at the memory used with python.exe before and after generating the sitemap (13 Mb before, 87 Mb after). Also looking at the RSS memory on an ununtu server, it jumps 90 Mb when accessing the sitemap, and does not go down when it's done.

I think this bug should be opened again.

comment:5 Changed 19 months ago by simon29

  • Cc simon@… added
  • Component changed from Contrib apps to contrib.sitemaps
  • Easy pickings unset
  • Resolution invalid deleted
  • Severity set to Normal
  • Status changed from closed to reopened
  • Triage Stage changed from Unreviewed to Design decision needed
  • Type set to Cleanup/optimization
  • UI/UX unset
  • Version changed from 1.0 to master

Big sitemaps have crashed my site(s) too. Hate to reopen, but the current approach is inadequate. Sitemaps need to stream by default.

comment:6 Changed 19 months ago by simon29

Just remembered, I wrote this a while back. I've put it up on github --
https://github.com/s29/django-fastsitemaps

Maybe streaming could be included in core sitemaps as an option.

comment:7 Changed 15 months ago by anonymous

Setting aside building a string vs streaming HTTP, surely it is a bug that the memory is never released for the life of the process, no?

comment:8 Changed 13 months ago by aaugustin

  • Status changed from reopened to new

comment:9 Changed 13 months ago by aaugustin

  • Triage Stage changed from Design decision needed to Accepted

Streaming responses are now supported in core. But as pointed out earlier in the comments, this isn't the problem; the problem is pulling objects from the database. So I'm not sure there's actually much to be gained on this side.

Currently, as soon as you access the first object of a queryset, the entire queryset is brought into memory. This might be optimized with server side cursors, but that's another story (with its own share of problems).

I agree that there's some room for optimization here. To move forward, this ticket needs:

  • a concrete proposal — a link to a project with similar goals isn't sufficient, which parts to you want to integrate exactly, and how do you guarantee backwards compatibility?
  • a benchmark proving the benefits

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as new
The owner will be changed from nobody to anonymous. Next status will be 'assigned'
as The resolution will be set. Next status will be 'closed'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.