Opened 15 years ago

Closed 10 years ago

#11572 closed Cleanup/optimization (worksforme)

Very high memory usage by big sitemaps

Reported by: Piotr Maliński Owned by: nobody
Component: contrib.sitemaps Version: dev
Severity: Normal Keywords:
Cc: simon@… Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I'm using Py 2.5, Django 1.X, Nginx/FastCGI hosting at megiteam.pl. The site has a big sitemap - 9K elements, 1,7MB sitemap.xml file. The sitemap is done by the book with Sitemap framework:

class MainMap(Sitemap):
	changefreq = "never"
	priority = 0.5
	
	def items(self):
		return JobOffer.objects.filter(published=True, inactive=False)
	
	def lastmod(self, obj):
		return obj.published_at

The problem is that looking at memstat -v after requesting sitemap.xml shows that memory usage boosts from 6MB just after restart to 105MB, and keeps at that level for every next request of the sitemap (where the file is 1,7MB). If I limit the query to 1000 elements I get ~22MB memory usage.

Change History (10)

comment:1 by James Bennett, 15 years ago

Resolution: invalid
Status: newclosed

I'm not sure what the bug is here; querying thousands of objects in one go (as you're doing when you fetch the list of items) should be expected to increase memory use. You may want to look into alternative query methods which allow you to conserve memory, but in general this is going to be more memory-intensive as the number of objects involved grows.

comment:2 by Jason Davies, 15 years ago

Resolution: invalid
Status: closedreopened

How about we modify the implementation to stream the response, instead of rendering the template into potentially quite a large string for every request?

comment:3 by James Bennett, 15 years ago

Resolution: invalid
Status: reopenedclosed

Your memory problems are mostly due to the overhead of all those model objects; the rendered template won't be nearly as big. At any rate, there are other tickets open regarding streaming HTTP responses, which I'd advise you to look at.

comment:4 by EmilStenstrom, 14 years ago

@ubernostrum: This issue took down our site and we've been debugging it for a week straight. Who thought a sitemap could make the apache process consume too much memory and finally get the process killed?

Anyway. I believe two things should be done to prevent this from happening to sites that grow:

1) Lower the default number of elements that get inserted into a sitemap. Let people up this value with an override to be the old one.

2) The memory is not released after the sitemap has been generated (I think). Tested on Windows by looking at the memory used with python.exe before and after generating the sitemap (13 Mb before, 87 Mb after). Also looking at the RSS memory on an ununtu server, it jumps 90 Mb when accessing the sitemap, and does not go down when it's done.

I think this bug should be opened again.

comment:5 by Simon Litchfield, 12 years ago

Cc: simon@… added
Component: Contrib appscontrib.sitemaps
Easy pickings: unset
Resolution: invalid
Severity: Normal
Status: closedreopened
Triage Stage: UnreviewedDesign decision needed
Type: Cleanup/optimization
UI/UX: unset
Version: 1.0master

Big sitemaps have crashed my site(s) too. Hate to reopen, but the current approach is inadequate. Sitemaps need to stream by default.

comment:6 by Simon Litchfield, 12 years ago

Just remembered, I wrote this a while back. I've put it up on github --
https://github.com/s29/django-fastsitemaps

Maybe streaming could be included in core sitemaps as an option.

comment:7 by anonymous, 12 years ago

Setting aside building a string vs streaming HTTP, surely it is a bug that the memory is never released for the life of the process, no?

comment:8 by Aymeric Augustin, 12 years ago

Status: reopenednew

comment:9 by Aymeric Augustin, 12 years ago

Triage Stage: Design decision neededAccepted

Streaming responses are now supported in core. But as pointed out earlier in the comments, this isn't the problem; the problem is pulling objects from the database. So I'm not sure there's actually much to be gained on this side.

Currently, as soon as you access the first object of a queryset, the entire queryset is brought into memory. This might be optimized with server side cursors, but that's another story (with its own share of problems).

I agree that there's some room for optimization here. To move forward, this ticket needs:

  • a concrete proposal — a link to a project with similar goals isn't sufficient, which parts to you want to integrate exactly, and how do you guarantee backwards compatibility?
  • a benchmark proving the benefits

comment:10 by Aymeric Augustin, 10 years ago

Resolution: worksforme
Status: newclosed

Reading this ticket again, there's just some handwaving about magical "streaming" that's going to fix everything, but it isn't clear whether that's HTTP streaming or server-side database cursors...

In the absence of a concrete proposal, I'm going to close this ticket as "needsinfo". Please reopen if you can suggest implementation changes or documentatation changes.

Note: See TracTickets for help on using tickets.
Back to Top