Code

Opened 4 years ago

Closed 2 months ago

#13543 closed New feature (needsinfo)

Enhancements to the Sitemaps Framework so it works better for large sites

Reported by: mlissner Owned by: nobody
Component: contrib.sitemaps Version: master
Severity: Normal Keywords:
Cc: Triage Stage: Someday/Maybe
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I've been using the sitemaps framework on my site for a little while now, and it occurs to me that, while the current implementation is good, maybe there are ways it could be improved. Forgive me if any of this has been mentioned elsewhere, but I did some looking, and didn't find it, so hopefully not, and hopefully this is the right place for such a discussion.

The problems I see with the sitemaps are two:

  1. Generating a sitemap takes a LOT of IO, DB and CPU
  2. There is no way to to trigger an update when certain pages change.
  3. There is no good caching mechanism for them.

I'll explain. The first problem for sitemaps is that they are generally large pages with LOTs of calls to the DB. The max for Google is 50k pages/sitemap, which means at least 50K things pulled from the DB. If you have custom date-modified and custom priority fields for each of these, that's 150K records from the DB. Bam, one of your DB threads is tied up. If you have an indexed sitemap set up, it's even possible for a crawler to request a bunch of your sitemaps simultaneously, in which case, bam, all your DB threads are tied up.

The second problem is for sites that create content that doesn't change very often. As an owner of such a site, my sitemaps don't ever change aside from the one that is the last page in the index, which is where the new content is listed. All the old sitemaps almost never change. The result of this is that they almost never need to be regenerated. Occasionally they do, but very rarely. This is the case for lots of types of websites:

  • blogs
  • news sites (ahem, lawrence.com)
  • e-commerce, where new products come, but old ones are largely static
  • reference sites (like mine)

The third problem I mentioned above is that there is no good caching mechanism for the sitemaps. They can be cached by one of the caching mechanisms listed in the caching documentation, but since they can be quite big, and since they often don't change, using a third of your RAM-based cache for your sitemaps is not a great option. Since there's no way to choose a different cache backend for a different page on the site, this becomes challenging.

So....I'm not ripe with ideas on how to change these things, but I thought I would mention them formally here, and see what discussion ensued. My only ideas both center on a caching mechanism of some kind:

  • assuming that the database or the filesystem is the best type of cache for these (since they're large and mostly static), a method to easily set the cache backend for sitemaps would be incredibly useful.
  • having a system of triggers for sitemap regeneration would also be amazing, so that rather than actually generating the sitemap whenever it is pinged, instead it could be generated only when there is a change. (I suppose this could be configured in the views that create new content, but that seems a little hackish.)

I'm curious what people's thoughts on this are, and perhaps what their solutions are. The current sitemap framework is great in its simplicity, but for real sitemaps on big sites, I don't think it works all that well, but maybe I'm missing something.

Attachments (0)

Change History (6)

comment:1 Changed 4 years ago by russellm

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Triage Stage changed from Unreviewed to Someday/Maybe

Sure. And a pony :-)

Trac isn't the best place to have an open discussion on a topic like this; if you want to discuss details, take it to django-developers. However, I'm certainly interested in improvements to sitemaps to deal with huge sitemaps.

Marking this someday/maybe because it's a vague proposal; show us a patch and we'll move it to something more concrete.

comment:2 Changed 3 years ago by gabrielhurley

  • Component changed from Contrib apps to contrib.sitemaps

comment:3 Changed 3 years ago by julien

  • Severity set to Normal
  • Type set to New feature

comment:4 Changed 2 years ago by aaugustin

  • UI/UX unset

Change UI/UX from NULL to False.

comment:5 Changed 2 years ago by aaugustin

  • Easy pickings unset

Change Easy pickings from NULL to False.

comment:6 Changed 2 months ago by aaugustin

  • Resolution set to needsinfo
  • Status changed from new to closed

Almost four years later, this still needs a concrete and actionable proposal.

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.