Enhancements to the Sitemaps Framework so it works better for large sites
|Reported by:||Mike Lissner||Owned by:||nobody|
|Has patch:||no||Needs documentation:||no|
|Needs tests:||no||Patch needs improvement:||no|
I've been using the sitemaps framework on my site for a little while now, and it occurs to me that, while the current implementation is good, maybe there are ways it could be improved. Forgive me if any of this has been mentioned elsewhere, but I did some looking, and didn't find it, so hopefully not, and hopefully this is the right place for such a discussion.
The problems I see with the sitemaps are two:
- Generating a sitemap takes a LOT of IO, DB and CPU
- There is no way to to trigger an update when certain pages change.
- There is no good caching mechanism for them.
I'll explain. The first problem for sitemaps is that they are generally large pages with LOTs of calls to the DB. The max for Google is 50k pages/sitemap, which means at least 50K things pulled from the DB. If you have custom date-modified and custom priority fields for each of these, that's 150K records from the DB. Bam, one of your DB threads is tied up. If you have an indexed sitemap set up, it's even possible for a crawler to request a bunch of your sitemaps simultaneously, in which case, bam, all your DB threads are tied up.
The second problem is for sites that create content that doesn't change very often. As an owner of such a site, my sitemaps don't ever change aside from the one that is the last page in the index, which is where the new content is listed. All the old sitemaps almost never change. The result of this is that they almost never need to be regenerated. Occasionally they do, but very rarely. This is the case for lots of types of websites:
- news sites (ahem, lawrence.com)
- e-commerce, where new products come, but old ones are largely static
- reference sites (like mine)
The third problem I mentioned above is that there is no good caching mechanism for the sitemaps. They can be cached by one of the caching mechanisms listed in the caching documentation, but since they can be quite big, and since they often don't change, using a third of your RAM-based cache for your sitemaps is not a great option. Since there's no way to choose a different cache backend for a different page on the site, this becomes challenging.
So....I'm not ripe with ideas on how to change these things, but I thought I would mention them formally here, and see what discussion ensued. My only ideas both center on a caching mechanism of some kind:
- assuming that the database or the filesystem is the best type of cache for these (since they're large and mostly static), a method to easily set the cache backend for sitemaps would be incredibly useful.
- having a system of triggers for sitemap regeneration would also be amazing, so that rather than actually generating the sitemap whenever it is pinged, instead it could be generated only when there is a change. (I suppose this could be configured in the views that create new content, but that seems a little hackish.)
I'm curious what people's thoughts on this are, and perhaps what their solutions are. The current sitemap framework is great in its simplicity, but for real sitemaps on big sites, I don't think it works all that well, but maybe I'm missing something.