Opened 6 years ago

Closed 3 years ago

#10793 closed Bug (fixed)

Sitemap implementation means it cannot be updated automatically

Reported by: tenkujin Owned by: gnosek
Component: contrib.sitemaps Version: 1.4-alpha-1
Severity: Normal Keywords:
Cc: Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

The Sitemap class uses an attribute _paginator which stores all the URLs. The method items() is thus called only once for the creation of the paginator, and subsequent invocations of the sitemap view do not call it. If the items() method is meant to return dynamically changing items (which is a common case), changes are not reflected in the sitemap. The only way to fix this is to either restart Django (not acceptable), override the get_urls() method and update the _paginator (not nice), or re-implement the Sitemap object to not have such semantics (violates DRY).

Since this paginator is a hack to accommodate Google's arbitrary 50k URL limit, I think it would be best to have a Sitemap which can be updated at runtime, and an optional GoogleSitemap for this particular search engine.

Attachments (4)

sitemaps.patch (839 bytes) - added by Denis Martinez <deuns.martinez@…> 6 years ago.
proposed solution for dynamic sitemaps
paginator.diff (1.7 KB) - added by krzysiumed 3 years ago.
Dynamic paginator creation, checking for DB changes and time limit
paginator_doc.diff (3.0 KB) - added by krzysiumed 3 years ago.
Added documentation to previous patch, moved checking code to separate method.
10793.diff (1.8 KB) - added by gnosek 3 years ago.

Download all attachments as: .zip

Change History (19)

comment:1 Changed 6 years ago by julianb

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

"subsequent invocations of the sitemap view do not call it"

Are you sure? AFAIK Django treats every request as a completely new one.

comment:2 follow-up: Changed 6 years ago by arthurk

The 50k limit is not arbitrary. Google, like MSN and Yahoo, use the sitemaps.org standard which specifies the 50k limit.

See http://sitemaps.org/protocol.php#index for more Information.

comment:3 in reply to: ↑ 2 Changed 6 years ago by tenkujin

Replying to arthurk:

The 50k limit is not arbitrary. Google, like MSN and Yahoo, use the sitemaps.org standard which specifies the 50k limit.

See http://sitemaps.org/protocol.php#index for more Information.

Still, by storing the paginator in the self._paginator field and not providing a way to clear it (signals? cache? method?) means we need to fiddle with the internals of Django sitemaps simply to have dynamic sitemaps. I think the pagination logic should somehow check to see whether the query set it uses has changed since the last time the paginator was created and assigned to _paginator

Changed 6 years ago by Denis Martinez <deuns.martinez@…>

proposed solution for dynamic sitemaps

comment:4 Changed 6 years ago by Denis Martinez <deuns.martinez@…>

Here's a solution to the problem, which I'm currently using on my website.
I applied the above patch to django 1.1 and I modified urls.py to look like this:

urlpatterns = patterns(
    '',
    url(r'^sitemap.xml$', 'django.contrib.sitemaps.views.index', {'sitemaps': get_sitemaps}, name='sitemap_index'),
    url(r'^sitemap-(?P<section>.+)\.xml$', 'django.contrib.sitemaps.views.sitemap', {'sitemaps': get_sitemaps}, name='sitemap'),
    )

where get_sitemaps is a function which returns the sitemaps directory.

def get_sitemaps():
    sitemaps = {}
    sitemaps['blogs'] = BlogSitemap
    ......
    return sitemaps

Using this method my sitemaps are instanciated on every request, and I get a new paginator, and expected results.

comment:5 Changed 6 years ago by xdissent

The approach in sitemaps.patch invalidates the current documentation, and doesn't take into consideration the fact that an previously instantiated Sitemap instance may be used instead of a subclass. I don't know if the pagination should really be handled by the Sitemap class at all, since it seems like a more view specific attribute. Since ditching the class based pagination would require a lot of view changes, I've been using a subclass of Sitemap whose paginator attribute is instantiated each time it is accessed. When using the current sitemap views, the paginator will only be accessed once, so only one paginator is initialized for each request (luckily).

For reference, I believe this bug only shows up if you add a Sitemap instance to the sitemaps dict. Using classes only shouldn't be a problem since the short-lived instance will be recreated each time the view is called.

Should we dumb down Sitemap and move pagination into the views? It seems like that would be more flexible.

comment:6 Changed 6 years ago by jacob

  • Triage Stage changed from Unreviewed to Accepted

comment:7 Changed 4 years ago by gabrielhurley

  • Component changed from Contrib apps to contrib.sitemaps

comment:8 Changed 4 years ago by SmileyChris

  • Severity set to Normal
  • Type set to Bug

Changed 3 years ago by krzysiumed

Dynamic paginator creation, checking for DB changes and time limit

comment:9 Changed 3 years ago by krzysiumed

  • Easy pickings unset
  • Has patch set
  • UI/UX unset

We check time to life of paginator. If it is older than 10 minutes (Sitemap.reload_delta), we force recreation of new paginator. If it is older than 1 minute (Sitemap.counter_delta) we check whether number of items in DB has changed and recreate paginator if so.

comment:10 Changed 3 years ago by krzysiumed

  • Version changed from 1.0 to 1.4-alpha-1

Changed 3 years ago by krzysiumed

Added documentation to previous patch, moved checking code to separate method.

comment:11 Changed 3 years ago by oinopion

  • Needs tests set

comment:12 Changed 3 years ago by gnosek

  • Owner changed from nobody to gnosek
  • Status changed from new to assigned

Changed 3 years ago by gnosek

comment:13 Changed 3 years ago by adam_przybyla

  • Needs tests unset
  • Triage Stage changed from Accepted to Ready for checkin

comment:14 Changed 3 years ago by gnosek

The Paginator object is needlessly cached for the whole lifetime of the Sitemap object, which may be the whole lifetime of the server.

After discussing with jezdez, we considered the potential performance impact (evaluating len(sitemap.items()) at every request acceptable for the time being. Fixing this properly would require Paginators to lazily determine the number of pages only when actually needed, which is quite outside of the scope of this ticket.

comment:15 Changed 3 years ago by jezdez

  • Resolution set to fixed
  • Status changed from assigned to closed

In [17468]:

Fixed #10793 -- Stopped caching paginator instances in sitemap classes to prevent stale sitemaps. Thanks, gnosek, krzysiumed and adam_przybyla.

Note: See TracTickets for help on using tickets.
Back to Top