Code

Opened 2 years ago

Closed 12 months ago

#18351 closed New feature (fixed)

Sitemaps should use the X-Robots-Tag HTTP header

Reported by: mlissner Owned by: mlissner
Component: contrib.sitemaps Version: 1.4
Severity: Normal Keywords: decorator, sitemap.xml, robots
Cc: Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: yes UI/UX: no

Description

Major search engines currently support three ways that they can be blocked:

  1. On an HTML page you can provide a meta robots tag that says nocrawl or noindex (or both)
  2. You can provide an HTTP X-Robots tag that says nocrawl or noindex (or both)
  3. You can list a file in your robots.txt file.

The distinction between nocrawl, robots and noindex is subtle but important.

*Nocrawl* means that crawlers should stay out -- not even visit the page. Robots.txt and the nocrawl tags accomplish this. Contrary to the *extremely* common belief, placing a resource in robots.txt or putting a nocrawl meta tag on it will *not* prevent it from showing up in search results. The reason is that if Google or Bing knows about a page, that page will show up in search results until it is looked into further. Later, when that happens, the crawler will detect if it's blocked by robots.txt or by norobots. If so, it won't crawl the page (as requested), *but* the page will continue to be in search results unless there's a *noindex* flag.

Here's a short video from Matt Cutts (Google employee) explaining this oddity: http://www.youtube.com/watch?v=KBdEwpRQRD0

And Microsoft has it documented here: http://www.bing.com/community/site_blogs/b/webmaster/archive/2009/08/21/prevent-a-bot-from-getting-lost-in-space-sem-101.aspx

*Noindex* means to please crawl the page, but to not include it in the index, and we should be using this on our sitemaps. Since we don't currently use the noindex HTTP headers, sitemaps made with Django will appear in search results even though they're pretty much useless. You can see see this with clever searches on google for things like [ sitemap.xml site:django-site.com ].

This oddity causes an additional problem because the only way to prevent a page from appearing in Bing or Google is currently:

  • to include it in your sitemap so that it will be crawled as soon as possible; and
  • to place a noindex tag on the page or resource itself.

The site I run has strict requirements when it comes to this fun topic, and there's a lot of people that believe robots.txt works, so I've written up my finding on this: http://michaeljaylissner.com/blog/respecting-privacy-while-providing-hundreds-of-thousands-of-public-documents

I'll write up a patch (my first) to fix this, and will submit it shortly.

Attachments (1)

add_http_tag.diff (1.8 KB) - added by mlissner 2 years ago.
Removed noimageindex, which doesn't make sense in this context.

Download all attachments as: .zip

Change History (11)

comment:1 Changed 2 years ago by mlissner

  • Needs documentation unset
  • Needs tests unset
  • Owner changed from nobody to mlissner
  • Patch needs improvement unset
  • Status changed from new to assigned

comment:2 Changed 2 years ago by mlissner

  • Has patch set

I've added a patch to implement the above. Would love review.

Changed 2 years ago by mlissner

Removed noimageindex, which doesn't make sense in this context.

comment:3 Changed 2 years ago by mlissner

Re-read over this, and figured I'd better summarize: This patch makes is so Django sitemaps don't appear in search results.

comment:4 Changed 23 months ago by andrewgodwin

  • Triage Stage changed from Unreviewed to Ready for checkin

comment:5 Changed 21 months ago by mlissner

This is marked as ready for checkin. Is there anything I need to do to get this pushed into core?

comment:6 Changed 20 months ago by agestart@…

  • Has patch unset
  • Keywords decorator, sitemap.xml, robots added

may be this decorator?

from django.utils.functional import wraps

def x_robots_tag(func):
    def inner(request, *args, **kwargs):
        responce = func(request, *args, **kwargs)
        responce['X-Robots-Tag'] = 'noindex, noodp, noarchive'
        return responce
    return wraps(func)(inner)


    url('^sitemap\.xml$', x_robots_tag(index), {'sitemaps': sitemaps}),

comment:7 Changed 19 months ago by claudep

  • Triage Stage changed from Ready for checkin to Accepted

Tests are missing (trivial). And I don't understand why you transformed the TemplateResponse in an HttpResponse? What problem does it solve?

comment:8 Changed 13 months ago by aaugustin

  • Component changed from Core (Other) to contrib.sitemaps

comment:9 Changed 12 months ago by morty

  • Has patch set

I've created a branch containing tests and a fix based on the suggested use of a decorator.

https://github.com/morty/django/tree/ticket_18351

comment:10 Changed 12 months ago by Claude Paroz <claude@…>

  • Resolution set to fixed
  • Status changed from assigned to closed

In 66c83dce074b48342dbfd0d9039c76b8949f0833:

Fixed #18351 -- Added X-Robots-Tag header to sitemaps

Thanks Michael Lissner for the report and initial patch, and
Tom Mortimer-Jones for working on the patch.

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.