Opened 18 years ago

Closed 18 years ago

Last modified 17 years ago

#2934 closed enhancement (fixed)

[patch] validators.isExistingURL is frequently wrong

Reported by: jdunck@… Owned by: Adrian Holovaty
Component: Validators Version: 0.95
Severity: normal Keywords:
Cc: Triage Stage: Unreviewed
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

The existing isExistingURL validator uses urllib2's default user agent string, which is commonly rejected by servers.

Similarly, the validator fails if a 301 or 302 is returned, though a 401 is accepted as passing.

I think it's better to claim to support all sorts of responses, allow a configurable user agent (via settings) and accept 301,302 as valid. As a philosophical issue, we could perhaps loop on 301,302, calling it a failure after a certain number of tries, but then you might fall into a cookied tarpit which is valid, but requires a cookie store. Semi-aside, hey, httplib2 is nice.

Sorry, no patch; I'm on 0.91 and can't easily diff w/ trunk. Even so, here's my local isExistingURL:

def isExistingURL(field_data, all_data):
    import urllib2
    try:
        headers = {
            "Accept" : "text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5",
            "Accept-Language" : "en-us,en;q=0.5",
            "Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.7",
            "Connection" : "close",
            "User-Agent":URL_FETCH_USER_AGENT
            }
        req = urllib2.Request(field_data,None, headers)
        u = urllib2.urlopen(req)
    except ValueError:
        raise ValidationError, _("Invalid URL: %s") % field_data
    except urllib2.HTTPError, e:
        # 401s are valid; they just mean authorization is required.
        # 301 and 302 are redirects; they just mean look somewhere else.
        if str(e.code) not in ('401','301','302'):
            raise ValidationError, _("The URL %s is a broken link.") % field_data
    except: # urllib2.URLError, httplib.InvalidURL, etc.
        raise ValidationError, _("The URL %s is a broken link.") % field_data

Change History (6)

comment:1 by jdunck@…, 18 years ago

Summary: validators.isExistingURL is frequently wrong[patch] validators.isExistingURL is frequently wrong

Claiming it as a patch since applying it is straight-forward.

May I suggest Firefox as the global_settings.py value of URL_FETCH_USER_AGENT?

comment:2 by Jacob, 18 years ago

By default I don't think we should make Django lie about who it is...

comment:3 by Jacob, 18 years ago

Resolution: fixed
Status: newclosed

(In [4035]) Fixed #2934: greatly improved the accuracy if the isExistingURL check. Also introduced a new setting, URL_VALIDATOR_USER_AGENT, which is the User-Agent that the validator will use to check for URL existance. Thanks, Jeremy.

comment:4 by Jeremy Dunck <jdunck@…>, 18 years ago

FYI, I had to use a "real" user agent string-- there are, unbelievably, sites out there which throw a 500 if they don't recognize the UA string.

comment:5 by Jacob, 18 years ago

Yeah, I know that -- my banking site, for one, which broke when FF2.0 came out -- but I just feel skeezy putting a real UA in the default Django settings.

comment:6 by (none), 17 years ago

milestone: Version 1.0

Milestone Version 1.0 deleted

Note: See TracTickets for help on using tickets.
Back to Top