Opened 4 years ago

Closed 2 years ago

Last modified 2 years ago

#15152 closed Bug (fixed)

Common middleware raises UnicodeDecodeError if receives non-ASCII QUERY_STRING from buggy web server

Reported by: Loststylus Owned by: aaugustin
Component: Core (Other) Version: 1.2
Severity: Normal Keywords: common middleware
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

So, FlaxCrawler seems to like my site and visits it very often always getting a 500 error.

Here's the common traceback:

Traceback (most recent call last):

 File "/usr/local/lib/python2.6/dist-packages/django/core/handlers/base.py", line 80, in get_response
   response = middleware_method(request)

 File "/usr/local/lib/python2.6/dist-packages/django/middleware/common.py", line 79, in process_request
   newurl += '?' + request.META['QUERY_STRING']

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 3: ordinal not in range(128)


<WSGIRequest
GET:<QueryDict: {u'q': [u'\u0427\u0430\u0439\u043a\u0430']}>,
POST:<QueryDict: {}>,
COOKIES:{},
META:{'CONTENT_LENGTH': '',
 'CONTENT_TYPE': '',
 'HTTP_ACCEPT': 'text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2',
 'HTTP_ACCEPT_ENCODING': 'gzip,defalte',
 'HTTP_ACCEPT_LANGUAGE': 'ru,en-us;q=0.7,en;q=0.3',
 'HTTP_CACHE_CONTROL': 'no-cache',
 'HTTP_CONNECTION': 'close',
 'HTTP_HOST': '{sorry, i've got that censored out}',
 'HTTP_PRAGMA': 'no-cache',
 'HTTP_USER_AGENT': 'FlaxCrawler/1.0',
 'PATH_INFO': u'/articles/ajaxsearch',
 'QUERY_STRING': 'q=\xd0\xa7\xd0\xb0\xd0\xb9\xd0\xba\xd0\xb0',
 'REMOTE_ADDR': '92.241.173.132',
 'REQUEST_METHOD': 'GET',
 'SCRIPT_NAME': u'',
 'SERVER_NAME': '{sorry, i've got that censored out}',
 'SERVER_PORT': '80',
 'SERVER_PROTOCOL': 'HTTP/1.1',
 'wsgi.errors': <flup.server.fcgi_base.TeeOutputStream object at 0x2634850>,
 'wsgi.input': <flup.server.fcgi_base.InputStream object at 0x2634610>,
 'wsgi.multiprocess': True,
 'wsgi.multithread': False,
 'wsgi.run_once': False,
 'wsgi.url_scheme': 'http',
 'wsgi.version': (1, 0)}>

The major problem i see here is that developer cannot do anything to catch the error :(

Change History (18)

comment:2 Changed 4 years ago by lukeplant

There is no "specially formatted unicode-like" query string here - it is a straightforward UTF-8 encoded string.

The strange thing here is that non-ASCII characters are ending up in META['QUERY_STRING']. With browsers, non-ASCII characters get percent encoded. So the request is simply wrong - this is definitely a bug the crawler (but that is irrelevant).

The next question is whether this is a bug in the web server, which appears to be flup. Looking at the spec for QUERY_STRING in CGI (http://ken.coar.org/cgi/draft-coar-cgi-v11-03.txt) which is the basis of the WSGI spec (http://www.python.org/dev/peps/pep-0333/#environ-variables), the value of QUERY_STRING should not contain these values.

So AFAICS, this is a bug in flup, because it should never be passing on values like these. That doesn't mean we shouldn't fix it in Django to stop 500 errors being produced. The best behaviour would be to return a '400 Malformed request' error if QUERY_STRING has any non-ascii chars, but we probably don't want to do that in the bit of code that is raising this exception, but somewhere like WSGIRequest.__init__ or BaseHandler.get_response. But this will add overhead to every request, so I'm not sure what to do.

There is a way to catch this at the developer level - install an exception middleware. You could also install a request middleware that checked that no invalid chars were in QUERY_STRING.

comment:3 Changed 4 years ago by Loststylus

Oh, thank you for your response, i'll try to catch it via middleware.

I think the WSGIRequest constructor seems like the proper place to check for that.

comment:4 Changed 4 years ago by Loststylus

Temporary workaround (middleware should be added befor common middleware):

class RequestCheckMiddleware(object):

    def process_request(self, request):
        
        try:            
            u'%s' % request.META.get('QUERY_STRING','')
        except UnicodeDecodeError:
            response = HttpResponse()
            response.status_code = 400  #Bad Request
            return response
        
        return None

comment:5 Changed 4 years ago by russellm

  • Triage Stage changed from Unreviewed to Accepted

Accepted on the basis that we could do something here, but I agree with Luke - we don't want to pay a big price because a handful of servers can't implement the spec correctly.

comment:6 Changed 4 years ago by ramiro

  • Summary changed from Common middleware raises UnicodeDecodeError if receives specially formatted unicode-like query string to Common middleware raises UnicodeDecodeError if receives non-ASCII QUERY_STRING from buggy web server

comment:7 Changed 4 years ago by lrekucki

  • Severity set to Normal
  • Type set to Bug

comment:8 Changed 4 years ago by anonymous

btw, the problem often shows up when just using ie10 beta

comment:9 Changed 3 years ago by jacob

  • milestone 1.3 deleted

Milestone 1.3 deleted

comment:11 Changed 3 years ago by aaugustin

  • UI/UX unset

Change UI/UX from NULL to False.

comment:12 Changed 3 years ago by aaugustin

  • Easy pickings unset

Change Easy pickings from NULL to False.

comment:13 Changed 3 years ago by fjsj

I got the same error.
My server is Apache with mod_wsgi. The client seems to be Internet Explorer 9.

Traceback (most recent call last):

 File "/usr/local/lib/python2.7/dist-packages/Django-1.3-py2.7.egg/django/core/handlers/base.py", line 89, in get_response
   response = middleware_method(request)

 File "/usr/local/lib/python2.7/dist-packages/Django-1.3-py2.7.egg/django/middleware/common.py", line 89, in process_request
   newurl += '?' + request.META['QUERY_STRING']

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 58: ordinal not in range(128)


<WSGIRequest
GET:<QueryDict: {u'datetime': [u'--------------'], u'time_group_id': [u'---'], u'speciality': [u'Psicologia (Dist\xfarbios emocionais e de personalidade)'], u'office': [u'--------------------------------------------'], u'health_insurance': [u'----------']}>,
POST:<QueryDict: {}>,
META:{
 'GATEWAY_INTERFACE': 'CGI/1.1',
 'HTTP_ACCEPT': 'text/html, application/xhtml+xml, */*',
 'HTTP_ACCEPT_LANGUAGE': 'pt-BR',
 'HTTP_CONNECTION': 'Keep-Alive',
 'HTTP_USER_AGENT': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)',
 'HTTP_VIA': '1.1 SVGWF02',
 'QUERY_STRING': 'health_insurance=----------&speciality=Psicologia%20(Dist\xc3\xbarbios%20emocionais%20e%20de%20personalidade)&office=--------------------------------------------&time_group_id=---&datetime=----------------',
 'REQUEST_METHOD': 'GET',
 'SERVER_PROTOCOL': 'HTTP/1.1',
 'SERVER_SOFTWARE': 'Apache',
 'mod_wsgi.callable_object': 'application',
 'mod_wsgi.handler_script': '',
 'mod_wsgi.input_chunked': '0',
 'mod_wsgi.listener_host': '',
 'mod_wsgi.process_group': '',
 'mod_wsgi.request_handler': 'wsgi-script',
 'mod_wsgi.script_reloading': '1',
 'mod_wsgi.version': (3, 3),
 'wsgi.errors': <mod_wsgi.Log object at 0xba62c0c0>,
 'wsgi.file_wrapper': <built-in method file_wrapper of mod_wsgi.Adapter object at 0xba4b2608>,
 'wsgi.input': <mod_wsgi.Input object at 0xba41cd90>,
 'wsgi.multiprocess': True,
 'wsgi.multithread': False,
 'wsgi.run_once': False,
 'wsgi.url_scheme': 'http',
 'wsgi.version': (1, 1)}>

comment:14 Changed 3 years ago by anonymous

I have same error

Traceback (most recent call last):

  File "/usr/lib/python2.6/site-packages/django/core/handlers/base.py", line 89, in get_response
    response = middleware_method(request)

  File "/usr/lib/python2.6/site-packages/django/middleware/common.py", line 89, in process_request
    newurl += '?' + request.META['QUERY_STRING']

UnicodeDecodeError: 'ascii' codec can't decode byte 0xb0 in position 4: ordinal not in range(128)

Some additional data

HTTP_USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)',
 'HTTP_VIA': '1.0 niiri.kharkov.com (squid/3.0.STABLE7), 1.0 wwwniiri (squid/3.2.0.12)',
 'HTTP_X_FORWARDED_FOR': 'unknown, 172.16.0.4, 82.117.230.71',
 'PATH_INFO': u'/ru/products/tag/N',
 'QUERY_STRING': 'N??\xb0???\xbb??????/fancybox/fancy_loading.png',
 'REMOTE_PORT': '27370',
 'REQUEST_METHOD': 'GET',
 'REQUEST_URI': '/ru/products/tag/N?N??\xb0???\xbb??????/fancybox/fancy_loading.png',

comment:15 Changed 3 years ago by anonymous

Same recurrent error here, just after updating to 1.4 :

Traceback (most recent call last):                                                                                                                                               
  File "/usr/lib/python2.7/wsgiref/handlers.py", line 85, in run
    self.result = application(self.environ, self.start_response)
  File "/usr/lib/python2.7/site-packages/Django-1.4-py2.7.egg/django/contrib/staticfiles/handlers.py", line 67, in __call__
    return self.application(environ, start_response)
  File "/usr/lib/python2.7/site-packages/Django-1.4-py2.7.egg/django/core/handlers/wsgi.py", line 241, in __call__
    response = self.get_response(request)
  File "/usr/lib/python2.7/site-packages/Django-1.4-py2.7.egg/django/core/handlers/base.py", line 146, in get_response
    response = debug.technical_404_response(request, e)
  File "/usr/lib/python2.7/site-packages/Django-1.4-py2.7.egg/django/views/debug.py", line 432, in technical_404_response
    'reason': smart_str(exception, errors='replace'),
  File "/usr/lib/python2.7/site-packages/Django-1.4-py2.7.egg/django/utils/encoding.py", line 116, in smart_str
    return str(s)
File "/usr/lib/python2.7/site-packages/Django-1.4-py2.7.egg/django/core/urlresolvers.py", line 185, in __repr__
    return smart_str(u'<%s %s %s>' % (self.__class__.__name__, self.name, self.regex.pattern))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

I tried add utf8 spec at the top of urlresolver + add DEFAULT utf8 at settings.py => no effect.

comment:16 Changed 2 years ago by aaugustin

  • Needs documentation unset
  • Needs tests unset
  • Owner changed from nobody to aaugustin
  • Patch needs improvement unset

comment:17 Changed 2 years ago by KyleMac

Apache + mod_wsgi and the Bingbot is causing this error on one of my servers.

comment:18 Changed 2 years ago by aaugustin

This happens because the URL produced by reversing is a unicode string and the query string is a bytestring.

(Interestingly, this bug doesn't exist under Python 3.)

comment:19 Changed 2 years ago by Aymeric Augustin <aymeric.augustin@…>

  • Resolution set to fixed
  • Status changed from new to closed

In be6522561f01aa2a0b503fb35f35c9fd34c5110f:

[1.5.x] Fixed #15152 -- Avoided crash of CommonMiddleware on broken querystring

Backport of 973f539 from master.

comment:20 Changed 2 years ago by Aymeric Augustin <aymeric.augustin@…>

In 973f539ab83bb46645f2f711190735c66a246797:

Fixed #15152 -- Avoided crash of CommonMiddleware on broken querystring

Note: See TracTickets for help on using tickets.
Back to Top