Opened 12 years ago

Closed 10 years ago

#18456 closed Bug (fixed)

HttpRequest.get_full_path does not escape # sign in the url

Reported by: vlad.shcherbina@… Owned by: Unai Zalakain
Component: HTTP handling Version: dev
Severity: Normal Keywords:
Cc: jim@… Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Suppose request was made with the url '/some%23thing?param=1', then get_full_path() would return '/some#thing?param=1' which I believe is incorrect as 'thing?param=1' is now in anchor part.

Change History (10)

comment:1 by Aymeric Augustin, 12 years ago

Component: UncategorizedHTTP handling
Triage Stage: UnreviewedDesign decision needed

request.get_full_path() returns the path with the query string (if there is one). The result doesn't have any particular encoding or escaping applied. The docs would mention it otherwise :)

This is a bit pathological when the path includes a "?". In this case, the "?" must have been escaped in the original representation of the URL (written in the HTML or typed in the address bar). Otherwise it would have been interpreted as the beginning of the query string. However, since request.get_full_path() is just path + '?' + query string, you can't tell the difference between a "?" in the path and the "?" that marks the beginning of the query string in its output.

"#" is less of a problem. Browsers don't includes fragments (that's the official name for "anchor") in requests, so if you have a # in the output of request.get_full_path(), it was in the path or the query string. It can't be the anchor.

To sum up with an example, if the escaped URL is /%3Ffoo%23bar?baz#quux, then request.get_full_path() is /?foo#bar?baz.


So your question boils down to: should path be URL-encoded when building request.get_full_path()?

To be honest I'm not sure.

Note that django.utils.encoding.iri_to_uri() won't do the job because it keeps # and ? unchanged.

Related ticket: #11522

comment:2 by Jim Garrison, 12 years ago

Cc: jim@… added

I have always found request.get_full_path() to be less than useful when dealing with tricky corner cases. It doesn't escape anything itself, but if you escape its output, it also escapes the '?' used to assemble the path and the query string:

>>> from django.http import HttpRequest
>>> from django.utils.http import urlquote
>>> request = HttpRequest()
>>> request.path = '/'
>>> request.META['QUERY_STRING'] = 'q=a'
>>> request.get_full_path()
'/?q=a'
>>> urlquote(request.get_full_path())
u'/%3Fq%3Da'

In cases involving both query strings and unicode characters in urls, I have found that it is best simply to avoid request.get_full_path() altogether. Instead, in my own projects I have defined a method called request.get_escaped_full_path() as follows:

    def get_escaped_full_path(self):
        return '%s%s' % (iri_to_uri(urlquote(self.path)), self.META.get('QUERY_STRING', '') and ('?' + iri_to_uri(self.META.get('QUERY_STRING', ''))) or '')

This turns out to be much more predictable (and useful).

comment:3 by Aymeric Augustin, 12 years ago

Resolution: wontfix
Status: newclosed

In fact, the webserver urldecodes the requested path before putting it in the WSGI environ, and urldecoding isn't reversible.

Therefore, it's impossible to retrieve the URL that was actually sent by the browser, short of parsing it and reconstructing it with reverse -- which, as of a few days ago, urlencodes properly.

in reply to:  3 ; comment:4 by Jim Garrison, 12 years ago

Replying to aaugustin:

In fact, the webserver urldecodes the requested path before putting it in the WSGI environ, and urldecoding isn't reversible.

Therefore, it's impossible to retrieve the URL that was actually sent by the browser, short of parsing it and reconstructing it with reverse -- which, as of a few days ago, urlencodes properly.

I think there is a bigger issue to consider here, and I hope you will reconsider having marked this ticket as wontfix.

As I mentioned in comment:2, get_full_path() as currently implemented is useless on sites that contain paths that need escaping along with query strings. The question comes down to: should one urlencode the result of get_full_path(), or not? The documentation endorses neither way, and actually, neither answer is fully correct. Consider the case where one is attempting to use get_full_path() as a way to construct a URL to place in an HTML anchor tag, or a URL that can be used as a 302 redirect. If the path has a query string, then calling urlquote(request.get_fullpath()) will mistakenly urlencode the '?' character. If one chooses not to escape the result and just use get_full_path() directly (which is, e.g., what the django admin does), then the link will be broken if the URL contains characters that need to be escaped.

As such, for any conceivable use case of HttpRequest.get_full_path(), the only behavior that makes consistent sense is to urlencode the path. This is true whether one is using it to e.g. construct an anchor tag with, or in a 301/302 redirect. In fact, in neither of these use cases (nor in any can I think of) does it even matter whether one is able to reconstruct the precise url that was sent by the browser.

in reply to:  4 comment:5 by Aymeric Augustin, 12 years ago

Resolution: wontfix
Status: closednew
Triage Stage: Design decision neededAccepted

It's difficult to say something about "every conceivable use case of get_full_path()". For instance, it could be used for logging; and one wants to log "/foo:bar/", not "%2Ffoo%3Abar%2F" :)

Reopening to see if it would make sense to escape specifically significant characters in the path of an URL (?, #, ;, maybe others).

comment:6 by Unai Zalakain, 11 years ago

Owner: changed from nobody to Unai Zalakain
Status: newassigned

According to https://www.ietf.org/rfc/rfc2396.txt

   The path may consist of a sequence of path segments separated by a
   single slash "/" character.  Within a path segment, the characters
   "/", ";", "=", and "?" are reserved.  Each path segment may include a
   sequence of parameters, indicated by the semicolon ";" character.
   The parameters are not significant to the parsing of relative
   references.

I would escape all ";", "=" and "?" characters. The fragment isn't even contemplated because it's not strictly part of the URI:

   When a URI reference is used to perform a retrieval action on the
   identified resource, the optional fragment identifier, separated from
   the URI by a crosshatch ("#") character, consists of additional
   reference information to be interpreted by the user agent after the
   retrieval action has been successfully completed.  As such, it is not
   part of a URI, but is often used in conjunction with a URI.

Personally, I consider the possible logging clarity problems less important than the problems arising from HttpRequest.get_full_path() bad behavior. If needed, logging could use some other function to print out the URI.

Last edited 11 years ago by Unai Zalakain (previous) (diff)

comment:7 by Unai Zalakain, 11 years ago

It seems that Django doesn't (in principle) log the successful requests anywhere (the PATH_INFO WSGI environ is what gets logged both in development and production). The only place where get_full_path() is intended for human consumption is django.middleware.common.BrokenLinkEmailsMiddleware.process_response where it gets mailed. I would say that, in this case, it might be even interesting to have the "correct" yet difficult to understand path printed.

comment:8 by Unai Zalakain, 11 years ago

Has patch: set

comment:9 by Berker Peksag, 10 years ago

Here is the revised pull request: https://github.com/django/django/pull/3454

Changes:

  • Fixed merge conflict
  • Rewrote the tests in tests/utils_tests/test_encoding.py
  • Added a test to tests/requests/tests.py
  • Updated the versionadded directive in docs/ref/utils.txt
  • Added a release note

comment:10 by Tim Graham <timograham@…>, 10 years ago

Resolution: fixed
Status: assignedclosed

In c548c8d0d1112d2c4bace2eebf73ede35300d842:

Fixed #18456 -- Added path escaping to HttpRequest.get_full_path().

Note: See TracTickets for help on using tickets.
Back to Top