Code

Opened 2 years ago

Last modified 8 months ago

#18456 assigned Bug

HttpRequest.get_full_path does not escape # sign in the url

Reported by: vlad.shcherbina@… Owned by: unaizalakain
Component: HTTP handling Version: master
Severity: Normal Keywords:
Cc: jim@… Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Suppose request was made with the url '/some%23thing?param=1', then get_full_path() would return '/some#thing?param=1' which I believe is incorrect as 'thing?param=1' is now in anchor part.

Attachments (0)

Change History (8)

comment:1 Changed 2 years ago by aaugustin

  • Component changed from Uncategorized to HTTP handling
  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Triage Stage changed from Unreviewed to Design decision needed

request.get_full_path() returns the path with the query string (if there is one). The result doesn't have any particular encoding or escaping applied. The docs would mention it otherwise :)

This is a bit pathological when the path includes a "?". In this case, the "?" must have been escaped in the original representation of the URL (written in the HTML or typed in the address bar). Otherwise it would have been interpreted as the beginning of the query string. However, since request.get_full_path() is just path + '?' + query string, you can't tell the difference between a "?" in the path and the "?" that marks the beginning of the query string in its output.

"#" is less of a problem. Browsers don't includes fragments (that's the official name for "anchor") in requests, so if you have a # in the output of request.get_full_path(), it was in the path or the query string. It can't be the anchor.

To sum up with an example, if the escaped URL is /%3Ffoo%23bar?baz#quux, then request.get_full_path() is /?foo#bar?baz.


So your question boils down to: should path be URL-encoded when building request.get_full_path()?

To be honest I'm not sure.

Note that django.utils.encoding.iri_to_uri() won't do the job because it keeps # and ? unchanged.

Related ticket: #11522

comment:2 Changed 2 years ago by garrison

  • Cc jim@… added

I have always found request.get_full_path() to be less than useful when dealing with tricky corner cases. It doesn't escape anything itself, but if you escape its output, it also escapes the '?' used to assemble the path and the query string:

>>> from django.http import HttpRequest
>>> from django.utils.http import urlquote
>>> request = HttpRequest()
>>> request.path = '/'
>>> request.META['QUERY_STRING'] = 'q=a'
>>> request.get_full_path()
'/?q=a'
>>> urlquote(request.get_full_path())
u'/%3Fq%3Da'

In cases involving both query strings and unicode characters in urls, I have found that it is best simply to avoid request.get_full_path() altogether. Instead, in my own projects I have defined a method called request.get_escaped_full_path() as follows:

    def get_escaped_full_path(self):
        return '%s%s' % (iri_to_uri(urlquote(self.path)), self.META.get('QUERY_STRING', '') and ('?' + iri_to_uri(self.META.get('QUERY_STRING', ''))) or '')

This turns out to be much more predictable (and useful).

comment:3 follow-up: Changed 16 months ago by aaugustin

  • Resolution set to wontfix
  • Status changed from new to closed

In fact, the webserver urldecodes the requested path before putting it in the WSGI environ, and urldecoding isn't reversible.

Therefore, it's impossible to retrieve the URL that was actually sent by the browser, short of parsing it and reconstructing it with reverse -- which, as of a few days ago, urlencodes properly.

comment:4 in reply to: ↑ 3 ; follow-up: Changed 16 months ago by garrison

Replying to aaugustin:

In fact, the webserver urldecodes the requested path before putting it in the WSGI environ, and urldecoding isn't reversible.

Therefore, it's impossible to retrieve the URL that was actually sent by the browser, short of parsing it and reconstructing it with reverse -- which, as of a few days ago, urlencodes properly.

I think there is a bigger issue to consider here, and I hope you will reconsider having marked this ticket as wontfix.

As I mentioned in comment:2, get_full_path() as currently implemented is useless on sites that contain paths that need escaping along with query strings. The question comes down to: should one urlencode the result of get_full_path(), or not? The documentation endorses neither way, and actually, neither answer is fully correct. Consider the case where one is attempting to use get_full_path() as a way to construct a URL to place in an HTML anchor tag, or a URL that can be used as a 302 redirect. If the path has a query string, then calling urlquote(request.get_fullpath()) will mistakenly urlencode the '?' character. If one chooses not to escape the result and just use get_full_path() directly (which is, e.g., what the django admin does), then the link will be broken if the URL contains characters that need to be escaped.

As such, for any conceivable use case of HttpRequest.get_full_path(), the only behavior that makes consistent sense is to urlencode the path. This is true whether one is using it to e.g. construct an anchor tag with, or in a 301/302 redirect. In fact, in neither of these use cases (nor in any can I think of) does it even matter whether one is able to reconstruct the precise url that was sent by the browser.

comment:5 in reply to: ↑ 4 Changed 16 months ago by aaugustin

  • Resolution wontfix deleted
  • Status changed from closed to new
  • Triage Stage changed from Design decision needed to Accepted

It's difficult to say something about "every conceivable use case of get_full_path()". For instance, it could be used for logging; and one wants to log "/foo:bar/", not "%2Ffoo%3Abar%2F" :)

Reopening to see if it would make sense to escape specifically significant characters in the path of an URL (?, #, ;, maybe others).

comment:6 Changed 8 months ago by unaizalakain

  • Owner changed from nobody to unaizalakain
  • Status changed from new to assigned

According to https://www.ietf.org/rfc/rfc2396.txt

   The path may consist of a sequence of path segments separated by a
   single slash "/" character.  Within a path segment, the characters
   "/", ";", "=", and "?" are reserved.  Each path segment may include a
   sequence of parameters, indicated by the semicolon ";" character.
   The parameters are not significant to the parsing of relative
   references.

I would escape all ";", "=" and "?" characters. The fragment isn't even contemplated because it's not strictly part of the URI:

   When a URI reference is used to perform a retrieval action on the
   identified resource, the optional fragment identifier, separated from
   the URI by a crosshatch ("#") character, consists of additional
   reference information to be interpreted by the user agent after the
   retrieval action has been successfully completed.  As such, it is not
   part of a URI, but is often used in conjunction with a URI.

Personally, I consider the possible logging clarity problems less important than the problems arising from HttpRequest.get_full_path() bad behavior. If needed, logging could use some other function to print out the URI.

Last edited 8 months ago by unaizalakain (previous) (diff)

comment:7 Changed 8 months ago by unaizalakain

It seems that Django doesn't (in principle) log the successful requests anywhere (the PATH_INFO WSGI environ is what gets logged both in development and production). The only place where get_full_path() is intended for human consumption is django.middleware.common.BrokenLinkEmailsMiddleware.process_response where it gets mailed. I would say that, in this case, it might be even interesting to have the "correct" yet difficult to understand path printed.

comment:8 Changed 8 months ago by unaizalakain

  • Has patch set

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as assigned
The owner will be changed from unaizalakain to anonymous. Next status will be 'assigned'
The ticket will be disowned. Next status will be 'new'
as The resolution will be set. Next status will be 'closed'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.