Opened 12 years ago
Closed 10 years ago
#18456 closed Bug (fixed)
HttpRequest.get_full_path does not escape # sign in the url
Reported by: | Owned by: | Unai Zalakain | |
---|---|---|---|
Component: | HTTP handling | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | jim@… | Triage Stage: | Accepted |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Suppose request was made with the url '/some%23thing?param=1', then get_full_path() would return '/some#thing?param=1' which I believe is incorrect as 'thing?param=1' is now in anchor part.
Change History (10)
comment:1 by , 12 years ago
Component: | Uncategorized → HTTP handling |
---|---|
Triage Stage: | Unreviewed → Design decision needed |
comment:2 by , 12 years ago
Cc: | added |
---|
I have always found request.get_full_path()
to be less than useful when dealing with tricky corner cases. It doesn't escape anything itself, but if you escape its output, it also escapes the '?
' used to assemble the path and the query string:
>>> from django.http import HttpRequest >>> from django.utils.http import urlquote >>> request = HttpRequest() >>> request.path = '/' >>> request.META['QUERY_STRING'] = 'q=a' >>> request.get_full_path() '/?q=a' >>> urlquote(request.get_full_path()) u'/%3Fq%3Da'
In cases involving both query strings and unicode characters in urls, I have found that it is best simply to avoid request.get_full_path()
altogether. Instead, in my own projects I have defined a method called request.get_escaped_full_path()
as follows:
def get_escaped_full_path(self): return '%s%s' % (iri_to_uri(urlquote(self.path)), self.META.get('QUERY_STRING', '') and ('?' + iri_to_uri(self.META.get('QUERY_STRING', ''))) or '')
This turns out to be much more predictable (and useful).
follow-up: 4 comment:3 by , 12 years ago
Resolution: | → wontfix |
---|---|
Status: | new → closed |
In fact, the webserver urldecodes the requested path before putting it in the WSGI environ, and urldecoding isn't reversible.
Therefore, it's impossible to retrieve the URL that was actually sent by the browser, short of parsing it and reconstructing it with reverse
-- which, as of a few days ago, urlencodes properly.
follow-up: 5 comment:4 by , 12 years ago
Replying to aaugustin:
In fact, the webserver urldecodes the requested path before putting it in the WSGI environ, and urldecoding isn't reversible.
Therefore, it's impossible to retrieve the URL that was actually sent by the browser, short of parsing it and reconstructing it with
reverse
-- which, as of a few days ago, urlencodes properly.
I think there is a bigger issue to consider here, and I hope you will reconsider having marked this ticket as wontfix.
As I mentioned in comment:2, get_full_path()
as currently implemented is useless on sites that contain paths that need escaping along with query strings. The question comes down to: should one urlencode the result of get_full_path()
, or not? The documentation endorses neither way, and actually, neither answer is fully correct. Consider the case where one is attempting to use get_full_path()
as a way to construct a URL to place in an HTML anchor tag, or a URL that can be used as a 302 redirect. If the path has a query string, then calling urlquote(request.get_fullpath())
will mistakenly urlencode the '?' character. If one chooses not to escape the result and just use get_full_path()
directly (which is, e.g., what the django admin does), then the link will be broken if the URL contains characters that need to be escaped.
As such, for any conceivable use case of HttpRequest.get_full_path()
, the only behavior that makes consistent sense is to urlencode the path. This is true whether one is using it to e.g. construct an anchor tag with, or in a 301/302 redirect. In fact, in neither of these use cases (nor in any can I think of) does it even matter whether one is able to reconstruct the precise url that was sent by the browser.
comment:5 by , 12 years ago
Resolution: | wontfix |
---|---|
Status: | closed → new |
Triage Stage: | Design decision needed → Accepted |
It's difficult to say something about "every conceivable use case of get_full_path()
". For instance, it could be used for logging; and one wants to log "/foo:bar/", not "%2Ffoo%3Abar%2F" :)
Reopening to see if it would make sense to escape specifically significant characters in the path of an URL (?, #, ;, maybe others).
comment:6 by , 11 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
According to https://www.ietf.org/rfc/rfc2396.txt
The path may consist of a sequence of path segments separated by a single slash "/" character. Within a path segment, the characters "/", ";", "=", and "?" are reserved. Each path segment may include a sequence of parameters, indicated by the semicolon ";" character. The parameters are not significant to the parsing of relative references.
I would escape all ";", "=" and "?" characters. The fragment isn't even contemplated because it's not strictly part of the URI:
When a URI reference is used to perform a retrieval action on the identified resource, the optional fragment identifier, separated from the URI by a crosshatch ("#") character, consists of additional reference information to be interpreted by the user agent after the retrieval action has been successfully completed. As such, it is not part of a URI, but is often used in conjunction with a URI.
Personally, I consider the possible logging clarity problems less important than the problems arising from HttpRequest.get_full_path()
bad behavior. If needed, logging could use some other function to print out the URI.
comment:7 by , 11 years ago
It seems that Django doesn't (in principle) log the successful requests anywhere (the PATH_INFO
WSGI environ is what gets logged both in development and production). The only place where get_full_path()
is intended for human consumption is django.middleware.common.BrokenLinkEmailsMiddleware.process_response
where it gets mailed. I would say that, in this case, it might be even interesting to have the "correct" yet difficult to understand path printed.
comment:9 by , 10 years ago
Here is the revised pull request: https://github.com/django/django/pull/3454
Changes:
- Fixed merge conflict
- Rewrote the tests in tests/utils_tests/test_encoding.py
- Added a test to tests/requests/tests.py
- Updated the versionadded directive in docs/ref/utils.txt
- Added a release note
comment:10 by , 10 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
request.get_full_path()
returns the path with the query string (if there is one). The result doesn't have any particular encoding or escaping applied. The docs would mention it otherwise :)This is a bit pathological when the path includes a "?". In this case, the "?" must have been escaped in the original representation of the URL (written in the HTML or typed in the address bar). Otherwise it would have been interpreted as the beginning of the query string. However, since
request.get_full_path()
is justpath + '?' + query string
, you can't tell the difference between a "?" in the path and the "?" that marks the beginning of the query string in its output."#" is less of a problem. Browsers don't includes fragments (that's the official name for "anchor") in requests, so if you have a # in the output of
request.get_full_path()
, it was in the path or the query string. It can't be the anchor.To sum up with an example, if the escaped URL is
/%3Ffoo%23bar?baz#quux
, thenrequest.get_full_path()
is/?foo#bar?baz
.So your question boils down to: should
path
be URL-encoded when buildingrequest.get_full_path()
?To be honest I'm not sure.
Note that
django.utils.encoding.iri_to_uri()
won't do the job because it keeps # and ? unchanged.Related ticket: #11522