Opened 6 years ago

Closed 6 years ago

Last modified 6 years ago

#28679 closed Bug (fixed)

urlencode does not decode bytes before passing them to Python's urlencode

Reported by: François Freitag Owned by: François Freitag
Component: Utilities Version: 2.0
Severity: Release blocker Keywords: urlencode bytes
Cc: Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

urlencode should decode bytes objects before to pass them to python's urlencode. Instead, it's currently calling str on the bytes, passing the bytes.__str__ representation of the bytes (b'...') to urlencode.

# A dictionary with bytes value
result = http.urlencode({'a': b'abc'}, doseq=True)
self.assertEqual(result, 'a=abc')
# Test result on 2.0a1
AssertionError: 'a=b%27abc%27' != 'a=abc' 

Change History (11)

comment:1 by François Freitag, 6 years ago

Owner: changed from nobody to François Freitag
Status: newassigned

comment:2 by François Freitag, 6 years ago

comment:3 by Claude Paroz, 6 years ago

Severity: NormalRelease blocker
Triage Stage: UnreviewedAccepted
Type: UncategorizedBug

Sure something needs to be fixed (sorry, I introduced the regression in [fee42fd99ee470528858c2ccb3621135c30ec262]).

I just wonder if we should use force_text or rather let bytes be bytes, as the Python urlencode seems to accept both strings or bytes.

comment:4 by Jon Dufresne, 6 years ago

What is the use case for passing bytes to urlencode()?

As URLs are strings, I'd expect their query parameters to also be strings (or coercable to a string). At first glance, this feels like the bytes data is being passed through the system as if it were a string instead of being deocded/encoded at its endpoints.

comment:5 by François Freitag, 6 years ago

Python is treating bytes differently than other types, perhaps Django should maintain that behaviour?

IMHO, decoding user-encoded bytes (assuming utf-8) and re-encoding is not useful and may cause harm to users who decided to encode their data using another encoding. I tend to agree with Claude on special-casing bytes.

Last edited 6 years ago by François Freitag (previous) (diff)

in reply to:  4 comment:6 by Claude Paroz, 6 years ago

Replying to Jon Dufresne:

What is the use case for passing bytes to urlencode()?

I had the same question initially. But then I saw that Python 3 urlencode was also mentioning bytes as possible input, so I guess there are valid use cases for that.

comment:7 by Jon Dufresne, 6 years ago

Python is ​bytes differently than other types, perhaps Django should maintain that behaviour?

I see. Is this the only motivation or is there a use case? While I might disagree with Python's choice here, I understand the argument of following upstream and the wider community.

IMHO, decoding user-encoded bytes (assuming utf-8) and re-encoding is not useful and may cause harm to users who decided to encode their data using another encoding. I tend to agree with Claude on special-casing bytes.

This is precisely what force_text() is doing. If decoding as utf-8 is causing harm, it hasn't been avoided.

By using force_text() there is an implicit assumption that the bytes data is utf-8 encoded. While that may be true for a large majority of the cases, there is no reason to believe this is true for all cases. Having the bytes data properly decoded as it is received would avoid this assumption.

comment:8 by François Freitag, 6 years ago

My use case is a library passing zipped xml, base64 encoded over GET parameters. The output of base64.b64encode is bytes, and I think Django should not change that (even though it does not matter in my use case).

I updated the PR to simply pass bytes "as is" to Python's urlencode. I agree that calling force_text on the value is not the best option.

Is this the only motivation or is there a use case?

I do not have a use case for Django not decoding bytes, but I find it safer than decoding a content that does not need it. I prefer Django being consistent with Python. The added value of Django's urlencode is the understanding of QueryDict, I do not expect it to re-encode the bytes I've given.

comment:9 by Jon Dufresne, 6 years ago

I updated the PR to simply pass bytes "as is" to Python's urlencode.

This approach makes sense to me. Thanks for explaining.

comment:10 by Tim Graham <timograham@…>, 6 years ago

Resolution: fixed
Status: assignedclosed

In 41be8586:

Fixed #28679 -- Fixed urlencode()'s handling of bytes.

Regression in fee42fd99ee470528858c2ccb3621135c30ec262.

Thanks Claude Paroz, Jon Dufresne, and Tim Graham for the guidance.

comment:11 by Tim Graham <timograham@…>, 6 years ago

In 51998a29:

[2.0.x] Fixed #28679 -- Fixed urlencode()'s handling of bytes.

Regression in fee42fd99ee470528858c2ccb3621135c30ec262.

Thanks Claude Paroz, Jon Dufresne, and Tim Graham for the guidance.

Backport of 41be85862d9067a809ccf3707d2a22dfef23d99a from master

Note: See TracTickets for help on using tickets.
Back to Top