Opened 16 months ago
Closed 15 months ago
#34709 closed Bug (fixed)
charset should be ignored for the application/x-www-form-urlencoded content type.
Reported by: | Mariusz Felisiak | Owned by: | Mariusz Felisiak |
---|---|---|---|
Component: | HTTP handling | Version: | 4.2 |
Severity: | Normal | Keywords: | |
Cc: | Markus Holtermann, Simon Charette, Adam Johnson, Shamil Abdulaev, Shai Berger | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description (last modified by )
charset
parameter is used in application/x-www-urlencoded
content type. However, per the current spec (check out RFC 1866) the content type application/x-www-form-urlencoded
does not have a charset
and should be treated as UTF-8.
Thanks Eki Xu for the report.
Change History (20)
comment:2 by , 16 months ago
Description: | modified (diff) |
---|
comment:3 by , 16 months ago
The current behavior, even if not correct, is documented and tested. Should we use a deprecation path?
follow-up: 5 comment:4 by , 16 months ago
I think it would be difficult to provide a sensible deprecation path (visible by devs, I mean), do you have a plan in mind?
follow-up: 8 comment:5 by , 16 months ago
Replying to Claude Paroz:
I think it would be difficult to provide a sensible deprecation path (visible by devs, I mean), do you have a plan in mind?
We could raise a warning when self._encoding
is not utf-8
, that it will be ignored in Django 6.0. I'm just not sure it's worth doing.
comment:6 by , 16 months ago
Cc: | added |
---|
Greetings to you! I would like to tackle this bug and solve it!)
comment:7 by , 16 months ago
Shamil, this ticket is already assign to me. We're discussing an acceptable approach.
follow-up: 9 comment:8 by , 16 months ago
Replying to Mariusz Felisiak:
We could raise a warning when
self._encoding
is notutf-8
, that it will be ignored in Django 6.0. I'm just not sure it's worth doing.
The warning will probably only be ever raised on production servers, so hardly visible in practice.
Did you explore what will happen in practice if an incoming request is encoded in a different encoding and we try to decode it with utf-8
? Server error (500)? Bad request(400)?
comment:9 by , 16 months ago
Replying to Claude Paroz:
Did you explore what will happen in practice if an incoming request is encoded in a different encoding and we try to decode it with
utf-8
? Server error (500)? Bad request(400)?
I was only able to achieve a badly decoded string, no crash 🤔
follow-up: 11 comment:10 by , 16 months ago
Indeed, by default parse_sql
is calling decode()
with errors='replace'
, which will produce � (U+FFFD, the official REPLACEMENT CHARACTER) for invalid UTF-8 sequences. I'm still not convinced it will be a real improvement over the current situation…
follow-up: 12 comment:11 by , 16 months ago
Replying to Claude Paroz:
Indeed, by default
parse_sql
is callingdecode()
witherrors='replace'
, which will produce � (U+FFFD, the official REPLACEMENT CHARACTER) for invalid UTF-8 sequences. I'm still not convinced it will be a real improvement over the current situation…
We could raise 400 instead of silently ignoring a custom charset.
comment:12 by , 16 months ago
Replying to Mariusz Felisiak:
We could raise 400 instead of silently ignoring a custom charset.
Sure, I'd prefer that, failing loudly, instead of silently getting wrongly-encoded input.
follow-up: 17 comment:15 by , 15 months ago
While starting to review the patch, and looking for more recent considerations than the old 1866 RFC, I read https://url.spec.whatwg.org/#application/x-www-form-urlencoded which is worth a read. Quoting a note:
A legacy server-oriented implementation might have to support encodings other than UTF-8 as well as have special logic for tuples of which the name is
_charset
. Such logic is not described here as only UTF-8 is conforming.
I don't necessarily re-question our previous discussions/decisions, however we might be prepared to receive some complaints as it may be that non-conforming agents start to produce BadRequest errors. Difficult to say before going to production!
comment:16 by , 15 months ago
Triage Stage: | Accepted → Ready for checkin |
---|
follow-up: 18 comment:17 by , 15 months ago
Cc: | added |
---|
Replying to Claude Paroz:
While starting to review the patch, and looking for more recent considerations than the old 1866 RFC, I read https://url.spec.whatwg.org/#application/x-www-form-urlencoded which is worth a read. Quoting a note:
A legacy server-oriented implementation might have to support encodings other than UTF-8 as well as have special logic for tuples of which the name is
_charset
. Such logic is not described here as only UTF-8 is conforming.
I don't necessarily re-question our previous discussions/decisions, however we might be prepared to receive some complaints as it may be that non-conforming agents start to produce BadRequest errors. Difficult to say before going to production!
Unfortunately, I don't see a way to support this with a loud crash at the same time.
comment:18 by , 15 months ago
I found some issues/PRs to remove charset
for the application/x-www-form-urlencoded
content type in other libraries. Even explicitly passing charset=utf-8
caused issues. As far as I'm aware, we can move it forward:
See related tickets #5076 and #14035.