Opened 2 months ago
Last modified 2 months ago
#36483 closed Cleanup/optimization
IntegerField will accept non-ASCII digits, which leads to the same page appearing at many URLs — at Initial Version
Reported by: | Morgan Wahl | Owned by: | |
---|---|---|---|
Component: | Core (URLs) | Version: | 5.2 |
Severity: | Normal | Keywords: | |
Cc: | Morgan Wahl | Triage Stage: | Unreviewed |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Hello,
I was recently surprised to find that a simple detail view URL with a model ID in it was also accessible at a URL using "full width" digit characters. For example the page at "/pizza/123" could also be returned from "/pizza/123". That's the Unicode characters U+FF11 U+FF12 U+FF13. It turns out this is ultimately because the model IntegerField
is using int
to get an integer from the string that was originally in the URL. And I was surprised to find Python's int
constructor uses unicodedata.decimal
(or some equivalent) to translate from characters in a string to decimal digits.
That was a cool accidental feature to discovery, however now I'm concerned about URL canonicalization. Python 3.13.3 accepts _68_ different characters for each digit. This means the same content is hypothetically accessible from many, many URLs. I've heard that can make a site look spammy to search engines. And maybe this could be an element of a security hole if something is assuming there is only one URL for a given page.
The SEO problem could be addressed by setting a <link rel=canonical>
in the page to point to Pizza.objects.get(pk=id).get_absolute_url()
or some similar logic, or you could address the problem as a whole by setting up redirects or 404 responses, but all those approaches require a separate implementation for every view, since the view code ultimately doesn't know which parts of the URL are going to be treated as values of a IntegerField
.
Possible solutions I can think of are either:
- make some mechanism to very easily canonicalize URLs, by allowing users to somehow mark this situation explicitly in the URL conf, and then Django can set a property on the request object with the "canonicalized" URL. Then redirects or 404s or <link> tags could be implemented just once for all such URLs. (Redirects and 404s in a middleware, <link> tags in a base template.)
- Don't just pass strings to
int
in the modelIntegerField
. Instead only allow strings with ASCII digits to be used.