#36483 closed Cleanup/optimization (duplicate)
IntegerField will accept non-ASCII digits, which leads to the same page appearing at many URLs
Reported by: | Morgan Wahl | Owned by: | |
---|---|---|---|
Component: | Core (URLs) | Version: | 5.2 |
Severity: | Normal | Keywords: | |
Cc: | Morgan Wahl | Triage Stage: | Unreviewed |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description (last modified by )
Hello,
I was recently surprised to find that a simple detail view URL with a model ID in it was also accessible at a URL using "full width" digit characters. For example the page at "/pizza/123" could also be returned from "/pizza/123". That's the Unicode characters U+FF11 U+FF12 U+FF13. It turns out this is ultimately because the model IntegerField
is using int
to get an integer from the string that was originally in the URL. And I was surprised to find Python's int
constructor uses unicodedata.decimal
(or some equivalent) to translate from characters in a string to decimal digits.
That was a cool accidental feature to discover, however now I'm concerned about URL canonicalization. Python 3.13.3 accepts _68_ different characters for each digit. This means the same content is hypothetically accessible from many, many URLs. I've heard that can make a site look spammy to search engines. And maybe this could be an element of a security hole if something is assuming there is only one URL for a given page.
The SEO problem could be addressed by setting a <link rel=canonical>
in the page to point to Pizza.objects.get(pk=id).get_absolute_url()
or some similar logic, or you could address the problem as a whole by setting up redirects or 404 responses, but all those approaches require a separate implementation for every view, since Django's code ultimately doesn't know which parts of the URL are going to be treated as values of a IntegerField
. (Django's DetailView
could implement some logic for this, however in reality there are lots of other situations where people take a string from the URL use it to look up a record with an IntegerField
.)
Possible solutions I can think of are either:
- make some mechanism to very easily canonicalize URLs, by allowing users to somehow mark this situation explicitly in the URL conf, and then Django can set a property on the request object with the "canonicalized" URL. Then redirects or 404s or <link> tags could be implemented just once for all such URLs. (Redirects and 404s in a middleware, <link> tags in a base template.)
- Don't just pass strings to
int
in the modelIntegerField
. Instead only allow strings with ASCII digits to be used.
Change History (6)
comment:1 by , 2 months ago
Description: | modified (diff) |
---|
comment:3 by , 2 months ago
Component: | Uncategorized → Core (URLs) |
---|---|
Type: | Bug → Cleanup/optimization |
Hello Morgan Wahl, thank you for your ticket.
The suggested option 2 would be backwards incompatible so I don't think that's an option we want to pursue. As to whether this needs fixing at all, I'm not entirely sure. I understand the potential cache and SEO invalidation issues, so I have shared this report with the security team to evaluate possible options.
I will update the ticket when I get more information.
comment:4 by , 2 months ago
Morgan, could you please provide details of how you are defining your view and urlconf? There was previous work in #28268 that would ensure that modern usages of URL definitions would not allow for these fullwidth unicode digits. So patterns like this:
path("something/<int:pk>/", something_detail)
re_path(r"^something/(?P<pk>[0-9]+)/$", something_detail)
properly raise 404 for non-ascii digits.
comment:5 by , 2 months ago
Resolution: | → wontfix |
---|---|
Status: | new → closed |
Ah! That was a good question. In the case where I ran into this, I see we're doing
re_path(r'^(?P<pk>\d+)/$', browse_views.material, name='browse.material'),
This particular code was written sometime before 2013, so path
was not an option, and was originally running on Python 2, where \d
only matches ASCII unless you passed the unicode flag.
I see we just need to use path
with <int:...>
or limit the regex.
Thanks for the help and sorry to take up your time!
comment:6 by , 2 months ago
Resolution: | wontfix → duplicate |
---|
Thank you for your quick reply Morgan! I'll adjust the resolution slightly to mark this as duplicate of #28268.
(Fix typo in description.)