Opened 2 months ago

Last modified 2 months ago

#36483 closed Cleanup/optimization

IntegerField will accept non-ASCII digits, which leads to the same page appearing at many URLs — at Initial Version

Reported by: Morgan Wahl Owned by:
Component: Core (URLs) Version: 5.2
Severity: Normal Keywords:
Cc: Morgan Wahl Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Hello,

I was recently surprised to find that a simple detail view URL with a model ID in it was also accessible at a URL using "full width" digit characters. For example the page at "/pizza/123" could also be returned from "/pizza/123". That's the Unicode characters U+FF11 U+FF12 U+FF13. It turns out this is ultimately because the model IntegerField is using int to get an integer from the string that was originally in the URL. And I was surprised to find Python's int constructor uses unicodedata.decimal (or some equivalent) to translate from characters in a string to decimal digits.

That was a cool accidental feature to discovery, however now I'm concerned about URL canonicalization. Python 3.13.3 accepts _68_ different characters for each digit. This means the same content is hypothetically accessible from many, many URLs. I've heard that can make a site look spammy to search engines. And maybe this could be an element of a security hole if something is assuming there is only one URL for a given page.

The SEO problem could be addressed by setting a <link rel=canonical> in the page to point to Pizza.objects.get(pk=id).get_absolute_url() or some similar logic, or you could address the problem as a whole by setting up redirects or 404 responses, but all those approaches require a separate implementation for every view, since the view code ultimately doesn't know which parts of the URL are going to be treated as values of a IntegerField.

Possible solutions I can think of are either:

  1. make some mechanism to very easily canonicalize URLs, by allowing users to somehow mark this situation explicitly in the URL conf, and then Django can set a property on the request object with the "canonicalized" URL. Then redirects or 404s or <link> tags could be implemented just once for all such URLs. (Redirects and 404s in a middleware, <link> tags in a base template.)
  2. Don't just pass strings to int in the model IntegerField. Instead only allow strings with ASCII digits to be used.

Change History (0)

Note: See TracTickets for help on using tickets.
Back to Top