Opened 2 months ago

Last modified 2 months ago

#36483 closed Cleanup/optimization

IntegerField will accept non-ASCII digits, which leads to the same page appearing at many URLs — at Version 2

Reported by: Morgan Wahl Owned by:
Component: Core (URLs) Version: 5.2
Severity: Normal Keywords:
Cc: Morgan Wahl Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Morgan Wahl)

Hello,

I was recently surprised to find that a simple detail view URL with a model ID in it was also accessible at a URL using "full width" digit characters. For example the page at "/pizza/123" could also be returned from "/pizza/123". That's the Unicode characters U+FF11 U+FF12 U+FF13. It turns out this is ultimately because the model IntegerField is using int to get an integer from the string that was originally in the URL. And I was surprised to find Python's int constructor uses unicodedata.decimal (or some equivalent) to translate from characters in a string to decimal digits.

That was a cool accidental feature to discover, however now I'm concerned about URL canonicalization. Python 3.13.3 accepts _68_ different characters for each digit. This means the same content is hypothetically accessible from many, many URLs. I've heard that can make a site look spammy to search engines. And maybe this could be an element of a security hole if something is assuming there is only one URL for a given page.

The SEO problem could be addressed by setting a <link rel=canonical> in the page to point to Pizza.objects.get(pk=id).get_absolute_url() or some similar logic, or you could address the problem as a whole by setting up redirects or 404 responses, but all those approaches require a separate implementation for every view, since Django's code ultimately doesn't know which parts of the URL are going to be treated as values of a IntegerField. (Django's DetailView could implement some logic for this, however in reality there are lots of other situations where people take a string from the URL use it to look up a record with an IntegerField.)

Possible solutions I can think of are either:

  1. make some mechanism to very easily canonicalize URLs, by allowing users to somehow mark this situation explicitly in the URL conf, and then Django can set a property on the request object with the "canonicalized" URL. Then redirects or 404s or <link> tags could be implemented just once for all such URLs. (Redirects and 404s in a middleware, <link> tags in a base template.)
  2. Don't just pass strings to int in the model IntegerField. Instead only allow strings with ASCII digits to be used.

Change History (2)

comment:1 by Morgan Wahl, 2 months ago

Description: modified (diff)

(Fix typo in description.)

comment:2 by Morgan Wahl, 2 months ago

Description: modified (diff)

(Clarified possible workarounds.)

Note: See TracTickets for help on using tickets.
Back to Top