Opened 2 months ago

Closed 2 months ago

Last modified 2 months ago

#36483 closed Cleanup/optimization (duplicate)

IntegerField will accept non-ASCII digits, which leads to the same page appearing at many URLs

Reported by: Morgan Wahl Owned by:
Component: Core (URLs) Version: 5.2
Severity: Normal Keywords:
Cc: Morgan Wahl Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Morgan Wahl)

Hello,

I was recently surprised to find that a simple detail view URL with a model ID in it was also accessible at a URL using "full width" digit characters. For example the page at "/pizza/123" could also be returned from "/pizza/123". That's the Unicode characters U+FF11 U+FF12 U+FF13. It turns out this is ultimately because the model IntegerField is using int to get an integer from the string that was originally in the URL. And I was surprised to find Python's int constructor uses unicodedata.decimal (or some equivalent) to translate from characters in a string to decimal digits.

That was a cool accidental feature to discover, however now I'm concerned about URL canonicalization. Python 3.13.3 accepts _68_ different characters for each digit. This means the same content is hypothetically accessible from many, many URLs. I've heard that can make a site look spammy to search engines. And maybe this could be an element of a security hole if something is assuming there is only one URL for a given page.

The SEO problem could be addressed by setting a <link rel=canonical> in the page to point to Pizza.objects.get(pk=id).get_absolute_url() or some similar logic, or you could address the problem as a whole by setting up redirects or 404 responses, but all those approaches require a separate implementation for every view, since Django's code ultimately doesn't know which parts of the URL are going to be treated as values of a IntegerField. (Django's DetailView could implement some logic for this, however in reality there are lots of other situations where people take a string from the URL use it to look up a record with an IntegerField.)

Possible solutions I can think of are either:

  1. make some mechanism to very easily canonicalize URLs, by allowing users to somehow mark this situation explicitly in the URL conf, and then Django can set a property on the request object with the "canonicalized" URL. Then redirects or 404s or <link> tags could be implemented just once for all such URLs. (Redirects and 404s in a middleware, <link> tags in a base template.)
  2. Don't just pass strings to int in the model IntegerField. Instead only allow strings with ASCII digits to be used.

Change History (6)

comment:1 by Morgan Wahl, 2 months ago

Description: modified (diff)

(Fix typo in description.)

comment:2 by Morgan Wahl, 2 months ago

Description: modified (diff)

(Clarified possible workarounds.)

comment:3 by Natalia Bidart, 2 months ago

Component: UncategorizedCore (URLs)
Type: BugCleanup/optimization

Hello Morgan Wahl, thank you for your ticket.

The suggested option 2 would be backwards incompatible so I don't think that's an option we want to pursue. As to whether this needs fixing at all, I'm not entirely sure. I understand the potential cache and SEO invalidation issues, so I have shared this report with the security team to evaluate possible options.

I will update the ticket when I get more information.

comment:4 by Natalia Bidart, 2 months ago

Morgan, could you please provide details of how you are defining your view and urlconf? There was previous work in #28268 that would ensure that modern usages of URL definitions would not allow for these fullwidth unicode digits. So patterns like this:

  • path("something/<int:pk>/", something_detail)
  • re_path(r"^something/(?P<pk>[0-9]+)/$", something_detail)

properly raise 404 for non-ascii digits.

comment:5 by Morgan Wahl, 2 months ago

Resolution: wontfix
Status: newclosed

Ah! That was a good question. In the case where I ran into this, I see we're doing

re_path(r'^(?P<pk>\d+)/$', browse_views.material, name='browse.material'),

This particular code was written sometime before 2013, so path was not an option, and was originally running on Python 2, where \d only matches ASCII unless you passed the unicode flag.

I see we just need to use path with <int:...> or limit the regex.

Thanks for the help and sorry to take up your time!

comment:6 by Natalia Bidart, 2 months ago

Resolution: wontfixduplicate

Thank you for your quick reply Morgan! I'll adjust the resolution slightly to mark this as duplicate of #28268.

Note: See TracTickets for help on using tickets.
Back to Top