Opened 4 hours ago

Last modified 3 hours ago

#36131 new Bug

URLValidator not correctly validating URLs

Reported by: Ludwig Kraatz Owned by:
Component: Core (Other) Version: 5.1
Severity: Normal Keywords: URL Validator
Cc: Ludwig Kraatz Triage Stage: Unreviewed
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: yes UI/UX: no

Description

Abstract

An URL is a way of describing a Resource.

https://resource -> is a valid URL.

Why do i raise this as issue

An URL resource-descriptor is constructed like that [RFC 3986#section-3]:

         foo://example.com:8042/over/there?name=ferret#nose
         \_/   \______________/\_________/ \_________/ \__/
          |           |            |            |        |
       scheme     authority       path        query   fragment

so: scheme, authority, rest...

The issue in djangos URLValidation I want to address, is a over-specification and 'selective circumvention of wrongful parsing' when it comes to the -host- compnent of the authority part.

What djangos URLValidator currently does:
host_re = "( FQDN-REGEX | localhost )"

Basically, django parses IP-OR-FQDN-OR-LOCALHOST-URLs.

This is basically the 'selective circumvention of wrongful parsing' i mentioned earlier. By ”| localhost" the URL field "feels" more okay, because all the obvious URLs on localhost that exist, now pass. But there is so much more than "localhost" besides FQDN as used for "(global) DNS URLs".

The RFC also acknowledges this. It is recommending using a syntax for hosts that conforms to the DNS syntax.

https://datatracker.ietf.org/doc/html/rfc3986#section-3.2.2

A host identified by a registered name is a sequence of characters
   usually intended for lookup within a locally defined host or service
   name registry, though the URI's scheme-specific semantics may require
   that a specific registry (or fixed name table) be used instead.  The
   most common name registry mechanism is the Domain Name System (DNS).
   A registered name intended for lookup in the DNS uses the syntax

   defined in Section 3.5 of [RFC1034] and Section 2.1 of [RFC1123].
   Such a name consists of a sequence of domain labels separated by ".",
   each domain label starting and ending with an alphanumeric character
   and possibly also containing "-" characters.  The rightmost domain
   label of a fully qualified domain name in DNS may be followed by a
   single "." and should be if it is necessary to distinguish between
   the complete domain name and some local domain.

      reg-name    = *( unreserved / pct-encoded / sub-delims )

   If the URI scheme defines a default for host, then that default
   applies when the host subcomponent is undefined or when the
   registered name is empty (zero length).  For example, the "file" URI
   scheme is defined so that no authority, an empty host, and
   "localhost" all mean the end-user's machine, whereas the "http"
   scheme considers a missing authority or empty host invalid.

   This specification does not mandate a particular registered name
   lookup technology and therefore does not restrict the syntax of reg-
   name beyond what is necessary for interoperability.  Instead, it
   delegates the issue of registered name syntax conformance to the
   operating system of each application performing URI resolution, and
   that operating system decides what it will allow for the purpose of
   host identification.  A URI resolution implementation might use DNS,
   host tables, yellow pages, NetInfo, WINS, or any other system for
   lookup of registered names.  However, a globally scoped naming
   system, such as DNS fully qualified domain names, is necessary for
   URIs intended to have global scope.  URI producers should use names
   that conform to the DNS syntax, even when use of DNS is not
   immediately apparent, and should limit these names to no more than
   255 characters in length.

What is said in many ways:

  • local host resolution is completely okay.
  • no "." is required as, a sequence (which is not further specified to length restrictions) can consist of 1, which would lack a "." seperator
  • host names that are -compatible-, are valid.

[RFC 6762 Multicast DNS # Section 3]

It is unimportant whether a name ending with ".local." occurred
   because the user explicitly typed in a fully qualified domain name
   ending in ".local.", or because the user entered an unqualified
   domain name and the host software appended the suffix ".local."
   because that suffix appears in the user's search list.

It is stated clearly, that a user can describe a resource with the implication, that if its not a fully qualified domain name, the TLD .local is to be assumed. As such - the URL, which is what the user would be referencing, was to be able to deal with more non-FQDN than just "localhost". This is in the context of Multicast DNS, which seems more than close enough to be considered relevant, when talking about URLs - as the URL RFC was so closely described around DNS.

[RFC 3986 URI/URL # 1.1]

 URIs that
   identify in relation to the end-user's local context should only be
   used when the context itself is a defining aspect of the resource,
   such as when an on-line help manual refers to a file on the end-
   user's file system (e.g., "file:///etc/hosts").
  • clearly states, that URI's are valid, even if they clearly only 'make sense' in a end-users local context.

As such - restricting django URLs to only Fully Qualified Domain Names/IPs, (except localhost.. for whatever reason except inconsitency :-* ) - is a restriction that contradicts that notion.

What i am proposing:

fully allowing for URLs as per rfc3986#section-3.2.2 - with a regex solution for localhost (and whatever else is possible) instead of a hardcoded < "magicnumber"-80%-"solution" >

To be Commited to django repository and pull requested. My earlier pull request is more - a starting point for discussion.

Why this is necessary & usefull:

Single-label URLs might be used

  • in intranet situations
  • for URLs that represent services / schemes that do not comply to FQDNaming conventions
  • for local testing (local DNS resolution that is not based on FQDN)
  • mDNS [RFC 6762] solutions, operating under .local TLD (which as of that RFC can be ommitted in a local context)
  • the django validator is named URLValidator, not FQDN_IP_LOCALHOST_URLValidator

Further notes:

i already submitted a pull request - which probably isn't mature enough.. given i did not even check which tests would break..

but - there was one test, that should not have broken:

FAIL: test_urlfield_clean_invalid (forms_tests.field_tests.test_urlfield.URLFieldTest.test_urlfield_clean_invalid) [<object object at 0x000001C1038C1760>] (value='foo')

URL <= "foo" should not be valid, even with my little changes, replacing 'localhost' with hostname_re

It feels like there are some (- -) missing - but i did not check.. i focused on providing a more solid ticket first..
So - if i am not mistaken, there is another issue besides what i propose. It seems, limiting hosts via FQDN was the thing, preventing missing URI-scheme's to be rejected by the validator, not a correct validation of uri-schemes themselves.

PS

its kindof late - i might polish this ticket tomorrow. if you feel like i'm drunk or disorganized - its just my brain thats screaming for relief. sry.

Change History (2)

comment:1 by Ludwig Kraatz, 4 hours ago

Has patch: set
Patch needs improvement: set

comment:2 by Ludwig Kraatz, 3 hours ago

https://github.com/django/django/pull/19096

Test cofirmed: URL Validation fails for "localhost" => which will be accepted, even though its not a valid URL. (it lacks scheme://)

Bottom line: whole regEx / Validation seems off on multiple layers.

Note: See TracTickets for help on using tickets.
Back to Top