Opened 4 hours ago
Last modified 3 hours ago
#36131 new Bug
URLValidator not correctly validating URLs
Reported by: | Ludwig Kraatz | Owned by: | |
---|---|---|---|
Component: | Core (Other) | Version: | 5.1 |
Severity: | Normal | Keywords: | URL Validator |
Cc: | Ludwig Kraatz | Triage Stage: | Unreviewed |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | yes |
Easy pickings: | yes | UI/UX: | no |
Description
Abstract
An URL is a way of describing a Resource.
https://resource -> is a valid URL.
Why do i raise this as issue
An URL resource-descriptor is constructed like that [RFC 3986#section-3]:
foo://example.com:8042/over/there?name=ferret#nose \_/ \______________/\_________/ \_________/ \__/ | | | | | scheme authority path query fragment
so: scheme, authority, rest...
The issue in djangos URLValidation I want to address, is a over-specification and 'selective circumvention of wrongful parsing' when it comes to the -host- compnent of the authority part.
What djangos URLValidator currently does:
host_re = "( FQDN-REGEX | localhost )"
Basically, django parses IP-OR-FQDN-OR-LOCALHOST-URLs.
This is basically the 'selective circumvention of wrongful parsing' i mentioned earlier. By ”| localhost" the URL field "feels" more okay, because all the obvious URLs on localhost that exist, now pass. But there is so much more than "localhost" besides FQDN as used for "(global) DNS URLs".
The RFC also acknowledges this. It is recommending using a syntax for hosts that conforms to the DNS syntax.
https://datatracker.ietf.org/doc/html/rfc3986#section-3.2.2
A host identified by a registered name is a sequence of characters usually intended for lookup within a locally defined host or service name registry, though the URI's scheme-specific semantics may require that a specific registry (or fixed name table) be used instead. The most common name registry mechanism is the Domain Name System (DNS). A registered name intended for lookup in the DNS uses the syntax defined in Section 3.5 of [RFC1034] and Section 2.1 of [RFC1123]. Such a name consists of a sequence of domain labels separated by ".", each domain label starting and ending with an alphanumeric character and possibly also containing "-" characters. The rightmost domain label of a fully qualified domain name in DNS may be followed by a single "." and should be if it is necessary to distinguish between the complete domain name and some local domain. reg-name = *( unreserved / pct-encoded / sub-delims ) If the URI scheme defines a default for host, then that default applies when the host subcomponent is undefined or when the registered name is empty (zero length). For example, the "file" URI scheme is defined so that no authority, an empty host, and "localhost" all mean the end-user's machine, whereas the "http" scheme considers a missing authority or empty host invalid. This specification does not mandate a particular registered name lookup technology and therefore does not restrict the syntax of reg- name beyond what is necessary for interoperability. Instead, it delegates the issue of registered name syntax conformance to the operating system of each application performing URI resolution, and that operating system decides what it will allow for the purpose of host identification. A URI resolution implementation might use DNS, host tables, yellow pages, NetInfo, WINS, or any other system for lookup of registered names. However, a globally scoped naming system, such as DNS fully qualified domain names, is necessary for URIs intended to have global scope. URI producers should use names that conform to the DNS syntax, even when use of DNS is not immediately apparent, and should limit these names to no more than 255 characters in length.
What is said in many ways:
- local host resolution is completely okay.
- no "." is required as, a sequence (which is not further specified to length restrictions) can consist of 1, which would lack a "." seperator
- host names that are -compatible-, are valid.
[RFC 6762 Multicast DNS # Section 3]
It is unimportant whether a name ending with ".local." occurred because the user explicitly typed in a fully qualified domain name ending in ".local.", or because the user entered an unqualified domain name and the host software appended the suffix ".local." because that suffix appears in the user's search list.
It is stated clearly, that a user can describe a resource with the implication, that if its not a fully qualified domain name, the TLD .local is to be assumed. As such - the URL, which is what the user would be referencing, was to be able to deal with more non-FQDN than just "localhost". This is in the context of Multicast DNS, which seems more than close enough to be considered relevant, when talking about URLs - as the URL RFC was so closely described around DNS.
[RFC 3986 URI/URL # 1.1]
URIs that identify in relation to the end-user's local context should only be used when the context itself is a defining aspect of the resource, such as when an on-line help manual refers to a file on the end- user's file system (e.g., "file:///etc/hosts").
- clearly states, that URI's are valid, even if they clearly only 'make sense' in a end-users local context.
As such - restricting django URLs to only Fully Qualified Domain Names/IPs, (except localhost.. for whatever reason except inconsitency :-* ) - is a restriction that contradicts that notion.
What i am proposing:
fully allowing for URLs as per rfc3986#section-3.2.2 - with a regex solution for localhost (and whatever else is possible) instead of a hardcoded < "magicnumber"-80%-"solution" >
To be Commited to django repository and pull requested. My earlier pull request is more - a starting point for discussion.
Why this is necessary & usefull:
Single-label URLs might be used
- in intranet situations
- for URLs that represent services / schemes that do not comply to FQDNaming conventions
- for local testing (local DNS resolution that is not based on FQDN)
- mDNS [RFC 6762] solutions, operating under .local TLD (which as of that RFC can be ommitted in a local context)
- the django validator is named URLValidator, not FQDN_IP_LOCALHOST_URLValidator
Further notes:
i already submitted a pull request - which probably isn't mature enough.. given i did not even check which tests would break..
but - there was one test, that should not have broken:
FAIL: test_urlfield_clean_invalid (forms_tests.field_tests.test_urlfield.URLFieldTest.test_urlfield_clean_invalid) [<object object at 0x000001C1038C1760>] (value='foo')
URL <= "foo" should not be valid, even with my little changes, replacing 'localhost' with hostname_re
It feels like there are some (- -) missing - but i did not check.. i focused on providing a more solid ticket first..
So - if i am not mistaken, there is another issue besides what i propose. It seems, limiting hosts via FQDN was the thing, preventing missing URI-scheme's to be rejected by the validator, not a correct validation of uri-schemes themselves.
PS
its kindof late - i might polish this ticket tomorrow. if you feel like i'm drunk or disorganized - its just my brain thats screaming for relief. sry.
Change History (2)
comment:1 by , 4 hours ago
Has patch: | set |
---|---|
Patch needs improvement: | set |
comment:2 by , 3 hours ago
https://github.com/django/django/pull/19096
Test cofirmed: URL Validation fails for "localhost" => which will be accepted, even though its not a valid URL. (it lacks scheme://)
Bottom line: whole regEx / Validation seems off on multiple layers.
https://github.com/django/django/pull/19095