Opened 5 months ago

Last modified 4 months ago

#35533 assigned Bug

urlize() makes a bit of a mess of links embedded in Markdown

Reported by: Simon Willison Owned by: DongwookKim0823
Component: Utilities Version: 5.1
Severity: Normal Keywords:
Cc: Adam Johnson Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

I have this input text:

Annotated versions of talks I have given, with extensive notes and additional links. Here's [how I make these](https://simonwillison.net/2023/Aug/6/annotated-presentations/).

After running through Django's urlize helper function I got this:

... Here&#x27;s [how I make <a href="http://these](https://simonwillison.net/2023/Aug/6/annotated-presentations/)" rel="nofollow">these](https://simonwillison.net/2023/Aug/6/annotated-presentations/)</a>

Attachments (2)

screenshot-of-markdown.jpg (60.7 KB ) - added by Simon Willison 5 months ago.
changes.patch (11.1 KB ) - added by DongwookKim0823 5 months ago.

Download all attachments as: .zip

Change History (12)

comment:1 by Simon Willison, 5 months ago

The ideal output for this example would be:

Annotated versions of talks I have given, with extensive notes and additional links. Here's [how I make these](<a href="https://simonwillison.net/2023/Aug/6/annotated-presentations/" rel="nofollow">https://simonwillison.net/2023/Aug/6/annotated-presentations/</a>).

Alternatively, not URLizing at all would be preferable to URLizing in a way that produces broken links.

I think what's happening here is the logic that looks for non-protocol links that include or end in .net may be kicking in, and deciding that the following is the URL that should be linked:

these](https://simonwillison.net/2023/Aug/6/annotated-presentations/)

It's hard to suggest a fix for this. Ideally the code would "notice" that these](https://simonwillison.net/2023 is not a valid URL, but instead the logic is deciding that it's probably valid but should have http:// glued on the start.

Maybe we could have code that notices that http://these](https://simonwillison.net/2023/Aug/6/annotated-presentations/) is NOT a valid URL - you cannot have ]( in the middle of the hostname portion - and hence decides not to URLize it at all.

(Background: the reason I'm seeing this is that my Django SQL Dashboard software tries to URLize text it displays, but has no way of knowing if a database column contains Markdown - this broken example came from ā€‹https://simonwillison.net/dashboard/tags-with-descriptions/ )

Version 0, edited 5 months ago by Simon Willison (next)

by Simon Willison, 5 months ago

Attachment: screenshot-of-markdown.jpg added

comment:3 by Sarah Boyce, 5 months ago

Triage Stage: Unreviewed ā†’ Accepted

Thank you for the report! Replicated šŸ‘

comment:4 by Sarah Boyce, 5 months ago

Cc: Adam Johnson added

comment:5 by DongwookKim0823, 5 months ago

Owner: changed from nobody to DongwookKim0823
Status: new ā†’ assigned

comment:6 by Vaarun Sinha, 5 months ago

Hey DongwookKim0823? Are you actively working on this? Do you require any help? Would love to help

in reply to:  6 comment:7 by DongwookKim0823, 5 months ago

Replying to Vaarun Sinha:

I'm currently working on this. I'll leave a comment if I need any help. Thank you! :)

comment:8 by DongwookKim0823, 5 months ago

Has patch: set

comment:9 by Mariusz Felisiak, 5 months ago

Patch needs improvement: set

by DongwookKim0823, 5 months ago

Attachment: changes.patch added

in reply to:  6 comment:10 by DongwookKim0823, 4 months ago

Replying to Vaarun Sinha:

I made some progress on this issue, but I think collaborating on a better solution would be great. Iā€™d love to work together and find the best approach.

Note: See TracTickets for help on using tickets.
Back to Top