Opened 2 years ago

Closed 22 months ago

#20568 closed Bug (fixed)

templatetag truncatewords_html split words containing HTML entities

Reported by: yann0@… Owned by: jaap3
Component: Utilities Version: master
Severity: Normal Keywords:
Cc: bmispelon@… Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I'm working with a Englsih / French website and when I use truncatewords_html with french texts with special caracters like "é,è,à, etc." (which is very common), it split words in half at thoses caracters.

Example:
Depuis mars 2008, le programme RECYC-FRIGO d’Hydro-Québec vous permet de vous débarrasser d’un vieil appareil, réfrigérateur ou congélateur, facilement [...]

become:
Depuis mars 2008, le programme RECYC-FRIGO d’Hydro-Québec vous permet de vous débarrasser d’un vieil appareil, r ...

Change History (6)

comment:1 Changed 2 years ago by bmispelon

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Resolution set to worksforme
  • Status changed from new to closed

Hi,

I cannot reproduce the issue you're describing.
I tried the following code with both 1.3 and master but it seems to be working correctly for me:

>>> from django.template.defaultfilters import truncatewords_html
>>> s = u"Depuis mars 2008, le programme RECYC-FRIGO d’Hydro-Québec vous permet de vous débarrasser d’un vieil appareil, réfrigérateur ou congélateur, facilement [...]"
>>> truncatewords_html(s, 18)
u'Depuis mars 2008, le programme RECYC-FRIGO d\u2019Hydro-Qu\xe9bec vous permet de vous d\xe9barrasser d\u2019un vieil appareil, r\xe9frig\xe9rateur ...'

I'm closing this ticket as worksforme.
Could you please reopen it with an example of a piece of code that shows the issue you're having?

Thanks.

comment:2 Changed 2 years ago by jaap3

  • Resolution worksforme deleted
  • Status changed from closed to new

I can reproduce it, but only if I convert the special characters to html entities first. Think that might be the actual cause:

>>> s = u'Depuis mars 2008, le programme RECYC-FRIGO d\u2019Hydro-Québec vous permet de vous débarrasser d\u2019un vieil appareil, réfrigérateur ou congélateur, facilement'
>>> truncatewords_html(s, 8)
u'Depuis mars 2008, le programme RECYC-FRIGO d\u2019Hydro-Qu ...'

comment:3 Changed 2 years ago by bmispelon

  • Cc bmispelon@… added
  • Summary changed from templatetag truncatewords_html split words on special caracters to templatetag truncatewords_html split words containing HTML entities
  • Triage Stage changed from Unreviewed to Accepted
  • Version changed from 1.3 to master

Hi,

Thanks for reopening this, there does appear to be an issue.

I made some quick tests and it seems that this behavior has always been present.

The problem seems to be that the regexp used to split words [1] doesn't consider a & to be part of a word, hence the behavior.

comment:4 Changed 2 years ago by jaap3

What about converting html entities back to chars before the regex? Just whipped up a quick proof of concept that seems to work fine (and uses just stdlib code)

>>> import xml.sax.saxutils
>>> import htmlentitydefs
>>> entity2unicode = dict([('&%s;' % k, unichr(v)) for k, v in htmlentitydefs.name2codepoint.items()])
>>> truncatewords_html(xml.sax.saxutils.unescape(s, entity2unicode), 8)
u'Depuis mars 2008, le programme RECYC-FRIGO d\u2019Hydro-Qu\xe9bec ...'

comment:5 Changed 2 years ago by jaap3

  • Has patch set
  • Owner changed from nobody to jaap3
  • Status changed from new to assigned

Noticed that the django.utils.text module already had an unescape_entities function. So I created this pull request:

https://github.com/django/django/pull/1332

comment:6 Changed 22 months ago by Tim Graham <timograham@…>

  • Resolution set to fixed
  • Status changed from assigned to closed

In 40b95a24ae159b6600457a23d6c2779a18037b7b:

Fixed #20568 -- truncatewords_html no longer splits words containing HTML entities.

Thanks yann0 at hotmail.com for the report.

Note: See TracTickets for help on using tickets.
Back to Top