Context Navigation

← Previous Ticket
Next Ticket →

#36896 closed Cleanup/optimization (needsinfo)

Optimize TruncateCharsHTMLParser.process() to avoid redundant sum() calculation

Reported by:	Tarek Nakkouch	Owned by:	absyol
Component:	Utilities	Version:	6.0
Severity:	Normal	Keywords:
Cc:		Triage Stage:	Unreviewed
Has patch:	yes	Needs documentation:	no
Needs tests:	no	Patch needs improvement:	no
Easy pickings:	no	UI/UX:	no

Description

The TruncateCharsHTMLParser.process() method in django/utils/text.py recalculates sum(len(p) for p in self.output) every time it processes a text chunk. For HTML with multiple text nodes, this repeatedly iterates over the growing output list unnecessarily.

def process(self, data):
    self.processed_chars += len(data)
    if (self.processed_chars == self.length) and (
        sum(len(p) for p in self.output) + len(data) == len(self.rawdata)
    ):
        self.output.append(data)
        raise self.TruncationCompleted
    output = escape("".join(data[: self.remaining]))
    return data, output

Suggested optimization

Cache the output length as self.output_len and increment it when appending to self.output:

Initialize self.output_len = 0 in TruncateHTMLParser.__init__()
Increment in handle_starttag(), handle_endtag(), handle_data(), feed(), and process()
Replace sum(len(p) for p in self.output) with self.output_len

This eliminates redundant iteration over already-processed output.

Change History (9)

comment:1 by absyol, 6 weeks ago

Owner:	set to absyol
Status:	new → assigned

comment:2 by absyol, 6 weeks ago

Hello, I am new here (and to open source contributions in general). Please feel feel free to let me know if there is anything I can improve in.

I do agree that iterating through every item in self.output is redundant and we can do it while adding to the list. I have a solution right now that implements a helper function to append a items in a list to self.output and compute self.output_length on the fly. Each modification to self.output will be replaced with this helper function so extending the class later will be less error-prone.

    def update_output_fields(self, outputs):
        for output in outputs:
            self.output.append(output)
            self.output_length += len(output)

When modifying one of the other functions, we can call it like the following:

    def handle_data(self, data):
        data, output = self.process(data)
        data_len = len(data)
        if self.remaining < data_len:
            self.remaining = 0
            
            # call here
            self.update_output_fields([add_truncation_text(output, self.replacement)])
            raise self.TruncationCompleted
        self.remaining -= data_len

        # call here
        self.update_output_fields([output])

We take a list as input to support feed():

    def feed(self, data):
        try:
            super().feed(data)
        except self.TruncationCompleted:
            self.update_output_fields([f"</{tag}>" for tag in self.tags])
            self.tags.clear()
            self.reset()
        else:
            # No data was handled.
            self.reset()

Would this be sufficient for this optimization?

comment:3 by absyol, 6 weeks ago

Triage Stage:	Unreviewed → Accepted

comment:4 by absyol, 6 weeks ago

Has patch:	set

comment:5 by absyol, 6 weeks ago

Here is the pull request - https://github.com/django/django/pull/20624

comment:6 by JaeHyuckSa, 5 weeks ago

Triage Stage:	Accepted → Unreviewed

Please note that the Triage Stage is generally handled during the maintainers’ review process, so please try to avoid modifying this section.

comment:7 by Jacob Walls, 5 weeks ago

I would offer a small addendum: community triage is welcome in my opinion, but unless the issue is completely "cut and dry", at least a sentence with your findings is helpful. The next steps box above says, "For new features or cleanups: give a second opinion of the proposal."

comment:8 by Jacob Walls, 5 weeks ago

Resolution:	→ needsinfo
Status:	assigned → closed

Do you have a benchmark showing an improvement? Checking the provided PR against a script provided to the Security Team in a prior report shows that this degrades performance.

comment:9 by absyol, 5 weeks ago

Thank you so much for the feedback. I ran a benchmark on an HTML string greater than 1000 characters and I don't see much of an improvement from my changes. I apologize for moving quickly and will do more verification on the problem before working on the fix moving forward.

Note: See TracTickets for help on using tickets.

Download in other formats:

Issues