Opened 3 weeks ago

Closed 2 weeks ago

Last modified 2 weeks ago

#36896 closed Cleanup/optimization (needsinfo)

Optimize TruncateCharsHTMLParser.process() to avoid redundant sum() calculation

Reported by: Tarek Nakkouch Owned by: absyol
Component: Utilities Version: 6.0
Severity: Normal Keywords:
Cc: Triage Stage: Unreviewed
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

The TruncateCharsHTMLParser.process() method in django/utils/text.py recalculates sum(len(p) for p in self.output) every time it processes a text chunk. For HTML with multiple text nodes, this repeatedly iterates over the growing output list unnecessarily.

def process(self, data):
    self.processed_chars += len(data)
    if (self.processed_chars == self.length) and (
        sum(len(p) for p in self.output) + len(data) == len(self.rawdata)
    ):
        self.output.append(data)
        raise self.TruncationCompleted
    output = escape("".join(data[: self.remaining]))
    return data, output

Suggested optimization

Cache the output length as self.output_len and increment it when appending to self.output:

  • Initialize self.output_len = 0 in TruncateHTMLParser.__init__()
  • Increment in handle_starttag(), handle_endtag(), handle_data(), feed(), and process()
  • Replace sum(len(p) for p in self.output) with self.output_len

This eliminates redundant iteration over already-processed output.

Change History (9)

comment:1 by absyol, 3 weeks ago

Owner: set to absyol
Status: newassigned

comment:2 by absyol, 3 weeks ago

Hello, I am new here (and to open source contributions in general). Please feel feel free to let me know if there is anything I can improve in.

I do agree that iterating through every item in self.output is redundant and we can do it while adding to the list. I have a solution right now that implements a helper function to append a items in a list to self.output and compute self.output_length on the fly. Each modification to self.output will be replaced with this helper function so extending the class later will be less error-prone.

    def update_output_fields(self, outputs):
        for output in outputs:
            self.output.append(output)
            self.output_length += len(output)

When modifying one of the other functions, we can call it like the following:

    def handle_data(self, data):
        data, output = self.process(data)
        data_len = len(data)
        if self.remaining < data_len:
            self.remaining = 0
            
            # call here
            self.update_output_fields([add_truncation_text(output, self.replacement)])
            raise self.TruncationCompleted
        self.remaining -= data_len

        # call here
        self.update_output_fields([output])

We take a list as input to support feed():

    def feed(self, data):
        try:
            super().feed(data)
        except self.TruncationCompleted:
            self.update_output_fields([f"</{tag}>" for tag in self.tags])
            self.tags.clear()
            self.reset()
        else:
            # No data was handled.
            self.reset()

Would this be sufficient for this optimization?

comment:3 by absyol, 3 weeks ago

Triage Stage: UnreviewedAccepted

comment:4 by absyol, 3 weeks ago

Has patch: set

comment:6 by JaeHyuckSa, 2 weeks ago

Triage Stage: AcceptedUnreviewed

Please note that the Triage Stage is generally handled during the maintainers’ review process, so please try to avoid modifying this section.

comment:7 by Jacob Walls, 2 weeks ago

I would offer a small addendum: community triage is welcome in my opinion, but unless the issue is completely "cut and dry", at least a sentence with your findings is helpful. The next steps box above says, "For new features or cleanups: give a second opinion of the proposal."

comment:8 by Jacob Walls, 2 weeks ago

Resolution: needsinfo
Status: assignedclosed

Do you have a benchmark showing an improvement? Checking the provided PR against a script provided to the Security Team in a prior report shows that this degrades performance.

comment:9 by absyol, 2 weeks ago

Thank you so much for the feedback. I ran a benchmark on an HTML string greater than 1000 characters and I don't see much of an improvement from my changes. I apologize for moving quickly and will do more verification on the problem before working on the fix moving forward.

Note: See TracTickets for help on using tickets.
Back to Top