#36896 closed Cleanup/optimization (needsinfo)
Optimize TruncateCharsHTMLParser.process() to avoid redundant sum() calculation
| Reported by: | Tarek Nakkouch | Owned by: | absyol |
|---|---|---|---|
| Component: | Utilities | Version: | 6.0 |
| Severity: | Normal | Keywords: | |
| Cc: | Triage Stage: | Unreviewed | |
| Has patch: | yes | Needs documentation: | no |
| Needs tests: | no | Patch needs improvement: | no |
| Easy pickings: | no | UI/UX: | no |
Description
The TruncateCharsHTMLParser.process() method in django/utils/text.py recalculates sum(len(p) for p in self.output) every time it processes a text chunk. For HTML with multiple text nodes, this repeatedly iterates over the growing output list unnecessarily.
def process(self, data): self.processed_chars += len(data) if (self.processed_chars == self.length) and ( sum(len(p) for p in self.output) + len(data) == len(self.rawdata) ): self.output.append(data) raise self.TruncationCompleted output = escape("".join(data[: self.remaining])) return data, output
Suggested optimization
Cache the output length as self.output_len and increment it when appending to self.output:
- Initialize
self.output_len = 0inTruncateHTMLParser.__init__() - Increment in
handle_starttag(),handle_endtag(),handle_data(),feed(), andprocess() - Replace
sum(len(p) for p in self.output)withself.output_len
This eliminates redundant iteration over already-processed output.
Change History (9)
comment:1 by , 3 weeks ago
| Owner: | set to |
|---|---|
| Status: | new → assigned |
comment:2 by , 3 weeks ago
comment:3 by , 3 weeks ago
| Triage Stage: | Unreviewed → Accepted |
|---|
comment:4 by , 3 weeks ago
| Has patch: | set |
|---|
comment:6 by , 2 weeks ago
| Triage Stage: | Accepted → Unreviewed |
|---|
Please note that the Triage Stage is generally handled during the maintainers’ review process, so please try to avoid modifying this section.
comment:7 by , 2 weeks ago
I would offer a small addendum: community triage is welcome in my opinion, but unless the issue is completely "cut and dry", at least a sentence with your findings is helpful. The next steps box above says, "For new features or cleanups: give a second opinion of the proposal."
comment:8 by , 2 weeks ago
| Resolution: | → needsinfo |
|---|---|
| Status: | assigned → closed |
Do you have a benchmark showing an improvement? Checking the provided PR against a script provided to the Security Team in a prior report shows that this degrades performance.
comment:9 by , 2 weeks ago
Thank you so much for the feedback. I ran a benchmark on an HTML string greater than 1000 characters and I don't see much of an improvement from my changes. I apologize for moving quickly and will do more verification on the problem before working on the fix moving forward.
Hello, I am new here (and to open source contributions in general). Please feel feel free to let me know if there is anything I can improve in.
I do agree that iterating through every item in
self.outputis redundant and we can do it while adding to the list. I have a solution right now that implements a helper function to append a items in a list toself.outputand computeself.output_lengthon the fly. Each modification toself.outputwill be replaced with this helper function so extending the class later will be less error-prone.def update_output_fields(self, outputs): for output in outputs: self.output.append(output) self.output_length += len(output)When modifying one of the other functions, we can call it like the following:
def handle_data(self, data): data, output = self.process(data) data_len = len(data) if self.remaining < data_len: self.remaining = 0 # call here self.update_output_fields([add_truncation_text(output, self.replacement)]) raise self.TruncationCompleted self.remaining -= data_len # call here self.update_output_fields([output])We take a list as input to support
feed():def feed(self, data): try: super().feed(data) except self.TruncationCompleted: self.update_output_fields([f"</{tag}>" for tag in self.tags]) self.tags.clear() self.reset() else: # No data was handled. self.reset()Would this be sufficient for this optimization?