Opened 3 weeks ago

Last modified 3 weeks ago

#36897 assigned Cleanup/optimization

Improve repercent_broken_unicode() performance

Reported by: Tarek Nakkouch Owned by: beestarkdev
Component: Utilities Version: 6.0
Severity: Normal Keywords:
Cc: Harsh007 Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

The repercent_broken_unicode() function in django/utils/encoding.py has performance issues when processing URLs with many consecutive invalid UTF-8 bytes. The bottleneck is due to raising an exception for each invalid byte and creating intermediate bytes objects through concatenation.

changed_parts = []
while True:
    try:
        path.decode()
    except UnicodeDecodeError as e:
        repercent = quote(path[e.start : e.end], safe=b"/#%[]=:;$&()+,!?*@'~")
        # creates new bytes object
        changed_parts.append(path[: e.start] + repercent.encode())
        path = path[e.end :]
    else:
        return b"".join(changed_parts) + path

Suggested optimization

The simplest solution is to append byte parts separately to the list instead of concatenating them with the + operator, avoiding creation of intermediate bytes objects. This provides ~40% improvement while keeping the same exception-based approach:

changed_parts = []
while True:
    try:
        path.decode()
    except UnicodeDecodeError as e:
        repercent = quote(path[e.start : e.end], safe=b"/#%[]=:;$&()+,!?*@'~")
        changed_parts.append(path[: e.start])
        changed_parts.append(repercent.encode())
        path = path[e.end :]
    else:
        changed_parts.append(path)
        return b"".join(changed_parts)

Alternatively, a manual UTF-8 validation approach could eliminate exception overhead entirely by scanning byte-by-byte and checking UTF-8 patterns to identify invalid sequences without raising exceptions. This would reduce processing time by ~80% though the implementation is more complex.

Change History (7)

comment:1 by Tarek Nakkouch, 3 weeks ago

Summary: Optimize repercent_broken_unicode() performanceImprove repercent_broken_unicode() performance

comment:2 by Harsh007, 3 weeks ago

Cc: Harsh007 added
Owner: set to Harsh007
Status: newassigned

comment:3 by Harsh007, 3 weeks ago

Owner: Harsh007 removed
Status: assignednew

comment:4 by beestarkdev, 3 weeks ago

Owner: set to beestarkdev
Status: newassigned

comment:5 by beestarkdev, 3 weeks ago

Triage Stage: UnreviewedAccepted

comment:6 by beestarkdev, 3 weeks ago

Has patch: set

comment:7 by beestarkdev, 3 weeks ago

Hi, here is the pull request: https://github.com/django/django/pull/20626

This is my first time ever contributing to open source so please feel free to give me feedback if there's anything I can improve on.

I have attempted to add another optimization to this function in addition to the recommendation here. I will post the testing/benchmarking methodologies as well in the pull request for full transparency. Thank you!

Note: See TracTickets for help on using tickets.
Back to Top