Opened 106 minutes ago

Last modified 101 minutes ago

#36897 new Cleanup/optimization

Improve repercent_broken_unicode() performance

Reported by: Tarek Nakkouch Owned by:
Component: Utilities Version: 6.0
Severity: Normal Keywords:
Cc: Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

The repercent_broken_unicode() function in django/utils/encoding.py has performance issues when processing URLs with many consecutive invalid UTF-8 bytes. The bottleneck is due to raising an exception for each invalid byte and creating intermediate bytes objects through concatenation.

changed_parts = []
while True:
    try:
        path.decode()
    except UnicodeDecodeError as e:
        repercent = quote(path[e.start : e.end], safe=b"/#%[]=:;$&()+,!?*@'~")
        # creates new bytes object
        changed_parts.append(path[: e.start] + repercent.encode())
        path = path[e.end :]
    else:
        return b"".join(changed_parts) + path

Suggested optimization

The simplest solution is to append byte parts separately to the list instead of concatenating them with the + operator, avoiding creation of intermediate bytes objects. This provides ~40% improvement while keeping the same exception-based approach:

changed_parts = []
while True:
    try:
        path.decode()
    except UnicodeDecodeError as e:
        repercent = quote(path[e.start : e.end], safe=b"/#%[]=:;$&()+,!?*@'~")
        changed_parts.append(path[: e.start])
        changed_parts.append(repercent.encode())
        path = path[e.end :]
    else:
        changed_parts.append(path)
        return b"".join(changed_parts)

Alternatively, a manual UTF-8 validation approach could eliminate exception overhead entirely by scanning byte-by-byte and checking UTF-8 patterns to identify invalid sequences without raising exceptions. This would reduce processing time by ~80% though the implementation is more complex.

Change History (1)

comment:1 by Tarek Nakkouch, 101 minutes ago

Summary: Optimize repercent_broken_unicode() performanceImprove repercent_broken_unicode() performance
Note: See TracTickets for help on using tickets.
Back to Top