Context Navigation

← Previous Ticket
Next Ticket →

#36897 assigned Cleanup/optimization

Improve repercent_broken_unicode() performance

Reported by:	Tarek Nakkouch	Owned by:	beestarkdev
Component:	Utilities	Version:	6.0
Severity:	Normal	Keywords:
Cc:	Harsh007	Triage Stage:	Accepted
Has patch:	yes	Needs documentation:	no
Needs tests:	no	Patch needs improvement:	no
Easy pickings:	no	UI/UX:	no

Description

The repercent_broken_unicode() function in django/utils/encoding.py has performance issues when processing URLs with many consecutive invalid UTF-8 bytes. The bottleneck is due to raising an exception for each invalid byte and creating intermediate bytes objects through concatenation.

changed_parts = []
while True:
    try:
        path.decode()
    except UnicodeDecodeError as e:
        repercent = quote(path[e.start : e.end], safe=b"/#%[]=:;$&()+,!?*@'~")
        # creates new bytes object
        changed_parts.append(path[: e.start] + repercent.encode())
        path = path[e.end :]
    else:
        return b"".join(changed_parts) + path

Suggested optimization

The simplest solution is to append byte parts separately to the list instead of concatenating them with the + operator, avoiding creation of intermediate bytes objects. This provides ~40% improvement while keeping the same exception-based approach:

changed_parts = []
while True:
    try:
        path.decode()
    except UnicodeDecodeError as e:
        repercent = quote(path[e.start : e.end], safe=b"/#%[]=:;$&()+,!?*@'~")
        changed_parts.append(path[: e.start])
        changed_parts.append(repercent.encode())
        path = path[e.end :]
    else:
        changed_parts.append(path)
        return b"".join(changed_parts)

Alternatively, a manual UTF-8 validation approach could eliminate exception overhead entirely by scanning byte-by-byte and checking UTF-8 patterns to identify invalid sequences without raising exceptions. This would reduce processing time by ~80% though the implementation is more complex.

Change History (7)

comment:1 by Tarek Nakkouch, 6 weeks ago

Summary:	Optimize repercent_broken_unicode() performance → Improve repercent_broken_unicode() performance

comment:2 by Harsh007, 6 weeks ago

Cc:	Harsh007 added
Owner:	set to Harsh007
Status:	new → assigned

comment:3 by Harsh007, 6 weeks ago

Owner:	Harsh007 removed
Status:	assigned → new

comment:4 by beestarkdev, 6 weeks ago

Owner:	set to beestarkdev
Status:	new → assigned

comment:5 by beestarkdev, 6 weeks ago

Triage Stage:	Unreviewed → Accepted

comment:6 by beestarkdev, 6 weeks ago

Has patch:	set

comment:7 by beestarkdev, 6 weeks ago

Hi, here is the pull request: https://github.com/django/django/pull/20626

This is my first time ever contributing to open source so please feel free to give me feedback if there's anything I can improve on.

I have attempted to add another optimization to this function in addition to the recommendation here. I will post the testing/benchmarking methodologies as well in the pull request for full transparency. Thank you!

Note: See TracTickets for help on using tickets.

Download in other formats:

Issues