id	summary	reporter	owner	description	type	status	component	version	severity	resolution	keywords	cc	stage	has_patch	needs_docs	needs_tests	needs_better_patch	easy	ui_ux
36897	Improve repercent_broken_unicode() performance	Tarek Nakkouch	beestarkdev	"The `repercent_broken_unicode()` function in `django/utils/encoding.py` has performance issues when processing URLs with many consecutive invalid UTF-8 bytes. The bottleneck is due to raising an exception for each invalid byte and creating intermediate bytes objects through concatenation.

{{{#!python
changed_parts = []
while True:
    try:
        path.decode()
    except UnicodeDecodeError as e:
        repercent = quote(path[e.start : e.end], safe=b""/#%[]=:;$&()+,!?*@'~"")
        # creates new bytes object
        changed_parts.append(path[: e.start] + repercent.encode())
        path = path[e.end :]
    else:
        return b"""".join(changed_parts) + path
}}}

== Suggested optimization ==

The simplest solution is to append byte parts separately to the list instead of concatenating them with the `+` operator, avoiding creation of intermediate bytes objects. This provides ~40% improvement while keeping the same exception-based approach:

{{{#!python
changed_parts = []
while True:
    try:
        path.decode()
    except UnicodeDecodeError as e:
        repercent = quote(path[e.start : e.end], safe=b""/#%[]=:;$&()+,!?*@'~"")
        changed_parts.append(path[: e.start])
        changed_parts.append(repercent.encode())
        path = path[e.end :]
    else:
        changed_parts.append(path)
        return b"""".join(changed_parts)
}}}

Alternatively, a manual UTF-8 validation approach could eliminate exception overhead entirely by scanning byte-by-byte and checking UTF-8 patterns to identify invalid sequences without raising exceptions. This would reduce processing time by ~80% though the implementation is more complex."	Cleanup/optimization	assigned	Utilities	6.0	Normal			Harsh007	Accepted	1	0	0	0	0	0