Improve repercent_broken_unicode() performance
The repercent_broken_unicode() function in django/utils/encoding.py has performance issues when processing URLs with many consecutive invalid UTF-8 bytes. The bottleneck is due to raising an exception for each invalid byte and creating intermediate bytes objects through concatenation.
changed_parts = []
while True:
try:
path.decode()
except UnicodeDecodeError as e:
repercent = quote(path[e.start : e.end], safe=b"/#%[]=:;$&()+,!?*@'~")
# creates new bytes object
changed_parts.append(path[: e.start] + repercent.encode())
path = path[e.end :]
else:
return b"".join(changed_parts) + path
Suggested optimization
The simplest solution is to append byte parts separately to the list instead of concatenating them with the + operator, avoiding creation of intermediate bytes objects. This provides ~40% improvement while keeping the same exception-based approach:
changed_parts = []
while True:
try:
path.decode()
except UnicodeDecodeError as e:
repercent = quote(path[e.start : e.end], safe=b"/#%[]=:;$&()+,!?*@'~")
changed_parts.append(path[: e.start])
changed_parts.append(repercent.encode())
path = path[e.end :]
else:
changed_parts.append(path)
return b"".join(changed_parts)
Alternatively, a manual UTF-8 validation approach could eliminate exception overhead entirely by scanning byte-by-byte and checking UTF-8 patterns to identify invalid sequences without raising exceptions. This would reduce processing time by ~80% though the implementation is more complex.
Change History
(7)
| Summary: |
Optimize repercent_broken_unicode() performance → Improve repercent_broken_unicode() performance
|
| Cc: |
Harsh007 added
|
| Owner: |
set to Harsh007
|
| Status: |
new → assigned
|
| Owner: |
Harsh007 removed
|
| Status: |
assigned → new
|
| Owner: |
set to beestarkdev
|
| Status: |
new → assigned
|
| Triage Stage: |
Unreviewed → Accepted
|
Hi, here is the pull request: https://github.com/django/django/pull/20626
This is my first time ever contributing to open source so please feel free to give me feedback if there's anything I can improve on.
I have attempted to add another optimization to this function in addition to the recommendation here. I will post the testing/benchmarking methodologies as well in the pull request for full transparency. Thank you!