Opened 106 minutes ago
Last modified 101 minutes ago
#36897 new Cleanup/optimization
Improve repercent_broken_unicode() performance
| Reported by: | Tarek Nakkouch | Owned by: | |
|---|---|---|---|
| Component: | Utilities | Version: | 6.0 |
| Severity: | Normal | Keywords: | |
| Cc: | Triage Stage: | Unreviewed | |
| Has patch: | no | Needs documentation: | no |
| Needs tests: | no | Patch needs improvement: | no |
| Easy pickings: | no | UI/UX: | no |
Description
The repercent_broken_unicode() function in django/utils/encoding.py has performance issues when processing URLs with many consecutive invalid UTF-8 bytes. The bottleneck is due to raising an exception for each invalid byte and creating intermediate bytes objects through concatenation.
changed_parts = [] while True: try: path.decode() except UnicodeDecodeError as e: repercent = quote(path[e.start : e.end], safe=b"/#%[]=:;$&()+,!?*@'~") # creates new bytes object changed_parts.append(path[: e.start] + repercent.encode()) path = path[e.end :] else: return b"".join(changed_parts) + path
Suggested optimization
The simplest solution is to append byte parts separately to the list instead of concatenating them with the + operator, avoiding creation of intermediate bytes objects. This provides ~40% improvement while keeping the same exception-based approach:
changed_parts = [] while True: try: path.decode() except UnicodeDecodeError as e: repercent = quote(path[e.start : e.end], safe=b"/#%[]=:;$&()+,!?*@'~") changed_parts.append(path[: e.start]) changed_parts.append(repercent.encode()) path = path[e.end :] else: changed_parts.append(path) return b"".join(changed_parts)
Alternatively, a manual UTF-8 validation approach could eliminate exception overhead entirely by scanning byte-by-byte and checking UTF-8 patterns to identify invalid sequences without raising exceptions. This would reduce processing time by ~80% though the implementation is more complex.
Change History (1)
comment:1 by , 101 minutes ago
| Summary: | Optimize repercent_broken_unicode() performance → Improve repercent_broken_unicode() performance |
|---|