Opened 12 years ago
Closed 12 years ago
#21573 closed Cleanup/optimization (fixed)
Cache regular expression for normalizing newlines in django.utils.text.normalize_newlines
| Reported by: | Vajrasky Kok | Owned by: | Vajrasky Kok |
|---|---|---|---|
| Component: | Utilities | Version: | dev |
| Severity: | Normal | Keywords: | performance |
| Cc: | sky.kok@… | Triage Stage: | Ready for checkin |
| Has patch: | yes | Needs documentation: | no |
| Needs tests: | no | Patch needs improvement: | no |
| Easy pickings: | yes | UI/UX: | no |
Description
This is the method:
def normalize_newlines(text):
return force_text(re.sub(r'\r\n|\r|\n', '\n', text))
normalize_newlines = allow_lazy(normalize_newlines, six.text_type)
It should be:
_newlines_re = re.compile(r'\r\n|\r|\r')
def normalize_newlines(text):
return force_text(_newlines_re.sub('\n', text))
normalize_newlines = allow_lazy(normalize_newlines, six.text_type)
Things to be considered:
- Do we use this method often in Django? If not, caching may not be necessary.
- From http://docs.python.org/2/library/re.html#re.compile: The compiled versions of the most recent patterns passed to re.match(), re.search() or re.compile() are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.
- The performance gain maybe negligible.
Change History (6)
comment:1 by , 12 years ago
| Cc: | added |
|---|---|
| Has patch: | set |
| Owner: | changed from to |
| Status: | new → assigned |
comment:2 by , 12 years ago
I ran a small benchmark [1] with four variants:
- The original function
- The same function but with a simpler regexp (
r'\r\n|\r'instead ofr'\r\n|\r|\n') - The original function with a compiled regexp
- A function using
str.replaceinstead ofre.sub
I tried each variant on 3 different sizes of text: small (~20 lines), medium (~200 lines), and big (~2000 lines).
Here are the results (less is better):
| Small | Medium | Big | |
|---|---|---|---|
| original | 19.2 usec | 141 usec | 1.65 msec |
| simple_re | 7.42 usec | 14.8 usec | 135 usec |
| compiled_re | 6.15 usec | 13.8 usec | 146 usec |
| str_replace | 5.29 usec | 16.4 usec | 130 usec |
The str.replace looks like a clear winner to me so unless there are objections, I'll commit that one.
comment:3 by , 12 years ago
| Triage Stage: | Unreviewed → Accepted |
|---|
(Oops, forgot to change the status of the ticket)
comment:4 by , 12 years ago
Objection. I got different result in my machine (32 bit) with Fedora Core 18.
On Python 3.4 (Yeah!!!!):
| Function | Small | Medium | Big |
| original: | 95 usec | 625 usec | 7.52 msec |
| simple_re: | 33.6 usec | 57 usec | 281 usec |
| compiled_re: | 27.2 usec | 50.7 usec | 273 usec |
| str_replace: | 27.1 usec | 51 usec | 303 usec |
But on Python 2.7, the compiled_re is still the winner:
| Function | Small | Medium | Big |
| original: | 34.3 usec | 231 usec | 2.48 msec |
| simple_re: | 17.6 usec | 49.6 usec | 343 usec |
| compiled_re: | 14.6 usec | 46.5 usec | 342 usec |
| str_replace: | 14 usec | 53.3 usec | 414 usec |
comment:5 by , 12 years ago
| Triage Stage: | Accepted → Ready for checkin |
|---|
Alright, let's go with compiled_re then (they're all in the same ballpark anyway).
comment:6 by , 12 years ago
| Resolution: | → fixed |
|---|---|
| Status: | assigned → closed |
Here is the PR: https://github.com/django/django/pull/2046
The unit test for normalize_newlines can be found in #21572.