Opened 10 years ago

Closed 10 years ago

#21573 closed Cleanup/optimization (fixed)

Cache regular expression for normalizing newlines in django.utils.text.normalize_newlines

Reported by: Vajrasky Kok Owned by: Vajrasky Kok
Component: Utilities Version: dev
Severity: Normal Keywords: performance
Cc: sky.kok@… Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: yes UI/UX: no

Description

This is the method:

def normalize_newlines(text):
    return force_text(re.sub(r'\r\n|\r|\n', '\n', text))
normalize_newlines = allow_lazy(normalize_newlines, six.text_type)

It should be:

_newlines_re = re.compile(r'\r\n|\r|\r')
def normalize_newlines(text):
    return force_text(_newlines_re.sub('\n', text))
normalize_newlines = allow_lazy(normalize_newlines, six.text_type)

Things to be considered:

  • Do we use this method often in Django? If not, caching may not be necessary.
  • From http://docs.python.org/2/library/re.html#re.compile: The compiled versions of the most recent patterns passed to re.match(), re.search() or re.compile() are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.
  • The performance gain maybe negligible.

Change History (6)

comment:1 by Vajrasky Kok, 10 years ago

Cc: sky.kok@… added
Has patch: set
Owner: changed from nobody to Vajrasky Kok
Status: newassigned

Here is the PR: https://github.com/django/django/pull/2046

The unit test for normalize_newlines can be found in #21572.

comment:2 by Baptiste Mispelon, 10 years ago

I ran a small benchmark [1] with four variants:

  • The original function
  • The same function but with a simpler regexp (r'\r\n|\r' instead of r'\r\n|\r|\n')
  • The original function with a compiled regexp
  • A function using str.replace instead of re.sub

I tried each variant on 3 different sizes of text: small (~20 lines), medium (~200 lines), and big (~2000 lines).

Here are the results (less is better):

Small Medium Big
original 19.2 usec 141 usec 1.65 msec
simple_re 7.42 usec 14.8 usec 135 usec
compiled_re 6.15 usec 13.8 usec 146 usec
str_replace 5.29 usec 16.4 usec 130 usec

The str.replace looks like a clear winner to me so unless there are objections, I'll commit that one.

[1] https://gist.github.com/bmispelon/7841287

comment:3 by Baptiste Mispelon, 10 years ago

Triage Stage: UnreviewedAccepted

(Oops, forgot to change the status of the ticket)

comment:4 by Vajrasky Kok, 10 years ago

Objection. I got different result in my machine (32 bit) with Fedora Core 18.

On Python 3.4 (Yeah!!!!):

FunctionSmallMediumBig
original:95 usec625 usec7.52 msec
simple_re:33.6 usec57 usec281 usec
compiled_re:27.2 usec50.7 usec273 usec
str_replace:27.1 usec51 usec303 usec

But on Python 2.7, the compiled_re is still the winner:

FunctionSmallMediumBig
original:34.3 usec231 usec2.48 msec
simple_re:17.6 usec49.6 usec343 usec
compiled_re:14.6 usec46.5 usec342 usec
str_replace:14 usec53.3 usec414 usec

comment:5 by Baptiste Mispelon, 10 years ago

Triage Stage: AcceptedReady for checkin

Alright, let's go with compiled_re then (they're all in the same ballpark anyway).

comment:6 by Baptiste Mispelon <bmispelon@…>, 10 years ago

Resolution: fixed
Status: assignedclosed

In a7cf48a2b7878bf98879f2963ab9a7d0e6493d9a:

Fixed #21573 -- Improved performance of utils.text.normalize_newlines.

Note: See TracTickets for help on using tickets.
Back to Top