Opened 3 years ago
Closed 3 years ago
#32824 closed Cleanup/optimization (fixed)
Potential micro-optimisations for NodeList.render
Reported by: | Keryn Knight | Owned by: | Keryn Knight |
---|---|---|---|
Component: | Template system | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | Triage Stage: | Ready for checkin | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
The current implementation of NodeList.render
looks like:
def render(self, context): bits = [] for node in self: if isinstance(node, Node): bit = node.render_annotated(context) else: bit = node bits.append(str(bit)) return mark_safe(''.join(bits))
which, given a single template render (template + code to be attached, but is basically the same fakery as in #32814) is rendered 41 times (cProfile output, relevant portions to be discussed in this ticket only):
41/1 0.000 0.000 0.007 0.007 base.py:956(render) 591 0.000 0.000 0.000 0.000 {built-in method builtins.hasattr} 191 0.000 0.000 0.000 0.000 {built-in method builtins.callable} 115/113 0.000 0.000 0.000 0.000 safestring.py:53(mark_safe)
Given the numbers are too miniscule to register, for the purposes of surfacing the most time consuming parts, and pushing things like importlib machinery etc to the bottom of the profile output, rendering the same template 1000 times gives us:
4897753 function calls (4539046 primitive calls) in 2.832 seconds 41000/1000 0.193 0.000 2.671 0.003 base.py:956(render) 70045/68045 0.042 0.000 0.095 0.000 safestring.py:50(mark_safe) 128045 0.053 0.000 0.053 0.000 safestring.py:26(__init__) 159432 0.023 0.000 0.023 0.000 {built-in method builtins.hasattr} 114077 0.012 0.000 0.012 0.000 {built-in method builtins.callable} 686582 0.076 0.000 0.076 0.000 {built-in method builtins.isinstance} 318184 0.036 0.000 0.036 0.000 {method 'append' of 'list' objects}
(to ensure SafeString.__init__
calls are registered, I manually created the method, so timings for it aren't correct but counts will be).
Proposed change
The primary change I'm suggesting is deviating from using mark_safe
to directly returning a new SafeString
- as far as I know, the result of ''.join(...)
is always of type str
regardless of if any of the underlying bits are themselves SafeString
etc, so for each call to NodeList.render
we'd be saving 1 hasattr(...)
test and 1 callable(...)
test, which would change the 1000 runs to:
4774747 function calls (4416040 primitive calls) in 2.710 seconds 41000/1000 0.194 0.000 2.669 0.003 base.py:956(render) 29045/27045 0.019 0.000 0.063 0.000 safestring.py:50(mark_safe) 128045 0.048 0.000 0.048 0.000 safestring.py:26(__init__) 118432 0.019 0.000 0.019 0.000 {built-in method builtins.hasattr} 73077 0.008 0.000 0.008 0.000 {built-in method builtins.callable} 686582 0.073 0.000 0.073 0.000 {built-in method builtins.isinstance} 318184 0.038 0.000 0.038 0.000 {method 'append' of 'list' objects}
You can see that whilst the timings don't change substantially, it is a whole bunch less calls.
Worth investigating?
But possibly more interesting than that is the rest of the method. As far as I can tell from running the test suite (which, tbf, does include a boatload of skips being output), the line if isinstance(node, Node):
is always true -- that is, there's no test anywhere for what happens if you somehow, somewhere include a non Node
instance into a NodeList
? If that's the case (which again, it may not be, given skips) then there's 2 possiblities:
- add in a test demonstrating what should happen when a
non-Node
appears in aNodeList
and canonically accept it as part of DTL - take the stance that NodeLists are inherently composed of only nodes (or at least things with
render_annotated(context)
) and sweep away the isinstance check.
If option 2 were taken, the final method could be (without any intermediate deprecation stuffs if necessary):
def render(self, context): bits = [] append = bits.append for node in self: append(str(node.render_annotated(context))) return SafeString(''.join(bits))
which, compounded with the change from mark_safe
to SafeString
proposed above would give (again, 1000 renders):
4575748 function calls (4217041 primitive calls) in 2.685 seconds 41000/1000 0.152 0.000 2.567 0.003 base.py:956(render) 29045/27045 0.018 0.000 0.062 0.000 safestring.py:53(mark_safe 128045 0.048 0.000 0.048 0.000 safestring.py:26(__init__)) 118432 0.019 0.000 0.019 0.000 {built-in method builtins.hasattr} 73077 0.008 0.000 0.008 0.000 {built-in method builtins.callable} 487582 0.058 0.000 0.058 0.000 {built-in method builtins.isinstance} 318184 0.039 0.000 0.039 0.000 {method 'append' of 'list' objects}
for a total saving of 199,000 isinstance
checks, 41,000 hasattr
checks and 41,000 callable
checks (from a starting point of 4897753
function and 4539046
primitive calls down to 4575748
and 4217041
respectively)
Were that final method implementation the target/goal, it may be OK to move from stacking up a intermediate list (the nodes
variable) to a generator expression into the ''.join(...)
but briefly using memory_profiler
to @profile
the method shows its all 0 allocations anyway (presumably references & gc detection?)
Additional notes
Yappi is slower than cProfile but suggests the same small wins (amplified in both cases by the profiling machinery slowing everything down).
The difference (when measured via yappi, and reported via pycharm's handy profile visualiser) between the original method and the ideal replacement would be a decrease from 7.6%
of time being spent (directly) on this method to 5.7%
The difference by just removing the mark_safe
call and the SafeString
suggestion is obviously less pronounced, but yappi suggests that the time spent in mark_safe
goes from 1.9%
of time to 0.9%
Those proportions are vaguely the same for cProfile, give or take.
Attachments (2)
Change History (10)
by , 3 years ago
Attachment: | textnode_test2.py added |
---|
comment:1 by , 3 years ago
Tim McCurrach proposed directly returning SafeString
in several places in #32568.
Regarding the rest of the proposal, NodeList
is documented to consist of Node
objects, so option 2 sounds right-track to me:
``parser.parse()`` takes a tuple of names of block tags ''to parse until''. It returns an instance of ``django.template.NodeList``, which is a list of all ``Node`` objects that the parser encountered ''before'' it encountered any of the tags named in the tuple.
comment:2 by , 3 years ago
Triage Stage: | Unreviewed → Accepted |
---|
We don't even need bits
, so maybe:
def render(self, context): return SafeString(''.join([ str(node.render_annotated(context)) for node in self ]))
comment:3 by , 3 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
I think I'm happier still with that change, I wasn't proposing it because I couldn't previously get memory_profiler
to play ball, but (at least in the pathological case of a stupid number of child nodes) it does skip the intermediate list, which does have some small measure on runtime performance:
Using a small stub for benchmarking, using ipython:
from nodelist_render_test import render from django.template.base import NodeList %load_ext memory_profiler %mprun -f NodeList.render render() Line # Mem usage Increment Occurences Line Contents ============================================================ 957 58.8 MiB 51.1 MiB 10 def render(self, context): 958 # return SafeString(''.join([ 959 # str(node.render_annotated(context)) 960 # for node in self 961 # ])) 962 58.8 MiB 0.0 MiB 10 bits = [] 963 58.8 MiB 0.0 MiB 10 append = bits.append 964 59.5 MiB 0.0 MiB 1000010 for node in self: 965 59.5 MiB 7.4 MiB 1000000 append(str(node.render_annotated(context))) 966 59.6 MiB 1.1 MiB 10 return SafeString(''.join(bits))
and:
%timeit render() 216 ms ± 3.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
timeit is consistently between 215ms & 225ms, repeatedly running the above (rather than using -nX00
)
switching to the generator expression version, now I've got memory_profiler
to report something:
%mprun -f NodeList.render render() Line # Mem usage Increment Occurences Line Contents ============================================================ 957 58.7 MiB 51.3 MiB 10 def render(self, context): 958 58.9 MiB -11945.8 MiB 1000040 return SafeString(''.join([ 959 58.8 MiB -11939.6 MiB 1000000 str(node.render_annotated(context)) 960 58.8 MiB -25012.3 MiB 1000010 for node in self 961 ])) 962 # bits = [] 963 # append = bits.append 964 # for node in self: 965 # append(str(node.render_annotated(context))) 966 # return SafeString(''.join(bits))
dunno what to make of the deallocations, but I guess zeroes or minus numbers are better than 7.1MB
... maybe I really am generating 11GB of nodes in this synthetic benchmark :)
and timeit:
%timeit render() 202 ms ± 1.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
timeit was consistently around 200 to 205ms repeatedly running it.
by , 3 years ago
Attachment: | nodelist_render_test.py added |
---|
synthetic test for isolating just this method, with memory profiler etc
comment:4 by , 3 years ago
Oh and for completeness, the original method before any of these proposed changes, for historic reference:
%mprun -f NodeList.render render() Line # Mem usage Increment Occurences Line Contents ============================================================ 957 62.3 MiB -43.6 MiB 10 def render(self, context): 958 62.3 MiB 0.0 MiB 10 bits = [] 959 72.8 MiB -4336382.0 MiB 1000010 for node in self: 960 72.8 MiB -4336466.3 MiB 1000000 if isinstance(node, Node): 961 bit = node.render_annotated(context) 962 else: 963 72.8 MiB -4333316.3 MiB 1000000 bit = node 964 72.8 MiB -4333210.0 MiB 1000000 bits.append(str(bit)) 965 72.8 MiB 10.1 MiB 10 return mark_safe(''.join(bits))
and the timeit (remember this is the pathological case in terms of nodes to iterate over):
578 ms ± 4.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
consistently between 850-615ms across repeated runs.
comment:5 by , 3 years ago
Has patch: | set |
---|
comment:6 by , 3 years ago
Sigh. Apologies for the noise, but I went for a coffee and realised I was mis-representing the performance of the original method (last set of benchmarks above) by not ensuring my FakeNode
was a subclass of Node
(and thus render_annotated
wasn't being called, and so the string being returned was the class repr, inflating things massively) ... and that's not useful to anyone, so here's a repeat which should more accurately demonstrate the change for historic record:
Line # Mem usage Increment Occurences Line Contents ============================================================ 956 60.2 MiB 52.9 MiB 10 def render(self, context): 957 60.2 MiB 0.0 MiB 10 bits = [] 958 60.4 MiB -11945.9 MiB 1000010 for node in self: 959 60.4 MiB -11946.1 MiB 1000000 if isinstance(node, Node): 960 60.4 MiB -11946.1 MiB 1000000 bit = node.render_annotated(context) 961 else: 962 bit = node 963 60.4 MiB -11939.6 MiB 1000000 bits.append(str(bit)) 964 60.5 MiB 1.0 MiB 10 return mark_safe(''.join(bits))
and the timeit:
306 ms ± 2.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
(consistently between 305 and 315ms across repeated runs)
comment:7 by , 3 years ago
Triage Stage: | Accepted → Ready for checkin |
---|
yappi/cprofile/timeit oriented template rendering (1 complex template, fake/minimal context)