#32919 closed Cleanup/optimization (fixed)
Micro-optimisations for lexing & parsing of templates.
Reported by: | Keryn Knight | Owned by: | Keryn Knight |
---|---|---|---|
Component: | Template system | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | pope1ni | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
I'm bundling all of these proposed changes into one ticket because on their own they're all very small things and none of them move the needle individually. I do have a patchset for the changes I'm describing below which I'll tidy up and put into a PR for discussion.
Outside of looking at the pstats output format, the only proper benchmarking I've done is with timeit ... on a laptop. And unfortunately the bits being discussed are all quick enough that it's easy enough to foresee that the differences are within the margin of error, especially given the results can be affected by the CPU throttling, or other tasks running concurrently (though I've tried to avoid that as much as possible). It's entirely possibly that I got the baseline results during a throttled period accidentally, etc. So I'm going to -- possibly wrongly -- lean on the pstats output. But by all accounts, the theory should be sound, because it's "do less work"...
Note that DEBUG=False
for all of these, but it's irrelevant because I'm not going through the template loaders so there's no cached Loader in play (I'm just not getting the DebuxLexer as a result). All of the work is based on instantiating a Template from a string, never getting as far as rendering it.
Baseline
To get a baseline of where things are, I've read the contents of the admin/change_list.html
into a variable, to get the following:
In [3]: %timeit -r5 -n5000 Template(changelist) 2.08 ms ± 29.1 µs per loop (mean ± std. dev. of 5 runs, 5000 loops each)
to try and boost the signal to noise of ms fluctuations, we'll also do:
def runtemplate(source=changelist, n=1): for x in range(n): t = Template(source)
and get the baseline for that:
In [2]: %timeit -r5 -n1 runtemplate(n=5000) 10.5 s ± 335 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
From there, I've cprofiled (well, yappi actually) runtemplate(n=1000)
to amortize any other setup/import/machinery costs and boost the lexing & parsing parts into the spotlight. I've cherry picked the relevant things that I'm putting forth for change, and for the purpose of extracting the distinct get_nodes_by_type
methods being run, copy-pasted the default Node.get_nodes_by_type
(sigh) to each.
I've added in "percentage of time" in at the end of each line, calculated as (tottime / overall time) * 100
(eg: Lexer.create_token
is (1.418 / 30.018) * 100
) ... these are ideally going to go down in the general case, but with each subsequent change may go back up because something else changed position, so are a guide more than anything.
6776637 function calls (6946213 primitive calls) in 30.018 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1000/31000 2.410 0.002 24.801 0.001 django/template/base.py:445(Parser.parse) (8%) 160000 1.418 0.000 2.436 0.000 django/template/base.py:367(Lexer.create_token) (4.7%) 687159 1.292 0.000 1.293 0.000 builtins:0(isinstance) (4.3%) 2000/56000 1.199 0.001 5.095 0.000 django/template/base.py:962(NodeList.get_nodes_by_type) (4%) 131000 1.100 0.000 1.835 0.000 django/template/base.py:512(Parser.extend_nodelist) (3.7%) 1000 1.047 0.001 4.206 0.004 django/template/base.py:353(Lexer.tokenize) (3.5%) 59000 0.911 0.000 3.220 0.000 django/template/base.py:768(Variable.__init__) (3%) 423338 0.853 0.000 0.853 0.000 __builtin__:0(<method 'startswith' of 'str' objects>) (2.8%) 140000 0.820 0.000 1.345 0.000 django/template/base.py:972(TextNode.get_nodes_by_type) (2.7%) 371343/371373 0.770 0.000 1.235 0.000 builtins:0(getattr) (2.6%) 24000 0.141 0.000 0.231 0.000 django/template/base.py:1024(VariableNode.get_nodes_by_type) (0.5%) 10000 0.059 0.000 0.097 0.000 django/templatetags/i18n.py:72(TranslateNode.get_nodes_by_type) (0.2%) 6000 0.036 0.000 0.059 0.000 django/template/defaulttags.py:434(URLNode.get_nodes_by_type) (0.1%) 4000 0.024 0.000 0.039 0.000 django/templatetags/static.py:95(StaticNode.get_nodes_by_type) (0.1%) 2000 0.012 0.000 0.020 0.000 django/template/defaulttags.py:361(LoadNode.get_nodes_by_type) (0%) 2000 0.012 0.000 0.020 0.000 django/template/library.py:186(SimpleNode.get_nodes_by_type) (0%) 2000 0.012 0.000 0.020 0.000 django/template/defaulttags.py:53(CsrfTokenNode.get_nodes_by_type) (0%)
After changes
So that you don't need to read all the rest of this just to find out the difference, I'll front-load the final/ideal results too. At the end there's an easier to read summary of the changes, too.
5682638 function calls (5852221 primitive calls) in 25.186 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1000/31000 2.000 0.002 21.065 0.001 django/template/base.py:446(Parser.parse) (7.9%) 2000/56000 1.160 0.001 3.711 0.000 django/template/base.py:966(NodeList.get_nodes_by_type) (4.6%) 556159 1.014 0.000 1.015 0.000 builtins:0(isinstance) (4%) 1000 1.007 0.001 3.088 0.003 django/template/base.py:353(Lexer.tokenize) (4%) 160000 0.892 0.000 1.387 0.000 django/template/base.py:367(Lexer.create_token) (3.5%) 131000 0.815 0.000 1.292 0.000 django/template/base.py:514(Parser.extend_nodelist) (3.2%) 59000 0.748 0.000 3.081 0.000 django/template/base.py:770(Variable.__init__) (2.9%) 140000 0.528 0.000 0.778 0.000 django/template/base.py:976(TextNode.get_nodes_by_type) (2.1%) 181343/181373 0.342 0.000 0.343 0.000 builtins:0(getattr) (1.4%) 107338 0.222 0.000 0.222 0.000 __builtin__:0(<method 'startswith' of 'str' objects>) (0.9%) 14000 0.142 0.000 0.243 0.000 django/template/defaulttags.py:280(IfNode.__init__) (0.5%) 51000 0.101 0.000 0.101 0.000 django/template/defaulttags.py:288(IfNode.__iter__) (0.4%) 24000 0.091 0.000 0.134 0.000 django/template/base.py:1028(VariableNode.get_nodes_by_type) (0.4%) 10000 0.038 0.000 0.056 0.000 django/templatetags/i18n.py:72(TranslateNode.get_nodes_by_type) (0.2%) ... other get_nodes_by_type are all the same miniscule parts ...
and the timeit output:
In [3]: %timeit -r5 -n5000 Template(changelist) 1.82 ms ± 20 µs per loop (mean ± std. dev. of 5 runs, 5000 loops each) In [4]: %timeit -r5 -n1 runtemplate(n=5000) 9.38 s ± 186 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
Proposed changes
And now I'm going to describe the various bits, hopefully maybe possibly convince you with stats, and ideally avoid adding in much in the way of extra complexity.
From here on out, the percentages listed are the difference from the previous execution, rather than the baseline.
Optimising Lexer.create_token
Currently the lexer has many repeated calls to token_string.startswith(...)
to determine what kind of token to make. Hoisting the first 2 chars as token_start = token_string[0:2]
and then doing token_start == ...
is notionally faster because it's the hot loop for lexing:
ncalls tottime percall cumtime percall filename:lineno(function) 160000 0.892 0.000 1.380 0.000 django/template/base.py:367(Lexer.create_token) (4.7% -> 3.2%) 165338 0.331 0.000 0.331 0.000 __builtin__:0(<method 'startswith' of 'str' objects> (2.8% -> 1.2%)
Obviously indexing into a string isn't free, it's just not trackable as a method call, but startswith
has the extra ergonomics for passing in a tuple
of values to check, which we don't need. The lexer already makes nods to micro-optimisation being desirable, with the comment: We could do len(BLOCK_TAG_START) to be more "correct", but we've hard-coded the 2s here for performance. And it's not like the TAG_START values are going to change anytime, anyway.
way back in 234b7e372000381657250fe133cbcfa90ffbb2a9 so I think this is a safe proposal for the same reasons.
Optimising Parser.parse
The method currently makes mention of Use the raw values here for TokenType.* for a tiny performance boost.
We can micro-optimise it more, because for each of the 3 if branches the opcodes are:
34 LOAD_FAST 3 (token) 36 LOAD_ATTR 3 (token_type) 38 LOAD_ATTR 4 (value) 40 LOAD_CONST 2 (0 or 1 or 2) 42 COMPARE_OP 2 (==)
By setting token_type = token.token_type.value
and then re-using that variable, it bcomes
34 LOAD_FAST 3 (token) 36 LOAD_ATTR 3 (token_type) 38 LOAD_ATTR 4 (value) 40 STORE_FAST 4 (token_type)
followed by, for each comparison:
42 LOAD_FAST 4 (token_type) 44 LOAD_CONST 2 (0 or 1 or 2) 46 COMPARE_OP 2 (==)
which makes a small difference, at least when instrumented, because again it's the hot loop:
ncalls tottime percall cumtime percall filename:lineno(function) 1000/31000 1.894 0.002 22.146 0.001 django/template/base.py:446(Parser.parse) (8% -> 7.2%)
Removing isinstance from Parser.extend_nodelist
Currently the method includes the line if isinstance(nodelist, NodeList) and not isinstance(node, TextNode):
but the only places the method is used, it's re-using nodelist = NodeList()
and I'm fairly certain that a 3rd party using the method and returning something which isn't a NodeList
would error further in due to missing extension methods. So it's always true.
django3.pstat ncalls tottime percall cumtime percall filename:lineno(function) 556159 0.997 0.000 0.998 0.000 builtins:0(isinstance) (4.7% -> 3.9%) 131000 0.793 0.000 1.263 0.000 django/template/base.py:514(Parser.extend_nodelist) (4.0% -> 3.1%)
Opting nodes out of having a nodelist
Slightly less sure of this one, it's sound in terms of passing the test suite, and makes sense from what I know about Nodes/NodeLists and their nature and how/when they're shared...
All nodes currently inherit child_nodelists = ('nodelist',)
if none is given, but AFAICT a bunch of nodes can't & don't actually have that attribute. This default definition causes get_nodes_by_type
to run and execute the following:
nodelist = getattr(self, attr, None) if nodelist: ...
By declaring child_nodelists = ()
for those where it's not possible those lines can be avoided. Those Node
subclasses I've noted in the template and adapted then behave like this:
django4.pstat ncalls tottime percall cumtime percall filename:lineno(function) 2000/56000 1.139 0.001 4.148 0.000 django/template/base.py:964(NodeList.get_nodes_by_type) (4.5%, different implementation) 12000/70000 0.633 0.000 3.870 0.000 django/template/base.py:938(BlockNode.get_nodes_by_type) (2.5%, unaffected/untouched) 181343/181373 0.391 0.000 0.837 0.000 builtins:0(getattr) (2.8% -> 1.5%) 140000 0.519 0.000 0.770 0.000 django/template/base.py:974(TextNode.get_nodes_by_type) (3.0% -> 2.1%) 24000 0.089 0.000 0.132 0.000 django/template/base.py:1026(VariableNode.get_nodes_by_type) (0.5% -> 0.4%) 10000 0.038 0.000 0.056 0.000 django/templatetags/i18n.py:72(TranslateNode.get_nodes_by_type) (0.2%) ... others are miniscule or same ...
Assuming my understanding is correct (it may not be!), there are almost certainly other nodes which can omit the implicit child nodelist, these are just the ones I have accounted for, but they're also probably the lion's share of the cost anyway (with TextNode being highest)
Optimising Variable.init
Like with the micro-optimisations to Lexer.create_token
by hoisting the first and last characters of the given string to variables, and avoiding startswith, we can see some small improvements:
after variable node init:
django5.pstat ncalls tottime percall cumtime percall filename:lineno(function) 59000 0.773 0.000 3.013 0.000 django/template/base.py:770(Variable.__init__) (3.6% -> 2.9%) 107338 0.228 0.000 0.228 0.000 __builtin__:0(<method 'startswith' of 'str' objects>) (1.3% -> 0.9%)
Summary of profiled changes
Remember these timings are intentionally inflated by rendering the same str
into a Template
1000 times...
First using Yappi, which is what I was using throughout, precisely because it's slower and more accurate WRT asyncio etc.
Yappi percentage changes
Based solely on the work done, rather than secondary affects:
Parser.parse: 8% -> 7.9% Lexer.create_token: 4.7% -> 3.5% isinstance: 4.3% -> 4% Parser.extend_nodelist: 3.7% -> 3.2% Variable.__init__: 3% -> 2.9% str.startswith: 2.8% -> 0.9% getattr: 2.6% -> 1.4% Node.get_nodes_by_type: 6.1% -> 5.4%
Yappi timing changes:
More important than the swapping around of percentages, based on the own time/tottime:
Parser.parse: 2.41s -> 2s Lexer.create_token: 1.41s -> 0.9s isinstance: 1.29s -> 1s Parser.extend_nodelist: 1.1s -> 0.81s Variable.__init__: 0.9s -> 0.75s str.startswith: 0.85s -> 0.22s getattr: 0.77s -> 0.34s get_nodes_by_type: 1.80s -> 1.35s
cProfile percentage changes
Parser.parse: 9% -> 8.5% Lexer.create_token: 4.1% -> 3.8% isinstance: 1.9% -> 1.8% Parser.extend_nodelist: 2.8% -> 2.5% Variable.__init__: 3% -> 2.9% str.startswith: 2.1% -> 0.8% getattr: 1.3% -> 0.7% Node.get_nodes_by_type: 3.2% -> 2.8%
cProfile timing changes:
Total time: 4.148s -> 3.833s Parser.parse: 370ms -> 320ms Lexer.create_token: 170ms -> 140ms isinstance: 77ms -> 69ms Parser.extend_nodelist: 118ms -> 97ms Variable.__init__: 240ms -> 230ms str.startswith: 86ms -> 29ms getattr: 52ms -> 27ms get_nodes_by_type: 131ms -> 107ms
Numbers should all be correct, though I'll admit I've re-written and tried to edit this for clarity a couple of times and it got long enough that I could've pasted a wrong number somewhere. I've also rebased from main
once previously so ... yeah. Scientific method may out of the window.
Change History (11)
comment:1 by , 3 years ago
Has patch: | set |
---|---|
Owner: | changed from | to
Patch needs improvement: | set |
Status: | new → assigned |
comment:2 by , 3 years ago
Cc: | added |
---|---|
Triage Stage: | Unreviewed → Accepted |
Hey Keryn. Thanks for this.
Really nice analysis, and the PR looks reasonable in terms of comprehensibility/maintainability, so a speed-up to the DTL is +1.
I'll cc Nick, as it strikes me as the kind of thing he likes :)
comment:3 by , 3 years ago
Patch needs improvement: | unset |
---|---|
Triage Stage: | Accepted → Ready for checkin |
comment:5 by , 3 years ago
I noticed that some additional, related improvements can be made to Lexer.create_token()
. For example, in_tag
is currently checked twice, when it can be checked once. Also, token_start
only needs to be defined when in_tag
is true. (Both seen here: https://github.com/django/django/blob/edf184dd067e676962e8e1fc1a50397b810461d6/django/template/base.py#L373-L382 )
comment:6 by , 3 years ago
Hey Chris, there's a PR https://github.com/django/django/pull/14686 for ticket #23356 which looks like it's similarly playing around with in_tag
... maybe there's enough overlap with that (which regrettably now needs rebasing because of me!) to deal with? I've only had a cursory glance at the diff though.
comment:7 by , 3 years ago
Thanks for the heads up, Keryn. That PR is a lot more expansive than what I had in mind. I was considering changing only create_token()
. I think it could help if the smaller changes I had in mind were considered separately. It would decouple the changes and make the other PR easier to evaluate as it would be starting from a better state.
comment:8 by , 3 years ago
I just added a PR that builds on Keryn's changes, along the lines of my comments above:
https://github.com/django/django/pull/14739
comment:9 by , 3 years ago
If you're curious, I just filed a ticket (#32986) regarding something I noticed about Lexer.create_token()
when working on the above PR. It might lead to another small optimization or fix.
PR for any pending discussion, if this is deemed appropriate, is: https://github.com/django/django/pull/14627