Opened 3 years ago
Closed 3 years ago
#32892 closed Cleanup/optimization (fixed)
Optimisation for parse_datetime by preferring datetime.fromisoformat for well-formed values.
Reported by: | Keryn Knight | Owned by: | Keryn Knight |
---|---|---|---|
Component: | Utilities | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | Triage Stage: | Ready for checkin | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
With 4.0+ having a minimum supported version of python 3.8
an opportunity to speedup the parse_datetime
(and possibly parse_date
) function may exist.
As of python 3.7, the datetime
class includes a fromisoformat
method which handles the happy path much faster than the regex based method used by parse_datetime
The values it cannot parse, from the whole Django test suite, are given as examples below (duplicates remain). For those cases (eg: the recently handled #32727) it still requires going through the regex variant:
'2010-1-10 14:49:00' '2006-10-25 14:30:45.0002' '2014-09-23T22:34Z' '2014-09-23T22:34Z' '2014-09-23T28:23' '2011-09-01T10:20:30Z' '2012-4-9 4:8:16' '2012-04-23T09:15:00Z' '2012-4-9 4:8:16-0320' '2012-04-23T10:20:30.400+02' '2012-04-23T10:20:30.400-02' '2012-04-23T10:20:30,400-02' '2012-04-23T10:20:30.400 +0230' '2012-04-23T10:20:30,400 +00' '2012-04-23T10:20:30 -02' '2012-04-56T09:15:90' '2011-10-32 10:10'
The only difference in output as far as I can tell is that the timezone
instance attached to the tzinfo
parameter sets the _name
attribute to the fixed offset (str
) where the stdlib one doesn't set it, and thus it's None
. I think that's fine, but YMMV.
On the happy path, using the regex version of parse_datetime
as is:
'2021-06-30 10:42:12.638719' -> 5.99 µs ± 88.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) '2008-04-02 12:00:00' -> 5.57 µs ± 50.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) '2011-11-04T00:05:23+04:00' -> 8.39 µs ± 98.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) '2011-11-04 00:05:23.283+00:00' -> 8.81 µs ± 93.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
vs using fromisoformat
:
'2021-06-30 10:42:12.638719' -> 227 ns ± 0.935 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each '2008-04-02 12:00:00' -> 216 ns ± 2.41 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) '2011-11-04T00:05:23+04:00' -> 266 ns ± 0.612 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) '2011-11-04 00:05:23.283+00:00' -> 234 ns ± 2.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
when combining both, such as:
def parse_datetime(value): """ ... """ try: return datetime.datetime.fromisoformat(value) except (TypeError, ValueError): match = datetime_re.match(value) if match: ...
it's easy enough to spot which formats end up hitting the regex case:
'2021-06-30 10:42:12.638719' -> 230 ns ± 1.57 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) '2008-04-02 12:00:00' -> 215 ns ± 2.45 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) '2011-11-04T00:05:23+04:00' -> 270 ns ± 4.28 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) '2011-11-04 00:05:23.283+00:00' -> 234 ns ± 1.23 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) '2012-04-23T10:20:30.400 +0230' -> 10.3 µs ± 281 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) '2012-04-23T10:20:30 -02' -> 9.14 µs ± 136 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Most of the speedup gain is going to be found when using sqlite3 as a DB backend, because it has convertery UDF type things for them, and generally speaking Django stores and parses them in the well-formed format:
Database.register_converter("datetime", decoder(parse_datetime)) Database.register_converter("timestamp", decoder(parse_datetime))
when using cProfile
to execute tuple(User.objects.all())
with 101 users, parse_datetime
accounts for 10+% of the overall time (with Model.__init__
being the only bigger component), ordered by tottime (column 2), for the purposes of amortising the noise elsewhere, the call has been executed 1000
times in a loop:
101000 0.451 0.000 0.628 0.000 django/db/models/base.py:406(__init__) 102000 0.379 0.000 0.837 0.000 django/utils/dateparse.py:98(parse_datetime) 102000 0.202 0.000 0.699 0.000 django/db/models/sql/compiler.py:1128(apply_converters) 102000 0.158 0.000 0.158 0.000 django/utils/dateparse.py:124(<dictcomp>)
but after adapting as above, parse_datetime
is about 1%, same ordering:
101000 0.433 0.000 0.606 0.000 django/db/models/base.py:406(__init__) 102000 0.198 0.000 0.682 0.000 django/db/models/sql/compiler.py:1128(apply_converters) 102000 0.031 0.000 0.048 0.000 django/utils/dateparse.py:98(parse_datetime)
(the dictcomp {k: int(v) for k, v in kw.items() if v is not None}
is never accounted for because it never executes)
It would be possible to restrict the variants passed through to the fromisoformat
branch further, because the minimum acceptable length string is 10
(for YYYY-MM-DD
) and then there's ... I dunno, some other number of valid options for the time parsing (HH[:MM[:SS[.fff[fff]]]][+HH:MM[:SS[.ffffff]]]
- eg, the tz string is only 5
8
or 15
chars long), but I doubt it's worth fussing over that.
I suspect the same is true for parse_date
which could use date.fromisoformat
if the input value is exactly 10
long, and fall back to the regex method for the optional leading-zeroes date_re
allows for.
Zero tests seem to fail with the adapted method proposed above.
Change History (8)
comment:1 by , 3 years ago
Triage Stage: | Unreviewed → Accepted |
---|
comment:2 by , 3 years ago
Has patch: | set |
---|---|
Owner: | changed from | to
Status: | new → assigned |
I just created a patch based on ticket.
comment:3 by , 3 years ago
Ah, I've just come back to this to set the has patch info and I see there's already one :/
Oh well, here's mine anyway: https://github.com/django/django/pull/14582 (which, to be fair, looks to have failed cos of flake8, aren't I great? :))
FWIW, across the test suite, the parse_date
and parse_time
values which weren't handled by the builtin fromisoformat
for the appropriate type are:
2009-7-8 2009-5-4 2012-4-9 2012-04-56 10:20:30,400 4:8:16 09:15:90 2011-13-10 2011-10-32 2011-10-32 25:50
by the look of it, which at least for the dates is entirely to be expected due to the stricter nature of the stdlib handler (must have 2 digit padded months/days etc)
comment:4 by , 3 years ago
Sorry, @Keryn for that. I've just assigned it to myself because it didn't own by anybody.
BTW, @Mariusz, feel free to close my PR if you want to continue with @Keryn PR.
comment:5 by , 3 years ago
Owner: | changed from | to
---|
comment:6 by , 3 years ago
Hey Hasan, absolutely no worries. I was surprised to be in a race (and losing!) , but we're all here for the same reason :)
For historical note, here's the equivalent timings for parse_date
and parse_time
with similar (expected) improvements for the happy path:
Before, using the regex method only. Any and all variants are roughly the same 3µs timing
parse_time('00:05:23+04:00') -> 3.56 µs ± 50 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) parse_time('00:05: ') -> 3 µs ± 41.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) parse_date('2000-01-01') -> 3.14 µs ± 28.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
After, using the regex method as a fallback and the builtin stricter format by preference:
parse_time('00:05:23+04:00') -> 1.05 µs ± 2.65 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) parse_time('00:05: ') -> 3.66 µs ± 15.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) parse_time('00:05:00') -> 987 ns ± 6.68 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) parse_time('00:05:00.001') -> 1.03 µs ± 12.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) parse_date('2000-01-01') -> 228 ns ± 2.89 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each
Which handily suggests that the most expensive part of parsing the whole iso string is the time+timezone handling, as you might expect.
comment:7 by , 3 years ago
Triage Stage: | Accepted → Ready for checkin |
---|
Sounds reasonable, thanks.