Opened 4 years ago

Closed 4 years ago

#32892 closed Cleanup/optimization (fixed)

Optimisation for parse_datetime by preferring datetime.fromisoformat for well-formed values.

Reported by: Keryn Knight Owned by: Keryn Knight
Component: Utilities Version: dev
Severity: Normal Keywords:
Cc: Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

With 4.0+ having a minimum supported version of python 3.8 an opportunity to speedup the parse_datetime (and possibly parse_date) function may exist.

As of python 3.7, the datetime class includes a fromisoformat method which handles the happy path much faster than the regex based method used by parse_datetime

The values it cannot parse, from the whole Django test suite, are given as examples below (duplicates remain). For those cases (eg: the recently handled #32727) it still requires going through the regex variant:

'2010-1-10 14:49:00'
'2006-10-25 14:30:45.0002'
'2014-09-23T22:34Z'
'2014-09-23T22:34Z'
'2014-09-23T28:23'
'2011-09-01T10:20:30Z'
'2012-4-9 4:8:16'
'2012-04-23T09:15:00Z'
'2012-4-9 4:8:16-0320'
'2012-04-23T10:20:30.400+02'
'2012-04-23T10:20:30.400-02'
'2012-04-23T10:20:30,400-02'
'2012-04-23T10:20:30.400 +0230'
'2012-04-23T10:20:30,400 +00'
'2012-04-23T10:20:30   -02'
'2012-04-56T09:15:90'
'2011-10-32 10:10'

The only difference in output as far as I can tell is that the timezone instance attached to the tzinfo parameter sets the _name attribute to the fixed offset (str) where the stdlib one doesn't set it, and thus it's None. I think that's fine, but YMMV.

On the happy path, using the regex version of parse_datetime as is:

'2021-06-30 10:42:12.638719'        -> 5.99 µs ± 88.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
'2008-04-02 12:00:00'               -> 5.57 µs ± 50.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
'2011-11-04T00:05:23+04:00'         -> 8.39 µs ± 98.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
'2011-11-04 00:05:23.283+00:00'     -> 8.81 µs ± 93.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

vs using fromisoformat:

'2021-06-30 10:42:12.638719'        -> 227 ns ± 0.935 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each
'2008-04-02 12:00:00'               -> 216 ns ± 2.41 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
'2011-11-04T00:05:23+04:00'         -> 266 ns ± 0.612 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
'2011-11-04 00:05:23.283+00:00'     -> 234 ns ± 2.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

when combining both, such as:

def parse_datetime(value):
    """ ... """
    try:
        return datetime.datetime.fromisoformat(value)
    except (TypeError, ValueError):
        match = datetime_re.match(value)
        if match:
            ...

it's easy enough to spot which formats end up hitting the regex case:

'2021-06-30 10:42:12.638719'        -> 230 ns ± 1.57 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
'2008-04-02 12:00:00'               -> 215 ns ± 2.45 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
'2011-11-04T00:05:23+04:00'         -> 270 ns ± 4.28 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
'2011-11-04 00:05:23.283+00:00'     -> 234 ns ± 1.23 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
'2012-04-23T10:20:30.400 +0230'     -> 10.3 µs ± 281 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
'2012-04-23T10:20:30   -02'         -> 9.14 µs ± 136 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Most of the speedup gain is going to be found when using sqlite3 as a DB backend, because it has convertery UDF type things for them, and generally speaking Django stores and parses them in the well-formed format:

Database.register_converter("datetime", decoder(parse_datetime))
Database.register_converter("timestamp", decoder(parse_datetime))

when using cProfile to execute tuple(User.objects.all()) with 101 users, parse_datetime accounts for 10+% of the overall time (with Model.__init__ being the only bigger component), ordered by tottime (column 2), for the purposes of amortising the noise elsewhere, the call has been executed 1000 times in a loop:

101000    0.451    0.000    0.628    0.000 django/db/models/base.py:406(__init__)
102000    0.379    0.000    0.837    0.000 django/utils/dateparse.py:98(parse_datetime)
102000    0.202    0.000    0.699    0.000 django/db/models/sql/compiler.py:1128(apply_converters)
102000    0.158    0.000    0.158    0.000 django/utils/dateparse.py:124(<dictcomp>)

but after adapting as above, parse_datetime is about 1%, same ordering:

101000    0.433    0.000    0.606    0.000 django/db/models/base.py:406(__init__)
102000    0.198    0.000    0.682    0.000 django/db/models/sql/compiler.py:1128(apply_converters)
102000    0.031    0.000    0.048    0.000 django/utils/dateparse.py:98(parse_datetime)

(the dictcomp {k: int(v) for k, v in kw.items() if v is not None} is never accounted for because it never executes)

It would be possible to restrict the variants passed through to the fromisoformat branch further, because the minimum acceptable length string is 10 (for YYYY-MM-DD) and then there's ... I dunno, some other number of valid options for the time parsing (HH[:MM[:SS[.fff[fff]]]][+HH:MM[:SS[.ffffff]]] - eg, the tz string is only 5 8 or 15 chars long), but I doubt it's worth fussing over that.

I suspect the same is true for parse_date which could use date.fromisoformat if the input value is exactly 10 long, and fall back to the regex method for the optional leading-zeroes date_re allows for.

Zero tests seem to fail with the adapted method proposed above.

Change History (8)

comment:1 by Mariusz Felisiak, 4 years ago

Triage Stage: UnreviewedAccepted

Sounds reasonable, thanks.

comment:2 by Hasan Ramezani, 4 years ago

Has patch: set
Owner: changed from nobody to Hasan Ramezani
Status: newassigned

I just created a patch based on ticket.

comment:3 by Keryn Knight, 4 years ago

Ah, I've just come back to this to set the has patch info and I see there's already one :/
Oh well, here's mine anyway: https://github.com/django/django/pull/14582 (which, to be fair, looks to have failed cos of flake8, aren't I great? :))

FWIW, across the test suite, the parse_date and parse_time values which weren't handled by the builtin fromisoformat for the appropriate type are:

2009-7-8
2009-5-4
2012-4-9
2012-04-56
10:20:30,400
4:8:16
09:15:90
2011-13-10
2011-10-32
2011-10-32
25:50

by the look of it, which at least for the dates is entirely to be expected due to the stricter nature of the stdlib handler (must have 2 digit padded months/days etc)

comment:4 by Hasan Ramezani, 4 years ago

Sorry, @Keryn for that. I've just assigned it to myself because it didn't own by anybody.

BTW, @Mariusz, feel free to close my PR if you want to continue with @Keryn PR.

comment:5 by Jacob Walls, 4 years ago

Owner: changed from Hasan Ramezani to Keryn Knight

comment:6 by Keryn Knight, 4 years ago

Hey Hasan, absolutely no worries. I was surprised to be in a race (and losing!) , but we're all here for the same reason :)


For historical note, here's the equivalent timings for parse_date and parse_time with similar (expected) improvements for the happy path:

Before, using the regex method only. Any and all variants are roughly the same 3µs timing

parse_time('00:05:23+04:00')    -> 3.56 µs ± 50 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
parse_time('00:05: ')           -> 3 µs ± 41.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
parse_date('2000-01-01')        -> 3.14 µs ± 28.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

After, using the regex method as a fallback and the builtin stricter format by preference:

parse_time('00:05:23+04:00')    -> 1.05 µs ± 2.65 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
parse_time('00:05: ')           -> 3.66 µs ± 15.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
parse_time('00:05:00')          -> 987 ns ± 6.68 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
parse_time('00:05:00.001')      -> 1.03 µs ± 12.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
parse_date('2000-01-01')        -> 228 ns ± 2.89 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each

Which handily suggests that the most expensive part of parsing the whole iso string is the time+timezone handling, as you might expect.

comment:7 by Mariusz Felisiak, 4 years ago

Triage Stage: AcceptedReady for checkin

comment:8 by Mariusz Felisiak <felisiak.mariusz@…>, 4 years ago

Resolution: fixed
Status: assignedclosed

In f35ab74:

Fixed #32892 -- Optimized django.utils.dateparse functions by using fromisoformat().

Note: See TracTickets for help on using tickets.
Back to Top