Opened 4 years ago

Last modified 10 months ago

#23242 new New feature

Add an option to avoid slow date_hierarchy queries on a big tables

Reported by: Alexander Klimenko Owned by: nobody
Component: contrib.admin Version: master
Severity: Normal Keywords: date_hierarchy datetimes dates
Cc: Hervé Cauwelier Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

Because of using DISTINCT query date_hierarchy incredibly slows down big tables rendering (~106 records),
May be it is more convenient to simply iterate between min and max values rather than using dates or datetimes queryset methods?

Change History (11)

comment:1 Changed 4 years ago by Tim Graham

Resolution: needsinfo
Status: newclosed

I'm not sure where the distinct query is that you're referring to. I also don't understand your proposed solution. Please reopen if you can provide more details (or even a patch).

comment:2 Changed 4 years ago by Aymeric Augustin

Easy pickings: unset

When the date_hierarchy option is used, the admin shows a list of available years, months or days. The implementation is designed to show only dates for which objects exist.

It provides a better UX for sparsely populated tables, where iterating between min and max dates would show links to pages with no objects, but performs badly on large tables.

In general, the admin isn't designed for dealing with large amounts of data. This choice can be traced back to Django being designed for editorial sites.

Certainly, the situation could be improved, but how?

comment:3 Changed 4 years ago by Hervé Cauwelier

Cc: Hervé Cauwelier added
Resolution: needsinfo
Status: closednew

Here is an example of such a request:

SELECT DISTINCT DATE_TRUNC('day', "messages_message"."sent_at" AT TIME ZONE E'Europe/Paris') FROM "messages_message" WHERE (EXTRACT('month' FROM "messages_message"."sent_at" AT TIME ZONE E'Europe/Paris') = 9 AND "messages_message"."sent_at" BETWEEN '2014-01-01T00:00:00+01:00'::timestamptz and '2014-12-31T23:59:59.999999+01:00'::timestamptz AND "messages_message"."sent_at" IS NOT NULL) ORDER BY 1 ASC

The explain/analyze on ~6 million entries:

 Sort  (cost=62488.59..62488.60 rows=1 width=8) (actual time=18791.784..18791.784 rows=1 loops=1)
   Sort Key: (date_trunc('day'::text, timezone('Europe/Paris'::text, sent_at)))
   Sort Method: quicksort  Memory: 25kB
   ->  HashAggregate  (cost=62488.57..62488.58 rows=1 width=8) (actual time=18791.465..18791.466 rows=1 loops=1)
         ->  Bitmap Heap Scan on messages_message  (cost=856.42..62488.18 rows=153 width=8) (actual time=1276.004..17066.794 rows=6135296 loops=1)
               Recheck Cond: ((sent_at >= '2014-01-01 00:00:00+01'::timestamp with time zone) AND (sent_at <= '2014-12-31 23:59:59.999999+01'::timestamp with time zone) AND (sent_at IS NOT NULL))
               Filter: (date_part('month'::text, timezone('Europe/Paris'::text, sent_at)) = 9::double precision)
               ->  Bitmap Index Scan on messages_message_sent_at_id_idx  (cost=0.00..856.38 rows=30655 width=0) (actual time=1274.039..1274.039 rows=6136069 loops=1)
                     Index Cond: ((sent_at >= '2014-01-01 00:00:00+01'::timestamp with time zone) AND (sent_at <= '2014-12-31 23:59:59.999999+01'::timestamp with time zone) AND (sent_at IS NOT NULL))
 Total runtime: 18791.933 ms

comment:4 Changed 4 years ago by Tim Graham

@herve, do you have ideas for how to improve the situation? Absent that, I don't see a reason to keep this ticket open besides maybe documenting "Don't use date_hierarchy if you have millions of entries."

comment:5 Changed 4 years ago by Aymeric Augustin

The alternative would be an option to show all months and days regardless of whether items exist on these dates.

date_hierarchy_existing_only = True  # default

The year part can be handled with range(min, max) which will be fast if the field is indexed.

comment:6 Changed 4 years ago by Tim Graham

Summary: slow date_hierarchy on a big tablesAdd an option to avoid slow date_hierarchy queries on a big tables
Triage Stage: UnreviewedAccepted
Type: Cleanup/optimizationNew feature

comment:8 Changed 3 years ago by Tim Graham

Patch needs improvement: set

comment:9 Changed 10 months ago by Aymeric Augustin

As discussed in this django-developers discussion, there's a third-party solution in django-admin-lightweight-date-hierarchy.

comment:10 Changed 10 months ago by Simon Charette

While it's still possible that getting rid of the the DISTINCT generated by dates() might help I wonder, by looking at the provided PostgreSQL plan in comment:3, if the actual culprit of the slow query was the EXTRACT('month') all along as related in #28933.

I also wonder if LIMIT'ing the results to 12 in in the case of year&month might help the database engine engine figure out that it can stop the DISTINCT processing once it retrieved 12 results. The ORDER BY clause would have to be dropped here or wrapped around the LIMITed query.

Another option would to stop using dates (or allow passing a range to it) for year&month and year&month&day filtering and use a combination of generate_series and EXISTS to speed up the query.

e.g. for the year&month case (on a DateField for simplicitiy but nothing prevent us from using the same mechanism on DateTimeField with a few adjustments)

SELECT EXTRACT('month' from serie.date)
FROM generate_series('2017-01-01'::date, '2017-12-01'::date, '1 month') AS serie(date)
WHERE EXISTS (
    SELECT 1
    FROM event
    WHERE event.date >= serie.date AND event.date < (serie.date + interval '1' month)
)

This seems to perform very well on a large events table with an indexed datetime column containing 17M rows.

Last edited 10 months ago by Simon Charette (previous) (diff)

comment:11 Changed 10 months ago by Simon Charette

FWIW it looks like the SELECT DISTINCT column FROM large_result_set LIMIT small_threshold optimization was suggested on pgsql-hackers a few years ago but never made it to the TODO.

Note: See TracTickets for help on using tickets.
Back to Top