Context Navigation

← Previous Ticket
Next Ticket →

#23242 new New feature

Add an option to avoid slow date_hierarchy queries on a big tables

Reported by:	Alexander Klimenko	Owned by:	nobody
Component:	contrib.admin	Version:	dev
Severity:	Normal	Keywords:	date_hierarchy datetimes dates
Cc:	Hervé Cauwelier	Triage Stage:	Accepted
Has patch:	yes	Needs documentation:	no
Needs tests:	no	Patch needs improvement:	yes
Easy pickings:	no	UI/UX:	no

Description ¶

Because of using DISTINCT query date_hierarchy incredibly slows down big tables rendering (~10^{6 records),

May be it is more convenient to simply iterate between min and max values rather than using dates or datetimes queryset methods?}

According to the ticket's flags, the next step(s) to move this issue forward are:

To improve the patch as described in the pull request review comments or on this ticket, then uncheck "Patch needs improvement".
If creating a new pull request, include a link to the pull request in the ticket comment when making that update. The usual format is: [https://github.com/django/django/pull/#### PR].

Change History (11)

comment:1 by Tim Graham, 11 years ago

Resolution:	→ needsinfo
Status:	new → closed

I'm not sure where the distinct query is that you're referring to. I also don't understand your proposed solution. Please reopen if you can provide more details (or even a patch).

comment:2 by Aymeric Augustin, 11 years ago

Easy pickings:	unset

When the date_hierarchy option is used, the admin shows a list of available years, months or days. The implementation is designed to show only dates for which objects exist.

It provides a better UX for sparsely populated tables, where iterating between min and max dates would show links to pages with no objects, but performs badly on large tables.

In general, the admin isn't designed for dealing with large amounts of data. This choice can be traced back to Django being designed for editorial sites.

Certainly, the situation could be improved, but how?

comment:3 by Hervé Cauwelier, 11 years ago

Cc:	Hervé Cauwelier added
Resolution:	needsinfo
Status:	closed → new

Here is an example of such a request:

SELECT DISTINCT DATE_TRUNC('day', "messages_message"."sent_at" AT TIME ZONE E'Europe/Paris') FROM "messages_message" WHERE (EXTRACT('month' FROM "messages_message"."sent_at" AT TIME ZONE E'Europe/Paris') = 9 AND "messages_message"."sent_at" BETWEEN '2014-01-01T00:00:00+01:00'::timestamptz and '2014-12-31T23:59:59.999999+01:00'::timestamptz AND "messages_message"."sent_at" IS NOT NULL) ORDER BY 1 ASC

The explain/analyze on ~6 million entries:

 Sort  (cost=62488.59..62488.60 rows=1 width=8) (actual time=18791.784..18791.784 rows=1 loops=1)
   Sort Key: (date_trunc('day'::text, timezone('Europe/Paris'::text, sent_at)))
   Sort Method: quicksort  Memory: 25kB
   ->  HashAggregate  (cost=62488.57..62488.58 rows=1 width=8) (actual time=18791.465..18791.466 rows=1 loops=1)
         ->  Bitmap Heap Scan on messages_message  (cost=856.42..62488.18 rows=153 width=8) (actual time=1276.004..17066.794 rows=6135296 loops=1)
               Recheck Cond: ((sent_at >= '2014-01-01 00:00:00+01'::timestamp with time zone) AND (sent_at <= '2014-12-31 23:59:59.999999+01'::timestamp with time zone) AND (sent_at IS NOT NULL))
               Filter: (date_part('month'::text, timezone('Europe/Paris'::text, sent_at)) = 9::double precision)
               ->  Bitmap Index Scan on messages_message_sent_at_id_idx  (cost=0.00..856.38 rows=30655 width=0) (actual time=1274.039..1274.039 rows=6136069 loops=1)
                     Index Cond: ((sent_at >= '2014-01-01 00:00:00+01'::timestamp with time zone) AND (sent_at <= '2014-12-31 23:59:59.999999+01'::timestamp with time zone) AND (sent_at IS NOT NULL))
 Total runtime: 18791.933 ms

comment:4 by Tim Graham, 11 years ago

@herve, do you have ideas for how to improve the situation? Absent that, I don't see a reason to keep this ticket open besides maybe documenting "Don't use date_hierarchy if you have millions of entries."

comment:5 by Aymeric Augustin, 11 years ago

The alternative would be an option to show all months and days regardless of whether items exist on these dates.

date_hierarchy_existing_only = True  # default

The year part can be handled with range(min, max) which will be fast if the field is indexed.

comment:6 by Tim Graham, 11 years ago

Summary:	slow date_hierarchy on a big tables → Add an option to avoid slow date_hierarchy queries on a big tables
Triage Stage:	Unreviewed → Accepted
Type:	Cleanup/optimization → New feature

comment:7 by Gavin Wahl, 10 years ago

Has patch:	set

https://github.com/django/django/pull/5213

comment:8 by Tim Graham, 10 years ago

Patch needs improvement:	set

comment:9 by Aymeric Augustin, 7 years ago

As discussed in this django-developers discussion, there's a third-party solution in django-admin-lightweight-date-hierarchy.

comment:10 by Simon Charette, 7 years ago

While it's still possible that getting rid of the the DISTINCT generated by dates() might help I wonder, by looking at the provided PostgreSQL plan in comment:3, if the actual culprit of the slow query was the EXTRACT('month') all along as related in #28933.

I also wonder if LIMIT'ing the results to 12 in in the case of year&month might help the database engine engine figure out that it can stop the DISTINCT processing once it retrieved 12 results. The ORDER BY clause would have to be dropped here or wrapped around the LIMITed query.

Another option would to stop using dates (or allow passing a range to it) for year&month and year&month&day filtering and use a combination of generate_series and EXISTS to speed up the query.

e.g. for the year&month case (on a DateField for simplicitiy but nothing prevent us from using the same mechanism on DateTimeField with a few adjustments)

SELECT EXTRACT('month' from serie.date)
FROM generate_series('2017-01-01'::date, '2017-12-01'::date, '1 month') AS serie(date)
WHERE EXISTS (
    SELECT 1
    FROM event
    WHERE event.date >= serie.date AND event.date < (serie.date + interval '1' month)
)

This seems to perform very well on a large events table with an indexed datetime column containing 17M rows.

Last edited 7 years ago by Simon Charette (previous) (diff)

comment:11 by Simon Charette, 7 years ago

FWIW it looks like the SELECT DISTINCT column FROM large_result_set LIMIT small_threshold optimization was suggested on pgsql-hackers a few years ago but never made it to the TODO.

Note: See TracTickets for help on using tickets.

Download in other formats:

Issues