Opened 10 years ago
Last modified 7 years ago
#23242 new New feature
Add an option to avoid slow date_hierarchy queries on a big tables
Reported by: | Alexander Klimenko | Owned by: | nobody |
---|---|---|---|
Component: | contrib.admin | Version: | dev |
Severity: | Normal | Keywords: | date_hierarchy datetimes dates |
Cc: | Hervé Cauwelier | Triage Stage: | Accepted |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | yes |
Easy pickings: | no | UI/UX: | no |
Description
Because of using DISTINCT query date_hierarchy incredibly slows down big tables rendering (~106 records),
May be it is more convenient to simply iterate between min and max values rather than using dates or datetimes queryset methods?
Change History (11)
comment:1 by , 10 years ago
Resolution: | → needsinfo |
---|---|
Status: | new → closed |
comment:2 by , 10 years ago
Easy pickings: | unset |
---|
When the date_hierarchy
option is used, the admin shows a list of available years, months or days. The implementation is designed to show only dates for which objects exist.
It provides a better UX for sparsely populated tables, where iterating between min and max dates would show links to pages with no objects, but performs badly on large tables.
In general, the admin isn't designed for dealing with large amounts of data. This choice can be traced back to Django being designed for editorial sites.
Certainly, the situation could be improved, but how?
comment:3 by , 10 years ago
Cc: | added |
---|---|
Resolution: | needsinfo |
Status: | closed → new |
Here is an example of such a request:
SELECT DISTINCT DATE_TRUNC('day', "messages_message"."sent_at" AT TIME ZONE E'Europe/Paris') FROM "messages_message" WHERE (EXTRACT('month' FROM "messages_message"."sent_at" AT TIME ZONE E'Europe/Paris') = 9 AND "messages_message"."sent_at" BETWEEN '2014-01-01T00:00:00+01:00'::timestamptz and '2014-12-31T23:59:59.999999+01:00'::timestamptz AND "messages_message"."sent_at" IS NOT NULL) ORDER BY 1 ASC
The explain/analyze on ~6 million entries:
Sort (cost=62488.59..62488.60 rows=1 width=8) (actual time=18791.784..18791.784 rows=1 loops=1) Sort Key: (date_trunc('day'::text, timezone('Europe/Paris'::text, sent_at))) Sort Method: quicksort Memory: 25kB -> HashAggregate (cost=62488.57..62488.58 rows=1 width=8) (actual time=18791.465..18791.466 rows=1 loops=1) -> Bitmap Heap Scan on messages_message (cost=856.42..62488.18 rows=153 width=8) (actual time=1276.004..17066.794 rows=6135296 loops=1) Recheck Cond: ((sent_at >= '2014-01-01 00:00:00+01'::timestamp with time zone) AND (sent_at <= '2014-12-31 23:59:59.999999+01'::timestamp with time zone) AND (sent_at IS NOT NULL)) Filter: (date_part('month'::text, timezone('Europe/Paris'::text, sent_at)) = 9::double precision) -> Bitmap Index Scan on messages_message_sent_at_id_idx (cost=0.00..856.38 rows=30655 width=0) (actual time=1274.039..1274.039 rows=6136069 loops=1) Index Cond: ((sent_at >= '2014-01-01 00:00:00+01'::timestamp with time zone) AND (sent_at <= '2014-12-31 23:59:59.999999+01'::timestamp with time zone) AND (sent_at IS NOT NULL)) Total runtime: 18791.933 ms
comment:4 by , 10 years ago
@herve, do you have ideas for how to improve the situation? Absent that, I don't see a reason to keep this ticket open besides maybe documenting "Don't use date_hierarchy
if you have millions of entries."
comment:5 by , 10 years ago
The alternative would be an option to show all months and days regardless of whether items exist on these dates.
date_hierarchy_existing_only = True # default
The year part can be handled with range(min, max) which will be fast if the field is indexed.
comment:6 by , 10 years ago
Summary: | slow date_hierarchy on a big tables → Add an option to avoid slow date_hierarchy queries on a big tables |
---|---|
Triage Stage: | Unreviewed → Accepted |
Type: | Cleanup/optimization → New feature |
comment:8 by , 9 years ago
Patch needs improvement: | set |
---|
comment:9 by , 7 years ago
As discussed in this django-developers discussion, there's a third-party solution in django-admin-lightweight-date-hierarchy.
comment:10 by , 7 years ago
While it's still possible that getting rid of the the DISTINCT
generated by dates()
might help I wonder, by looking at the provided PostgreSQL plan in comment:3, if the actual culprit of the slow query was the EXTRACT('month')
all along as related in #28933.
I also wonder if LIMIT
'ing the results to 12
in in the case of year&month
might help the database engine engine figure out that it can stop the DISTINCT
processing once it retrieved 12 results. The ORDER BY
clause would have to be dropped here or wrapped around the LIMIT
ed query.
Another option would to stop using dates
(or allow passing a range to it) for year&month
and year&month&day
filtering and use a combination of generate_series
and EXISTS
to speed up the query.
e.g. for the year&month
case (on a DateField
for simplicitiy but nothing prevent us from using the same mechanism on DateTimeField
with a few adjustments)
SELECT EXTRACT('month' from serie.date) FROM generate_series('2017-01-01'::date, '2017-12-01'::date, '1 month') AS serie(date) WHERE EXISTS ( SELECT 1 FROM event WHERE event.date >= serie.date AND event.date < (serie.date + interval '1' month) )
This seems to perform very well on a large events table with an indexed datetime
column containing 17M rows.
comment:11 by , 7 years ago
FWIW it looks like the SELECT DISTINCT column FROM large_result_set LIMIT small_threshold
optimization was suggested on pgsql-hackers a few years ago but never made it to the TODO.
I'm not sure where the distinct query is that you're referring to. I also don't understand your proposed solution. Please reopen if you can provide more details (or even a patch).