Opened 6 years ago
Closed 5 years ago
#30207 closed Cleanup/optimization (needsinfo)
Optimise paginator for tables with massive records
Reported by: | M. Javidan Darugar | Owned by: | M. Javidan Darugar |
---|---|---|---|
Component: | Core (Other) | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | Can Sarıgöl | Triage Stage: | Unreviewed |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
I had a problem with Paginator class to slice huge dataset. In case of millions of record it can take 30 second to reach the last page. So I cam up with a solution to overcome this problem.
The problem is that for slicing data, current solution needs to compute whole list to render data bottom:top
. The reason is that, current script will generate for example SELECT ID, COL_1, ..., COL_N, ... WHERE ...
which has huge burden for database to slice the data. To overcome this problem we can instead select primary keys and do the slicing step and then fetch records that their pk are in that sliced list. Very simple but very efficient solution. I improved the performance for our project significantly using this approach. So form 30 seconds to only 2-3 seconds for 8 million records.
In this ticket I propose same approach to improve django Paginator class.
Change History (9)
comment:1 by , 6 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
follow-up: 7 comment:2 by , 6 years ago
follow-up: 6 comment:3 by , 6 years ago
- i too use paginator hack for admin, work for Postgres only if not use any filters, this is can be inbox good optimization
class LargeTablePaginator(Paginator): """ Warning: Postgresql only hack Overrides the count method of QuerySet objects to get an estimate instead of actual count when not filtered. However, this estimate can be stale and hence not fit for situations where the count of objects actually matter. """ def _get_count(self): if getattr(self, '_count', None) is not None: return self._count query = self.object_list.query if not query.where: try: cursor = connection.cursor() cursor.execute("SELECT reltuples FROM pg_class WHERE relname = %s", [query.model._meta.db_table]) self._count = int(cursor.fetchone()[0]) except: self._count = super(LargeTablePaginator, self)._get_count() else: self._count = super(LargeTablePaginator, self)._get_count() return self._count count = property(_get_count)
maximum speed on any sized table
- if use only pk order query can serios optimize when have max pk and wont get page-1, in this case just need select ..... where pk<last_max_id. In this case will be use index and have maximum speed
comment:4 by , 6 years ago
Resolution: | → needsinfo |
---|---|
Status: | assigned → closed |
The technical justification seems lacking both here and on the PR.
comment:5 by , 5 years ago
Resolution: | needsinfo |
---|---|
Status: | closed → new |
I reopen this ticket because it is not solved yet, if I am replying to messages by delay it is because I have a full time job and rarely have time to work on a side project but still love to contribute to the django community.
Some of the comments in the github branch was about naming for variable or function arguments which means there is no standards for naming for variables because everyone can suggest something new.
Another comment was too irrelevant about what I proposed. If postgres has a new feature it does not mean that it is good for django because django has an ORM which should work with other databases as well.
After 4 months that we have used same approach in our project the performance is still very good for pagination of almost 8 million of records on production server.
comment:6 by , 5 years ago
What do you mean by index? Tables index? for a table with such huge data index table can be very huge. In our case indexing used almost 18GB of disk memory. Indexing is very useful method but not always it depends on the problem and the situation.
Replying to Nikolas:
- i too use paginator hack for admin, work for Postgres only if not use any filters, this is can be inbox good optimization
class LargeTablePaginator(Paginator): """ Warning: Postgresql only hack Overrides the count method of QuerySet objects to get an estimate instead of actual count when not filtered. However, this estimate can be stale and hence not fit for situations where the count of objects actually matter. """ def _get_count(self): if getattr(self, '_count', None) is not None: return self._count query = self.object_list.query if not query.where: try: cursor = connection.cursor() cursor.execute("SELECT reltuples FROM pg_class WHERE relname = %s", [query.model._meta.db_table]) self._count = int(cursor.fetchone()[0]) except: self._count = super(LargeTablePaginator, self)._get_count() else: self._count = super(LargeTablePaginator, self)._get_count() return self._count count = property(_get_count)maximum speed on any sized table
- if use only pk order query can serios optimize when have max pk and wont get page-1, in this case just need select ..... where pk<last_max_id. In this case will be use index and have maximum speed
comment:7 by , 5 years ago
Well this is a bit surprising for me as well that many developers don't know about this, because it is a very basic fact in database core, maybe because most developers focused on code base rather than raw queries.
Replying to Simon Charette:
Hello there,
Could you provide more details about which database backend is sped up with this technique? I'm a bit surprised two queries perform faster than one in this case because even if the first one would allows the usage of an index only scan the tuples still have to be fetched in the end. The query planner should be smart enough to chain the operations in a single query. I suspect something else might be at play here such as the query to count the number of results.
By the way, the issue title should be changed to reflect what the optimization technique is as right now it's pretty vague.
comment:8 by , 5 years ago
Cc: | added |
---|
comment:9 by , 5 years ago
Resolution: | → needsinfo |
---|---|
Status: | new → closed |
Hi. At this point, the issue tracker here is not the appropriate place for this discussion.
Please see TicketClosingReasons/DontReopenTickets.
Rather, a message to the DevelopersMailingList explaining the issue and the proposed change would be needed to progress.
A couple of points to make it easier:
- This kind of thing has come up previously: search here and the mailing list history to find the discussions. Summarizing that history may help.
- Wrapping your suggestion in a custom paginator class and providing a sample project, with some benchmarking data maybe, so people can explore your solution, would be useful.
- In general, the proficiency level of contributors to the ORM is very high. They know how databases work. Assume that.
Thanks.
Hello there,
Could you provide more details about which database backend is sped up with this technique? I'm a bit surprised two queries perform faster than one in this case because even if the first one would allows the usage of an index only scan the tuples still have to be fetched in the end. The query planner should be smart enough to chain the operations in a single query. I suspect something else might be at play here such as the query to count the number of results.
By the way, the issue title should be changed to reflect what the optimization technique is as right now it's pretty vague.