Opened 3 weeks ago
Last modified 8 days ago
#36732 assigned Cleanup/optimization
Sitemaps with i18n=True fetch the entire queryset of items before paginating
| Reported by: | Julien Palard | Owned by: | Augusto Pontes |
|---|---|---|---|
| Component: | contrib.sitemaps | Version: | 5.2 |
| Severity: | Normal | Keywords: | sitemap, memory |
| Cc: | Julien Palard, Roxane | Triage Stage: | Accepted |
| Has patch: | yes | Needs documentation: | yes |
| Needs tests: | yes | Patch needs improvement: | no |
| Easy pickings: | no | UI/UX: | no |
Description (last modified by )
Hi!
It's a bit like: https://code.djangoproject.com/ticket/11572
when enabling i18n on a sitemap, _items uses the following code:
# Create (item, lang_code) tuples for all items and languages.
# This is necessary to paginate with all languages already considered.
items = [
(item, lang_code)
for item in self.items()
for lang_code in self.get_languages_for_item(item)
]
The list comprehension loads the whole table times the number of languages in memory.
We tried with a table containing two millions elements and two languages: it does not fit in 16GB of memory and the process gets killed.
Change History (13)
comment:1 by , 3 weeks ago
comment:2 by , 3 weeks ago
| Description: | modified (diff) |
|---|
comment:3 by , 3 weeks ago
| Cc: | added |
|---|---|
| Summary: | Sitemaps with i18n=True load the whole table in memory → Sitemaps with i18n=True fetch the entire queryset of items despite self.limit |
| Triage Stage: | Unreviewed → Accepted |
| Type: | Cleanup/optimization → Bug |
Thanks for the report. #33662 introduced eager evaluation of self.items() (if i18n is in use) instead of letting self.items() (a queryset) remain unevaluated until paged by the paginator.
Tentatively accepting as a bug that we should at least respect self.limit.
Could be as simple as something like this (could limit even further by doing some language arithmetic):
-
django/contrib/sitemaps/__init__.py
diff --git a/django/contrib/sitemaps/__init__.py b/django/contrib/sitemaps/__init__.py index c0c0ac77a7..faf60f346b 100644
a b class GenericSitemap(Sitemap): 185 185 186 186 def items(self): 187 187 # Make sure to return a clone; we don't want premature evaluation. 188 return self.queryset .filter()188 return self.queryset[:self.limit] 189 189 190 190 def lastmod(self, item): 191 191 if self.date_field is not None:
I'm not seeing what you initially suggested about a cartesian product of items and languages -- if the order of the for loops in the comprehension were reversed, then I'd see the problem, but it's not, so it's the same item. Am I missing something?
(Pdb) items [(<I18nTestModel: I18nTestModel object (1)>, 'en'), (<I18nTestModel: I18nTestModel object (1)>, 'pt')] (Pdb) items[0][0] is items[1][0] True
comment:4 by , 3 weeks ago
| Owner: | set to |
|---|---|
| Status: | new → assigned |
comment:5 by , 3 weeks ago
I'm not seeing what you initially suggested about a cartesian product
You're right, it's not a cartesian product database-side, it's only a carthesian product Python-side, so 8 bytes per object, so ~15 MB per language for my 2 million objects, that's nothing compared to the 16GB+ that it takes currently.
This can't be as simple as return self.queryset[:self.limit] because the limit is not a limit overall but a limit per page, so putting the limit here imples a single page (the pagination occurs over the result of items: return paginator.Paginator(self._items(), self.limit)).
The path of doing arithmetics to be able to query the right slice of items depending on the requested page *and* the number of languages breaks if get_languages_for_item uses its item argument.
comment:6 by , 3 weeks ago
Right, okay. Maybe we can pursue checking if get_languages_for_item has been overridden and if not, avoid the materialization of the items before paginating.
comment:7 by , 3 weeks ago
| Summary: | Sitemaps with i18n=True fetch the entire queryset of items despite self.limit → Sitemaps with i18n=True fetch the entire queryset of items before paginating |
|---|
comment:8 by , 3 weeks ago
| Type: | Bug → Cleanup/optimization |
|---|
comment:9 by , 3 weeks ago
Thinking of while eating today, did not checked the code or tried an implementation, but:
The limit being just a limit, so in cases when an URL has no translation we could just not add it to the page, leaving some pages not comptely full, but this could be OK.
It would greatly ease pagination in case get_languages_for_item is overloaded.
The "risk" being: in a site with many languages, say 20, and a low limit, say 100 URLs per page, but very few URLs being translated, say less than 1%, then we would give only 5 URLs per page, leaving 95 « empty spots » for future translations. This could be considered as a bug.
comment:11 by , 2 weeks ago
| Owner: | set to |
|---|---|
| Status: | new → assigned |
comment:12 by , 8 days ago
| Has patch: | set |
|---|---|
| Needs documentation: | set |
| Needs tests: | set |
While:
def get_languages_for_item(self, item): """Languages for which this item is displayed.""" return self._languages()does not use item, something (non trivial) could be implemented to slice the queryset at the right spot (depending on the number of languages), and the total number of elements could be computed by a simple multiplication.
But as long as get_languages_for_item uses item all of this break.