Context Navigation

← Previous Ticket
Next Ticket →

#36732 assigned Cleanup/optimization

Sitemaps with i18n=True fetch the entire queryset of items before paginating

Reported by:	Julien Palard	Owned by:	Augusto Pontes
Component:	contrib.sitemaps	Version:	5.2
Severity:	Normal	Keywords:	sitemap, memory
Cc:	Julien Palard, Roxane	Triage Stage:	Accepted
Has patch:	yes	Needs documentation:	yes
Needs tests:	yes	Patch needs improvement:	no
Easy pickings:	no	UI/UX:	no

Description (last modified by Julien Palard)

Hi!

It's a bit like: https://code.djangoproject.com/ticket/11572

when enabling i18n on a sitemap, _items uses the following code:

            # Create (item, lang_code) tuples for all items and languages.
            # This is necessary to paginate with all languages already considered.
            items = [
                (item, lang_code)
                for item in self.items()
                for lang_code in self.get_languages_for_item(item)
            ]

The list comprehension loads the whole table times the number of languages in memory.

We tried with a table containing two millions elements and two languages: it does not fit in 16GB of memory and the process gets killed.

Change History (13)

comment:1 by Julien Palard, 4 months ago

While:

    def get_languages_for_item(self, item):
        """Languages for which this item is displayed."""
        return self._languages()

does not use item, something (non trivial) could be implemented to slice the queryset at the right spot (depending on the number of languages), and the total number of elements could be computed by a simple multiplication.

But as long as get_languages_for_item uses item all of this break.

comment:2 by Julien Palard, 4 months ago

Description:	modified (diff)

comment:3 by Jacob Walls, 3 months ago

Cc:	Roxane added
Summary:	Sitemaps with i18n=True load the whole table in memory → Sitemaps with i18n=True fetch the entire queryset of items despite self.limit
Triage Stage:	Unreviewed → Accepted
Type:	Cleanup/optimization → Bug

Thanks for the report. #33662 introduced eager evaluation of self.items() (if i18n is in use) instead of letting self.items() (a queryset) remain unevaluated until paged by the paginator.

Tentatively accepting as a bug that we should at least respect self.limit.

Could be as simple as something like this (could limit even further by doing some language arithmetic):

django/contrib/sitemaps/init.py

diff --git a/django/contrib/sitemaps/__init__.py b/django/contrib/sitemaps/__init__.py
index c0c0ac77a7..faf60f346b 100644

                class GenericSitemap(Sitemap):
     def items(self):
         # Make sure to return a clone; we don't want premature evaluation.
         return self.queryset.filter()
+        return self.queryset[:self.limit]
     def lastmod(self, item):
         if self.date_field is not None:

I'm not seeing what you initially suggested about a cartesian product of items and languages -- if the order of the for loops in the comprehension were reversed, then I'd see the problem, but it's not, so it's the same item. Am I missing something?

(Pdb) items
[(<I18nTestModel: I18nTestModel object (1)>, 'en'), (<I18nTestModel: I18nTestModel object (1)>, 'pt')]
(Pdb) items[0][0] is items[1][0]
True

comment:4 by Varun Kasyap Pentamaraju, 3 months ago

Owner:	set to Varun Kasyap Pentamaraju
Status:	new → assigned

comment:5 by Julien Palard, 3 months ago

I'm not seeing what you initially suggested about a cartesian product

You're right, it's not a cartesian product database-side, it's only a carthesian product Python-side, so 8 bytes per object, so ~15 MB per language for my 2 million objects, that's nothing compared to the 16GB+ that it takes currently.

This can't be as simple as return self.queryset[:self.limit] because the limit is not a limit overall but a limit per page, so putting the limit here imples a single page (the pagination occurs over the result of items: return paginator.Paginator(self._items(), self.limit)).

The path of doing arithmetics to be able to query the right slice of items depending on the requested page *and* the number of languages breaks if get_languages_for_item uses its item argument.

Last edited 3 months ago by Julien Palard (previous) (diff)

comment:6 by Jacob Walls, 3 months ago

Right, okay. Maybe we can pursue checking if get_languages_for_item has been overridden and if not, avoid the materialization of the items before paginating.

comment:7 by Jacob Walls, 3 months ago

Summary:	Sitemaps with i18n=True fetch the entire queryset of items despite self.limit → Sitemaps with i18n=True fetch the entire queryset of items before paginating

comment:8 by Jacob Walls, 3 months ago

Type:	Bug → Cleanup/optimization

comment:9 by Julien Palard, 3 months ago

Thinking of while eating today, did not checked the code or tried an implementation, but:

The limit being just a limit, so in cases when an URL has no translation we could just not add it to the page, leaving some pages not comptely full, but this could be OK.

It would greatly ease pagination in case get_languages_for_item is overloaded.

The "risk" being: in a site with many languages, say 20, and a low limit, say 100 URLs per page, but very few URLs being translated, say less than 1%, then we would give only 5 URLs per page, leaving 95 « empty spots » for future translations. This could be considered as a bug.

comment:10 by Varun Kasyap Pentamaraju, 3 months ago

Owner:	Varun Kasyap Pentamaraju removed
Status:	assigned → new

Deassigned. Thanks

comment:11 by Augusto Pontes, 3 months ago

Owner:	set to Augusto Pontes
Status:	new → assigned

comment:12 by Augusto Pontes, 3 months ago

Has patch:	set
Needs documentation:	set
Needs tests:	set

comment:13 by Augusto Pontes, 3 months ago

https://github.com/django/django/pull/20268

Note: See TracTickets for help on using tickets.

Download in other formats:

Issues