Opened 3 weeks ago

Last modified 8 days ago

#36732 assigned Cleanup/optimization

Sitemaps with i18n=True fetch the entire queryset of items before paginating

Reported by: Julien Palard Owned by: Augusto Pontes
Component: contrib.sitemaps Version: 5.2
Severity: Normal Keywords: sitemap, memory
Cc: Julien Palard, Roxane Triage Stage: Accepted
Has patch: yes Needs documentation: yes
Needs tests: yes Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Julien Palard)

Hi!

It's a bit like: https://code.djangoproject.com/ticket/11572

when enabling i18n on a sitemap, _items uses the following code:

            # Create (item, lang_code) tuples for all items and languages.
            # This is necessary to paginate with all languages already considered.
            items = [
                (item, lang_code)
                for item in self.items()
                for lang_code in self.get_languages_for_item(item)
            ]

The list comprehension loads the whole table times the number of languages in memory.

We tried with a table containing two millions elements and two languages: it does not fit in 16GB of memory and the process gets killed.

Change History (13)

comment:1 by Julien Palard, 3 weeks ago

While:

    def get_languages_for_item(self, item):
        """Languages for which this item is displayed."""
        return self._languages()

does not use item, something (non trivial) could be implemented to slice the queryset at the right spot (depending on the number of languages), and the total number of elements could be computed by a simple multiplication.

But as long as get_languages_for_item uses item all of this break.

comment:2 by Julien Palard, 3 weeks ago

Description: modified (diff)

comment:3 by Jacob Walls, 3 weeks ago

Cc: Roxane added
Summary: Sitemaps with i18n=True load the whole table in memorySitemaps with i18n=True fetch the entire queryset of items despite self.limit
Triage Stage: UnreviewedAccepted
Type: Cleanup/optimizationBug

Thanks for the report. #33662 introduced eager evaluation of self.items() (if i18n is in use) instead of letting self.items() (a queryset) remain unevaluated until paged by the paginator.

Tentatively accepting as a bug that we should at least respect self.limit.

Could be as simple as something like this (could limit even further by doing some language arithmetic):

  • django/contrib/sitemaps/__init__.py

    diff --git a/django/contrib/sitemaps/__init__.py b/django/contrib/sitemaps/__init__.py
    index c0c0ac77a7..faf60f346b 100644
    a b class GenericSitemap(Sitemap):  
    185185
    186186    def items(self):
    187187        # Make sure to return a clone; we don't want premature evaluation.
    188         return self.queryset.filter()
     188        return self.queryset[:self.limit]
    189189
    190190    def lastmod(self, item):
    191191        if self.date_field is not None:

I'm not seeing what you initially suggested about a cartesian product of items and languages -- if the order of the for loops in the comprehension were reversed, then I'd see the problem, but it's not, so it's the same item. Am I missing something?

(Pdb) items
[(<I18nTestModel: I18nTestModel object (1)>, 'en'), (<I18nTestModel: I18nTestModel object (1)>, 'pt')]
(Pdb) items[0][0] is items[1][0]
True

comment:4 by Varun Kasyap Pentamaraju, 3 weeks ago

Owner: set to Varun Kasyap Pentamaraju
Status: newassigned

comment:5 by Julien Palard, 3 weeks ago

I'm not seeing what you initially suggested about a cartesian product

You're right, it's not a cartesian product database-side, it's only a carthesian product Python-side, so 8 bytes per object, so ~15 MB per language for my 2 million objects, that's nothing compared to the 16GB+ that it takes currently.

This can't be as simple as return self.queryset[:self.limit] because the limit is not a limit overall but a limit per page, so putting the limit here imples a single page (the pagination occurs over the result of items: return paginator.Paginator(self._items(), self.limit)).

The path of doing arithmetics to be able to query the right slice of items depending on the requested page *and* the number of languages breaks if get_languages_for_item uses its item argument.

Last edited 3 weeks ago by Julien Palard (previous) (diff)

comment:6 by Jacob Walls, 3 weeks ago

Right, okay. Maybe we can pursue checking if get_languages_for_item has been overridden and if not, avoid the materialization of the items before paginating.

comment:7 by Jacob Walls, 3 weeks ago

Summary: Sitemaps with i18n=True fetch the entire queryset of items despite self.limitSitemaps with i18n=True fetch the entire queryset of items before paginating

comment:8 by Jacob Walls, 3 weeks ago

Type: BugCleanup/optimization

comment:9 by Julien Palard, 3 weeks ago

Thinking of while eating today, did not checked the code or tried an implementation, but:

The limit being just a limit, so in cases when an URL has no translation we could just not add it to the page, leaving some pages not comptely full, but this could be OK.

It would greatly ease pagination in case get_languages_for_item is overloaded.

The "risk" being: in a site with many languages, say 20, and a low limit, say 100 URLs per page, but very few URLs being translated, say less than 1%, then we would give only 5 URLs per page, leaving 95 « empty spots » for future translations. This could be considered as a bug.

comment:10 by Varun Kasyap Pentamaraju, 2 weeks ago

Owner: Varun Kasyap Pentamaraju removed
Status: assignednew

Deassigned. Thanks

comment:11 by Augusto Pontes, 2 weeks ago

Owner: set to Augusto Pontes
Status: newassigned

comment:12 by Augusto Pontes, 8 days ago

Has patch: set
Needs documentation: set
Needs tests: set
Note: See TracTickets for help on using tickets.
Back to Top