Opened 3 years ago

Closed 2 months ago

Last modified 2 months ago

#32831 closed Cleanup/optimization (fixed)

Some cache tests are flaky (e.g. cache.tests.BaseMemcachedTests.test_touch)

Reported by: Chris Jerdonek Owned by: Wassef Ben Ahmed
Component: Core (Cache system) Version: dev
Severity: Normal Keywords:
Cc: Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Chris Jerdonek)

The following tests occasionally fail in CI:

  • cache.tests.PyLibMCCacheTests.test_touch
  • cache.tests.PyMemcacheCacheTests.test_touch
Traceback (most recent call last):
  File "/home/jenkins/workspace/pull-requests-bionic/database/sqlite3/label/bionic-pr/python/python3.8/tests/cache/tests.py", line 481, in test_touch
    self.assertIs(cache.touch('expire1'), True)
AssertionError: False is not True

The line that fails is here: https://github.com/django/django/blob/ed3af3ff4b3cfb72de598f1b39a1028ba3826efb/tests/cache/tests.py#L481

Both test classes inherit from BaseMemcachedTests, which inherits from BaseCacheTests.

Change History (23)

comment:1 by Chris Jerdonek, 3 years ago

Description: modified (diff)

comment:2 by Mariusz Felisiak, 3 years ago

Triage Stage: UnreviewedAccepted

Yes these tests are flaky (and probably 2-3 more cache tests). We tried to fix them in the past but with no success (see PR).

comment:3 by Chris Jerdonek, 3 years ago

Description: modified (diff)

comment:4 by Chris Jerdonek, 3 years ago

Could this be due to different concrete subclasses of BaseMemcachedTests stepping on each other during a parallel run? I ask because different test classes are still using the same cache key.

Last edited 3 years ago by Chris Jerdonek (previous) (diff)

comment:5 by Chris Jerdonek, 3 years ago

Another cache-related flaky test is: cache.tests.CacheMiddlewareTest.test_cache_page_timeout on this line.

Traceback (most recent call last):
  File "C:\Jenkins\workspace\pull-requests-windows\database\sqlite3\label\windows\python\Python39\tests\cache\tests.py", line 2449, in test_cache_page_timeout
    self.assertEqual(
AssertionError: b'Hello World 1' != b'Hello World 2'
Last edited 3 years ago by Chris Jerdonek (previous) (diff)

comment:6 by Chris Jerdonek, 3 years ago

A possible pattern for fixing issues like this would be to keep retrying the assertion in a while loop, with a small sleep in between each attempt, and a relatively large cumulative amount of time before giving up. That would make it as pass as quickly as possible, without waiting longer than necessary. I'm thinking of something like--

    def assert_with_retries(self, assert_func, *args,  interval_time=0.1, total_time=10, **kwargs):
        start = time.time()
        while True:
            try:
                assert_func(*args, **kwargs)
            except Exception as exc:
                if time.time() - start > total_time:
                    raise RuntimeError(f'Total retry time exceeded.') from exc
            else:
                break
            time.sleep(interval_time)
Last edited 3 years ago by Chris Jerdonek (previous) (diff)

comment:7 by Chris Jerdonek, 3 years ago

Something else that occurred to me is that it could only help to start each test by asserting that each key being used isn't set, and to end each test by clearing the key in a finally block. The former would help to show if a test isolation issue is at fault.

comment:8 by Chris Jerdonek, 3 years ago

Summary: Flaky test in CI: cache.tests.BaseMemcachedTests.test_touchSome cache tests are flaky (e.g. cache.tests.BaseMemcachedTests.test_touch)

comment:9 by Chris Jerdonek, 3 years ago

Owner: changed from nobody to Chris Jerdonek
Status: newassigned

comment:10 by Chris Jerdonek, 3 years ago

Another flaky test failure: cache.tests.MemcachedCacheTests.test_forever_timeout() at this line-- https://github.com/django/django/blob/1697098795707bd39501d9ceddd3d9be93f8d549/tests/cache/tests.py#L606
(again with pull-requests-bionic/database=spatialite,label=bionic-pr,python=python3.8).

Last edited 3 years ago by Chris Jerdonek (previous) (diff)

comment:11 by Chris Jerdonek, 3 years ago

Owner: Chris Jerdonek removed
Status: assignednew

comment:12 by Natalia Bidart, 10 months ago

I started some debugging efforts in this forum post and landed a fix for test_touch when using the filebased cache backend.

comment:13 by Natalia Bidart, 7 months ago

Cache flaky test in main-random:

 > git init /home/jenkins/workspace/main-random/database/spatialite/label/focal/python/python3.12 # timeout=10
...
Using shuffle seed: 5744808359 (generated)
...
======================================================================
FAIL [0.004s]: test_lru_get (cache.tests.LocMemCacheTests.test_lru_get)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/workspace/main-random/database/spatialite/label/focal/python/python3.12/django/test/utils.py", line 446, in inner
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jenkins/workspace/main-random/database/spatialite/label/focal/python/python3.12/tests/cache/tests.py", line 1406, in test_lru_get
    self.assertEqual(cache.get(key), key)
AssertionError: None != 0

----------------------------------------------------------------------
Ran 17769 tests in 1613.662s

FAILED (failures=1, skipped=1160, expected failures=5)

comment:14 by Natalia Bidart, 7 months ago

Another one, different but similar:

...
Using shuffle seed: 5526860306 (generated)
...
======================================================================
ERROR [0.003s]: test_lru_incr (cache.tests.LocMemCacheTests.test_lru_incr)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/workspace/main-random/database/spatialite/label/focal/python/python3.12/django/test/utils.py", line 446, in inner
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jenkins/workspace/main-random/database/spatialite/label/focal/python/python3.12/tests/cache/tests.py", line 1433, in test_lru_incr
    self.assertEqual(cache.incr(key), key + 1)
                     ^^^^^^^^^^^^^^^
  File "/home/jenkins/workspace/main-random/database/spatialite/label/focal/python/python3.12/django/core/cache/backends/locmem.py", line 70, in incr
    raise ValueError("Key '%s' not found" % key)
ValueError: Key ':1:0' not found

----------------------------------------------------------------------
Ran 17776 tests in 1566.550s

FAILED (errors=1, skipped=1161, expected failures=5)

comment:15 by Simon Charette, 7 months ago

I get a sense that a lot of these failures are due to tests using their same global store at the same time as Chris pointed out. A strong supporting signal of that is cache.clear is being called at setUp time which (e.g. in the case of memcached backend tests) could cause keys to be unexpectedly evicted.

A solution here could be to run these tests cases in they own separate key space (e.g. with a cache key prefix or with two distinct memcached instances) or to serialize them to ensure that no tests share the same resource (e.g. using a multiprocessing.lock acquired in setUp and released in tearDown).

There might be more too it though as I'm unsure what the djangci setup looks like. Is there one memcached / Redis instance per job or are they shared amongst them?

Last edited 7 months ago by Simon Charette (previous) (diff)

comment:16 by Wassef Ben Ahmed, 4 months ago

Owner: set to Wassef Ben Ahmed
Status: newassigned

comment:17 by Wassef Ben Ahmed, 4 months ago

to my understanding,

so far we don't know which tests exactly are flaky.
mainly the cases mentioned above, and anything that uses a shared medium.

as @Simon Charette suggested this could be done by figuring out where to set locks, which he also pointed out that it might get tricky depending on how djangoci works (I have no clue)
we might need to handle lock depending on the platform running the test or use something like "portalocker" which covers cross-platform locks but forces the adoption of a new dependency?

expanding on a previous answer by @Chris Jerdonek on setting assertion retries, why not set it at a unit level instead of individual assertion?

how you'd typically do with some pytest extensions... And we'll be assured the whole test passed without having to reason line-by-line on what went wrong?

implementation goes something like:

from functools import wraps
def retry(retries=3, delay=1):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            attempts = 0
            while attempts < retries:
                try:
                    return func(*args, **kwargs)
                except AssertionError as e:
                    attempts += 1
                    if attempts >= retries:
                        raise
                    time.sleep(delay)
        return wrapper
    return decorator

only catching/retrying on "AssertionError" so other exceptions are not permitted.

which I'm still not sure about? (maybe a shared medium causes you to pluck a None or something...)

Last edited 4 months ago by Wassef Ben Ahmed (previous) (diff)

comment:18 by Wassef Ben Ahmed, 4 months ago

Has patch: set

comment:19 by Natalia Bidart, 3 months ago

Patch needs improvement: set

comment:20 by Wassef Ben Ahmed, 3 months ago

Patch needs improvement: unset

comment:21 by Natalia Bidart, 2 months ago

Triage Stage: AcceptedReady for checkin

comment:22 by GitHub <noreply@…>, 2 months ago

Resolution: fixed
Status: assignedclosed

In 957c54d9:

Fixed #32831 -– Allowed cache tests to be retried via a new "retry" decorator.

comment:23 by Natalia <124304+nessita@…>, 2 months ago

In 33cd593:

[5.1.x] Fixed #32831 -– Allowed cache tests to be retried via a new "retry" decorator.

Backport of 957c54d945fedb58febff05e4ce82377cc43f9f4 from main.

Note: See TracTickets for help on using tickets.
Back to Top