Opened 3 years ago

Last modified 10 days ago

#32831 new Cleanup/optimization

Some cache tests are flaky (e.g. cache.tests.BaseMemcachedTests.test_touch)

Reported by: Chris Jerdonek Owned by:
Component: Core (Cache system) Version: dev
Severity: Normal Keywords:
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Chris Jerdonek)

The following tests occasionally fail in CI:

  • cache.tests.PyLibMCCacheTests.test_touch
  • cache.tests.PyMemcacheCacheTests.test_touch
Traceback (most recent call last):
  File "/home/jenkins/workspace/pull-requests-bionic/database/sqlite3/label/bionic-pr/python/python3.8/tests/cache/tests.py", line 481, in test_touch
    self.assertIs(cache.touch('expire1'), True)
AssertionError: False is not True

The line that fails is here: https://github.com/django/django/blob/ed3af3ff4b3cfb72de598f1b39a1028ba3826efb/tests/cache/tests.py#L481

Both test classes inherit from BaseMemcachedTests, which inherits from BaseCacheTests.

Change History (15)

comment:1 by Chris Jerdonek, 3 years ago

Description: modified (diff)

comment:2 by Mariusz Felisiak, 3 years ago

Triage Stage: UnreviewedAccepted

Yes these tests are flaky (and probably 2-3 more cache tests). We tried to fix them in the past but with no success (see PR).

comment:3 by Chris Jerdonek, 3 years ago

Description: modified (diff)

comment:4 by Chris Jerdonek, 3 years ago

Could this be due to different concrete subclasses of BaseMemcachedTests stepping on each other during a parallel run? I ask because different test classes are still using the same cache key.

Last edited 3 years ago by Chris Jerdonek (previous) (diff)

comment:5 by Chris Jerdonek, 3 years ago

Another cache-related flaky test is: cache.tests.CacheMiddlewareTest.test_cache_page_timeout on this line.

Traceback (most recent call last):
  File "C:\Jenkins\workspace\pull-requests-windows\database\sqlite3\label\windows\python\Python39\tests\cache\tests.py", line 2449, in test_cache_page_timeout
    self.assertEqual(
AssertionError: b'Hello World 1' != b'Hello World 2'
Last edited 3 years ago by Chris Jerdonek (previous) (diff)

comment:6 by Chris Jerdonek, 3 years ago

A possible pattern for fixing issues like this would be to keep retrying the assertion in a while loop, with a small sleep in between each attempt, and a relatively large cumulative amount of time before giving up. That would make it as pass as quickly as possible, without waiting longer than necessary. I'm thinking of something like--

    def assert_with_retries(self, assert_func, *args,  interval_time=0.1, total_time=10, **kwargs):
        start = time.time()
        while True:
            try:
                assert_func(*args, **kwargs)
            except Exception as exc:
                if time.time() - start > total_time:
                    raise RuntimeError(f'Total retry time exceeded.') from exc
            else:
                break
            time.sleep(interval_time)
Last edited 3 years ago by Chris Jerdonek (previous) (diff)

comment:7 by Chris Jerdonek, 3 years ago

Something else that occurred to me is that it could only help to start each test by asserting that each key being used isn't set, and to end each test by clearing the key in a finally block. The former would help to show if a test isolation issue is at fault.

comment:8 by Chris Jerdonek, 3 years ago

Summary: Flaky test in CI: cache.tests.BaseMemcachedTests.test_touchSome cache tests are flaky (e.g. cache.tests.BaseMemcachedTests.test_touch)

comment:9 by Chris Jerdonek, 3 years ago

Owner: changed from nobody to Chris Jerdonek
Status: newassigned

comment:10 by Chris Jerdonek, 3 years ago

Another flaky test failure: cache.tests.MemcachedCacheTests.test_forever_timeout() at this line-- https://github.com/django/django/blob/1697098795707bd39501d9ceddd3d9be93f8d549/tests/cache/tests.py#L606
(again with pull-requests-bionic/database=spatialite,label=bionic-pr,python=python3.8).

Last edited 3 years ago by Chris Jerdonek (previous) (diff)

comment:11 by Chris Jerdonek, 3 years ago

Owner: Chris Jerdonek removed
Status: assignednew

comment:12 by Natalia Bidart, 3 months ago

I started some debugging efforts in this forum post and landed a fix for test_touch when using the filebased cache backend.

comment:13 by Natalia Bidart, 4 weeks ago

Cache flaky test in main-random:

 > git init /home/jenkins/workspace/main-random/database/spatialite/label/focal/python/python3.12 # timeout=10
...
Using shuffle seed: 5744808359 (generated)
...
======================================================================
FAIL [0.004s]: test_lru_get (cache.tests.LocMemCacheTests.test_lru_get)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/workspace/main-random/database/spatialite/label/focal/python/python3.12/django/test/utils.py", line 446, in inner
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jenkins/workspace/main-random/database/spatialite/label/focal/python/python3.12/tests/cache/tests.py", line 1406, in test_lru_get
    self.assertEqual(cache.get(key), key)
AssertionError: None != 0

----------------------------------------------------------------------
Ran 17769 tests in 1613.662s

FAILED (failures=1, skipped=1160, expected failures=5)

comment:14 by Natalia Bidart, 11 days ago

Another one, different but similar:

...
Using shuffle seed: 5526860306 (generated)
...
======================================================================
ERROR [0.003s]: test_lru_incr (cache.tests.LocMemCacheTests.test_lru_incr)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/workspace/main-random/database/spatialite/label/focal/python/python3.12/django/test/utils.py", line 446, in inner
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jenkins/workspace/main-random/database/spatialite/label/focal/python/python3.12/tests/cache/tests.py", line 1433, in test_lru_incr
    self.assertEqual(cache.incr(key), key + 1)
                     ^^^^^^^^^^^^^^^
  File "/home/jenkins/workspace/main-random/database/spatialite/label/focal/python/python3.12/django/core/cache/backends/locmem.py", line 70, in incr
    raise ValueError("Key '%s' not found" % key)
ValueError: Key ':1:0' not found

----------------------------------------------------------------------
Ran 17776 tests in 1566.550s

FAILED (errors=1, skipped=1161, expected failures=5)

comment:15 by Simon Charette, 10 days ago

I get a sense that a lot of these failures are due to tests using their same global store at the same time as Chris pointed out. A strong supporting signal of that is cache.clear is being called at setUp time which (e.g. in the case of memcached backend tests) could cause keys to be unexpectedly evicted.

A solution here could be to run these tests cases in they own separate key space (e.g. with a cache key prefix or with two distinct memcached instances) or to serialize them to ensure that no tests share the same resource (e.g. using a multiprocessing.lock acquired in setUp and released in tearDown).

There might be more too it though as I'm unsure what the djangci setup looks like. Is there one memcached / Redis instance per job or are they shared amongst them?

Last edited 10 days ago by Simon Charette (previous) (diff)
Note: See TracTickets for help on using tickets.
Back to Top