Opened 6 years ago

Closed 6 years ago

#28977 closed New feature (fixed)

Change Local Memory Cache to Use LRU

Reported by: Grant Jenks Owned by: Grant Jenks
Component: Core (Cache system) Version: dev
Severity: Normal Keywords:
Cc: Adam Johnson, josh.smeaton@… Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

The current local memory cache (locmem) in Django uses a pseudo-random culling strategy. Rather than random, the OrderedDict data type can be used to implement an LRU eviction policy. A prototype implementation is already used by functools.lru_cache and Python 3 now supports OrderedDict.move_to_end and OrderedDict.popitem to ease the implementation.

I'm willing to work on a pull request that changes locmem to use an LRU eviction strategy but I wanted to first check whether it would be accepted. I did some research to find a good reason for the locmem's random culling strategy but did not find one.

There's also a bit of prior art at https://pypi.python.org/pypi/django-lrucache-backend.

Change History (13)

comment:1 by Tim Graham, 6 years ago

Triage Stage: UnreviewedSomeday/Maybe

I'm not sure how to evaluate the pros and cons of that change. I guess the idea should be raised on the DevelopersMailingList.

comment:2 by Grant Jenks, 6 years ago

Owner: changed from nobody to Grant Jenks
Status: newassigned

comment:3 by Grant Jenks, 6 years ago

I have created the example changes at https://github.com/grantjenks/django/tree/ticket_28977 The only thing I remain unsure about is whether django.utils.synch.RWLock needs to be deprecated. View the commit at https://github.com/grantjenks/django/commit/b06574f6713d4b7d367d7a11e0268fb62f5fd1d1

I will ask for feedback on the developers mailing list next.

comment:4 by Grant Jenks, 6 years ago

Has patch: set
Triage Stage: Someday/MaybeAccepted

Pull request created at https://github.com/django/django/pull/9555

Django developers mailing list thread at https://groups.google.com/forum/#!topic/django-developers/Gz2XqtoYmNk

Summarizing the two responses: Josh Smeaton (author of django-lrucache-backend, project cited in the initial post) was in favor of providing a better default option. Adam Johnson was also +1. Josh and Adam were also interested in providing a way to disable cache key validation. I'm +0 on that change so long as the default is to validate the key. I'd rather not add those changes myself.

I did some very simple benchmarking locally:

Here's the performance of cache.get for the current implementation:

$ python manage.py shell
Python 3.6.3 (default, Oct  5 2017, 22:47:21) 
Type 'copyright', 'credits' or 'license' for more information
IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from django.core.cache import cache

In [2]: cache.set(b'foo', b'bar')

In [3]: %timeit cache.get(b'foo')
14.2 µs ± 109 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

And here's the performance of cache.get for the new implementation:

$ python manage.py shell
Python 3.6.3 (default, Oct  5 2017, 22:47:21) 
Type 'copyright', 'credits' or 'license' for more information
IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from django.core.cache import cache

In [2]: cache.set(b'foo', b'bar')

In [3]: %timeit cache.get(b'foo')
6.29 µs ± 140 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

I also tried Josh's django-lrucache-backend implementation:

$ python manage.py shell
Python 3.6.3 (default, Oct  5 2017, 22:47:21) 
Type 'copyright', 'credits' or 'license' for more information
IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.

In [3]: from django.core.cache import caches

In [4]: cache = caches['local']

In [5]: cache.set(b'foo', b'bar')

In [6]: %timeit cache.get(b'foo')
10.1 µs ± 135 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

It's not a great benchmark but it's encouraging. The new implementation appears to be faster than both the current implementation and the django-lrucache-backend. I haven't profiled but I would guess collections.OrderedDict is very fast, as is threading.RLock both of which are introduced by these changes.

comment:5 by Grant Jenks, 6 years ago

Josh asked on django-developers if I would run more extensive benchmarking against the current changes. Here's the results of the locmem backend using the python-diskcache benchmark harness. Note that these were generated with only one process as locmem does no cross-process synchronization.

The current implementation benchmark results:

========= ========= ========= ========= ========= ========= ========= =========
Timings for locmem
-------------------------------------------------------------------------------
   Action     Count      Miss    Median       P90       P99       Max     Total
========= ========= ========= ========= ========= ========= ========= =========
      get     88986     16526  12.159us  19.789us  28.849us 154.734us   1.266s 
      set      9030         0  14.067us  15.974us  30.994us 150.919us 134.134ms
   delete       983         0  11.921us  13.828us  26.941us  32.663us  12.563ms
    Total     98999                                                     1.413s 
========= ========= ========= ========= ========= ========= ========= =========

And the new implementation benchmark results:

========= ========= ========= ========= ========= ========= ========= =========
Timings for locmem
-------------------------------------------------------------------------------
   Action     Count      Miss    Median       P90       P99       Max     Total
========= ========= ========= ========= ========= ========= ========= =========
      get     88986     16526   5.007us   5.245us  14.067us  80.109us 449.330ms
      set      9030         0   5.960us   7.153us  16.689us 111.103us  58.852ms
   delete       983         0   4.053us   5.007us   8.821us  18.835us   4.200ms
    Total     98999                                                   512.381ms
========= ========= ========= ========= ========= ========= ========= =========

Displayed is four percentiles of timing: Median (50th), 90th, 99th, and Max (100th). The results are better across the board with a 50% speedup typical.

comment:6 by Grant Jenks, 6 years ago

One more set of benchmarks. The above results benchmark in a single-process and single-thread environment. There's therefore never a time when the locks contend with one another for access. I thought it would be useful to observe performance with more contention and my development machine has eight logical processors so here's the results in a single-process and eight-thread environment.

The current implementation:

========= ========= ========= ========= ========= ========= ========= =========
Timings for locmem
-------------------------------------------------------------------------------
   Action     Count      Miss    Median       P90       P99       Max     Total
========= ========= ========= ========= ========= ========= ========= =========
      get    712648     85412  14.067us 863.075us   2.461ms  14.337ms 160.973s 
      set     71226         0 204.563us 238.180us 282.288us   7.419ms  12.356s 
   delete      8118         0 201.941us 235.081us 281.811us 410.080us   1.395s 
    Total    791992                                                   174.724s 
========= ========= ========= ========= ========= ========= ========= =========

The new implementation:

========= ========= ========= ========= ========= ========= ========= =========
Timings for locmem
-------------------------------------------------------------------------------
   Action     Count      Miss    Median       P90       P99       Max     Total
========= ========= ========= ========= ========= ========= ========= =========
      get    712637     84222 146.151us 165.224us 201.225us   9.591ms 106.855s 
      set     71241         0 149.012us 167.847us 204.802us   9.604ms  10.875s 
   delete      8114         0 144.958us 163.078us 195.026us 552.893us   1.200s 
    Total    791992                                                   118.930s 
========= ========= ========= ========= ========= ========= ========= =========

The results overall display less variation but the median "get" time is 10x slower (14 microseconds vs 146 microseconds). Despite that large slowdown, the total time for "get" operations is ~33% faster (160 seconds vs 106 seconds). The "set" and "delete" operations have comparable performance.

The LRU eviction policy turns every read into a write which obsoletes the django.utils.synch.RWLock used in the current implementation. The RWLock was capable of giving multiple readers concurrent access but with the additional overhead. Therefore we see in the "no contention" scenario that the overhead is less by using threading.RLock. In the "high contention" scenario the additional overhead gives reads an advantage at the median but a larger penalty at the 99th percentile.

I think these tradeoffs are easy to accept. Note that the benchmark does not include a miss-rate penalty which the LRU-eviction policy will almost certainly improve compared with the current random-eviction policy.

comment:7 by Adam Johnson, 6 years ago

Cc: Adam Johnson added

comment:8 by Josh Smeaton, 6 years ago

Cc: josh.smeaton@… added

comment:9 by Carlton Gibson, 6 years ago

Triage Stage: AcceptedReady for checkin

comment:10 by Tim Graham, 6 years ago

Patch needs improvement: set
Triage Stage: Ready for checkinAccepted

Some test coverage is missing as noted on the PR.

comment:11 by Grant Jenks, 6 years ago

Pull request updated to exercise cache.incr code path.

comment:12 by Carlton Gibson, 6 years ago

Patch needs improvement: unset
Triage Stage: AcceptedReady for checkin

comment:13 by Tim Graham <timograham@…>, 6 years ago

Resolution: fixed
Status: assignedclosed

In d38a3169:

Fixed #28977 -- Changed local-memory cache to use LRU culling.

LRU culling turns every read into a kind of write to the cache: cache keys
are moved to the first position in the OrderedDict when they are retrieved.
The RWLock which permitted multiple readers while prioritizing a single
writer is obsolete since all accesses are now writes.

Note: See TracTickets for help on using tickets.
Back to Top