Opened 6 years ago

Last modified 4 months ago

#11331 new Cleanup/optimization

Memcached backend closes connection after every request

Reported by: booink@… Owned by: nobody
Component: Core (Cache system) Version: 1.0
Severity: Normal Keywords:
Cc: mjmalone@…, daevaorn@…, harm.verhagen+django@…, rwillmer, trbs@…, jeremy.orem@… Triage Stage: Accepted
Has patch: yes Needs documentation: yes
Needs tests: yes Patch needs improvement: yes
Easy pickings: no UI/UX: no


Fix for kills servers in production.


Copy of

The patch takes care of connections kept open, but it introduces another problem - the need to open one or more tcp connections every single request.

With a simple loop, you can make a system run out of sockets easily - after a socket is closed, that port cannot be reused for an eternity, ranging from 1 minute to 4 depending on OS. If enough sockets get stuck in TIME_WAIT state, the server simply fails to connect to memcached and start serving everything from db again - that's not something you want to see on a site with sufficient traffic to need a memcached installation.

In my opinion, the cure is worse than the disease. There's an easy workaround available for the original problem: restart workers after a certain amount of requests. With max-request=500 on a 5 threads deamon process (mod_wsgi, times 20 processes), we never go over 100 connections on our memcached server, started with the default cap of 1024 connections. If you run mod_python, use MaxRequestsPerChild?.

My current solution is to just noop the whole fix with one line in any .py: django.core.cache.backends.memcached.CacheClass?.close = lambda x: None. It might be an idea to make it configurable so people can choose between disconnect after every request and keep it open until process restart.

Change History (12)

comment:1 Changed 6 years ago by mmalone

  • Cc mjmalone@… added
  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

Unfortunately this is the way TCP works. There are, however, a number of ways to address this problem without changing anything in Django. In any case, the dangling connection problem (#5133) is substantially worse since the sockets in TIME_WAIT will eventually become usable again.

It would be helpful if you could provide some version information for the libraries, servers, and environment you're using. In particular, what is your MSL and how long are your sockets left in TIME_WAIT. Also, what version of memcached/cmemcached/python-memcache are you using (I'm just speculating, but it's possible that sockets aren't being closed properly). A typical linux install has a 30s MSL and 60s TIME_WAIT. With this configuration you'd need to be handling hundreds of requests per second on a single server for this to be an issue. Django is fast, but to handle that volume of requests either your app is trivial or your hardware is a lot more powerful than mine ;). Is this an issue you've seen in production?

It would also be useful if you could provide a short script that will reproduce this problem.

Another option would be to send a quit message to the server instead of closing the socket. This would cause the server to initiate the close. Since the server isn't using an ephemeral port for each socket it should be able to manage lots of sockets in TIME_WAIT without suffering socket exhaustion.

If you're running linux here are some tuning parameters that may be helpful:

net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1

(See also,

Finally, I think newer versions of memcached support UDP, which would also provide a solution to this problem (although I'm not sure what support is like in the python libraries).

comment:2 Changed 6 years ago by booink@…

I've managed to trigger it on multiple systems so far - as python really doesn't care about where it runs, we develop on windows and test in a vm with ubuntu server (8.10 x64); deployment will be on ubuntu 8.04 LTS x64. Using Python-memcached.

I run memcached locally with -vv, so every event is printed to console. When I started using it, I was already wondering about the part with "<104 connection closed." "<104 new client connection" spam, but I didn't give it any thought as it was much faster than "good enough" already.

The problem is when we added feeds, we made sure that a cache hit meant 0 queries to the database - that, plus no template rendering (simplejson.dumps, pprint or xml.etree magic) meant that you get a huge throughput when benchmarked in a loop. It probably costs as much cpu as getting hello world over memcached. Example url:

I think TIME_WAIT on both the vm and windows is a minute, it takes about that long to recover from the out of sockets condition. When netstat -an | grep TIME_WAIT | wc -l reaches around 10k, connections start to fail increasing chance the longer I let JMeter run. Oh, and that netstat -an command sometimes returns 0 when there's too many of them open under windows, I bet they never stress tested that :P

Finally, there was no beefy hardware involved triggering this:

Local: runserver (/me hides), standard core 2 duo desktop, 1 request thread in JMeter.
VM: apache 2.2 / mod_wsgi 2.5 / python 2.5 / django 1.02, VirtualBox, 1 cpu. 1 request thread, JMeter runs on the VM host.

On production, we would probably run out of socket much quicker, with 2x quad core xeons (core2 generation) on each machine, in theory. In practice, our traffic isn't this high yet, during the peak, there were usually 3k sockets stuck in TIME_WAIT, that number is reduced by half now with ~1.3k in TIME_WAIT (I blame no pconnect in psycopg2 backend and port 80), 150 ESTABLISHED and 200 FIN_WAIT1/2. (output from netstat -an --tcp | awk '/tcp/ {print $6}' | sort | uniq -c)

Fun fact: during benchmarks, the python process, memcached and JMeter uses equal amounts of resources -- it's incredible how efficient the python side of the setup is, 75 requests for a single thread for the whole combination is just silly.

@tuning: preferably not - it works fine without the disconnect_all() call after every request, I've saw a few horror story threads about how it messes up clients behind a NAT and break random things, I prefer not to find out about that on production.

comment:3 Changed 6 years ago by Alex

  • Triage Stage changed from Unreviewed to Design decision needed

comment:4 Changed 6 years ago by jhenry

I am also experiencing this problem while using cmemcached. My solution was to remove the signal applied in #5133 in my local branch since I do not have problems with connection limits in a prefork multiprocess deployment, but the proper solution would seem to be smarter connection pooling instead of closing and reopening a connection to the memcached server on every single request.

comment:5 Changed 5 years ago by julien

  • Severity set to Normal
  • Type set to Cleanup/optimization

comment:6 Changed 4 years ago by alexkoshelev

  • Cc daevaorn@… added
  • Easy pickings unset

comment:7 Changed 4 years ago by harm

  • Cc harm.verhagen+django@… added

comment:8 Changed 4 years ago by rwillmer

  • Cc rwillmer added
  • UI/UX unset

comment:9 Changed 3 years ago by anonymous

  • Cc trbs@… added

comment:10 Changed 3 years ago by aaugustin

  • Triage Stage changed from Design decision needed to Accepted

I recently added support for persistent database connections, I guess we could do something similar for cache connections.

comment:11 Changed 2 years ago by jeremy.orem@…

  • Cc jeremy.orem@… added

comment:12 Changed 4 months ago by MarkusH

  • Has patch set
  • Needs documentation set
  • Needs tests set
  • Patch needs improvement set
Note: See TracTickets for help on using tickets.
Back to Top