Opened 7 years ago

Closed 3 weeks ago

#11331 closed Cleanup/optimization (fixed)

Memcached backend closes connection after every request

Reported by: booink@… Owned by: Ed Morley
Component: Core (Cache system) Version: 1.0
Severity: Normal Keywords: pylibmc
Cc: mjmalone@…, daevaorn@…, harm.verhagen+django@…, Rachel Willmer, trbs@…, jeremy.orem@…, emorley@… Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Fix for http://code.djangoproject.com/ticket/5133 kills servers in production.

--

Copy of http://code.djangoproject.com/ticket/5133#comment:16

The patch takes care of connections kept open, but it introduces another problem - the need to open one or more tcp connections every single request.

With a simple loop, you can make a system run out of sockets easily - after a socket is closed, that port cannot be reused for an eternity, ranging from 1 minute to 4 depending on OS. If enough sockets get stuck in TIME_WAIT state, the server simply fails to connect to memcached and start serving everything from db again - that's not something you want to see on a site with sufficient traffic to need a memcached installation.

In my opinion, the cure is worse than the disease. There's an easy workaround available for the original problem: restart workers after a certain amount of requests. With max-request=500 on a 5 threads deamon process (mod_wsgi, times 20 processes), we never go over 100 connections on our memcached server, started with the default cap of 1024 connections. If you run mod_python, use MaxRequestsPerChild?.

My current solution is to just noop the whole fix with one line in any .py: django.core.cache.backends.memcached.CacheClass?.close = lambda x: None. It might be an idea to make it configurable so people can choose between disconnect after every request and keep it open until process restart.

Change History (15)

comment:1 Changed 7 years ago by Michael Malone

Cc: mjmalone@… added
Needs documentation: unset
Needs tests: unset
Patch needs improvement: unset

Unfortunately this is the way TCP works. There are, however, a number of ways to address this problem without changing anything in Django. In any case, the dangling connection problem (#5133) is substantially worse since the sockets in TIME_WAIT will eventually become usable again.

It would be helpful if you could provide some version information for the libraries, servers, and environment you're using. In particular, what is your MSL and how long are your sockets left in TIME_WAIT. Also, what version of memcached/cmemcached/python-memcache are you using (I'm just speculating, but it's possible that sockets aren't being closed properly). A typical linux install has a 30s MSL and 60s TIME_WAIT. With this configuration you'd need to be handling hundreds of requests per second on a single server for this to be an issue. Django is fast, but to handle that volume of requests either your app is trivial or your hardware is a lot more powerful than mine ;). Is this an issue you've seen in production?

It would also be useful if you could provide a short script that will reproduce this problem.

Another option would be to send a quit message to the server instead of closing the socket. This would cause the server to initiate the close. Since the server isn't using an ephemeral port for each socket it should be able to manage lots of sockets in TIME_WAIT without suffering socket exhaustion.

If you're running linux here are some tuning parameters that may be helpful:

net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1

(See also, http://www.ietf.org/rfc/rfc1337.txt)

Finally, I think newer versions of memcached support UDP, which would also provide a solution to this problem (although I'm not sure what support is like in the python libraries).

comment:2 Changed 7 years ago by booink@…

I've managed to trigger it on multiple systems so far - as python really doesn't care about where it runs, we develop on windows and test in a vm with ubuntu server (8.10 x64); deployment will be on ubuntu 8.04 LTS x64. Using Python-memcached.

I run memcached locally with -vv, so every event is printed to console. When I started using it, I was already wondering about the part with "<104 connection closed." "<104 new client connection" spam, but I didn't give it any thought as it was much faster than "good enough" already.

The problem is when we added feeds, we made sure that a cache hit meant 0 queries to the database - that, plus no template rendering (simplejson.dumps, pprint or xml.etree magic) meant that you get a huge throughput when benchmarked in a loop. It probably costs as much cpu as getting hello world over memcached. Example url: http://worldoflogs.com/feeds/guilds/7/raids/?t=plain

I think TIME_WAIT on both the vm and windows is a minute, it takes about that long to recover from the out of sockets condition. When netstat -an | grep TIME_WAIT | wc -l reaches around 10k, connections start to fail increasing chance the longer I let JMeter run. Oh, and that netstat -an command sometimes returns 0 when there's too many of them open under windows, I bet they never stress tested that :P

Finally, there was no beefy hardware involved triggering this:

Local: manage.py runserver (/me hides), standard core 2 duo desktop, 1 request thread in JMeter.
VM: apache 2.2 / mod_wsgi 2.5 / python 2.5 / django 1.02, VirtualBox, 1 cpu. 1 request thread, JMeter runs on the VM host.

On production, we would probably run out of socket much quicker, with 2x quad core xeons (core2 generation) on each machine, in theory. In practice, our traffic isn't this high yet, during the peak, there were usually 3k sockets stuck in TIME_WAIT, that number is reduced by half now with ~1.3k in TIME_WAIT (I blame no pconnect in psycopg2 backend and port 80), 150 ESTABLISHED and 200 FIN_WAIT1/2. (output from netstat -an --tcp | awk '/tcp/ {print $6}' | sort | uniq -c)

Fun fact: during benchmarks, the python process, memcached and JMeter uses equal amounts of resources -- it's incredible how efficient the python side of the setup is, 75 requests for a single thread for the whole combination is just silly.

@tuning: preferably not - it works fine without the disconnect_all() call after every request, I've saw a few horror story threads about how it messes up clients behind a NAT and break random things, I prefer not to find out about that on production.

comment:3 Changed 7 years ago by Alex Gaynor

Triage Stage: UnreviewedDesign decision needed

comment:4 Changed 7 years ago by jhenry

I am also experiencing this problem while using cmemcached. My solution was to remove the signal applied in #5133 in my local branch since I do not have problems with connection limits in a prefork multiprocess deployment, but the proper solution would seem to be smarter connection pooling instead of closing and reopening a connection to the memcached server on every single request.

comment:5 Changed 5 years ago by Julien Phalip

Severity: Normal
Type: Cleanup/optimization

comment:6 Changed 5 years ago by Alexander Koshelev

Cc: daevaorn@… added
Easy pickings: unset

comment:7 Changed 5 years ago by harm

Cc: harm.verhagen+django@… added

comment:8 Changed 5 years ago by Rachel Willmer

Cc: Rachel Willmer added
UI/UX: unset

comment:9 Changed 4 years ago by anonymous

Cc: trbs@… added

comment:10 Changed 4 years ago by Aymeric Augustin

Triage Stage: Design decision neededAccepted

I recently added support for persistent database connections, I guess we could do something similar for cache connections.

comment:11 Changed 3 years ago by jeremy.orem@…

Cc: jeremy.orem@… added

comment:12 Changed 16 months ago by Markus Holtermann

Has patch: set
Needs documentation: set
Needs tests: set
Patch needs improvement: set

comment:13 Changed 4 weeks ago by Ed Morley

Cc: emorley@… added
Owner: changed from nobody to Ed Morley
Status: newassigned

I'm going to open a new PR for this - however one question:

We definitely need to disconnect_all() for python-memcached (since it's why this behaviour was added in #5133), and definitely don't want it for pylibmc (see pylibmc owner's comment here: https://github.com/django/django/pull/4866#issue-88649865) -- but what about third party backends? (eg python-binary-memcached and pymemcache)

ie: should I move the disconnect_all() from the base class to the python-memcached backend (MemcachedCache), or make it a no-op only for the PyLibMCCache backend?

Thanks!

comment:14 Changed 3 weeks ago by Ed Morley

Keywords: pylibmc added
Needs documentation: unset
Needs tests: unset
Patch needs improvement: unset

The question in comment 13 was discussed in https://github.com/django/django/pull/4866#issuecomment-242985539 onwards, with the conclusion being that we should special-case pylibmc rather than python-memcached.

PR

Last edited 3 weeks ago by Tim Graham (previous) (diff)

comment:15 Changed 3 weeks ago by Tim Graham <timograham@…>

Resolution: fixed
Status: assignedclosed

In f02dbbe1:

Fixed #11331 -- Stopped closing pylibmc connections after each request.

libmemcached manages its own connections, so isn't affected by refs #5133.

Note: See TracTickets for help on using tickets.
Back to Top