#21597 closed Bug (wontfix)

(2006, 'MySQL server has gone away') in django1.6 when wait_timeout passed

Reported by: ekeydar@… Owned by: aaugustin
Component: Database layer (models, ORM) Version: 1.6
Severity: Normal Keywords: mysql
Cc: timo, victorgama, markus.magnuson@… Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description (last modified by aaugustin)

EDIT -- THE SOLUTION TO THIS PROBLEM IS EXPLAINED IN COMMENT 29. JUST DO WHAT IT SAYS. THANK YOU!


In django 1.6, when the wait_timeout passed (of mysql), then DB access cause the (2006, 'MySQL server has gone away') error.
This was not the case in django 1.5.1

I've noticed this error when using workers that run the django code (using gearman).

To reproduce:

Set the timeout to low value by editing /etc/mysql/my.cnf
add the following under [mysqld]

wait_timeout = 10
interactive_timeout = 10

Then

% python manage.py shell

>>> # access DB 
>>> import django.contrib.auth.models
>>> print list(django.contrib.auth.models.User.objects.all())
>>> import time
>>> time.sleep(15)
>>> print list(django.contrib.auth.models.User.objects.all())

Now you get the error.

Simple solution I found on the web is to call django.db.close_connection() before the access

>>> import django.db
>>> django.db.close_connection()
>>> print list(django.contrib.auth.models.User.objects.all())

works ok.

Attachments (1)

error.txt (1.8 KB) - added by anonymous 16 months ago.

Download all attachments as: .zip

Change History (45)

Changed 16 months ago by anonymous

comment:1 Changed 16 months ago by aaugustin

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

Are you using persistent connections (CONN_MAX_AGE > 0)?

comment:2 Changed 16 months ago by ekeydar@…

Not using this.

comment:3 Changed 16 months ago by aaugustin

  • Description modified (diff)

This might be a side effect of using autocommit. Since Django 1.5 kept a transaction open in that case, MySQL couldn't close the connection. Now it can.

Could you try the following snippet to confirm my hypothesis?

>>> import django.contrib.auth.models
>>> from django.db import transaction
>>> with transaction.atomic():
...     print list(django.contrib.auth.models.User.objects.all())
...     import time
...     time.sleep(15)
...     print list(django.contrib.auth.models.User.objects.all())

You shouldn't get a timeout when you run this on Django 1.6.

Last edited 16 months ago by aaugustin (previous) (diff)

comment:4 Changed 16 months ago by anonymous

I ran into the same issue after upgrading to Django 1.6. In my case it affects long-running (multiple days) WorkerThreads, although I am not using any wait_timeout setting. It looks like the connection just times out without Django noticing/handling it.

Code which is triggering this:

cursor = connections['legacydb'].cursor()
cursor.execute(query)
result = cursor.fetchall()
cursor.close()

The above code works for a few hours and after a while the OperationalError 2006 is triggered on cursor.execute(). This behavior was not present in early Django versions and is a critical regression (aka: code which used to work now raises an Exception → severity should be set to release blocker).

BTW: The legacydb in the example does not support transactions (MyISAM).

comment:5 Changed 16 months ago by claudep

AFAIK the default timeout on MySQL is 28800 seconds (8 hours).
http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_wait_timeout

comment:6 Changed 16 months ago by claudep

  • Triage Stage changed from Unreviewed to Accepted

@aaugustin: even with transaction.atomic() (as of comment:3), I'm able to reproduce the failure.

comment:7 Changed 16 months ago by claudep

Query comparison (as in OP example, without using transaction.atomic()):

Django 1.5Django 1.6
562 Connect	root@localhost on test_
562 Query	SET NAMES utf8
562 Query	set autocommit=0
562 Query	SET SQL_AUTO_IS_NULL = 0


562 Query	SELECT `auth_user`.`id`, ... FROM `auth_user`

563 Connect	root@localhost on test_
563 Query	SET NAMES utf8
563 Query	set autocommit=0
563 Query	SET SQL_AUTO_IS_NULL = 0

563 Query	SELECT `auth_user`.`id`,... FROM `auth_user`
560 Connect	root@localhost on test_
560 Query	SET NAMES utf8
560 Query	set autocommit=0
560 Query	SET SQL_AUTO_IS_NULL = 0
560 Query	set autocommit=1
560 Query	set autocommit=0
560 Query	SELECT `auth_user`.`id`, ... FROM `auth_user`

561 Connect	root@localhost on test_
561 Query	SET NAMES utf8
561 Query	set autocommit=0
561 Query	SET SQL_AUTO_IS_NULL = 0
561 Query	set autocommit=1
561 Query	SELECT `auth_user`.`id`, ... FROM `auth_user`

comment:8 Changed 16 months ago by aigarius

Could this actually be a failure to re-establish a connection to MySQL after receiving the 2006 error? I get the same symptoms in production now when a MySQL server is restarted - all workers start getting 2006 errors and do not recover. Only gunicorn restart helps then. You could reproduce it locally by calling "print list(django.contrib.auth.models.User.objects.all())" every second and then restarting MySQL server. The connection does not recover even after the server is back up and accepting connections. This was tested with CONN_MAX_AGE > 0 but lower than wait_timeout on the MySQL server.

comment:9 Changed 16 months ago by aigarius

For now we are using the following workaround for long-running workers:

from django.db import connection
...
def is_connection_usable():
    try:
        connection.connection.ping()
    except:
        return False
    else:
        return True
...

def do_work():
    while(True): # Endless loop that keeps the worker going (simplified)
        if not is_connection_usable():
            connection.close()
        try:
            do_a_bit_of_work()
        except:
            logger.exception("Something bad happened, trying again")
            sleep(1)

comment:10 Changed 15 months ago by err

Maybe we can validate that connection is usable in ensure_connection ?

something like this:

*** Django-1.6.1/django/db/backends/__init__.py 2014-01-23 16:57:15.927687924 +0400
--- /usr/local/lib/python2.7/dist-packages/django/db/backends/__init__.py       2014-01-23 16:56:21.000000000 +0400
***************
*** 119,125 ****
          """
          Guarantees that a connection to the database is established.
          """
!         if self.connection is None or not self.is_usable():
              with self.wrap_database_errors:
                  self.connect()
  
--- 119,125 ----
          """
          Guarantees that a connection to the database is established.
          """
!         if self.connection is None:
              with self.wrap_database_errors:
                  self.connect()

are there any caveats?

comment:11 Changed 15 months ago by err

or even close connection (if not usable) in ensure_connection

comment:12 Changed 14 months ago by andreis

Hi,
to me it seems to be a bug. Old django would close every connection right away, django 1.6 checks with CONN_MAX_AGE:

  1. It gets CONN_MAX_AGE from DATABASES, sets close_at:
max_age = self.settings_dict['CONN_MAX_AGE']
self.close_at = None if max_age is None else time.time() + max_age
  1. Actually the code above affects close_if_unusable_or_obsolete, which closes the connection if 'self.close_at is not None and time.time() >= self.close_at'
  2. close_if_unusable_or_obsolete itself is being called by close_old_connections, which in turn is a request handler for signals.request_started and signals.request_finished.

We have a worker, which is effectively a django app but but it doesn't process any HTTP requests. In fact that makes all connections persistent because close_old_connections never gets called.

Please advise.

Thanks

comment:13 Changed 14 months ago by jeroen.pulles@…

Hi,

Without transactions you hit the Gone Away if the sleep is longer than MySQL's wait_timeout:

mysql> set global wait_timeout=10; 
>>> import django.contrib.auth.models
>>> import time
>>> print list(django.contrib.auth.models.User.objects.all())
>>> time.sleep(15)
>>> print list(django.contrib.auth.models.User.objects.all())

According to MySQL/python documentation this should not be a problem. If you add the INTERACTIVE bit to the client connection flags in db/backends/mysql/base.py, you regain the MySQL drivers' auto-reconnect feature and everything works as before (I think that in Django 1.5 you ran into trouble with a transaction that ran longer than wait_timeout too).

from kwargs['client_flag'] = CLIENT.FOUND_ROWS

to kwargs['client_flag'] = CLIENT.FOUND_ROWS | CLIENT.INTERACTIVE

I haven't looked into the origins of this line, but maybe it is the real culprit for the recent Gone Away issues.

comment:14 Changed 14 months ago by andreis

Hi Jeroen!
It seems like adding CLIENT.INTERACTIVE flag just tells the driver to switch from checking on wait_timeout to taking interactive_timeout into account. I set interactive_timeout=10 and was able to reproduce this problem.
Both of these values are 8 hours by default, but once your code has been inactive for that long, mysql drops the connection and the client fails next time it tries to access some data. It looks perfectly right to catch this error in the code, call django.db.close_connection() every time or whatever, but I think that maybe connection persistence logic needs a bit of fine-tuning so that we can control persistence without relying on signals.request_started/request_finished.

comment:15 Changed 13 months ago by andreis

Hey folks! Any thoughts on this matter?

Thanks

Last edited 13 months ago by andreis (previous) (diff)

comment:16 Changed 13 months ago by jeroen.pulles@…

I've checked with my existing long running processes on Django 1.5 installations, running the same mysql-python and libmysqlclient.so, with tcpdump:

They do set the interactive flag on the MySQL connection. That explains for me why I never experienced the Gone Away's before. Django doesn't notice that the underlying connection went away and came back.

I haven't had enough time to find out (dig down deep enough) what makes this flag appear on the connection in the 1.5 situation and what changed in 1.6 that is relevant to this problem. (a) My suspicion is that it isn't directly related to the connection persistence mechanism. (b) To me it doesn't seem to be in any way related to the transactions mechanisms: My transactions happen fast enough and it's fine that things break if the transaction takes longer than wait_timeout (e.g. more than two minutes); The same application that works fine in 1.5 also works in 1.6 with the interactive flag set.

JeroenP

comment:17 Changed 13 months ago by err

  • Owner changed from nobody to err
  • Status changed from new to assigned

comment:18 Changed 13 months ago by err

https://github.com/django/django/pull/2454 I've submitted pull request

comment:19 Changed 13 months ago by timo

  • Cc timo added
  • Has patch set

When I run the MySQL tests using djangocore-box, I get the traceback below when the test suite concludes. The PR above resolves this error.

Traceback (most recent call last):
  File "/django/tests/runtests.py", line 374, in <module>
    options.failfast, args)
  File "/django/tests/runtests.py", line 216, in django_tests
    test_labels or get_installed(), extra_tests=extra_tests)
  File "/django/django/test/runner.py", line 149, in run_tests
    self.teardown_databases(old_config)
  File "/django/django/test/runner.py", line 124, in teardown_databases
    connection.creation.destroy_test_db(old_name, self.verbosity)
  File "/django/django/db/backends/creation.py", line 452, in destroy_test_db
    self._destroy_test_db(test_database_name, verbosity)
  File "/django/django/db/backends/creation.py", line 466, in _destroy_test_db
    % self.connection.ops.quote_name(test_database_name))
  File "/django/django/db/backends/utils.py", line 59, in execute
    return self.cursor.execute(sql, params)
  File "/django/django/db/utils.py", line 94, in __exit__
    six.reraise(dj_exc_type, dj_exc_value, traceback)
  File "/django/django/db/backends/utils.py", line 57, in execute
    return self.cursor.execute(sql)
  File "/django/django/db/backends/mysql/base.py", line 128, in execute
    return self.cursor.execute(query, args)
  File "/home/vagrant/.virtualenvs/py2.7/local/lib/python2.7/site-packages/MySQLdb/cursors.py", line 174, in execute
    self.errorhandler(self, exc, value)
  File "/home/vagrant/.virtualenvs/py2.7/local/lib/python2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    raise errorclass, errorvalue
django.db.utils.OperationalError: (2006, 'MySQL server has gone away')

comment:20 Changed 13 months ago by aaugustin

  • Patch needs improvement set

I believe this patch breaks the transactional integrity guarantees made by transaction.atomic.

Django mustn't attempt to reconnect until the connection has been dropped properly, possibly after waiting for the exit of an atomic block.

comment:21 Changed 13 months ago by anonymous

A fix for this is still badly required, as this is a horrible regression. Please mark this bug as a 'release blocker', as it breaks existing code and makes upgrading to Django 1.6 impossible for quite a few people.

comment:22 follow-up: Changed 12 months ago by aaugustin

  • Severity changed from Normal to Release blocker

Unless I missed something, this only affects people that have management commands running for more than 8 hours (or MySQL's connection timeout if it has been changed from the default). Is that correct?

comment:23 Changed 12 months ago by andreis

Hi, this absolutely true for me. However I've managed to come up with a workaround, we've wrapped some of our code with decorators which implicitly call close_old_connections() routine, so that django now respects CONN_MAX_AGE parameter to some extent.

comment:24 in reply to: ↑ 22 Changed 12 months ago by anonymous

Replying to aaugustin:

Unless I missed something, this only affects people that have management commands running for more than 8 hours (or MySQL's connection timeout if it has been changed from the default). Is that correct?

As far as I can tell I think this is correct. At least this is the place where I was hitting this bug in my environment (before downgrading to 1.5 again because of this).

comment:25 Changed 12 months ago by jazeps.basko@…

Maybe this helps someone like me who uses Django and SQLAlchemy. I was experiencing this problem every morning with an API (not used for several hours at night) which had CONN_MAX_AGE=3600 and MySQL server with wait_timeout = 28800. I even changed settings which are used by manage.py (to run migrations and collect static assets) to not use persistent connections CONN_MAX_AGE=0, but the problem persisted. Then I noticed that this is actually happening in the code where SQLAlchemy accesses DB, so I googled and found this post: http://nathansnoggin.blogspot.co.uk/2013/11/integrating-sqlalchemy-into-django.html Followed the instructions in the blog post. Now that my SQLAlchemy does not handle connections (just uses whatever Django supplies) the problem is gone.

comment:26 Changed 12 months ago by matteius@…

Having the same problem in Django 1.6.3 in a process that is run indefinitely via a management command. I am attempting a build now where at the start of each loop I call:

db.close_old_connections()

We'll see if this solves the problem or not.

comment:27 Changed 12 months ago by aaugustin

My biggest problem is that I have no idea why this worked on 1.5.x :(

comment:28 Changed 12 months ago by aaugustin

  • Owner changed from err to aaugustin

comment:29 Changed 12 months ago by aaugustin

  • Resolution set to wontfix
  • Status changed from assigned to closed

Actually this is the intended behavior after #15119. See that ticket for the rationale.


If you hit this problem and don't want to understand what's going on, don't reopen this ticket, just do this:

  • RECOMMENDED SOLUTION: close the connection with from django.db import connection; connection.close() when you know that your program is going to be idle for a long time.
  • CRAPPY SOLUTION: increase wait_timeout so it's longer than the maximum idle time of your program.

In this context, idle time is the time between two successive database queries.

comment:30 Changed 11 months ago by pembo13

Could someone give a little more context to this issue? I ran into it the very first morning after upgrading from 1.5.x to 1.6.x, I am guessing because the app was idle for a few hours (2am to 8am). I had 'CONN_MAX_AGE' set to 15mins, but have had wait_timeout and interactive_timeout set to 5mins.

So in my case, I don't exactly know my program is going to be idle for a long time, though my wait_timeout didn't seem that low. I've preemptively adjusted my wait_timeout to 30mins and my interactive_timeout to 60mins, but a bit more explanation of the issue. I don't yet see the relation to #15119.

Also, this issue should probably be mentioned in the docs somewhere.

Last edited 11 months ago by pembo13 (previous) (diff)

comment:31 Changed 11 months ago by jeroen.pulles@…

My preferred solution is to increase the wait_timeout to 86400 (24hr) on sessions from processes that are long lived. Otherwise I have to do a close before any blocking call to other systems, e.g. redis, which may block for a month or return in a split second. These blocking calls are mostly in loops; Under load they are repeatedly called. I am not about to add a connection.close() call in those code paths. (And this is where the MySQL reconnect behavior worked fine, it only kicked in whén there was a timeout).

http://dev.mysql.com/doc/refman/5.6/en/auto-reconnect.html

comment:32 Changed 11 months ago by victorgama

  • Cc victorgama added

I just opened an pull request that may address this issue: https://github.com/django/django/pull/2740
@aaugustin can you review it?

comment:33 Changed 11 months ago by aaugustin

I believe this patch breaks the transactional integrity guarantees made by transaction.atomic.

Django mustn't attempt to reconnect until the connection has been dropped properly, possibly after waiting for the exit of an atomic block.

(Yes, I've just copy-pasted comment 20, because that's the answer to all naive reconnection attempts. If this bug was that easy to fix, I would have done it.)

comment:34 Changed 10 months ago by pembo13

What is the correct solution for management commands that may take longer than the regular web queries? This issue was closed with more questions than answers.

comment:35 Changed 10 months ago by aaugustin

I already answered in comment 29.

If people read the answer instead of asking the same question again and again, it would remain more visible.

Last edited 10 months ago by aaugustin (previous) (diff)

comment:36 Changed 10 months ago by aaugustin

  • Description modified (diff)

comment:37 Changed 10 months ago by aaugustin

  • Description modified (diff)

comment:38 Changed 10 months ago by germanoguerrini

I know I'll be hated for this, but it actually happens even for regular web queries. We installed 1.6.5 on one of our production servers yesterday and we received a hundred or so tracebacks from some of our views. I can't do the math, but it's probably less then 1%. Still, with 1.5 we had no issues.

To give you an idea, the problem arises both from select and insert statements. They could take maybe fractions of second. I triple checked MySql variables and wait_timeout is 28800, while connection_timeout is 10 seconds (which, by the way, should raise a Lost connection to MySQL server).

I tried switching from CONN_MAX_AGE = 0 to CONN_MAX_AGE = 2 and nothing changed, so maybe it's not related with the new persistent connection mechanism.

The traceback is slightly different from the one in comment 19:

Traceback (most recent call last):

  [...]

  File "/home/django/VIRTUALENVS/multi/lib/python2.7/site-packages/django/db/models/query.py", line 96, in __iter__
    self._fetch_all()

  File "/home/django/VIRTUALENVS/multi/lib/python2.7/site-packages/django/db/models/query.py", line 857, in _fetch_all
    self._result_cache = list(self.iterator())

  File "/home/django/VIRTUALENVS/multi/lib/python2.7/site-packages/django/db/models/query.py", line 220, in iterator
    for row in compiler.results_iter():

  File "/home/django/VIRTUALENVS/multi/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 713, in results_iter
    for rows in self.execute_sql(MULTI):

  File "/home/django/VIRTUALENVS/multi/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 786, in execute_sql
    cursor.execute(sql, params)

  File "/home/django/VIRTUALENVS/multi/lib/python2.7/site-packages/django/db/backends/util.py", line 53, in execute
    return self.cursor.execute(sql, params)

  File "/home/django/VIRTUALENVS/multi/lib/python2.7/site-packages/django/db/utils.py", line 99, in __exit__
    six.reraise(dj_exc_type, dj_exc_value, traceback)

  File "/home/django/VIRTUALENVS/multi/lib/python2.7/site-packages/django/db/backends/util.py", line 53, in execute
    return self.cursor.execute(sql, params)

  File "/home/django/VIRTUALENVS/multi/lib/python2.7/site-packages/django/db/backends/mysql/base.py", line 124, in execute
    return self.cursor.execute(query, args)

  File "/home/django/VIRTUALENVS/multi/lib/python2.7/site-packages/MySQLdb/cursors.py", line 205, in execute
    self.errorhandler(self, exc, value)

  File "/home/django/VIRTUALENVS/multi/lib/python2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    raise errorclass, errorvalue

OperationalError: (2006, 'MySQL server has gone away')

comment:39 Changed 10 months ago by aaugustin

  • Resolution wontfix deleted
  • Status changed from closed to new

That's a different issue and a good reason to reopen the ticket until we figure it out.

Do you have any way to check if MySQL drops connections while a view function is executing? In that case, Django 1.5 reopens a connection automatically (possibly breaking transactional integrity) while Django 1.6 raises an error.

comment:40 Changed 10 months ago by germanoguerrini

I was able to reproduce the error. First of all, while the global wait_timeout was 28800, the session value was just 30 seconds.
That means the if the time between two queries in the same view is longer than that, or if a view runs just one query but after 30 seconds from its invocation (for example for a very long computation in a middleware or a forced sleep) it raises the infamous MySQL has gone away.

For example:

def my_view(request):
    import time
    time.sleep(31)
    print User.objects.get(pk=1)
    return HttpResponse()

and I think this is the expected behavior.
The problem is that we run some pretty sophisticated monitoring application (namely New Relic) and the longest trace I can see is a 3.5 seconds view which was probably a glitch from MySQL (it took 3.1 seconds to select a user using his pk).
For now, we solved the most severe view error by closing the connection at the beginning of it (forcing the creation of a new one), but I'll keep monitoring the situation and I'll get back to you because I'm starting to think that it has to do with our particular configuration. Django 1.6 has been out for too long for us being the first to notice that.

comment:41 Changed 10 months ago by giuliettamasina

  • Cc markus.magnuson@… added

comment:42 Changed 10 months ago by aaugustin

  • Severity changed from Release blocker to Normal

It has been suggested to call close_if_unusable_or_obsolete before handling each HTTP request. However, I don't think that makes much sense. If the connection to MySQL times out between two HTTP requests, you should either stop using persistent connections (with CONN_MAX_AGE = 0) or increase wait_timeout. I don't like the idea of adding overhead to Django to compensate for inconsistent settings.

What I don't understand is that you tried CONN_MAX_AGE = 0 and that didn't help. I fail to see why closing the connection before each request would help if closing it after each request does nothing. Theoretically closing after a request is a safer behavior than closing before the next request -- you never run into timeouts. Could you double-check your tests?

comment:43 Changed 10 months ago by germanoguerrini

We closed the connection at the beginning of the view, not at the beginning of the request. So, presumably, the timeout happened inside a middleware but it didn't bubble up before the first query in the view.
We ended up increasing wait_timeout (we kept it at a fairly low value as we had a large number of sleeping queries) and that solved the issue.

As far as I'm concerned I think that is exactly the expected behavior and as such the ticket can be closed.

comment:44 Changed 10 months ago by aaugustin

  • Resolution set to wontfix
  • Status changed from new to closed

Cool. Thanks for taking the time to investigate and report your results!

Note: See TracTickets for help on using tickets.
Back to Top