Code

Opened 4 months ago

Last modified 3 days ago

#21597 assigned Bug

(2006, 'MySQL server has gone away') in django1.6 when wait_timeout passed

Reported by: ekeydar@… Owned by: err
Component: Database layer (models, ORM) Version: 1.6
Severity: Release blocker Keywords: mysql
Cc: timo Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description (last modified by aaugustin)

In django 1.6, when the wait_timeout passed (of mysql), then DB access cause the (2006, 'MySQL server has gone away') error.
This was not the case in django 1.5.1

I've noticed this error when using workers that run the django code (using gearman).

To reproduce:

Set the timeout to low value by editing /etc/mysql/my.cnf
add the following under [mysqld]

wait_timeout = 10
interactive_timeout = 10

Then

% python manage.py shell

>>> # access DB 
>>> import django.contrib.auth.models
>>> print list(django.contrib.auth.models.User.objects.all())
>>> import time
>>> time.sleep(15)
>>> print list(django.contrib.auth.models.User.objects.all())

Now you get the error.

Simple solution I found on the web is to call django.db.close_connection() before the access

>>> import django.db
>>> django.db.close_connection()
>>> print list(django.contrib.auth.models.User.objects.all())

works ok.

Attachments (1)

error.txt (1.8 KB) - added by anonymous 4 months ago.

Download all attachments as: .zip

Change History (26)

Changed 4 months ago by anonymous

comment:1 Changed 4 months ago by aaugustin

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

Are you using persistent connections (CONN_MAX_AGE > 0)?

comment:2 Changed 4 months ago by ekeydar@…

Not using this.

comment:3 Changed 4 months ago by aaugustin

  • Description modified (diff)

This might be a side effect of using autocommit. Since Django 1.5 kept a transaction open in that case, MySQL couldn't close the connection. Now it can.

Could you try the following snippet to confirm my hypothesis?

>>> import django.contrib.auth.models
>>> from django.db import transaction
>>> with transaction.atomic():
...     print list(django.contrib.auth.models.User.objects.all())
...     import time
...     time.sleep(15)
...     print list(django.contrib.auth.models.User.objects.all())

You shouldn't get a timeout when you run this on Django 1.6.

Last edited 4 months ago by aaugustin (previous) (diff)

comment:4 Changed 4 months ago by anonymous

I ran into the same issue after upgrading to Django 1.6. In my case it affects long-running (multiple days) WorkerThreads, although I am not using any wait_timeout setting. It looks like the connection just times out without Django noticing/handling it.

Code which is triggering this:

cursor = connections['legacydb'].cursor()
cursor.execute(query)
result = cursor.fetchall()
cursor.close()

The above code works for a few hours and after a while the OperationalError 2006 is triggered on cursor.execute(). This behavior was not present in early Django versions and is a critical regression (aka: code which used to work now raises an Exception → severity should be set to release blocker).

BTW: The legacydb in the example does not support transactions (MyISAM).

comment:5 Changed 4 months ago by claudep

AFAIK the default timeout on MySQL is 28800 seconds (8 hours).
http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_wait_timeout

comment:6 Changed 4 months ago by claudep

  • Triage Stage changed from Unreviewed to Accepted

@aaugustin: even with transaction.atomic() (as of comment:3), I'm able to reproduce the failure.

comment:7 Changed 4 months ago by claudep

Query comparison (as in OP example, without using transaction.atomic()):

Django 1.5Django 1.6
562 Connect	root@localhost on test_
562 Query	SET NAMES utf8
562 Query	set autocommit=0
562 Query	SET SQL_AUTO_IS_NULL = 0


562 Query	SELECT `auth_user`.`id`, ... FROM `auth_user`

563 Connect	root@localhost on test_
563 Query	SET NAMES utf8
563 Query	set autocommit=0
563 Query	SET SQL_AUTO_IS_NULL = 0

563 Query	SELECT `auth_user`.`id`,... FROM `auth_user`
560 Connect	root@localhost on test_
560 Query	SET NAMES utf8
560 Query	set autocommit=0
560 Query	SET SQL_AUTO_IS_NULL = 0
560 Query	set autocommit=1
560 Query	set autocommit=0
560 Query	SELECT `auth_user`.`id`, ... FROM `auth_user`

561 Connect	root@localhost on test_
561 Query	SET NAMES utf8
561 Query	set autocommit=0
561 Query	SET SQL_AUTO_IS_NULL = 0
561 Query	set autocommit=1
561 Query	SELECT `auth_user`.`id`, ... FROM `auth_user`

comment:8 Changed 4 months ago by aigarius

Could this actually be a failure to re-establish a connection to MySQL after receiving the 2006 error? I get the same symptoms in production now when a MySQL server is restarted - all workers start getting 2006 errors and do not recover. Only gunicorn restart helps then. You could reproduce it locally by calling "print list(django.contrib.auth.models.User.objects.all())" every second and then restarting MySQL server. The connection does not recover even after the server is back up and accepting connections. This was tested with CONN_MAX_AGE > 0 but lower than wait_timeout on the MySQL server.

comment:9 Changed 4 months ago by aigarius

For now we are using the following workaround for long-running workers:

from django.db import connection
...
def is_connection_usable():
    try:
        connection.connection.ping()
    except:
        return False
    else:
        return True
...

def do_work():
    while(True): # Endless loop that keeps the worker going (simplified)
        if not is_connection_usable():
            connection.close()
        try:
            do_a_bit_of_work()
        except:
            logger.exception("Something bad happened, trying again")
            sleep(1)

comment:10 Changed 3 months ago by err

Maybe we can validate that connection is usable in ensure_connection ?

something like this:

*** Django-1.6.1/django/db/backends/__init__.py 2014-01-23 16:57:15.927687924 +0400
--- /usr/local/lib/python2.7/dist-packages/django/db/backends/__init__.py       2014-01-23 16:56:21.000000000 +0400
***************
*** 119,125 ****
          """
          Guarantees that a connection to the database is established.
          """
!         if self.connection is None or not self.is_usable():
              with self.wrap_database_errors:
                  self.connect()
  
--- 119,125 ----
          """
          Guarantees that a connection to the database is established.
          """
!         if self.connection is None:
              with self.wrap_database_errors:
                  self.connect()

are there any caveats?

comment:11 Changed 3 months ago by err

or even close connection (if not usable) in ensure_connection

comment:12 Changed 2 months ago by andreis

Hi,
to me it seems to be a bug. Old django would close every connection right away, django 1.6 checks with CONN_MAX_AGE:

  1. It gets CONN_MAX_AGE from DATABASES, sets close_at:
max_age = self.settings_dict['CONN_MAX_AGE']
self.close_at = None if max_age is None else time.time() + max_age
  1. Actually the code above affects close_if_unusable_or_obsolete, which closes the connection if 'self.close_at is not None and time.time() >= self.close_at'
  2. close_if_unusable_or_obsolete itself is being called by close_old_connections, which in turn is a request handler for signals.request_started and signals.request_finished.

We have a worker, which is effectively a django app but but it doesn't process any HTTP requests. In fact that makes all connections persistent because close_old_connections never gets called.

Please advise.

Thanks

comment:13 Changed 2 months ago by jeroen.pulles@…

Hi,

Without transactions you hit the Gone Away if the sleep is longer than MySQL's wait_timeout:

mysql> set global wait_timeout=10; 
>>> import django.contrib.auth.models
>>> import time
>>> print list(django.contrib.auth.models.User.objects.all())
>>> time.sleep(15)
>>> print list(django.contrib.auth.models.User.objects.all())

According to MySQL/python documentation this should not be a problem. If you add the INTERACTIVE bit to the client connection flags in db/backends/mysql/base.py, you regain the MySQL drivers' auto-reconnect feature and everything works as before (I think that in Django 1.5 you ran into trouble with a transaction that ran longer than wait_timeout too).

from kwargs['client_flag'] = CLIENT.FOUND_ROWS

to kwargs['client_flag'] = CLIENT.FOUND_ROWS | CLIENT.INTERACTIVE

I haven't looked into the origins of this line, but maybe it is the real culprit for the recent Gone Away issues.

comment:14 Changed 2 months ago by andreis

Hi Jeroen!
It seems like adding CLIENT.INTERACTIVE flag just tells the driver to switch from checking on wait_timeout to taking interactive_timeout into account. I set interactive_timeout=10 and was able to reproduce this problem.
Both of these values are 8 hours by default, but once your code has been inactive for that long, mysql drops the connection and the client fails next time it tries to access some data. It looks perfectly right to catch this error in the code, call django.db.close_connection() every time or whatever, but I think that maybe connection persistence logic needs a bit of fine-tuning so that we can control persistence without relying on signals.request_started/request_finished.

comment:15 Changed 5 weeks ago by andreis

Hey folks! Any thoughts on this matter?

Thanks

Last edited 5 weeks ago by andreis (previous) (diff)

comment:16 Changed 4 weeks ago by jeroen.pulles@…

I've checked with my existing long running processes on Django 1.5 installations, running the same mysql-python and libmysqlclient.so, with tcpdump:

They do set the interactive flag on the MySQL connection. That explains for me why I never experienced the Gone Away's before. Django doesn't notice that the underlying connection went away and came back.

I haven't had enough time to find out (dig down deep enough) what makes this flag appear on the connection in the 1.5 situation and what changed in 1.6 that is relevant to this problem. (a) My suspicion is that it isn't directly related to the connection persistence mechanism. (b) To me it doesn't seem to be in any way related to the transactions mechanisms: My transactions happen fast enough and it's fine that things break if the transaction takes longer than wait_timeout (e.g. more than two minutes); The same application that works fine in 1.5 also works in 1.6 with the interactive flag set.

JeroenP

comment:17 Changed 4 weeks ago by err

  • Owner changed from nobody to err
  • Status changed from new to assigned

comment:18 Changed 4 weeks ago by err

https://github.com/django/django/pull/2454 I've submitted pull request

comment:19 Changed 4 weeks ago by timo

  • Cc timo added
  • Has patch set

When I run the MySQL tests using djangocore-box, I get the traceback below when the test suite concludes. The PR above resolves this error.

Traceback (most recent call last):
  File "/django/tests/runtests.py", line 374, in <module>
    options.failfast, args)
  File "/django/tests/runtests.py", line 216, in django_tests
    test_labels or get_installed(), extra_tests=extra_tests)
  File "/django/django/test/runner.py", line 149, in run_tests
    self.teardown_databases(old_config)
  File "/django/django/test/runner.py", line 124, in teardown_databases
    connection.creation.destroy_test_db(old_name, self.verbosity)
  File "/django/django/db/backends/creation.py", line 452, in destroy_test_db
    self._destroy_test_db(test_database_name, verbosity)
  File "/django/django/db/backends/creation.py", line 466, in _destroy_test_db
    % self.connection.ops.quote_name(test_database_name))
  File "/django/django/db/backends/utils.py", line 59, in execute
    return self.cursor.execute(sql, params)
  File "/django/django/db/utils.py", line 94, in __exit__
    six.reraise(dj_exc_type, dj_exc_value, traceback)
  File "/django/django/db/backends/utils.py", line 57, in execute
    return self.cursor.execute(sql)
  File "/django/django/db/backends/mysql/base.py", line 128, in execute
    return self.cursor.execute(query, args)
  File "/home/vagrant/.virtualenvs/py2.7/local/lib/python2.7/site-packages/MySQLdb/cursors.py", line 174, in execute
    self.errorhandler(self, exc, value)
  File "/home/vagrant/.virtualenvs/py2.7/local/lib/python2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    raise errorclass, errorvalue
django.db.utils.OperationalError: (2006, 'MySQL server has gone away')

comment:20 Changed 4 weeks ago by aaugustin

  • Patch needs improvement set

I believe this patch breaks the transactional integrity guarantees made by transaction.atomic.

Django mustn't attempt to reconnect until the connection has been dropped properly, possibly after waiting for the exit of an atomic block.

comment:21 Changed 11 days ago by anonymous

A fix for this is still badly required, as this is a horrible regression. Please mark this bug as a 'release blocker', as it breaks existing code and makes upgrading to Django 1.6 impossible for quite a few people.

comment:22 follow-up: Changed 11 days ago by aaugustin

  • Severity changed from Normal to Release blocker

Unless I missed something, this only affects people that have management commands running for more than 8 hours (or MySQL's connection timeout if it has been changed from the default). Is that correct?

comment:23 Changed 11 days ago by andreis

Hi, this absolutely true for me. However I've managed to come up with a workaround, we've wrapped some of our code with decorators which implicitly call close_old_connections() routine, so that django now respects CONN_MAX_AGE parameter to some extent.

comment:24 in reply to: ↑ 22 Changed 9 days ago by anonymous

Replying to aaugustin:

Unless I missed something, this only affects people that have management commands running for more than 8 hours (or MySQL's connection timeout if it has been changed from the default). Is that correct?

As far as I can tell I think this is correct. At least this is the place where I was hitting this bug in my environment (before downgrading to 1.5 again because of this).

comment:25 Changed 3 days ago by jazeps.basko@…

Maybe this helps someone like me who uses Django and SQLAlchemy. I was experiencing this problem every morning with an API (not used for several hours at night) which had CONN_MAX_AGE=3600 and MySQL server with wait_timeout = 28800. I even changed settings which are used by manage.py (to run migrations and collect static assets) to not use persistent connections CONN_MAX_AGE=0, but the problem persisted. Then I noticed that this is actually happening in the code where SQLAlchemy accesses DB, so I googled and found this post: http://nathansnoggin.blogspot.co.uk/2013/11/integrating-sqlalchemy-into-django.html Followed the instructions in the blog post. Now that my SQLAlchemy does not handle connections (just uses whatever Django supplies) the problem is gone.

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as assigned
The owner will be changed from err to anonymous. Next status will be 'assigned'
The ticket will be disowned. Next status will be 'new'
as The resolution will be set. Next status will be 'closed'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.