Opened 11 years ago

Closed 11 years ago

Last modified 11 years ago

#19707 closed Bug (fixed)

TransactionMiddleware leaks transaction state

Reported by: Anssi Kääriäinen Owned by: nobody
Component: Database layer (models, ORM) Version: 1.4
Severity: Release blocker Keywords:
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

TransactionMiddleware process_exception and process_response are written in a style that will leak transaction state on error. process_response for example:

        if transaction.is_managed():
            if transaction.is_dirty():
                transaction.commit()
            transaction.leave_transaction_management()

now, what happens if transaction.commit() throws an error? transaction.leave_transaction_management() is never called, and the connection's transaction_state leaks an "managed" entry.

This can cause a "funny" situation where SessionMiddleware doesn't commit transactions if the SessionMiddleware is above TransactionMiddleware in the middleware stack. This is user visible in that logins do nothing - they don't give any errors, but they don't log in either.

A simple test-project attached. You need to run this under PostgreSQL so that deferred foreign keys can cause an error on commit(). And, you need to use runserver --nothreading as otherwise every request gets a new thread and a new connection with clear transaction state.

Run syncdb and then runserver --nothreading. Go to localhost:8000/view/, then go to localhost:8000/admin/ and try to log in. You can also add from django.db import connection; print connection.transaction_state to process_response to see better what happens.

I have hit this same issue under apache + mod_wsgi so this doesn't happen only in development server without threading.

The bug is present at least in versions 1.4+ of Django. I am marking this as release blocker as this is a crashing bug (logins are impossible when condition hit), and I plan to backpatch this to 1.4-master.

The fix should also be easy: add try-finally to process_response/process_exception.

Attachments (3)

txmiddleware_example.tar.gz (5.5 KB ) - added by Anssi Kääriäinen 11 years ago.
ticket_19707.diff (7.0 KB ) - added by Anssi Kääriäinen 11 years ago.
ticket_19707-2.diff (10.9 KB ) - added by Anssi Kääriäinen 11 years ago.

Download all attachments as: .zip

Change History (23)

by Anssi Kääriäinen, 11 years ago

Attachment: txmiddleware_example.tar.gz added

comment:1 by Anssi Kääriäinen, 11 years ago

Triage Stage: UnreviewedAccepted

I tested some more, and another way to cause this issue in 1.4 is to have "assert 1 == 0" in the testapp.views.view() and run under settings.DEBUG = False. The code will run request middleware for the 500 response, see that 500.html isn't available and fails to run neither exception or response middleware at all after that failure. If 500.html is present (1.5+ has this) the response middleware is called correctly. So, transaction state is again leaked and even try-finally in the exception or response middleware wont help as they aren't called at all (.

Another case that could cause this is erroneous user code doing enter_transaction_management, but no exit. I think it would be a good idea to clear connection.transaction_state in close_connections() (called on response_finished signal). Doing so in connection.close() isn't correct, as transaction_state isn't bound to single open connection, it is bound to the connection wrapper. Multiple open - close cycles are possible within single enter_transaction_management. And, this would also cause failures in such cases as:

enter_transaction_management()
do_something_with_connection
connection.close()
leave_transaction_management()

Some related tickets: #12250 deals about situations where process_response/exception isn't called. #19645 is about adding tests for TransactionMiddleware.

comment:2 by Anssi Kääriäinen, 11 years ago

Patch attached. There are many hacks related to transaction handling special cases in testing setup. The patch does fix both error cases mentioned above.

There might be some state still stored in the connection from request to request, but at least the stuck transaction should be handled correctly now.

by Anssi Kääriäinen, 11 years ago

Attachment: ticket_19707.diff added

comment:3 by Anssi Kääriäinen, 11 years ago

A cleaned up implementation is attached. The tests are still ugly, but they do catch the errors.

Should we log possible problems in transaction.abort()? And could there be situations where del connections[alias] isn't safe to do. We haven't ever deleted connections before. Still, in situations where del connections[alias] is called the connection will be in unknown state. It is hard to see how things could get worse than that (famous last words...).

This one needs a review, as the plan is to backpatch the code to 1.4. The testing code will not apply cleanly, but the changes to django.* will apply to 1.4.

by Anssi Kääriäinen, 11 years ago

Attachment: ticket_19707-2.diff added

comment:4 by Aymeric Augustin, 11 years ago

Patch needs improvement: set

If you want to catch all exceptions, use except Exception:, not except:. You don't want to catch MemoryError.

Otherwise the patch looks reasonable.

comment:5 by Anssi Kääriäinen, 11 years ago

OK, didn't know that about except and except Exception.

The only possible regression I can see is if somebody is using close_connection inside transaction management block expecting that only the connection will be closed, yet the transaction management state remains. Doing so seems to be on the verge of abuse, and even then close_connection isn't public API. Still, the name is now misleading, but I don't think it is a good idea to alter the name, at least not in the backbranches.

For backpatching it seems easiest to just backpatch all the TransactionMiddleware tests, even those added in #19645. Backpatching tests shouldn't hurt anybody...

comment:6 by Anssi Kääriäinen <akaariai@…>, 11 years ago

Resolution: fixed
Status: newclosed

In a4e97cf315142e61bb4bc3ed8259b95d8586d09c:

Fixed #19707 -- Reset transaction state after requests

comment:7 by Anssi Kääriäinen <akaariai@…>, 11 years ago

In 60186aa2e5db5900b96cbc3dac0ad23b72a37e80:

[1.5.x] Fixed #19707 -- Reset transaction state after requests

Backpatch of a4e97cf315142e61bb4bc3ed8259b95d8586d09c

comment:8 by Anssi Kääriäinen <akaariai@…>, 11 years ago

In 9918b3f50212bfb408eebc8f9965beb061baf840:

[1.4.x] Fixed #19707 -- Reset transaction state after requests

Backpatch of a4e97cf315142e61bb4bc3ed8259b95d8586d09c.

comment:9 by Anssi Kääriäinen, 11 years ago

The fix has caused a regression for 1.4.x when running tests under postgis, specifically these two tests: gis requests. For some reason a connection will be left open. The reason is linked to del connections[conn] in close_connection(). Also, if the requests.test_request_finished_failed_connection assigns the old conn to connections["DEFAULT_DB_ALIAS"] then everything will work.

I don't yet know what exactly is happening. Still, it seems that the del connections[conn] is a bad idea, at least for 1.4. Maybe it would be better to try to clean the transaction state variables of the connection directly. But, for example postgresql has .isolation_level variable which should be cleared to proper state so that the options['autocommit'] feature will continue to work properly. This is done by leave_transaction_management(). Cleaning the state otherwise than by doing leave_transaction_managment() is risky, but maybe it is still the least risky approach...

I will continue investigating this tomorrow.

comment:10 by Carl Meyer, 11 years ago

I haven't spent as much time looking at this as you have, Anssi, but it seems to me that catching the case in the signal handler where abort()/close() raise an error is outside the reasonable scope of what Django can handle, and that it would be safer to just allow that error to raise normally and force the developer to deal with that condition, as before. If an error was raised in abort/close, there's no particular reason to think that removing our reference to the connection will actually close it successfully, so the current approach seems to me quite likely to leave connections hanging open (just as it is doing in the tests you mention).

comment:11 by Anssi Kääriäinen, 11 years ago

The problem is close_connection is called as last steps of request handling, and thus the developer has zero chance to handle the error. If the connection state isn't cleaned, then you can get into situations where logins work otherwise perfectly except the session save is rolled back... Thus users can log in without error, only their login doesn't work for the next request. So, the error isn't limited to one request, this will turn the Django installation into a state where it doesn't work properly any more.

This isn't only theoretical. The reason I started to investigate this was that I had a site where connections were dropped during night, and the result was that during next day logins would randomly fail. Not the nicest situation to debug.

Even if Django can't properly handle the error, at least it can try to recover into state where next requests will work. The developer can't do this, the code isn't under the devs control. So, "force the developer to deal with that condition" is equivalent to saying that the developer should restart the service.

The underlying problem is that when connection.close() or _rollback() fails, you can't trust that leave_transaction_management() will work either. And, calling leave_transaction_management() repeatedly is the only way that connections know how to unwind the transaction state stack. For example, the PostgreSQL's autocommit feature works by counting the nested calls. So, the safest way to clean the connection's state is to throw the whole wrapper away and then next request will continue with a new wrapper.

Now, even after all this I think that at least 1.4 should not delete connections from django.db.connections. Instead, it should do something like this:

  1. Try to close the connection
  2. If that succeeds, then unwind the management stack by leave_transaction_managemet().
  3. If close doesn't succeed, then just try to set the variables to correct values and throw connection.connection away. Garbage collector can then deal with the internal connection.connection closing.

The reason is that 1.4 doesn't seem to handle connection deletion from django.db.connections properly. Also, there might be some user code that can't handle connection deletion.

No. 3. above is hard to write in a way that is guaranteed to work for all connections, but still it seems safer than introducing connection deletion in a point release.

I just checked and the problematic test itself isn't leaving any connections open. Somehow, somewhere the old connection the test removed from django.db.connections is used again. I am not exactly sure where that happens (test teardown is the likely source of this).

This is a situation where there are only bad options: Connection deletion is risky, so is manual transaction state cleanup. And doing nothing at all can leave the service in broken state.

comment:12 by Carl Meyer, 11 years ago

I may be missing something, but it seems to me that in this latest comment you are not distinguishing clearly between the various distinct failure cases. The specific case that I am suggesting we not try to handle is the case where:

  1. The request finishes without leaving transaction management (i.e. the 500.html or unbalanced-enter-transaction-management situations you describe in your first comment).
  1. AND rolling back the transaction / leaving transaction management raises an error (AFAICS nobody has presented any real-life situation where this would occur yet).

To be clear, I am not proposing that we should roll back this entire patch. I am in agreement with all of the fixes in this patch (including the addition of a call to "abort()" in the request_finished handler) _except_ for the "except Exception" clause in the request_finished handler. It seems to me that if we simply removed that clause, all of the real-life scenarios you described so far would still be fixed.

I think that the only correct fix for case 1 is for the developer to fix their code (although I also think we should consider whether the 500.html issue is Django bug in its own right), so I am skeptical of hiding that condition from the developer. And I think that the combination of both 1 and 2 is so unlikely (and almost certainly indicative of a bug in the codebase) that it is preferable to let the developer see the error than to hide it and try to recover.

comment:13 by Carl Meyer, 11 years ago

In any case, I think that even if we keep the except Exception clause and attempt some form of cleanup there, we must add a raise to the end of it so that the error is not hidden.

comment:14 by Anssi Kääriäinen, 11 years ago

One way a rollback might fail is if a connection goes away in the middle of a request. I just tried (using the attached test project), and if a connection goes away between the testapp.views.view create() and finishing the request, then the rollback will fail on MySQL. And, if there is no try-except, then the connection will leak transaction state. psycopg2 actually manages this situation correctly and the rollback() and close() will succeed. As a complete shocker mysql doesn't. Now, dropped db connections are rare, but they are something Django should survive. Unfortunately the DB API 2 has nothing guaranteeing that rollback() and/or close() will not raise exceptions in this case.

(To reproduce, place a breakpoint after the testapp.views.view create(), shut down the dbserver, continue the request).

Raise is a good idea, but only after all the connections have been looped through, otherwise the later connections might leak state. Would log.exception() to request log be good enough?

comment:15 by Anssi Kääriäinen, 11 years ago

More testing, and another option seems possible. Even if the connection state cleaning doesn't succeed on the request that caused the error, it will succeed in the next request that has a working connection available. So, a request could still leak state to next request, but the next successful request would then restore the connection to clear state (after finish).

This seems good enough to me, what I hate is leaving the service in permanent erroneous state, and that is avoided.

Still another option: how about doing the cleanup before request instead after the request?

comment:16 by Carl Meyer, 11 years ago

Good point about later connections. Logging an error to request log should still send an error email in the default config (since that's exactly what handle_uncaught_exception does), so yes that seems fine.

I agree that Django should be able to recover from a dropped DB connection. What we are talking about here is a clearly misconfigured/broken codebase (i.e. missing 500.html or unbalanced enter-transaction-management) _plus_ a dropped DB connection. I think in that rare situation the highest priority should be alerting that the codebase is broken, and cleanup should only be attempted if we are confident it can be done correctly.

What is really needed for reliable cleanup is some way to call leave_transaction_management and tell it "don't try to use the database connection, it might be broken, just clean up the transaction state in the wrapper." Allowing for this might be a more invasive future change. For 1.4.x I agree that letting the next request clean it up is good enough.

comment:17 by Anssi Kääriäinen, 11 years ago

I think I will be fine with doing the same for 1.5, too. It wouldn't be a bad idea to do a bigger cleanup to transaction management in 1.6, there are also other problems in the tx management code (see for example the _dirty flag handling...). Of course, 1.6 changes depend on available developer time.

I haven't tested broken connections before, and it seems 500.html wont save you when dealing with broken connections. And, it is somewhat easy to have unbalanced enter/leave tx calls, for example it is possible some middlewares are skipped if another middleware raises an exception. And an exception from middleware is somewhat likely in case db connections do not work...

So, the current plan: remove the try-except from the request_finished signal for 1.4 and 1.5, adjust tests as needed. The connection state wont be guaranteed to be cleaned immediately after badly failed requests, but it will be cleaned eventually.

Even with this adjustment I am afraid this ticket has still some potential for regressions in testing conditions.

comment:18 by Anssi Kääriäinen <akaariai@…>, 11 years ago

In fafee74306e869c23edcef864f9d816565a5c4a2:

Removed try-except in django.db.close_connection()

The reason was that the except clause needed to remove a connection
from the django.db.connections dict, but other parts of Django do not
expect this to happen. In addition the except clause was silently
swallowing the exception messages.

Refs #19707, special thanks to Carl Meyer for pointing out that this
approach should be taken.

comment:19 by Anssi Kääriäinen <akaariai@…>, 11 years ago

In 743263a1058bb54617446507cbb645aab571ac93:

[1.5.x] Removed try-except in django.db.close_connection()

The reason was that the except clause needed to remove a connection
from the django.db.connections dict, but other parts of Django do not
expect this to happen. In addition the except clause was silently
swallowing the exception messages.

Refs #19707, special thanks to Carl Meyer for pointing out that this
approach should be taken.

comment:20 by Anssi Kääriäinen <akaariai@…>, 11 years ago

In dec7dd99f095f938bc4306a0260d5b131935ad82:

[1.4.x] Removed try-except in django.db.close_connection()

The reason was that the except clause needed to remove a connection
from the django.db.connections dict, but other parts of Django do not
expect this to happen. In addition the except clause was silently
swallowing the exception messages.

Refs #19707, special thanks to Carl Meyer for pointing out that this
approach should be taken.

Note: See TracTickets for help on using tickets.
Back to Top