#19707 closed Bug (fixed)
TransactionMiddleware leaks transaction state
Reported by: | Anssi Kääriäinen | Owned by: | nobody |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 1.4 |
Severity: | Release blocker | Keywords: | |
Cc: | Triage Stage: | Accepted | |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | yes |
Easy pickings: | no | UI/UX: | no |
Description
TransactionMiddleware process_exception and process_response are written in a style that will leak transaction state on error. process_response for example:
if transaction.is_managed(): if transaction.is_dirty(): transaction.commit() transaction.leave_transaction_management()
now, what happens if transaction.commit() throws an error? transaction.leave_transaction_management() is never called, and the connection's transaction_state leaks an "managed" entry.
This can cause a "funny" situation where SessionMiddleware doesn't commit transactions if the SessionMiddleware is above TransactionMiddleware in the middleware stack. This is user visible in that logins do nothing - they don't give any errors, but they don't log in either.
A simple test-project attached. You need to run this under PostgreSQL so that deferred foreign keys can cause an error on commit(). And, you need to use runserver --nothreading as otherwise every request gets a new thread and a new connection with clear transaction state.
Run syncdb and then runserver --nothreading. Go to localhost:8000/view/, then go to localhost:8000/admin/ and try to log in. You can also add from django.db import connection; print connection.transaction_state to process_response to see better what happens.
I have hit this same issue under apache + mod_wsgi so this doesn't happen only in development server without threading.
The bug is present at least in versions 1.4+ of Django. I am marking this as release blocker as this is a crashing bug (logins are impossible when condition hit), and I plan to backpatch this to 1.4-master.
The fix should also be easy: add try-finally to process_response/process_exception.
Attachments (3)
Change History (23)
by , 12 years ago
Attachment: | txmiddleware_example.tar.gz added |
---|
comment:1 by , 12 years ago
Triage Stage: | Unreviewed → Accepted |
---|
comment:2 by , 12 years ago
Patch attached. There are many hacks related to transaction handling special cases in testing setup. The patch does fix both error cases mentioned above.
There might be some state still stored in the connection from request to request, but at least the stuck transaction should be handled correctly now.
by , 12 years ago
Attachment: | ticket_19707.diff added |
---|
comment:3 by , 12 years ago
A cleaned up implementation is attached. The tests are still ugly, but they do catch the errors.
Should we log possible problems in transaction.abort()? And could there be situations where del connections[alias] isn't safe to do. We haven't ever deleted connections before. Still, in situations where del connections[alias] is called the connection will be in unknown state. It is hard to see how things could get worse than that (famous last words...).
This one needs a review, as the plan is to backpatch the code to 1.4. The testing code will not apply cleanly, but the changes to django.* will apply to 1.4.
by , 12 years ago
Attachment: | ticket_19707-2.diff added |
---|
comment:4 by , 12 years ago
Patch needs improvement: | set |
---|
If you want to catch all exceptions, use except Exception:
, not except:
. You don't want to catch MemoryError
.
Otherwise the patch looks reasonable.
comment:5 by , 12 years ago
OK, didn't know that about except and except Exception.
The only possible regression I can see is if somebody is using close_connection inside transaction management block expecting that only the connection will be closed, yet the transaction management state remains. Doing so seems to be on the verge of abuse, and even then close_connection isn't public API. Still, the name is now misleading, but I don't think it is a good idea to alter the name, at least not in the backbranches.
For backpatching it seems easiest to just backpatch all the TransactionMiddleware tests, even those added in #19645. Backpatching tests shouldn't hurt anybody...
comment:6 by , 12 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
comment:9 by , 12 years ago
The fix has caused a regression for 1.4.x when running tests under postgis, specifically these two tests: gis requests
. For some reason a connection will be left open. The reason is linked to del connections[conn]
in close_connection(). Also, if the requests.test_request_finished_failed_connection assigns the old conn to connections["DEFAULT_DB_ALIAS"]
then everything will work.
I don't yet know what exactly is happening. Still, it seems that the del connections[conn] is a bad idea, at least for 1.4. Maybe it would be better to try to clean the transaction state variables of the connection directly. But, for example postgresql has .isolation_level variable which should be cleared to proper state so that the options['autocommit']
feature will continue to work properly. This is done by leave_transaction_management(). Cleaning the state otherwise than by doing leave_transaction_managment() is risky, but maybe it is still the least risky approach...
I will continue investigating this tomorrow.
comment:10 by , 12 years ago
I haven't spent as much time looking at this as you have, Anssi, but it seems to me that catching the case in the signal handler where abort()
/close()
raise an error is outside the reasonable scope of what Django can handle, and that it would be safer to just allow that error to raise normally and force the developer to deal with that condition, as before. If an error was raised in abort/close, there's no particular reason to think that removing our reference to the connection will actually close it successfully, so the current approach seems to me quite likely to leave connections hanging open (just as it is doing in the tests you mention).
comment:11 by , 12 years ago
The problem is close_connection is called as last steps of request handling, and thus the developer has zero chance to handle the error. If the connection state isn't cleaned, then you can get into situations where logins work otherwise perfectly except the session save is rolled back... Thus users can log in without error, only their login doesn't work for the next request. So, the error isn't limited to one request, this will turn the Django installation into a state where it doesn't work properly any more.
This isn't only theoretical. The reason I started to investigate this was that I had a site where connections were dropped during night, and the result was that during next day logins would randomly fail. Not the nicest situation to debug.
Even if Django can't properly handle the error, at least it can try to recover into state where next requests will work. The developer can't do this, the code isn't under the devs control. So, "force the developer to deal with that condition" is equivalent to saying that the developer should restart the service.
The underlying problem is that when connection.close() or _rollback() fails, you can't trust that leave_transaction_management() will work either. And, calling leave_transaction_management() repeatedly is the only way that connections know how to unwind the transaction state stack. For example, the PostgreSQL's autocommit feature works by counting the nested calls. So, the safest way to clean the connection's state is to throw the whole wrapper away and then next request will continue with a new wrapper.
Now, even after all this I think that at least 1.4 should not delete connections from django.db.connections. Instead, it should do something like this:
- Try to close the connection
- If that succeeds, then unwind the management stack by leave_transaction_managemet().
- If close doesn't succeed, then just try to set the variables to correct values and throw connection.connection away. Garbage collector can then deal with the internal connection.connection closing.
The reason is that 1.4 doesn't seem to handle connection deletion from django.db.connections properly. Also, there might be some user code that can't handle connection deletion.
No. 3. above is hard to write in a way that is guaranteed to work for all connections, but still it seems safer than introducing connection deletion in a point release.
I just checked and the problematic test itself isn't leaving any connections open. Somehow, somewhere the old connection the test removed from django.db.connections is used again. I am not exactly sure where that happens (test teardown is the likely source of this).
This is a situation where there are only bad options: Connection deletion is risky, so is manual transaction state cleanup. And doing nothing at all can leave the service in broken state.
comment:12 by , 12 years ago
I may be missing something, but it seems to me that in this latest comment you are not distinguishing clearly between the various distinct failure cases. The specific case that I am suggesting we not try to handle is the case where:
- The request finishes without leaving transaction management (i.e. the 500.html or unbalanced-enter-transaction-management situations you describe in your first comment).
- AND rolling back the transaction / leaving transaction management raises an error (AFAICS nobody has presented any real-life situation where this would occur yet).
To be clear, I am not proposing that we should roll back this entire patch. I am in agreement with all of the fixes in this patch (including the addition of a call to "abort()" in the request_finished handler) _except_ for the "except Exception" clause in the request_finished handler. It seems to me that if we simply removed that clause, all of the real-life scenarios you described so far would still be fixed.
I think that the only correct fix for case 1 is for the developer to fix their code (although I also think we should consider whether the 500.html issue is Django bug in its own right), so I am skeptical of hiding that condition from the developer. And I think that the combination of both 1 and 2 is so unlikely (and almost certainly indicative of a bug in the codebase) that it is preferable to let the developer see the error than to hide it and try to recover.
comment:13 by , 12 years ago
In any case, I think that even if we keep the except Exception
clause and attempt some form of cleanup there, we must add a raise
to the end of it so that the error is not hidden.
comment:14 by , 12 years ago
One way a rollback might fail is if a connection goes away in the middle of a request. I just tried (using the attached test project), and if a connection goes away between the testapp.views.view create() and finishing the request, then the rollback will fail on MySQL. And, if there is no try-except, then the connection will leak transaction state. psycopg2 actually manages this situation correctly and the rollback() and close() will succeed. As a complete shocker mysql doesn't. Now, dropped db connections are rare, but they are something Django should survive. Unfortunately the DB API 2 has nothing guaranteeing that rollback() and/or close() will not raise exceptions in this case.
(To reproduce, place a breakpoint after the testapp.views.view create(), shut down the dbserver, continue the request).
Raise is a good idea, but only after all the connections have been looped through, otherwise the later connections might leak state. Would log.exception() to request log be good enough?
comment:15 by , 12 years ago
More testing, and another option seems possible. Even if the connection state cleaning doesn't succeed on the request that caused the error, it will succeed in the next request that has a working connection available. So, a request could still leak state to next request, but the next successful request would then restore the connection to clear state (after finish).
This seems good enough to me, what I hate is leaving the service in permanent erroneous state, and that is avoided.
Still another option: how about doing the cleanup before request instead after the request?
comment:16 by , 12 years ago
Good point about later connections. Logging an error to request log should still send an error email in the default config (since that's exactly what handle_uncaught_exception
does), so yes that seems fine.
I agree that Django should be able to recover from a dropped DB connection. What we are talking about here is a clearly misconfigured/broken codebase (i.e. missing 500.html or unbalanced enter-transaction-management) _plus_ a dropped DB connection. I think in that rare situation the highest priority should be alerting that the codebase is broken, and cleanup should only be attempted if we are confident it can be done correctly.
What is really needed for reliable cleanup is some way to call leave_transaction_management and tell it "don't try to use the database connection, it might be broken, just clean up the transaction state in the wrapper." Allowing for this might be a more invasive future change. For 1.4.x I agree that letting the next request clean it up is good enough.
comment:17 by , 12 years ago
I think I will be fine with doing the same for 1.5, too. It wouldn't be a bad idea to do a bigger cleanup to transaction management in 1.6, there are also other problems in the tx management code (see for example the _dirty flag handling...). Of course, 1.6 changes depend on available developer time.
I haven't tested broken connections before, and it seems 500.html wont save you when dealing with broken connections. And, it is somewhat easy to have unbalanced enter/leave tx calls, for example it is possible some middlewares are skipped if another middleware raises an exception. And an exception from middleware is somewhat likely in case db connections do not work...
So, the current plan: remove the try-except from the request_finished signal for 1.4 and 1.5, adjust tests as needed. The connection state wont be guaranteed to be cleaned immediately after badly failed requests, but it will be cleaned eventually.
Even with this adjustment I am afraid this ticket has still some potential for regressions in testing conditions.
I tested some more, and another way to cause this issue in 1.4 is to have "assert 1 == 0" in the testapp.views.view() and run under settings.DEBUG = False. The code will run request middleware for the 500 response, see that 500.html isn't available and fails to run neither exception or response middleware at all after that failure. If 500.html is present (1.5+ has this) the response middleware is called correctly. So, transaction state is again leaked and even try-finally in the exception or response middleware wont help as they aren't called at all (.
Another case that could cause this is erroneous user code doing enter_transaction_management, but no exit. I think it would be a good idea to clear connection.transaction_state in close_connections() (called on response_finished signal). Doing so in connection.close() isn't correct, as transaction_state isn't bound to single open connection, it is bound to the connection wrapper. Multiple open - close cycles are possible within single enter_transaction_management. And, this would also cause failures in such cases as:
Some related tickets: #12250 deals about situations where process_response/exception isn't called. #19645 is about adding tests for TransactionMiddleware.