Opened 16 years ago

Closed 14 years ago

#8616 closed (fixed)

File-based sessions dropped with SESSION_SAVE_EVERY_REQUEST = True

Reported by: warren@… Owned by: Jacob
Component: contrib.sessions Version: dev
Severity: Keywords: session, file, race
Cc: wiser0 Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: yes Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

When using django.contrib.sessions.backends.file with SESSION_SAVE_EVERY_REQUEST = True, repeated requests for the same session that arrive very close together will cause corruption in the session file and result in a dropped session. The recent changes in the session framework went a long ways in resolving this issue, but not quite far enough.

To resolve the problem, I tried using django.core.files.locks to lock the file when open for reading or writing. However, this proved not sufficiently atomic. The problem can be resolved using the atomic file locking flags when opening the session file for reading or writing. Unfortunately, these are not available on all platforms. The attached patch uses them if they are available. This fixes the problem on platforms that support the atomic file locking flags.

Attachments (5)

session_file_read_write_locking.diff (1.4 KB ) - added by warren@… 16 years ago.
cross_platform_session_file_read_write_locking.diff (2.7 KB ) - added by warren@… 16 years ago.
session_file_io_race_fix.diff (3.8 KB ) - added by warren@… 16 years ago.
I/O Race condition fix that does not rely on file locking
name_error_fix.diff (754 bytes ) - added by warren@… 16 years ago.
Fix potential NameError in save().
file.py.diff (719 bytes ) - added by wiser0 14 years ago.
Windows specific patch for zero length session files.

Download all attachments as: .zip

Change History (38)

by warren@…, 16 years ago

comment:1 by Malcolm Tredinnick, 16 years ago

Can you please explain the race you are trying to fix here. I want to make sure we're trying to solve the right problem.

comment:2 by Jacob, 16 years ago

Triage Stage: UnreviewedAccepted

comment:3 by warren@…, 16 years ago

The problem (I think) I am seeing is this:

  1. Request A opens the file for reading, reads the session data, and closes the file.
  2. Request A opens the session file for writing, truncating the file.
  3. Request A starts to write the session data to the file.
  4. Request B opens the file for reading, reads the partial data from the file, and closes the file.
  5. Request A finishes writing the session data to the file.
  6. Request B attempts to decode the partial session data and fails.
  7. Request B creates a new empty session.
  8. Code executed by Request B and subsequent requests fails because of missing session data.

I encountered this problem before the recent session changes, so the chain of events above may be incorrect. The core of the issue is (I think) that while the session data is being written to the file, other requests may read the file before the write is complete.

This seems to manifest itself only when all of these conditions exist:

  • SESSION_SAVE_EVERY_REQUEST = True causes the session data to be written on every request.
  • Django is not running under the development server so the requests are not serialized.
  • Multiple requests from the same session arrive very close together (as can happen with AJAX requests).

Before the recent session changes, I had tried using the django.core.files.locks module to lock the file appropriately (shared for read, exclusive for write) immediately after the file was opened, but I still got failures (not sure why). It was only when I used the locking flags to os.open() that the problem went away. Unfortunately, the O_SHLOCK and O_EXLOCK flags seem to be BSD specific, so while they are available on my Mac, they may not be under other flavors of *nix. I looked at Python 2.5.1 running on RHEL 4 and they were not there. More recent version of Linux may or may not have them.

It is possible that other factors, that may have been fixed by the recent session changes, caused the failures when I tried the django file locking before. I'll give it another go.

comment:4 by warren@…, 16 years ago

*Previous post changed to use wiki formatting*

The problem (I think) I am seeing is this:

  1. Request A opens the file for reading, reads the session data, and closes the file.
  2. Request A opens the session file for writing, truncating the file.
  3. Request A starts to write the session data to the file.
  4. Request B opens the file for reading, reads the partial data from the file, and closes the file.
  5. Request A finishes writing the session data to the file.
  6. Request B attempts to decode the partial session data and fails.
  7. Request B creates a new empty session.
  8. Code executed by Request B and subsequent requests fails because of missing session data.

I encountered this problem before the recent session changes, so the chain of events above may be incorrect. The core of the issue is (I think) that while the session data is being written to the file, other requests may read the file before the write is complete.

This seems to manifest itself only when all of these conditions exist:

  • SESSION_SAVE_EVERY_REQUEST = True causes the session data to be written on every request.
  • Django is not running under the development server so the requests are not serialized.
  • Multiple requests from the same session arrive very close together (as can happen with AJAX requests).

Before the recent session changes, I had tried using the django.core.files.locks module to lock the file appropriately (shared for read, exclusive for write) immediately after the file was opened, but I still got failures (not sure why). It was only when I used the locking flags to os.open() that the problem went away. Unfortunately, the O_SHLOCK and O_EXLOCK flags seem to be BSD specific, so while they are available on my Mac, they may not be under other flavors of *nix. I looked at Python 2.5.1 running on RHEL 4 and they were not there. More recent version of Linux may or may not have them.

It is possible that other factors, that may have been fixed by the recent session changes, caused the failures when I tried the django file locking before. I'll give it another go.

comment:5 by warren@…, 16 years ago

Summary: Session file corrupted with SESSION_SAVE_EVERY_REQUEST = TrueFile-based sessions dropped with SESSION_SAVE_EVERY_REQUEST = True

On further research it appears that the session file is not corrupted, but just appears to be corrupted because it is read at the same time another request is writing it. Because the failed attempt by the second request to decode the session data results in the generation of a new session key, the (now empty) session data is written to a new file, not the old one. Requests that use the original session id get the original session data intact, so long as another request is not in the process of writing it.


comment:6 by warren@…, 16 years ago

I tried using django.core.files.locks again and got similar results as before. As I think about it, perhaps this is what is happening:

  1. Request A opens the file for writing, truncating the file.
  2. Request B opens the file for reading.
  3. Request B obtains a shared lock on the file.
  4. Request B reads the empty file.
  5. Request B closes the file, releasing the shared lock.
  6. Request A obtains an exclusive lock on the file, writes the session data, and closes the file, releasing the exclusive lock.
  7. Request B is left with no data from the empty file it read.

Apparently, using the O_SHLOCK and O_EXLOCK flags prevents the file from being truncated until the exclusive lock has been obtained. Can this behavior be duplicated somehow?

comment:7 by warren@…, 16 years ago

I have resolved the issue with using django.core.files.locks to lock the session file for reading and writing.

The solution is to separate the locking semantics into a separate lock file that is opened and locked before opening the session file and closed after closing the session file. This allows the non-atomicity of opening and locking a file to become irrelevant, since no I/O operations are actually performed on the lock file.

I am submitting a second patch that implements this alternate locking method. This patch should work cross-platform, though I have only tested it on Mac OSX.

comment:8 by Malcolm Tredinnick, 16 years ago

Owner: changed from nobody to Malcolm Tredinnick
Status: newassigned

comment:9 by Malcolm Tredinnick, 16 years ago

It's tempting to engage in a bikeshed painting operation here just because I thought of a different way, but since you've tested this and have a failing case to work against, I'm going with your patch.

In case this does give problems down the track, though, here's an alternate solution: we write the new session to a different filename, so the only contention is on the write side (which blocks due to the exclusive write lock). Then we rename it to the original name upon completion. Renaming is atomic (in the sense that nobody will read only part of a renamed file) across platforms.

comment:10 by warren@…, 16 years ago

I actually like the renaming solution better. My solution leaves the 0-byte lock file hanging around, which is messy. It may also be missed by session-sweepers if their pattern is really strict.

I'll work on a renaming solution.

comment:11 by Malcolm Tredinnick, 16 years ago

Warren, don't worry about it. I'm already working on, because testing showed a few other concerns: we needed to unlink the empty files anyway, so the performance of your patch wasn't better (if we don't unlink, we use up inodes and dirents and eventually fill up the limit of the number of files in the directory). Things were starting to get messy. So I've nearly got the moving version working fine (it's only a few lines) and that saves the need for the unlinking step.

So let's not duplicate work here, I'm already doing this, in between other juggling. But thanks for the sanity check. :-)

comment:12 by Malcolm Tredinnick, 16 years ago

Resolution: fixed
Status: assignedclosed

(In [8688]) Fixed #8616 -- Fixed a race condition in the file-based session backend.
Thanks to warren@… for the patch.

comment:13 by Malcolm Tredinnick, 16 years ago

I ended up reverting to the locking solution. The problem is that we still have to reserve the final destination name in the case of "must create a new session" and it began to get nearly as complicated. Your patch seems to work well, so thanks for all the effort on this.

comment:14 by anonymous, 16 years ago

Resolution: fixed
Status: closedreopened

I thought a lot about this last night and came to the same conclusion.

I just updated to the latest trunk and my test app doesn't work at all now. It seems to not be saving the session data at all.

I have isolated the breakage to rev 8688, which is where you fixed this ticket.

I have further isolated the problem to the unlinking of the lock files. The load() method expects the lock file to exist when it opens it (mode = "rb"). I changed the mode to "wb", just like the save() method, but no joy.

As I think more about this, unlinking the lock file creates yet another race condition. The whole point of the lock file is that it is a shared locking context. Having each request unlink it destroys the shared context and may create the situation where there are several lock file handles in existence at once with only one of them being linked to the lock file name on the file system. How this is causing my session data to be lost is beyond me.

I understand the issue with increased inode usage with the 0-byte lock files hanging around, but I don't see any easy way around it. I sure wish the O_SHLOCK and O_EXLOCK flags existed on all platforms. That is the elegant solution.

Since the problem we are trying to solve only manifests itself under certain circumstances, maybe we should make the locking optional. However, I'm inclined to err on the side of caution and guarantee session file integrity under *any* conditions.

Your call.

in reply to:  14 comment:15 by warren@…, 16 years ago

Oops. The previous comment was mine. I forgot to update the username field.

in reply to:  14 comment:16 by warren@…, 16 years ago

Replying to anonymous:

As I think more about this, unlinking the lock file creates yet another race condition. The whole point of the lock file is that it is a shared locking context. Having each request unlink it destroys the shared context and may create the situation where there are several lock file handles in existence at once with only one of them being linked to the lock file name on the file system. How this is causing my session data to be lost is beyond me.

I mis-spoke. The situation I was thinking about is where several lock *files* exist because they are open but another thread/process has unlinked them. I don't know enough about how the low-level I/O works to know if this is a possibility.

comment:17 by warren@…, 16 years ago

I just got an email from another user thanking me for the patch that broke file-based sessions. :-(

I appreciate the sole attribution, but not when the patch is not completely mine. :-)

Revision 8688 should probably be reverted while we come up with a better solution to the original problem.

Since I can reliably reproduce the original race condition, perhaps I should test future patches before they are committed to trunk.

I don't have the environment to run the django regression tests. In my application, django is not using a database at all (though may application is). Do the tests cover the basic functionality of file-based sessions? If so, and they are passing, perhaps this other user and I have a configuration combination that is not covered by the tests.

comment:18 by Malcolm Tredinnick, 16 years ago

A few things here.

Firstly, sorry that you got mail from somebody without the manners to understand how Open Source works. Unfortunately, that happens from time to time. Unpleasant, I agree.

I don't really understand your argument about unlinking. We release the lock before unlinking and unlinking just reduces the link count, so the file isn't removed until the last user has gone away (and unlocked it first). So I don't quite see the failure mode you're talking about.

Leaving 0 bytes files around is simply not an option. It will eventually fill up the directory's dirent records to the point that you cannot create any more files, which is very uncool. We can't replace one bug with a different one.

I'll back out the patch, try to understand what's going on and possibly resurrect the "move" version, ugly as it may be. Since you do have a nice failing case, I'll attach the patch here when I'm done and get a thumbs up from you before it goes in. I couldn't repeat the issue yesterday at all.

comment:19 by Malcolm Tredinnick, 16 years ago

(In [8707]) Reverted #8688 for now, since it merely introduced different bugs, rather than
fixing the problem. We have a plan B (and plan C, if needed), so this will be
fixed in a different way.

Refs #8616.

comment:20 by Jacob, 16 years ago

Owner: changed from Malcolm Tredinnick to Jacob
Status: reopenednew

comment:21 by Jacob, 16 years ago

Status: newassigned

comment:22 by warren@…, 16 years ago

Replying to mtredinnick:

Leaving 0 bytes files around is simply not an option. It will eventually fill up the directory's dirent records to the point that you cannot create any more files, which is very uncool. We can't replace one bug with a different one.

I agree that we cannot replace one bug with a different one and I definitely understand your concern about leaving the lock files around indefinitely.

However, with my patch, the django-controlled lifetime of the lock files was the same as the session file itself. I unlinked them in the delete() method. So, the number of files would grow twice as fast as before, but the growth would be linear. The only caveat I can think of is the one I mentioned earlier: external session sweeper code that ignores the lock file. I tried to mitigate this by purposely keeping the lock file name very similar to the lock itself.

I'm doing some more thinking about why the unlinking is causing problems. I suspect a race condition, but I'm not versed enough in the low-level I/O stuff to know for sure.

I agree that my patch was not perfect. Your judicious addition of try:finally blocks was definitely needed. Were we to proceed with the long-lived lock file solution, the delete() method would need to insure that an exception generated by unlinking the session file did not cause the unlinking of the lock file to never be attempted.

In the end, though, if we can make the rename solution workable, that would be preferable.

in reply to:  22 comment:23 by warren@…, 16 years ago

Replying to warren@wandrsmith.net:

However, with my patch, the django-controlled lifetime of the lock files was the same as the session file itself. I unlinked them in the delete() method. So, the number of files would grow twice as fast as before, but the growth would be linear. The only caveat I can think of is the one I mentioned earlier: external session sweeper code that ignores the lock file. I tried to mitigate this by purposely keeping the lock file name very similar to the lock itself.

That last sentence should be "...very similar to the *session file* itself."

comment:24 by Jacob, 16 years ago

Warren, since you're the one with the test case that can reproduce this, can you give the "move" method a shot and post a patch? I'm going to take over for Malcolm on this, but it'll be easier for me if you get something working that I can review -- it's really hard to do this "blind".

comment:25 by warren@…, 16 years ago

Sure. I'll see what I can come up with. I did some thinking on this last night, so I have a head start.

I've done some more thinking on why unlinking the lock file caused problems. I don't have a deep understanding of low-level I/O or how the locking is implemented. However, my shallow understanding is that there are several parts to a file when we have it open:

  • The file descriptor in memory.
  • The file itself.
  • The link to that file in the file system.
  • Perhaps others I am ignorant of...

I'm not sure where the lock information is actually stored. The man page for flock() says that the lock is on the file, not the file descriptor. But what does it mean by "file"?. The fact that the locking call requires a file descriptor and not a file path seems to indicate that the lock information is part of the file itself. If so, I can envision this scenario:

  • Request A.load() opens lockfile1, creating it.
  • Request A.load() acquires a shared lock on lockfile1.
  • Request A.load() opens the session file, reads the session data, and closes it.
  • Request A.load() closes lockfile1, releasing the lock.
    • Request B.load() opens lockfile1.
    • Request B.load() acquires a shared lock on lockfile1.
  • Request A.load() unlinks lockfile1. This will succeed on UNIX, but probably fail on Windows because the file is in use by Request B. On UNIX, this removes the file system link to lockfile1. Subsequent attempts to open it without creating it will fail. However, because Request B still has an open reference to it, the original file is not yet completely gone.
  • Request A.save() opens lockfile2, creating it (because the link to lockfile1 has been removed).
  • Request A.save() acquires an exclusive lock on lockfile2 (because the shared lock that Request B is holding is on lockfile1, not lockfile2).
  • Request A.save() opens the session file for writing, truncating it.
    • Request B.load() opens the session file, reads the (now empty) session data, and closes it.
    • Request B.load() attempts to decode the session data and fails.

From here on I don't have a good grasp of everything that will happen and why the results are so much more severe (at least in my application) than the original problem.

However, if my assumptions are correct, unlinking the lock file can create the situation where there are multiple simultaneous locking contexts that are unaware of each other, thus defeating the whole purpose of the lock.

In the end, I would prefer that we not have to do any locking at all. I'll try to fix the original problem using os.rename() in the save() method.

by warren@…, 16 years ago

I/O Race condition fix that does not rely on file locking

comment:26 by warren@…, 16 years ago

I have added a patch that does not depend on file locking.

I have tested it against my application that was failing and it seems to be as reliable as the locking solution and slightly faster.

However, it is possible that I may have broken some of the recent fixes to other problems.

Please look the code over carefully and test it as much as is feasible in other contexts.

comment:27 by Jacob, 16 years ago

Resolution: fixed
Status: assignedclosed

(In [8750]) Fixed #8616 (again): prevent a race condition in the session file backend. Many thanks to Warren Smith for help and the eventual fix.

comment:28 by warren@…, 16 years ago

Resolution: fixed
Status: closedreopened

I recognize the value of the slight modifications made to my patch. An exception should not be generated if all goes well.

However, there is a slight problem with the changes in #8750. Because the "renamed" variable is initialized after the try:finally around the encoding and writing of the file, if there is any problem there, we will get "NameError: name 'renamed' is not defined" in the finally: block of the enclosing try:.

I am submitting a patch that fixes this.

by warren@…, 16 years ago

Attachment: name_error_fix.diff added

Fix potential NameError in save().

in reply to:  28 comment:29 by warren@…, 16 years ago

Replying to warren@wandrsmith.net:

However, there is a slight problem with the changes in #8750.

Oops. Make that [8750].

comment:30 by Jacob, 16 years ago

Resolution: fixed
Status: reopenedclosed

Fixed in [8812].

comment:31 by wiser0, 14 years ago

milestone: 1.0
Resolution: fixed
Status: closedreopened
Triage Stage: AcceptedUnreviewed
Version: SVN1.1

On Windows (XP) with Python 2.6.3, the session files were always zero bytes in length. I created the following patch:

--- D:/python26/Lib/site-packages/django-1.1-py2.6.egg/django/contrib/sessions/backends/file.py	Tue Oct 20 16:58:59 2009
+++ D:/python26/Lib/site-packages/django-1.1-py2.6.egg/django/contrib/sessions/backends/file_NEW.py	Tue Oct 20 16:58:34 2009
@@ -119,6 +119,11 @@
                     os.write(output_file_fd, self.encode(session_data))
                 finally:
                     os.close(output_file_fd)
+                
+                # On Windows, old file must be removed or rename() will fail.
+                if os.name == 'nt':
+                    os.unlink(session_file_name)
+
                 os.rename(output_file_name, session_file_name)
                 renamed = True
             finally:

Apologies if this is not the correct place for this. It is my first attempt at contributing to Django.

  • Randy

by wiser0, 14 years ago

Attachment: file.py.diff added

Windows specific patch for zero length session files.

comment:32 by wiser0, 14 years ago

Cc: wiser0 added
Needs tests: set

comment:33 by Karen Tracey, 14 years ago

Resolution: fixed
Status: reopenedclosed
Triage Stage: UnreviewedAccepted
Version: 1.1SVN

This problem was fixed. #9084 is open for the issue of file-based sessions not working on Windows.

Note: See TracTickets for help on using tickets.
Back to Top