Opened 7 years ago

Closed 6 years ago

#8616 closed (fixed)

File-based sessions dropped with SESSION_SAVE_EVERY_REQUEST = True

Reported by: warren@… Owned by: jacob
Component: contrib.sessions Version: master
Severity: Keywords: session, file, race
Cc: wiser0 Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: yes Patch needs improvement: no
Easy pickings: UI/UX:

Description

When using django.contrib.sessions.backends.file with SESSION_SAVE_EVERY_REQUEST = True, repeated requests for the same session that arrive very close together will cause corruption in the session file and result in a dropped session. The recent changes in the session framework went a long ways in resolving this issue, but not quite far enough.

To resolve the problem, I tried using django.core.files.locks to lock the file when open for reading or writing. However, this proved not sufficiently atomic. The problem can be resolved using the atomic file locking flags when opening the session file for reading or writing. Unfortunately, these are not available on all platforms. The attached patch uses them if they are available. This fixes the problem on platforms that support the atomic file locking flags.

Attachments (5)

session_file_read_write_locking.diff (1.4 KB) - added by warren@… 7 years ago.
cross_platform_session_file_read_write_locking.diff (2.7 KB) - added by warren@… 7 years ago.
session_file_io_race_fix.diff (3.8 KB) - added by warren@… 7 years ago.
I/O Race condition fix that does not rely on file locking
name_error_fix.diff (754 bytes) - added by warren@… 7 years ago.
Fix potential NameError in save().
file.py.diff (719 bytes) - added by wiser0 6 years ago.
Windows specific patch for zero length session files.

Download all attachments as: .zip

Change History (38)

Changed 7 years ago by warren@…

comment:1 Changed 7 years ago by mtredinnick

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

Can you please explain the race you are trying to fix here. I want to make sure we're trying to solve the right problem.

comment:2 Changed 7 years ago by jacob

  • Triage Stage changed from Unreviewed to Accepted

comment:3 Changed 7 years ago by warren@…

The problem (I think) I am seeing is this:

  1. Request A opens the file for reading, reads the session data, and closes the file.
  2. Request A opens the session file for writing, truncating the file.
  3. Request A starts to write the session data to the file.
  4. Request B opens the file for reading, reads the partial data from the file, and closes the file.
  5. Request A finishes writing the session data to the file.
  6. Request B attempts to decode the partial session data and fails.
  7. Request B creates a new empty session.
  8. Code executed by Request B and subsequent requests fails because of missing session data.

I encountered this problem before the recent session changes, so the chain of events above may be incorrect. The core of the issue is (I think) that while the session data is being written to the file, other requests may read the file before the write is complete.

This seems to manifest itself only when all of these conditions exist:

  • SESSION_SAVE_EVERY_REQUEST = True causes the session data to be written on every request.
  • Django is not running under the development server so the requests are not serialized.
  • Multiple requests from the same session arrive very close together (as can happen with AJAX requests).

Before the recent session changes, I had tried using the django.core.files.locks module to lock the file appropriately (shared for read, exclusive for write) immediately after the file was opened, but I still got failures (not sure why). It was only when I used the locking flags to os.open() that the problem went away. Unfortunately, the O_SHLOCK and O_EXLOCK flags seem to be BSD specific, so while they are available on my Mac, they may not be under other flavors of *nix. I looked at Python 2.5.1 running on RHEL 4 and they were not there. More recent version of Linux may or may not have them.

It is possible that other factors, that may have been fixed by the recent session changes, caused the failures when I tried the django file locking before. I'll give it another go.

comment:4 Changed 7 years ago by warren@…

*Previous post changed to use wiki formatting*

The problem (I think) I am seeing is this:

  1. Request A opens the file for reading, reads the session data, and closes the file.
  2. Request A opens the session file for writing, truncating the file.
  3. Request A starts to write the session data to the file.
  4. Request B opens the file for reading, reads the partial data from the file, and closes the file.
  5. Request A finishes writing the session data to the file.
  6. Request B attempts to decode the partial session data and fails.
  7. Request B creates a new empty session.
  8. Code executed by Request B and subsequent requests fails because of missing session data.

I encountered this problem before the recent session changes, so the chain of events above may be incorrect. The core of the issue is (I think) that while the session data is being written to the file, other requests may read the file before the write is complete.

This seems to manifest itself only when all of these conditions exist:

  • SESSION_SAVE_EVERY_REQUEST = True causes the session data to be written on every request.
  • Django is not running under the development server so the requests are not serialized.
  • Multiple requests from the same session arrive very close together (as can happen with AJAX requests).

Before the recent session changes, I had tried using the django.core.files.locks module to lock the file appropriately (shared for read, exclusive for write) immediately after the file was opened, but I still got failures (not sure why). It was only when I used the locking flags to os.open() that the problem went away. Unfortunately, the O_SHLOCK and O_EXLOCK flags seem to be BSD specific, so while they are available on my Mac, they may not be under other flavors of *nix. I looked at Python 2.5.1 running on RHEL 4 and they were not there. More recent version of Linux may or may not have them.

It is possible that other factors, that may have been fixed by the recent session changes, caused the failures when I tried the django file locking before. I'll give it another go.

comment:5 Changed 7 years ago by warren@…

  • Summary changed from Session file corrupted with SESSION_SAVE_EVERY_REQUEST = True to File-based sessions dropped with SESSION_SAVE_EVERY_REQUEST = True

On further research it appears that the session file is not corrupted, but just appears to be corrupted because it is read at the same time another request is writing it. Because the failed attempt by the second request to decode the session data results in the generation of a new session key, the (now empty) session data is written to a new file, not the old one. Requests that use the original session id get the original session data intact, so long as another request is not in the process of writing it.


comment:6 Changed 7 years ago by warren@…

I tried using django.core.files.locks again and got similar results as before. As I think about it, perhaps this is what is happening:

  1. Request A opens the file for writing, truncating the file.
  2. Request B opens the file for reading.
  3. Request B obtains a shared lock on the file.
  4. Request B reads the empty file.
  5. Request B closes the file, releasing the shared lock.
  6. Request A obtains an exclusive lock on the file, writes the session data, and closes the file, releasing the exclusive lock.
  7. Request B is left with no data from the empty file it read.

Apparently, using the O_SHLOCK and O_EXLOCK flags prevents the file from being truncated until the exclusive lock has been obtained. Can this behavior be duplicated somehow?

comment:7 Changed 7 years ago by warren@…

I have resolved the issue with using django.core.files.locks to lock the session file for reading and writing.

The solution is to separate the locking semantics into a separate lock file that is opened and locked before opening the session file and closed after closing the session file. This allows the non-atomicity of opening and locking a file to become irrelevant, since no I/O operations are actually performed on the lock file.

I am submitting a second patch that implements this alternate locking method. This patch should work cross-platform, though I have only tested it on Mac OSX.

comment:8 Changed 7 years ago by mtredinnick

  • Owner changed from nobody to mtredinnick
  • Status changed from new to assigned

comment:9 Changed 7 years ago by mtredinnick

It's tempting to engage in a bikeshed painting operation here just because I thought of a different way, but since you've tested this and have a failing case to work against, I'm going with your patch.

In case this does give problems down the track, though, here's an alternate solution: we write the new session to a different filename, so the only contention is on the write side (which blocks due to the exclusive write lock). Then we rename it to the original name upon completion. Renaming is atomic (in the sense that nobody will read only part of a renamed file) across platforms.

comment:10 Changed 7 years ago by warren@…

I actually like the renaming solution better. My solution leaves the 0-byte lock file hanging around, which is messy. It may also be missed by session-sweepers if their pattern is really strict.

I'll work on a renaming solution.

comment:11 Changed 7 years ago by mtredinnick

Warren, don't worry about it. I'm already working on, because testing showed a few other concerns: we needed to unlink the empty files anyway, so the performance of your patch wasn't better (if we don't unlink, we use up inodes and dirents and eventually fill up the limit of the number of files in the directory). Things were starting to get messy. So I've nearly got the moving version working fine (it's only a few lines) and that saves the need for the unlinking step.

So let's not duplicate work here, I'm already doing this, in between other juggling. But thanks for the sanity check. :-)

comment:12 Changed 7 years ago by mtredinnick

  • Resolution set to fixed
  • Status changed from assigned to closed

(In [8688]) Fixed #8616 -- Fixed a race condition in the file-based session backend.
Thanks to warren@… for the patch.

comment:13 Changed 7 years ago by mtredinnick

I ended up reverting to the locking solution. The problem is that we still have to reserve the final destination name in the case of "must create a new session" and it began to get nearly as complicated. Your patch seems to work well, so thanks for all the effort on this.

comment:14 follow-ups: Changed 7 years ago by anonymous

  • Resolution fixed deleted
  • Status changed from closed to reopened

I thought a lot about this last night and came to the same conclusion.

I just updated to the latest trunk and my test app doesn't work at all now. It seems to not be saving the session data at all.

I have isolated the breakage to rev 8688, which is where you fixed this ticket.

I have further isolated the problem to the unlinking of the lock files. The load() method expects the lock file to exist when it opens it (mode = "rb"). I changed the mode to "wb", just like the save() method, but no joy.

As I think more about this, unlinking the lock file creates yet another race condition. The whole point of the lock file is that it is a shared locking context. Having each request unlink it destroys the shared context and may create the situation where there are several lock file handles in existence at once with only one of them being linked to the lock file name on the file system. How this is causing my session data to be lost is beyond me.

I understand the issue with increased inode usage with the 0-byte lock files hanging around, but I don't see any easy way around it. I sure wish the O_SHLOCK and O_EXLOCK flags existed on all platforms. That is the elegant solution.

Since the problem we are trying to solve only manifests itself under certain circumstances, maybe we should make the locking optional. However, I'm inclined to err on the side of caution and guarantee session file integrity under *any* conditions.

Your call.

comment:15 in reply to: ↑ 14 Changed 7 years ago by warren@…

Oops. The previous comment was mine. I forgot to update the username field.

comment:16 in reply to: ↑ 14 Changed 7 years ago by warren@…

Replying to anonymous:

As I think more about this, unlinking the lock file creates yet another race condition. The whole point of the lock file is that it is a shared locking context. Having each request unlink it destroys the shared context and may create the situation where there are several lock file handles in existence at once with only one of them being linked to the lock file name on the file system. How this is causing my session data to be lost is beyond me.

I mis-spoke. The situation I was thinking about is where several lock *files* exist because they are open but another thread/process has unlinked them. I don't know enough about how the low-level I/O works to know if this is a possibility.

comment:17 Changed 7 years ago by warren@…

I just got an email from another user thanking me for the patch that broke file-based sessions. :-(

I appreciate the sole attribution, but not when the patch is not completely mine. :-)

Revision 8688 should probably be reverted while we come up with a better solution to the original problem.

Since I can reliably reproduce the original race condition, perhaps I should test future patches before they are committed to trunk.

I don't have the environment to run the django regression tests. In my application, django is not using a database at all (though may application is). Do the tests cover the basic functionality of file-based sessions? If so, and they are passing, perhaps this other user and I have a configuration combination that is not covered by the tests.

comment:18 Changed 7 years ago by mtredinnick

A few things here.

Firstly, sorry that you got mail from somebody without the manners to understand how Open Source works. Unfortunately, that happens from time to time. Unpleasant, I agree.

I don't really understand your argument about unlinking. We release the lock before unlinking and unlinking just reduces the link count, so the file isn't removed until the last user has gone away (and unlocked it first). So I don't quite see the failure mode you're talking about.

Leaving 0 bytes files around is simply not an option. It will eventually fill up the directory's dirent records to the point that you cannot create any more files, which is very uncool. We can't replace one bug with a different one.

I'll back out the patch, try to understand what's going on and possibly resurrect the "move" version, ugly as it may be. Since you do have a nice failing case, I'll attach the patch here when I'm done and get a thumbs up from you before it goes in. I couldn't repeat the issue yesterday at all.

comment:19 Changed 7 years ago by mtredinnick

(In [8707]) Reverted #8688 for now, since it merely introduced different bugs, rather than
fixing the problem. We have a plan B (and plan C, if needed), so this will be
fixed in a different way.

Refs #8616.

comment:20 Changed 7 years ago by jacob

  • Owner changed from mtredinnick to jacob
  • Status changed from reopened to new

comment:21 Changed 7 years ago by jacob

  • Status changed from new to assigned

comment:22 follow-up: Changed 7 years ago by warren@…

Replying to mtredinnick:

Leaving 0 bytes files around is simply not an option. It will eventually fill up the directory's dirent records to the point that you cannot create any more files, which is very uncool. We can't replace one bug with a different one.

I agree that we cannot replace one bug with a different one and I definitely understand your concern about leaving the lock files around indefinitely.

However, with my patch, the django-controlled lifetime of the lock files was the same as the session file itself. I unlinked them in the delete() method. So, the number of files would grow twice as fast as before, but the growth would be linear. The only caveat I can think of is the one I mentioned earlier: external session sweeper code that ignores the lock file. I tried to mitigate this by purposely keeping the lock file name very similar to the lock itself.

I'm doing some more thinking about why the unlinking is causing problems. I suspect a race condition, but I'm not versed enough in the low-level I/O stuff to know for sure.

I agree that my patch was not perfect. Your judicious addition of try:finally blocks was definitely needed. Were we to proceed with the long-lived lock file solution, the delete() method would need to insure that an exception generated by unlinking the session file did not cause the unlinking of the lock file to never be attempted.

In the end, though, if we can make the rename solution workable, that would be preferable.

comment:23 in reply to: ↑ 22 Changed 7 years ago by warren@…

Replying to warren@wandrsmith.net:

However, with my patch, the django-controlled lifetime of the lock files was the same as the session file itself. I unlinked them in the delete() method. So, the number of files would grow twice as fast as before, but the growth would be linear. The only caveat I can think of is the one I mentioned earlier: external session sweeper code that ignores the lock file. I tried to mitigate this by purposely keeping the lock file name very similar to the lock itself.

That last sentence should be "...very similar to the *session file* itself."

comment:24 Changed 7 years ago by jacob

Warren, since you're the one with the test case that can reproduce this, can you give the "move" method a shot and post a patch? I'm going to take over for Malcolm on this, but it'll be easier for me if you get something working that I can review -- it's really hard to do this "blind".

comment:25 Changed 7 years ago by warren@…

Sure. I'll see what I can come up with. I did some thinking on this last night, so I have a head start.

I've done some more thinking on why unlinking the lock file caused problems. I don't have a deep understanding of low-level I/O or how the locking is implemented. However, my shallow understanding is that there are several parts to a file when we have it open:

  • The file descriptor in memory.
  • The file itself.
  • The link to that file in the file system.
  • Perhaps others I am ignorant of...

I'm not sure where the lock information is actually stored. The man page for flock() says that the lock is on the file, not the file descriptor. But what does it mean by "file"?. The fact that the locking call requires a file descriptor and not a file path seems to indicate that the lock information is part of the file itself. If so, I can envision this scenario:

  • Request A.load() opens lockfile1, creating it.
  • Request A.load() acquires a shared lock on lockfile1.
  • Request A.load() opens the session file, reads the session data, and closes it.
  • Request A.load() closes lockfile1, releasing the lock.
    • Request B.load() opens lockfile1.
    • Request B.load() acquires a shared lock on lockfile1.
  • Request A.load() unlinks lockfile1. This will succeed on UNIX, but probably fail on Windows because the file is in use by Request B. On UNIX, this removes the file system link to lockfile1. Subsequent attempts to open it without creating it will fail. However, because Request B still has an open reference to it, the original file is not yet completely gone.
  • Request A.save() opens lockfile2, creating it (because the link to lockfile1 has been removed).
  • Request A.save() acquires an exclusive lock on lockfile2 (because the shared lock that Request B is holding is on lockfile1, not lockfile2).
  • Request A.save() opens the session file for writing, truncating it.
    • Request B.load() opens the session file, reads the (now empty) session data, and closes it.
    • Request B.load() attempts to decode the session data and fails.

From here on I don't have a good grasp of everything that will happen and why the results are so much more severe (at least in my application) than the original problem.

However, if my assumptions are correct, unlinking the lock file can create the situation where there are multiple simultaneous locking contexts that are unaware of each other, thus defeating the whole purpose of the lock.

In the end, I would prefer that we not have to do any locking at all. I'll try to fix the original problem using os.rename() in the save() method.

Changed 7 years ago by warren@…

I/O Race condition fix that does not rely on file locking

comment:26 Changed 7 years ago by warren@…

I have added a patch that does not depend on file locking.

I have tested it against my application that was failing and it seems to be as reliable as the locking solution and slightly faster.

However, it is possible that I may have broken some of the recent fixes to other problems.

Please look the code over carefully and test it as much as is feasible in other contexts.

comment:27 Changed 7 years ago by jacob

  • Resolution set to fixed
  • Status changed from assigned to closed

(In [8750]) Fixed #8616 (again): prevent a race condition in the session file backend. Many thanks to Warren Smith for help and the eventual fix.

comment:28 follow-up: Changed 7 years ago by warren@…

  • Resolution fixed deleted
  • Status changed from closed to reopened

I recognize the value of the slight modifications made to my patch. An exception should not be generated if all goes well.

However, there is a slight problem with the changes in #8750. Because the "renamed" variable is initialized after the try:finally around the encoding and writing of the file, if there is any problem there, we will get "NameError: name 'renamed' is not defined" in the finally: block of the enclosing try:.

I am submitting a patch that fixes this.

Changed 7 years ago by warren@…

Fix potential NameError in save().

comment:29 in reply to: ↑ 28 Changed 7 years ago by warren@…

Replying to warren@wandrsmith.net:

However, there is a slight problem with the changes in #8750.

Oops. Make that [8750].

comment:30 Changed 7 years ago by jacob

  • Resolution set to fixed
  • Status changed from reopened to closed

Fixed in [8812].

comment:31 Changed 6 years ago by wiser0

  • milestone 1.0 deleted
  • Resolution fixed deleted
  • Status changed from closed to reopened
  • Triage Stage changed from Accepted to Unreviewed
  • Version changed from SVN to 1.1

On Windows (XP) with Python 2.6.3, the session files were always zero bytes in length. I created the following patch:

--- D:/python26/Lib/site-packages/django-1.1-py2.6.egg/django/contrib/sessions/backends/file.py	Tue Oct 20 16:58:59 2009
+++ D:/python26/Lib/site-packages/django-1.1-py2.6.egg/django/contrib/sessions/backends/file_NEW.py	Tue Oct 20 16:58:34 2009
@@ -119,6 +119,11 @@
                     os.write(output_file_fd, self.encode(session_data))
                 finally:
                     os.close(output_file_fd)
+                
+                # On Windows, old file must be removed or rename() will fail.
+                if os.name == 'nt':
+                    os.unlink(session_file_name)
+
                 os.rename(output_file_name, session_file_name)
                 renamed = True
             finally:

Apologies if this is not the correct place for this. It is my first attempt at contributing to Django.

  • Randy

Changed 6 years ago by wiser0

Windows specific patch for zero length session files.

comment:32 Changed 6 years ago by wiser0

  • Cc wiser0 added
  • Needs tests set

comment:33 Changed 6 years ago by kmtracey

  • Resolution set to fixed
  • Status changed from reopened to closed
  • Triage Stage changed from Unreviewed to Accepted
  • Version changed from 1.1 to SVN

This problem was fixed. #9084 is open for the issue of file-based sessions not working on Windows.

Note: See TracTickets for help on using tickets.
Back to Top