Context Navigation

← Previous Ticket
Next Ticket →

#12157 new Cleanup/optimization

FileSystemStorage does file I/O inefficiently, despite providing options to permit larger blocksizes

Reported by:	alecmuffett	Owned by:	nobody
Component:	File uploads/storage	Version:	1.1
Severity:	Normal	Keywords:	io, FileSystemStorage, buffering, performance
Cc:	alec.muffett@…	Triage Stage:	Accepted
Has patch:	no	Needs documentation:	no
Needs tests:	no	Patch needs improvement:	no
Easy pickings:	no	UI/UX:	no

Description

FileSystemStorage contains the following:

    def _open(self, name, mode='rb'):
        return File(open(self.path(name), mode))

..which is used to open files which are stored as FileFields in Django models.

If the programmer decides to hack through the file by using (for instance) the django.core.files.base.File.chunks() method:

    def chunks(self, chunk_size=None):
        """
        Read the file and yield chucks of ``chunk_size`` bytes (defaults to
        ``UploadedFile.DEFAULT_CHUNK_SIZE``).
        """

        if not chunk_size:
            chunk_size = self.__class__.DEFAULT_CHUNK_SIZE

        if hasattr(self, 'seek'):
            self.seek(0)
        # Assume the pointer is at zero...
        counter = self.size

        while counter > 0:
            yield self.read(chunk_size)
            counter -= chunk_size

...the programmer would expect self.read() - which drops through to django.core.files.base.File.read() - to honour its arguments and for the I/O to occur in DEFAULT_CHUNK_SIZE blocks, currently 64k; however Dtrace shows otherwise:

29830/0xaf465d0:  open_nocancel("file.jpg\0", 0x0, 0x1B6)              = 5 0
29830/0xaf465d0:  fstat(0x5, 0xB007DB60, 0x1B6)          = 0 
29830/0xaf465d0:  fstat64(0x5, 0xB007E1E4, 0x1B6)                = 0 0
29830/0xaf465d0:  lseek(0x5, 0x0, 0x1)           = 0 0
29830/0xaf465d0:  lseek(0x5, 0x0, 0x0)           = 0 0
29830/0xaf465d0:  stat("file.jpg\0", 0xB007DF7C, 0x0)          = 0 0
29830/0xaf465d0:  write_nocancel(0x1, "65536 113762\n\0", 0xD)           = 13 0
29830/0xaf465d0:  mmap(0x0, 0x11000, 0x3, 0x1002, 0x3000000, 0x0)                = 0x7C5000 0
29830/0xaf465d0:  read_nocancel(0x5, "\377\330\377\340\0", 0x1000)               = 4096 0
29830/0xaf465d0:  read_nocancel(0x5, "\333\035eS[\026+\360\215Q\361'I\304c`\352\v4M\272C\201\273\261\377\0", 0x1000)             = 4096 0
...
...(many more 4kb reads elided)...
...
29830/0xaf465d0:  sendto(0x4, 0x7C5014, 0x10000)                 = 65536 0

...reading blocks in chunks of 4Kb (on OSX) and writing them in 64Kb blocks.

The reason this is occurring is because "open(self.path(name), mode)" is used to open the file, invoking the libc() stdio buffering which is much smaller than the 64kb requested by the programmer.

This can be kludged-around by hacking the open() statement:

    def _open(self, name, mode='rb'):
        return File(open(self.path(name), mode, 65536)) # use a larger buffer

...or by not using the stdio file()/open() calls, instead using os.open()

In the meantime this means that Django is not handling FileSystemStorage reads efficiently.

It is not easy to determine whether this general stdio-buffer issue impacts other parts of Django's performance.

Change History (7)

comment:1 by Russell Keith-Magee, 16 years ago

Triage Stage:	Unreviewed → Accepted

comment:2 by Adam Nelson, 16 years ago

#9632 has a patch which may or may not affect this issue.

comment:3 by Matt McClanahan, 15 years ago

Severity:	→ Normal
Type:	→ Cleanup/optimization

comment:4 by Julien Phalip, 15 years ago

Has patch:	unset

comment:5 by Aymeric Augustin, 14 years ago

UI/UX:	unset

Change UI/UX from NULL to False.

comment:6 by Aymeric Augustin, 14 years ago

Easy pickings:	unset

Change Easy pickings from NULL to False.

comment:7 by Mike Dearman, 13 days ago

Obviously this is an old ticket, but I was looking over the existing notes and codebase to see what was possible:

Part of the complexity to implement this ticket properly seems to revolve around the lack of parameter (or setting?) to pass a buffering policy from the Storage.open API to the internal Storage._open method. The developer also often doesn't call Storage.open directly, and utilizes the lazy open pattern from FieldFile, etc. If we are trying to use the File.chunks method, which has a chunk_size parameter (or default setting), it seems too late to be able to adjust/affect the buffering policy (as the file was already opened and returned in the File instance).

If we tried to just set the buffering parameter inside FileSystemStorage._open as previously suggested, there's also the challenge of (possibly) adjusting the buffering depending on the file mode used. Based on https://docs.python.org/3/library/functions.html#open , open's buffering policy/heuristic has more functionality depending on binary or text mode, that a fixed (say 64k) buffer doesn't account for.

If a model FieldFile is used to access the underlying storage file, it often implicitly calls Storage.open. However, developers can call FieldFile.open explicitly too. To provide buffering there, we'd have to expand the Storage.open API and File.open API definitions too I'd think? Not sure the appetite for this kind of change though.

Anyway, just thinking out loud and maybe it'll spark new thoughts, or at least recap the current state.

Note: See TracTickets for help on using tickets.

Download in other formats:

Issues

Context Navigation

#12157 new Cleanup/optimization

FileSystemStorage does file I/O inefficiently, despite providing options to permit larger blocksizes

Description

Change History (7)

comment:1 by Russell Keith-Magee, 16 years ago

comment:2 by Adam Nelson, 16 years ago

comment:3 by Matt McClanahan, 15 years ago

comment:4 by Julien Phalip, 15 years ago

comment:5 by Aymeric Augustin, 14 years ago

comment:6 by Aymeric Augustin, 14 years ago

comment:7 by Mike Dearman, 13 days ago

Download in other formats:

Django Links

Learn More

Get Involved

Get Help

Follow Us

Support Us