Code

Opened 6 years ago

Last modified 12 months ago

#8149 new Cleanup/optimization

UploadedFile doesn't iterate over lines with \r line endings

Reported by: mrmachine Owned by: nobody
Component: File uploads/storage Version: master
Severity: Normal Keywords: UploadedFile
Cc: real.human@…, timograham@… Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

UploadedFile.__iter__ loops over chunks and converts each chunk to a StringIO object and then iterates over that. Iterating over StringIO objects only splits on \n, and I'm not sure why we need StringIO at all. Couldn't we just do for line in chunk.splitlines() (which does split on \r and \n).

Attachments (2)

UploadedFile-r8225.diff (593 bytes) - added by mrmachine 6 years ago.
fix iteration of UploadedFile objects on newlines.
UploadedFile-r8226.diff (711 bytes) - added by mrmachine 6 years ago.
Updated for r8654 and fixed regular expression to not match empty strings.

Download all attachments as: .zip

Change History (16)

Changed 6 years ago by mrmachine

fix iteration of UploadedFile objects on newlines.

comment:1 Changed 6 years ago by mrmachine

  • Has patch set
  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

comment:2 Changed 6 years ago by mrmachine

  • Cc real.human@… added

comment:3 Changed 6 years ago by mrmachine

Looks like using splitlines will strip the line endings and break binary uploads. Here's a new patch that iterates through a call to re.findall with a simple regular expression instead of a StringIO object.

comment:4 Changed 6 years ago by mrmachine

Here's a use-case for this change. Upload a CSV file (which can have \r or \r\n line endings), and pass the UploadedFile object to csv.reader and you get Error: newline inside string when you try to iterate over it. When working with real files on the local file system you can open them with universal newline support, which is not available with StringIO or cStringIO objects.

comment:5 Changed 6 years ago by ericholscher

  • Triage Stage changed from Unreviewed to Accepted

Changed 6 years ago by mrmachine

Updated for r8654 and fixed regular expression to not match empty strings.

comment:6 Changed 6 years ago by mtredinnick

  • Resolution set to wontfix
  • Status changed from new to closed

Universal newline handling only makes sense for files generated/stored on the local system, where the newline convention is known. The UploadedFile class, pretty much by definition, is for data that has come from, and been generated on, a remote system that we know nothing about. So trying to use the appropriate line-ending handling is an unattainable goal for that situation. #8656 is the ticket to make sure this is documented.

comment:7 Changed 3 years ago by jacob

  • milestone 1.0 deleted

Milestone 1.0 deleted

comment:8 Changed 23 months ago by mrmachine

  • Component changed from Forms to File uploads/storage
  • Easy pickings unset
  • Resolution wontfix deleted
  • Severity set to Normal
  • Status changed from closed to reopened
  • Type set to Uncategorized
  • UI/UX unset

I'm re-opening this ticket because I have found a much simpler implementation, and this is still a problem for anyone who needs to parse uploaded text files line-by-line (e.g. CSV) when they can't know what line terminators end-users are using in their files. Excel 2011 on OS X is at least one commonly used program that still saves CSV data with \r line terminators by default.

Originally, there was a discussion between Malcolm and Jacob on IRC and the strongest objection at the time to fixing this was the ugly regular expression in my previous patch. This was also during a busy time, and it was decided that this should be a documentation issue.

I don't think the explanation given above when this ticket was wontfixed actually makes much sense. Uploaded file data which has come from and been generated on a remote system that we know nothing about is precisely the reason why we *need* universal newline support when iterating those files line-by-line.

I'm not sure how I missed it before, but when I said that we couldn't just use splitlines() because it consumed the line terminators, it seems I was wrong. splitlines() has a keepends argument that will actually do exactly what we need without any ugly regular expressions. Perhaps this wasn't the case back then, when we still had to support Python 2.3 and 2.4. I couldn't find confirmation in the Python 2.3 and 2.4 docs, but I did confirm that splitlines(True) works in Python 2.5 which is now our minimum version number.

I've fixed this in a branch on GitHub, and opened a pull request. All tests pass under OS X Lion with Python 2.7.2 and SQLite.

https://github.com/thirstydigital/django/tree/tickets/8149-universal-newline-uploads

https://github.com/django/django/pull/320

comment:9 Changed 23 months ago by claudep

  • Needs tests set

This __iter__ method appears to be untested.

Last edited 23 months ago by claudep (previous) (diff)

comment:10 Changed 23 months ago by mrmachine

Sorry. Don't know why I thought this didn't need tests. New pull request with test.

https://github.com/django/django/pull/323

comment:11 Changed 16 months ago by mrmachine

Updated versionchanged note to 1.6. Any feedback?

comment:12 Changed 16 months ago by aaugustin

  • Type changed from Uncategorized to Cleanup/optimization

comment:13 Changed 16 months ago by aaugustin

  • Status changed from reopened to new

comment:14 Changed 12 months ago by timo

  • Cc timograham@… added
  • Needs tests unset
  • Patch needs improvement set

The test doesn't pass on Python 3.

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as new
The owner will be changed from nobody to anonymous. Next status will be 'assigned'
as The resolution will be set. Next status will be 'closed'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.