Opened 14 years ago

Closed 14 years ago

Last modified 14 years ago

#12578 closed (invalid)

multipartparser.Parser does not accept non-canonical bare CR and bare LF

Reported by: Jacob Fenwick Owned by: nobody
Component: HTTP handling Version: 1.1
Severity: Keywords: jython
Cc: Leo Soto M. Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

According to RFC 2616, Section 3.7.1:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.1
"HTTP applications MUST accept CRLF, bare CR, and bare LF as being representative of a line break in text media received via HTTP."

The Parser object in multipartparser can only parse canonical CRLF because of this line:
http://code.djangoproject.com/browser/django/trunk/django/http/multipartparser.py#L553

As a result, any data coming through the WSGI gateway that does not conform to canonical CRLF but is still considered valid by RFC 2616 be processed incorrectly.

Attachments (5)

multipartparser.diff (543 bytes ) - added by Jacob Fenwick 14 years ago.
code to print chunks to file
multipartdjangojythonwindows.hex (16.7 KB ) - added by Jacob Fenwick 14 years ago.
hex output from chunks printed to file on django-jython on runserver
multiparttomcatjythonwindows.hex (8.0 KB ) - added by Jacob Fenwick 14 years ago.
hex output from chunks printed out on django-jython on tomcat
multipartpythonwindows.hex (8.4 KB ) - added by Jacob Fenwick 14 years ago.
multipartpythonosx.hex (1.6 KB ) - added by Jacob Fenwick 14 years ago.

Download all attachments as: .zip

Change History (14)

comment:1 by Leo Soto M., 14 years ago

Component: Core frameworkHTTP handling
Keywords: jython added

Just want to point out that this problem is actually causing serious problems to people deploying Django apps on Jython on Windows platforms.

comment:2 by Leo Soto M., 14 years ago

Cc: Leo Soto M. added

comment:3 by Karen Tracey, 14 years ago

More details would be helpful here. From a brief look at the code referenced, I'm not at all sure that changing the one line identified would fix any problem. (Perhaps that wasn't intended to be implied by identifying one line, but that is how I read it). Further in that routine (line 579, for example), CRLF is used for splitting lines. So I suspect more than a single-line change would be needed to fix this.

I'm also not entirely sure the code in Django here is wrong, per the RFC. The name of this routine is parse_boundary_stream, implying to me it is responsible only for splitting a multipart message based on boundary markers. The very next section of the RFC cited (http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.2) says, in reference to multipart types: "The message body is itself a protocol element and MUST therefore use only CRLF to represent line breaks between body-parts." Thus it sounds perfectly legit to me for Django to be requiring CRLF in this specific case.

So, I'd really like to get a better understanding of the exact form of the data that is causing problems here.

comment:4 by Leo Soto M., 14 years ago

Resolution: invalid
Status: newclosed

You are right Karen, Django is doing the right thing and something else is to blame. Marking the ticket as invalid.

comment:5 by Karen Tracey, 14 years ago

Note even if Django is correct per the RFC, if there are non-compliant clients in the wild that Django can't deal with because it is strictly enforcing the RFC then we might want to look at relaxing the Django implementation here. Particularly if there are other server side implementations that these clients can work properly with. Adhering to standards is great, but if it gets in the way of interoperating with real clients that don't have issues with other servers then I don't know we'd want to stick with a stance of "we're correct per the RFC". Hence the request for more information. It'd be good to know what these clients are (Barney's home-grown client used by exactly 8 machines on the internet or something a little more popular?) and some information about servers that they currently don't have trouble with.

comment:6 by Leo Soto M., 14 years ago

The problem occurs on deployments of Django 1.1 on Jython 2.5.1 with Tomcat as the application server on Windows systems.

The full thread on the observed problem is on <http://groups.google.com/group/django-jython-dev/browse_thread/thread/c1d5d718f8d322b1>. It may be a Jython bug, though but we haven't be able to confirm it yet.

comment:7 by Karen Tracey, 14 years ago

Ah, so it's not some client that is sending LF-only in the POST data, since I gather the clients in use here are standard browsers, which will be sending CRLFs. Interesting it only happens on Windows, where CRLF is the "normal" line ending. It sounds like some code that is passing along the post data is translating CRLFs to just LFs. Besides causing this problem that behavior has the potential for corrupting binary files uploaded to a Django server. All the code touching this data before feeding it to the Django code needs to be treating it as binary, not text, and not doing any type of line normalization.

by Jacob Fenwick, 14 years ago

Attachment: multipartparser.diff added

code to print chunks to file

by Jacob Fenwick, 14 years ago

hex output from chunks printed to file on django-jython on runserver

by Jacob Fenwick, 14 years ago

hex output from chunks printed out on django-jython on tomcat

by Jacob Fenwick, 14 years ago

Attachment: multipartpythonwindows.hex added

by Jacob Fenwick, 14 years ago

Attachment: multipartpythonosx.hex added

comment:8 by Jacob Fenwick, 14 years ago

I'm not sure what the real problem is, so rather than open a new ticket I will ask some questions here in the hope that someone can answer them.

The issue is that in on Windows, in Django-Jython on Tomcat, multipart data is not parsed correctly by the LazyStream in multipartparser.py.

As far as I can tell, this happens because at http://code.djangoproject.com/browser/django/trunk/django/http/multipartparser.py#L553 because the character CRLFCRLF is not found.

I did some experiments. I POSTed some data using the same app on four different Django platforms.
Here are the platforms I tested on, and the associated data that was output as hex:

Django on Python on runserver on Windows - multipartpythonwindows.hex
Django on Jython on runserver on Windows - multipartdjangojythonwindows.hex
Django on Jython on Tomcat on Windows - multiparttomcatjythonwindows.hex
Django on Python on runserver on OS X - multipartpythonosx.hex (note: I used a different Django app, but I believe the result would have been the same)

The data was dumped using the code in multipartparser.diff

Note: I ran the files I generated through hexdump -C file to generate the hex files from the data files I created.

According to http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.2 the message body of the multipart should use CRLF as a line break between body-parts. If you look at multipartpythonwindows.hex you will see that what actually happens is all CRs are replaced with CRCR. This means that when the cited line in multipartparser.py is looking for CRLFCRLF, it instead finds CRCRLFCRCRLF. I would have thought this would fail, but it does not! It works correctly. This is the normal operating procedure of Django on Python, as far as I can tell.

In multipartdjangojythonwindows.hex you will see that the pattern of CRs being replaced by CRCR still occurs. This means that Django on Jython running on runserver works the same as Django on Python running on runserver, and as a result, still works.

Now we go to multiparttomcatjythonwindows.hex. In this file, the CRLF comes as you would expect it in RFC 2616. In this case, chunk.find fails, which is the root of the problem.

Finally, compare the CR characters of multipartpythonosx.hex. You will see it does not duplicate CR the same way multiparttomcatjythonwindows does. And yet it works.

These are my questions:

  1. Where is that extra CR coming from and why is it required? Why does this not result in failure?
  2. Could there be something wrong with my method of data collection as specified in multipartparser.diff?
  3. kmtracy previously said "All the code touching this data before feeding it to the Django code needs to be treating it as binary, not text, and not doing any type of line normalization." Is there a way I can check whether the data is binary or text in Python to verify this is the case?

I'm sorry if this is not the correct avenue to be asking these questions. If there is a better one, please point me in that direction.

comment:9 by Karen Tracey, 14 years ago

To answer your first question, the extra CRs in the debug data are coming from the way it was collected:

This code:

f = open('c:/chunk.txt', 'a') 
f.write(chunk) 
f.close() 

on Windows, will transform any LFs in chunk to CRLF. So where chunk originally had CRLF, what gets written to the file is CRCRLF. What was actually fed to the Django parsing code (chunk) has the correct CRLF so you don't see a POST/parse failure.

The problem with the debug data collection code is it does not open the file in binary mode. See: http://docs.python.org/tutorial/inputoutput.html#reading-and-writing-files. To record exactly what is in chunk, the file needs to be opened in binary mode.

You can see this behavior in action in a Python shell:

D:\tmp>python
Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> txtfile = file.open('xyz.txt','a')
>>> txtfile = open('xyz.txt','a')
>>> txtfile.write('This is a line terminated by CRLF already\r\n')
>>> txtfile.close()
>>> binfile = open('xyz.txt','rb')
>>> data = binfile.read()
>>> data
'This is a line terminated by CRLF already\r\r\n'
>>>

Note I believe it is the reverse of this problem that is causing the failure with Jython/Windows. Some code somewhere is reading the data as a text file, instead of a binary file, and the existing CRLFs are being turned into just plain LFs:

>>> binfile = open('abc.txt','ab')
>>> binfile.write('This is a line terminated by CRLF already\r\n')
>>> binfile.close()
>>> txtfile_wrong = open('abc.txt','r')
>>> txtfile_wrong.read()
'This is a line terminated by CRLF already\n'
>>>

I don't know where this is happening because I've got only the haziest idea of all the pieces involved in running Django under Jython/tomcat. It doesn't necessarily have to be happening in Python code either -- if I remember right Java file I/O can do similar things. But my Java memories are pretty dim since I haven't had to use it in a while.

For the case that looks like it should work but is failing -- bare LFs in the chunk data have been turned into CRLFs in the debug data due to writing the file as text. If you write the file as binary, then I expect you'll see the chunk data contains just LFs, which is why the parsing is failing.

Note: See TracTickets for help on using tickets.
Back to Top