Context Navigation

← Previous Ticket
Next Ticket →

#22088 new Bug

XML deserializer strips leading whitespace on loaddata

Reported by:	Joseph-django@…	Owned by:
Component:	Core (Serialization)	Version:	1.6
Severity:	Normal	Keywords:	xml deserialization
Cc:	numerodix@…	Triage Stage:	Accepted
Has patch:	no	Needs documentation:	no
Needs tests:	no	Patch needs improvement:	no
Easy pickings:	no	UI/UX:	no

Description ¶

If an object instance has a character field and the value of that field starts with the tab character, loaddata removes that tab character when the loaded fixture is in XML format.

Note that the XML dump data does not strip this leading tab character. Also note that both the JSON dump and load data preserve the tab character.

I have not tested this with other whitespace characters. This can be easily reproduced by creating a simple model:

class Foobar(models.model)
  name = models.CharField(max_length=20)

And then creating an instance of that model (e.g, in Django Shell) with a name value of, e.g, "\tBaz" and then using the manage.py dumpdata with --format=xml.

Once the fixture has been generated, remove the existing instance (either by deleting it, or flushing the app data, or your preferred method) and then using manage.py loaddata to load the fixture. Note the instance's name no longer contains the tab character.

According to the ticket's flags, the next step(s) to move this issue forward are:

To provide a patch by sending a pull request. Claim the ticket when you start working so that someone else doesn't duplicate effort. Before sending a pull request, review your work against the patch review checklist. Check the "Has patch" flag on the ticket after sending a pull request and include a link to the pull request in the ticket comment when making that update. The usual format is: [https://github.com/django/django/pull/#### PR].

Change History (9)

comment:1 by Martin Matusiak, 11 years ago

Cc:	numerodix@… added
Owner:	changed from nobody to Martin Matusiak
Status:	new → assigned

comment:2 by Martin Matusiak, 11 years ago

I can reproduce it.

follow-up: 5 comment:3 by Martin Matusiak, 11 years ago

Trying with a few special characters:

Foobar.objects.create(name=' bar')
Foobar.objects.create(name='\abar')
Foobar.objects.create(name='\bbar')
Foobar.objects.create(name='\fbar')
Foobar.objects.create(name='\nbar')
Foobar.objects.create(name='\rbar')
Foobar.objects.create(name='\tbar')
Foobar.objects.create(name='\vbar')

XML pretty printer corrupts this completely:

    <object pk="9" model="app.foobar">
        <field type="CharField" name="name"> bar</field>
    </object>
    <object pk="10" model="app.foobar">
        <field type="CharField" name="name">bar</field>
    </object>
    <object pk="11" model="app.foobar">
        <field type="CharField" name="name"bar</field>
    </object>
    <object pk="12" model="app.foobar">
        <field type="CharField" name="name">
                                            bar</field>
    </object>
    <object pk="13" model="app.foobar">
        <field type="CharField" name="name">
bar</field>
    </object>
    <object pk="14" model="app.foobar">
bar</field>eld type="CharField" name="name">
    </object>
    <object pk="15" model="app.foobar">
        <field type="CharField" name="name">    bar</field>
    </object>
    <object pk="16" model="app.foobar">
        <field type="CharField" name="name">
                                            bar</field>
    </object>

In terse mode it's more likely to be correct when loaded again, but clearly this needs fixing:

<object pk="9" model="app.foobar"><field type="CharField" name="name"> bar</field></object><object pk="10" model="app.foobar"><field type="CharField" name="name">bar</field></object><object pk="11" model="app.foobar"><field type="CharField" name="name"bar</field></object><object pk="12" model="app.foobar"><field type="CharField" name="name">
                               bar</field></object><object pk="13" model="app.foobar"><field type="CharField" name="name">
bar</field></object><object pk="15" model="app.foobar"><field type="CharField" name="name">     bar</field></object><object pk="16" model="app.foobar"><field type="CharField" name="name">
                                                     bar</field></object>

So that's dumpdata at fault, not loaddata.

comment:4 by Martin Matusiak, 11 years ago

Another experiment: hand edit the xml dump file using html escapes (1) to see if loaddata will load it correctly. No, also doesn't work.

So:
Problem 1: The xml serializers (SimplerXMLGenerator/pulldom) do not round trip these characters.
Problem 2: Even if they did a tab character would be stripped due to:

value = field.to_python(getInnerText(field_node).strip())

core.serializers.xml_serliazer.py:214

(1) http://mail-archives.apache.org/mod_mbox/xmlgraphics-fop-users/200406.mbox/%3C40C5E61C.9040304@hotmail.com%3E

in reply to: 3 comment:5 by Daniele Procida, 11 years ago

Triage Stage:	Unreviewed → Accepted

Replying to numerodix:

    <object pk="11" model="app.foobar">
        <field type="CharField" name="name"bar</field>
    </object>>

Wow. If that's what you're getting in the dumped text file, that's remarkable.

Last edited 11 years ago by Daniele Procida (previous) (diff)

comment:6 by Martin Matusiak, 11 years ago

What we could do is try to wrap any special characters in a CDATA section. So the xml would look like this:

<object pk="2" model="app.foobar"><field type="CharField" name="name"><![CDATA[\t]]>bar</field></object>

The deserializer then gives us the tab character escaped:

u'\\tbar'

So we'd have to strip the escape.

But this feels very ad-hoc and messy.

comment:7 by jarshwah, 11 years ago

Shouldn't all straight up text (CharField/TextField) be wrapped in a CDATA though? Then there's no need to look for special characters at all. Are there any downsides to wrapping all text values in a CDATA? Unsure why the deserialiser would escape the tab though (I haven't looked into it), but that is sort of a separate - related - issue.

comment:8 by Martin Matusiak, 11 years ago

@smeaton, I think it probably should, yes. The downside would be that you're adding 12 bytes to every value even though most strings would not need it.

One could optimize that by wrapping only strings that need to be wrapped, according to a character range or something like that.

This serializer is also used by the syndication feed framework btw. It would be nice to fix the problem in both places at once.

comment:9 by Jacob Walls, 4 years ago

Owner:	Martin Matusiak removed
Status:	assigned → new

Note: See TracTickets for help on using tickets.

Download in other formats:

Issues