Django

Code

Ticket #3370 (closed: fixed)

Opened 1 year ago

Last modified 1 year ago

[patch] newforms: form.save() raises UnicodeEncodeError when form contains any non latin characters

Reported by: anton@khalikov.ru Assigned to: adrian
Milestone: Component: django.newforms
Version: SVN Keywords: newforms utf8 unicode-branch
Cc: Maniac@SoftwareManiacs.Org, jm.bugtracking@gmail.com, densetsu.no.ero.sennin@gmail.com Triage Stage: Accepted
Has patch: 1 Needs documentation: 0
Needs tests: 1 Patch needs improvement: 1

Description

Hello everyone

I've found the following bug with newforms: when one uses form_for_model/form_for_instance methods and then does save() and form contains any non-latin (national) characters, UnicodeEncodeError? is raised.

Example (console encoding = ru_RU.UTF-8)

>>> from myproject.models import Payment
>>> from django import newforms as forms

Let's try form_for_model:

>>> PaymentForm1 = forms.models.form_for_model(Payment)
>>> form1 = PaymentForm1({'description': 'превед now', 'event_date': '2007-01-26', 'user': '1', 'pay_type': 'cash', 'amount':'50.3'})
>>> form1.is_valid()
True
>>> form1.save()
Traceback (most recent call last):
  File "<console>", line 1, in ?
  File "/usr/lib/python2.4/site-packages/django/newforms/models.py", line 25, in model_save
    obj.save()
  File "/usr/lib/python2.4/site-packages/netangels/models/payment.py", line 26, in save
    super(Payment, self).save()
  File "/usr/lib/python2.4/site-packages/django/db/models/base.py", line 204, in save
    ','.join(placeholders)), db_values)
  File "/usr/lib/python2.4/site-packages/MySQLdb/cursors.py", line 148, in execute
    query = query % db.literal(args)
  File "/usr/lib/python2.4/site-packages/MySQLdb/connections.py", line 232, in literal
    return self.escape(o, self.encoders)
  File "/usr/lib/python2.4/site-packages/MySQLdb/connections.py", line 179, in unicode_literal
    return db.literal(u.encode(unicode_literal.charset))
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-5: ordinal not in range(256)

now let's try form_for_instance

>>> payment = Payment.objects.get(pk=2)
>>> PaymentForm2 = forms.models.form_for_instance(payment)
>>> form2 = PaymentForm2({'description': 'превед now', 'event_date': '2007-01-26', 'user': '1', 'pay_type': 'cash', 'amount':'50.3'})
>>> form2.is_valid()
True
>>> form2.save()
Traceback (most recent call last):
  File "<console>", line 1, in ?
  File "/usr/lib/python2.4/site-packages/django/newforms/models.py", line 52, in save
    return save_instance(self, instance, commit)
  File "/usr/lib/python2.4/site-packages/django/newforms/models.py", line 46, in save_instance
    instance.save()
  File "/usr/lib/python2.4/site-packages/netangels/models/payment.py", line 26, in save
    super(Payment, self).save()
  File "/usr/lib/python2.4/site-packages/django/db/models/base.py", line 184, in save
    db_values + [pk_val])
  File "/usr/lib/python2.4/site-packages/MySQLdb/cursors.py", line 148, in execute
    query = query % db.literal(args)
  File "/usr/lib/python2.4/site-packages/MySQLdb/connections.py", line 232, in literal
    return self.escape(o, self.encoders)
  File "/usr/lib/python2.4/site-packages/MySQLdb/connections.py", line 179, in unicode_literal
    return db.literal(u.encode(unicode_literal.charset))
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-5: ordinal not in range(256)

and one more little example:

>>> form2.clean_data['description'] = 'превед in unicode'
>>> form2.save()

it works because description is of type 'str', but contains Unicode characters.

Attachments

mysql-utf.patch (0.5 kB) - added by anton@khalikov.ru on 01/26/07 00:43:27.
patch that fixes the issue (not well tested)
mysql-utf8-complete.patch (1.1 kB) - added by anton@khalikov.ru on 01/26/07 01:02:15.
This patch fixes the issue
mysql-utf8-complete.2.patch (1.3 kB) - added by anton@khalikov.ru on 01/26/07 02:57:54.
Better patch that fix the issue and should work correctly with MySQL 4.0
mysql-utf8-complete.3.patch (1.2 kB) - added by anton@khalikov.ru on 01/26/07 03:16:04.
this works well for me
models.py.diff (0.7 kB) - added by Esaj on 02/15/07 08:32:10.
Another possible fix.
models.py.2.diff (0.7 kB) - added by Esaj on 02/16/07 08:54:31.
Fix bug in previous patch…
models-utf8.diff (0.9 kB) - added by Dirk Datzert <dummy@habmalnefrage.de> on 04/01/07 03:51:55.
my variant patch

Change History

01/26/07 00:43:27 changed by anton@khalikov.ru

  • attachment mysql-utf.patch added.

patch that fixes the issue (not well tested)

01/26/07 00:45:45 changed by anton@khalikov.ru

  • needs_better_patch changed.
  • needs_tests changed.
  • needs_docs changed.

I've added the patch that fixes the UnicodeEncodeError? raising but after save() I get:

Traceback (most recent call last):
  File "<console>", line 1, in ?
  File "/usr/lib/python2.4/site-packages/django/db/models/base.py", line 80, in __repr__
    return '<%s: %s>' % (self.__class__.__name__, self)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)

this is because self contains unicode string with russian characters

01/26/07 01:02:15 changed by anton@khalikov.ru

  • attachment mysql-utf8-complete.patch added.

This patch fixes the issue

01/26/07 01:04:52 changed by anonymous

  • summary changed from newforms: form.save() raises UnicodeEncodeError when form contains any non latin characters to [patch] newforms: form.save() raises UnicodeEncodeError when form contains any non latin characters.

01/26/07 01:05:18 changed by anonymous

  • has_patch set to 1.
  • needs_tests set to 1.

01/26/07 01:57:33 changed by Michael Radziej <mir@noris.de>

This basically looks it's deeper in the database wrapper, not in newforms. Isn't this a duplicate of #1356 together with #3314?

Can you please check this and report back? Thanks!

In any case, this patch needs to be checked for mysql-4.0.

01/26/07 02:47:42 changed by anton@khalikov.ru

Michael, #1356 - yes this does the same thing as a part of my patches. Only thing is missed in both patches is that we should not change charset if we use MySQL 4.0 and I think it needs to be checked.

About #3314 - it is good patch but it does not fix the following problem:

>>> from myproject.models import *
>>> t = Tariff.objects.get(pk=1)
>>> t
Traceback (most recent call last):
  File "<console>", line 1, in ?
  File "/usr/lib/python2.4/site-packages/django/db/models/base.py", line 80, in __repr__
    return '<%s: %s>' % (self.__class__.__name__, self)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-7: ordinal not in range(128)

where str(t) looks like this:

    def __str__(self):
        return self.name

and name contains unicode non-latin characters

I'm going to upload a better patch that fixes my problems and that should work ok with mysql 4.0

01/26/07 02:57:54 changed by anton@khalikov.ru

  • attachment mysql-utf8-complete.2.patch added.

Better patch that fix the issue and should work correctly with MySQL 4.0

01/26/07 03:15:28 changed by anton@khalikov.ru

Sorry guys but

self.connection.charset = 'utf8'

does NOT fix the problem for me, only passing 'charset': 'utf8' to kwargs does .... So I'm sending the last patch (I hope) that works for me but it is not tested against MySQL 4.0

01/26/07 03:16:04 changed by anton@khalikov.ru

  • attachment mysql-utf8-complete.3.patch added.

this works well for me

01/26/07 03:22:21 changed by Simon G. <dev@simon.net.nz>

  • keywords set to newforms utf8.
  • stage changed from Unreviewed to Accepted.

01/26/07 03:53:14 changed by Ivan Sagalaev <Maniac@SoftwareManiacs.Org>

  • cc set to Maniac@SoftwareManiacs.Org.

1. This patch won't work for legacy databases configured to non-utf-8 encodings (and they aren't rare at all). You can't just hard-code 'utf8' all over the place.

2. In fact this 'bug' is not a bug at all. Newforms now use unicode data looking forward to ongoing unicodification. Until it happens people are expected to convert unicode to needed charset explicitly. I don't think it should be 'fixed' in a hurry, only for one backend and by hard-coding a workaround for some special case.

3. Ticket #952 is an actual fix for this very issue, it proposes a separate setting for setting database charset in order to Django could encode unicode data itself.

To keep things sane I suggest marking it as a dupe of #952.

01/26/07 04:19:27 changed by Michael Radziej <mir@noris.de>

  • needs_better_patch set to 1.
  • stage changed from Accepted to Design decision needed.

We have a bit of chaos here ... Tickets #3370, #1356 and probably #952 all are about this problem, all are accepted, and #3370 and #1356 have very similar patches. I ask everybody to continue discussion in django-developers ("unicode issues in multiple tickets"), and I ask the authors of these three tickets to work together to find out how to proceed.

As long as it's not clear which path to take, I mark all bugs as "design decision needed." (I assume that the other reviews were not aware of the competing tickets.)

http://groups.google.com/group/django-developers/browse_thread/thread/4b71be8257d42faf

01/26/07 04:27:23 changed by anton@khalikov.ru

Ivan, I think you are wrong. Firstly, take a look at django/db/backends/mysql/base.py, there is:

            if self.connection.get_server_info() >= '4.1':
                cursor.execute("SET NAMES 'utf8'")

you can see that utf8 is already hardcoded there. #952 could be good for legacy charset support BUT if you take a look into newforms code you find there that all data is converted to unicode before to into db so I think newforms won't be compatible with non-unicode databases at all.

The reason why I started to do this patch is simple: I have a big project coded in windows-1251 which uses MySQL 4.1 with cp1251 encoding and since I started to migrate parts to newforms I found that either newforms use unicode in clean_data and all apps based on newforms must be coded utf8 (and I can't change this, decision was made without asking people like me), or these new apps won't work for any national characters in db. #952 will just break newforms. I am saying this because I have had django with similar to #952 patch applied - newforms don't work with it.

01/26/07 04:30:25 changed by Michael Radziej <mir@noris.de>

Anton, could you please post this to the new thread? The discussions need to be merged to get anywhere.

02/14/07 03:35:43 changed by Michael Radziej <mir@noris.de>

related for psycopg1: #3492

02/15/07 08:32:10 changed by Esaj

  • attachment models.py.diff added.

Another possible fix.

02/15/07 08:34:49 changed by Esaj

Currently, newforms uses unicode and Django's database layer doesn't, so we should convert the data at the boundary (presumably converting it using the default charset is the best approach). See attached patch.

02/15/07 10:36:41 changed by anton@khalikov.ru

Esaj, it is better to add patches that move database layer to unicode too, than trying to do what you do. Please read our discussion in django-devel. Btw I have a complete patch for common db layer and mysql backend that moves everything to unicode. Dunno if it is required.

02/15/07 12:20:57 changed by Ivan Sagalaev <Maniac@SoftwareManiacs.Org>

Anton, moving db backends to unicode is certainly better but it's an incomparable amount of work. Along with documenting, testing on different versions of db libraries it may take months. This patch just fixes things as they are now and it doesn't hurt.

02/15/07 23:20:01 changed by anton@khalikov.ru

Ivan, this work needs to be started anyway, isn't it ? You can think I have it a half done.

02/16/07 08:54:31 changed by Esaj

  • attachment models.py.2.diff added.

Fix bug in previous patch...

04/01/07 03:50:53 changed by Dirk Datzert <dummy@habmalnefrage.de>

Hi,

I came across the same error and found a good solution where newforms is doing the save call.

Regards, Dirk

04/01/07 03:51:55 changed by Dirk Datzert <dummy@habmalnefrage.de>

  • attachment models-utf8.diff added.

my variant patch

04/22/07 08:07:57 changed by anonymous

  • cc changed from Maniac@SoftwareManiacs.Org to Maniac@SoftwareManiacs.Org, jm.bugtracking@gmail.com.

05/14/07 04:32:32 changed by anonymous

  • cc changed from Maniac@SoftwareManiacs.Org, jm.bugtracking@gmail.com to Maniac@SoftwareManiacs.Org, jm.bugtracking@gmail.com, densetsu.no.ero.sennin@gmail.com.

05/14/07 09:44:42 changed by mtredinnick

  • keywords changed from newforms utf8 to newforms utf8 unicode-branch.
  • stage changed from Design decision needed to Accepted.

This has been fixed on the unicode branch. I'll close the ticket when that branch is merged with trunk.

07/04/07 07:11:05 changed by mtredinnick

  • status changed from new to closed.
  • resolution set to fixed.

(In [5609]) Merged Unicode branch into trunk (r4952:5608). This should be fully backwards compatible for all practical purposes.

Fixed #2391, #2489, #2996, #3322, #3344, #3370, #3406, #3432, #3454, #3492, #3582, #3690, #3878, #3891, #3937, #4039, #4141, #4227, #4286, #4291, #4300, #4452, #4702


Add/Change #3370 ([patch] newforms: form.save() raises UnicodeEncodeError when form contains any non latin characters)




Change Properties
Action