Code

Opened 7 years ago

Closed 7 years ago

Last modified 7 years ago

#3976 closed (wontfix)

[unicode] Unicode models

Reported by: Ivan Sagalaev <Maniac@…> Owned by: adrian
Component: Database layer (models, ORM) Version: master
Severity: Keywords:
Cc: mtredinnick, Maniac@… Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: UI/UX:

Description

Models has to output their data from __unicode__ method, not __str__ as it is used now. I have a working code that does roughly this: changes current __str__ to __unicode__ and defines a __str__ as unicode(self).encode(settings.DEFAULT_CHARSET).

The big part is documenting the change: users should use __unicode__ instead of __str__ for their model classes. If there's no principal objections I could make a patch.

Attachments (0)

Change History (4)

comment:1 Changed 7 years ago by mtredinnick

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Resolution set to wontfix
  • Status changed from new to closed

I think this is creating unnecessary work for the developer when it won't be required in a number of cases.

Firstly, models can quite happily use the str() method. It just has to return UTF-8 bytestrings. So many models will not need any changes at all. It's not really practical to say that we are only going to handle unicode strings. We also have to be able to handle bytestrings -- we just have to have a convention about the encoding of those bytestrings and for consistency with the rest of Python, we'll use UTF-8.

Secondly, encoding to settings.DEFAULT_CHARSET is wrong (it's something Django does a lot at the moment, but I'm slowly burying all of them in the unicode branch), because that will return a bytestring that is not necessarily UTF-8 and you will have no way of knowing its encoding. So we would have inconsistent internal bytestring encoding. The key phrase in the unicode branch is consistency. :-)

Remember that Model.__str__ is not only used for template output. It can also be used in logs or other forms of output that is not necessarily controlled by Django. We should only convert to DEFAULT_CHARSET at the last minute (e.g. Template.render()).

We do need to document a couple of coding practices and assumptions (e.g. all bytestrings passed to Django are assumed to be in UTF-8). However, it shouldn't be too painful.

I'm going to close this as wontfix, because I think the key point is not quite right. I'm keeping a list of doc updates required (it's only two points at the moment) and will update the Wiki page shortly so that we don't lose them.

comment:2 Changed 7 years ago by Ivan Sagalaev <Maniac@…>

I think this is creating unnecessary work for the developer when it won't be required in a number of cases.

Only for data that is known to be in ASCII. But for more interesting data I'd say in most cases some kind of attention is required because people use model fields in __str__ and these fields now changed their type from str to unicode. Thus developers should anyway check all their __str__'s and:

  • either replace it with __unicode__ changing all '...' to u'...' inside
  • or add .encode('utf-8') to self.field_name's

We can encourage people (in docs) to use the first approach since it's better for consistency and for performance (slightly). And we can provide a default __str__ for models that looks like this:

def __str__(self):
  if hasattr(self, '__unicode__'):
    return unicode(self).encode('utf-8')
  # current __str__ implementation with class name

Leaving models with only __str__ asks for troubles since simple unicode(instance) will try to decode it without utf-8 and there's no guaranty that everyone and everywhere will use smart_unicode.

What do you think?

P.S. I agree on changing settings.DEFAULT_CHARSET to utf-8 everywhere except HTTP IO.

P.P.S. I don't know if it's better to discuss this in django-developers?

comment:3 Changed 7 years ago by mtredinnick

Their code is already broken if they are sending around random bytestrings. We won't make it any less broken by interpreting the str() strings as UTF-8 bytestrings. In fact, we will catch the errors faster. People will also possibly want to write {{{unicode_}} methods; that's a given. But we shouldn't force them if they don't want to. The developer knows (or should know) how their class behaves. We leave it up to them to make a decision.

I understand the change you want to make (it's exactly the same as the StrAndUnicode class, for example), but I don't think it's a good idea/necessary to do it by default.

comment:4 Changed 7 years ago by Ivan Sagalaev <Maniac@…>

Ok, agreed on not prematurely fixing current __str__'s.

People will also possibly want to write {{{unicode}} methods

Yes. But they will want to use str(obj) somewhere and it won't automatically pick up their unicode. This is why I'm suggesting an addition in current Model.str to pick it up and convert using smart_unicode. It won't break anything and is useful.

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.