Code

Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#15778 closed Bug (fixed)

Command createsuperuser fails under some system user names

Reported by: Hynek Cernoch <hynek@…> Owned by: Alex Gaynor
Component: contrib.auth Version: 1.3
Severity: Normal Keywords:
Cc: Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: yes Patch needs improvement: no
Easy pickings: no UI/UX:

Description

Commands 'createsuperuser' and 'syncdb' can not create a superuser account no way because of database error,
if a system account username contains an 8-bit character.
It fails when the name is automatically searched in the database, even if the user wants to write an ascii username manually.
This is typical for usernames created by Microsoft Windows.

File "C:\Python26\lib\site-packages\django\contrib\auth\management\commands\createsuperuser.py", line 72, in handle
  User.objects.get(username=default_username)
...
File "C:\Python26\lib\site-packages\django\db\backends\sqlite3\base.py", line 234, in execute
  return Database.Cursor.execute(self, query, params)
DatabaseError : You must not use 8-bit bytestrings unless you use a text_factory that can interpret 
8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch 
your application to Unicode strings.

Versions: Django 1.3, Python 2.6.4 windows, Sqlite3 3.5.9, dbapi 2.4.1

It is easier to fix it once then to circumvent it twice.

The middle part of the patch:

-            default_username = getpass.getuser().replace(' ', '').lower()
+            default_username = str(getpass.getuser().decode('ascii', 'ignore')).replace(' ', '').lower()

Attachments (2)

django-createsuperuser-8bit-username-bug.diff (707 bytes) - added by Hynek Cernoch <hynek@…> 3 years ago.
createsuperuser.diff (1.5 KB) - added by hynekcer 3 years ago.

Download all attachments as: .zip

Change History (28)

Changed 3 years ago by Hynek Cernoch <hynek@…>

comment:1 Changed 3 years ago by julien

  • Needs documentation unset
  • Needs tests set
  • Patch needs improvement unset

Could you provide some tests replicating this bug?

comment:2 Changed 3 years ago by Hynek Cernoch <hynek@…>

To reproduce

"Adrián" and "Júlia" are nice frequent Spain names. Microsoft recommended to enter a localized full user name and full company name in Windows installation guides in the past. Localized user names are still very frequent in our country and I got from my company a preinstalled computer with a localized user name. The user name returned by "getpass.getuser()" by Python on Windows is usually encoded in any of possible one byte character sets, typically 'cp1252' for West Europe.

The most convincing and the only complete test would be to create such account on Windows, but probably you mean by the requirement "test" any possible simplified verification. So, I wrote a short temporary patch to Django:

--- a/django/contrib/auth/management/commands/createsuperuser.py  2011-02-22 12:33:04.000000000 +0100
+++ b/django/contrib/auth/management/commands/createsuperuser.py  2011-04-08 18:51:04.682233473 +0200
@@ -59,6 +59,7 @@
         # Try to determine the current system user's username to use as a default.
         try:
             default_username = getpass.getuser().replace(' ', '').lower()
+            default_username = 'J\xfalia'.lower()  # the name 'Julia' with accented 'u' - fails
         except (ImportError, KeyError):
             # KeyError will be raised by os.getpwuid() (called by getuser())
             # if there is no corresponding entry in the /etc/passwd file

startproject, startapp, ENGINE is 'django.db.backends.sqlite3'
python manage.py createsuperuser

Python 2.6
This synthetic test anytime fails. (even on Linux, although all possible usernames are in ascii on Linux)

DatabaseError:
"You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings."

Python 2.5
never fails. (probable because its '_sqlite3.so' does not contains that typical error message string.) It normally displays:

Username (Leave blank to use u'j\xfalia'):

Function getpass.getuser() also depends on the precompiled package.
Python from http://www.python.org/ftp/python/2.6.4/python-2.6.4.msi is the described above, it gets international characters (fails). 
Python from Cygwin strips international characters from the username in Windows.

comment:3 Changed 3 years ago by kmtracey

  • Triage Stage changed from Unreviewed to Accepted

The arcane error message from sqlilte means a non-utf8 bytestring was passed as default_username on the call to User.objects.get(username=default_username). 'J\xfalia' is the latin1 (same as Windows cp1252 for this case) encoding for "Júlia".
If instead you passed in 'J\xc3\xbalia', the utf-8 encoding, the User.objects.get call would work. Due to other code in this area it would not be accepted as a unsername, but you'd get past that exception.

This error from sqlite is new with 2.6, see #7921 for details and the explanation of why you can pass utf-8 encoded bytestrings. Django adapted to the change with 2.6 by installing an adapter to convert all bytestrings passed down to the database to unicode, assuming they have a utf-8 encoding. If they don't and the attempt to decode from utf-8 fails, you will see the error message you are seeing.

If there was some way to know the encoding of the bytestring returned by getpass.getuser() then the best thing would be to use that known encoding to transform the bytestring into a unicode object.

comment:4 Changed 3 years ago by Hynek Cernoch <hynek@…>

Yes, but that was a somewhat different situation.
I would prefer the original proposed solution to strip international characters from the default_username, because an evantually conversion to unicode would only postpone the problem for later. (Convert something correctly and display that correctly, what would be finally must rejected?)

I thing that a simple solution can be incorporaded in any bugfix release, it need not to wait for Django 1.4

(I know a better solution with striping the accents, not dropping characters, but it is not important now at all, imho. The encoding is not 'utf-8'. It is sys.getfilesystemencoding(), usually 'mbcs', which is a generic name for differnent windows default encodings of the actual installation.)

comment:5 Changed 3 years ago by kmtracey

  • Patch needs improvement set

If we are going to fix it, we should make some effort to fix it reasonably. If I were named Júlia I'd find it a bit irritating for the system to suggest a "good" username for me was jlia. It's not that hard to "translate" accented unicode characters to their non-accented ASCII equivalents, using unicodedata.normalize:

>>> import unicodedata
>>> yipes = u'¡¢£¥§©ª«¬®¯°±²³µ¶·¹º»¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'
>>> unicodedata.normalize('NFKD', yipes).encode('ascii', 'ignore')
'a 231oAAAAAACEEEEIIIINOOOOOUUUUYaaaaaaceeeeiiiinooooouuuuyy'
>>> 

Not perfect, but far better than stripping accented characters entirely.

comment:6 Changed 3 years ago by hynekcer

This is much better. :-) I preferred a simple solution and strong arguments such as example names similar to a member of the community and the person who requested a test :-)

I thought so: Without a patch or with any failing patch the person probably lost several hours, because the simple bypass "./manage.py syncdb --noinput; ./manage.py createsuperuser --username=USERNAME" can be easily missed. With a simple patch he lost only several minutes, because unfortunately only a journalist is able to notice every missing letter. (Maybe a user can not login first.)

I am not sure enough that "getpass.getuser()" returns results in the encoding "sys.getfilesystemencoding()" for every language and version of Windows and Python. Otherwise the conversion fails. I only hope that Microsoft was never so proud to implement the right to left picture alphabets to the home directory paths. We should use some "decode(..., 'ignore')".

I suggest:

default_username = getpass.getuser().decode(sys.getfilesystemencoding(), 'ignore')
default_username = unicodedata.normalize('NFKD', default_username) \
                       .encode('ascii', 'ignore').replace(' ', '').lower()

Hopefully the "unicodedata.normalize('NFKD'...)" does not raise an exception, otherwise it should be added to captured exceptions, not only (ImportError, KeyError).

comment:7 Changed 3 years ago by hynekcer

Alternative: If some character can not be decoded (maybe bad detected encoding perhaps for a non-latin alphabet), better is to give no suggestions:

  try:
-     default_username = getpass.getuser().replace(' ', '').lower()
- except (ImportError, KeyError):
+     default_username = getpass.getuser().decode(sys.getfilesystemencoding())
+     default_username = unicodedata.normalize('NFKD', default_username) \
+                            .encode('ascii', 'ignore').replace(' ', '').lower()
+ except (ImportError, KeyError, UnicodeDecodeError):
+     # UnicodeDecodeError - We are not sure what will do a non-latin Windows version.

comment:8 Changed 3 years ago by kmtracey

#15162 reported this as well.

Changed 3 years ago by hynekcer

comment:9 Changed 3 years ago by hynekcer

#15162 has similar guess of the encoding: locale.getdefaultlocale()[1]

  • It returns more explicit names e.g 'cp1250', 'cp1252' than sys.getfilesystemencoding() which returns 'mbcs'.
  • Decoded characters are the same for both methods. Encoding 'cp1250' is a subset defined for 251 characters, 'mbcs' is defined for all 256 characters.

Conclusion: I think that better is:

locale.getdefaultlocale()[1]

Characters from non-default encoding are replaced by question marks before any conversion by 'getpass.getuser()'. E. g. 'ů' on West Europe Windows or vice-versa 'ù' on East Europe Windows. Better is no suggestion than a bad one. Therefore I add:

+            if not RE_VALID_USERNAME.match(default_username):
+                default_username = ''

My final patch createsuperuser.diff is ready.

comment:10 Changed 3 years ago by anonymous

  • Easy pickings unset
  • Owner changed from nobody to Alex Gaynor
  • Patch needs improvement unset

Mister Gaynor, please write the unit test for this patch. Thank you.

comment:11 Changed 3 years ago by anonymous

The test can be either formal only and very complicated, because it is not easy to modify output of getpass.getuser() and sys.getfilesystemencoding() or even simulate type of operating system.

This is similar requirement like julien asked tests replicating this bug. I would advice good manual testing and checkout.

Do you want really rely on tests like ?

   self.assertEqual('ABC'.lower(), 'abc')

comment:12 Changed 3 years ago by anonymous

I have tested manually polite refusal with expecting suggesting no username for little obscure cases on Greek three weeks ago. (Greek would not be possible to be tested without buing Greek Windows and installing it totally blind. Greek keyboard was not enough because Greek characters are replaced by ascii '?' by getuser() on non Greek Windows, which is not the real case.) Do test only 'Júlia' ('J\xfalia') or 'Adrián'.

Believe that testing is much more complicated than good fixing. I read piece of Windows documentation due to testing, but Django development is not reading how business plan of Microsoft has blemished Windows between the lines.

comment:13 Changed 3 years ago by anonymous

What about using the django.utils.encoding.smart_str() function to deal with the encoding of a non-ASCII username?

comment:14 Changed 3 years ago by anonymous

  • Patch needs improvement set

The patch is bad.

What .encode('ascii', 'ignore') does is that it ignores the UnicodeError. So what is the point of catching the UnicodeDecodeError (a subclass of UnicodeError) that would never get raised? Then the code block default_username = will never get executed. I suggest fixing the 'ignore' positional argument to 'strict'.

comment:15 Changed 3 years ago by anonymous

The patch is bad.

What .encode('ascii', 'ignore') does is that it ignores the UnicodeError. So what is the point of catching the UnicodeDecodeError (a subclass of UnicodeError) that would never get raised? Then the code block default_username = '' will never get executed. I suggest removing the 'ignore' positional argument; if it is removed, that defaults to 'strict', which catches the UnicodeError -- which is what we want.

comment:16 Changed 3 years ago by anonymous

DAMN! I was wrong! Please don't follow my instructions.

comment:17 Changed 3 years ago by anonymous

  • Patch needs improvement unset

The patch is very good. Please apply it.

comment:18 Changed 3 years ago by anonymous

  • Needs tests unset

I have created a new user account on my Windows 7 operating system and have added the omega Greek character into the user account name. My account was named userΩtest. And when I reached the point of having to type "yes" at the Django createsuperuser prompt, the prompt was like this: Username (leave blank to use 'userotest'). So the patch worked. The Greek letter "Ω" (omega) was successfully converted to an "o" character. This patch is bulletproof. You could mock getpass, but what is the point? Please incorporate this patch already.

comment:19 Changed 3 years ago by anonymous

Yes, the createsuperuser.diff patch is the final patch and is tested with very well results and should to be commited to the trunk ASAP.

comment:20 Changed 3 years ago by anonymous

  • milestone 1.3 deleted
  • Needs tests set

Freddie, we're not putting it in without unit tests.

comment:21 Changed 3 years ago by SmileyChris

(that comment was me, and the previous ones were Freddie__ from irc, who has some entitlement issues to work through)

comment:22 Changed 3 years ago by kmtracey

I am finding the repeated demands to get this in NOW off-putting but I'm also a bit puzzled by the insistence on tests -- testing behavior that requires a current system username value that contains a non-ascii char seems a bit outside of the scope of where we would normally require tests? For this particular case I would have trusted the reports of people who can recreate the issue, and manual review of the code (which I don't have time for at the moment).

comment:23 follow-up: Changed 3 years ago by SmileyChris

  • Resolution set to fixed
  • Status changed from new to closed

In [16182]:

Fixes #15778 -- createsuperuser fails on international characters in system user names. Thanks for the patch, Hynek Cernoch.

comment:24 in reply to: ↑ 23 Changed 3 years ago by jezdez

Replying to SmileyChris:

In [16182]:

Fixes #15778 -- createsuperuser fails on international characters in system user names. Thanks for the patch, Hynek Cernoch.

What's up with the :params: and :returns: stuff in the docstrings?

comment:25 Changed 3 years ago by hynekcer

Thanks for fixing. I acknowledge that it is satisfactory closed.

It is not important to catch UnicodeDecodeError in get_default_username in [16182]. It can be safely removed, which improves readibility:

 def get_default_username(check_db=True):
 ...
     default_username = get_system_username()
-    try:
-        default_username = unicodedata.normalize('NFKD', default_username)\
-            .encode('ascii', 'ignore').replace(' ', '').lower()
-    except UnicodeDecodeError:
-        return ''
+    default_username = unicodedata.normalize('NFKD', default_username)\
+        .encode('ascii', 'ignore').replace(' ', '').lower()
     if not RE_VALID_USERNAME.match(default_username):
         return ''

Catching of UnicodeDecodeError was only important in the original patch due to decode(locale.getdefaultlocale...), not due to other commands, which can be demonstrated:

# Verificaion that `unicodedata.normalize` does not raises anythyng
# for the whole range of valid unicode characters. It is also clear from the documentation.
import unicodedata
for i in range(0x110000):
    dummy = unicodedata.normalize('NFKD', unichr(i)).encode('ascii', 'ignore')
  • Answer to objection comment:14: The prove or disprove of possibility UnicodeDecodeError after getpass.getuser().decode is not easy, because important things are outside of free software. It is safer to catch it.
  • Objection comment:13 Why? Because the proposed solution would not work at all.
  • Comments comment:11 and comment:12 was me - patch author and reporter.

comment:26 Changed 3 years ago by SmileyChris

Thanks for the follow-up hynekcer. I'll just leave it in there since it doesn't hurt anyone (and it'd be better to still catch a decoding issue on some strange setup than choke).

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.