Opened 3 years ago

Last modified 4 months ago

#18392 new New feature

Use utf8mb4 encoding with MySQL 5.5

Reported by: EmilStenstrom Owned by: nobody
Component: Database layer (models, ORM) Version: 1.4
Severity: Normal Keywords: utf8mb4 mysql
Cc: kitsunde@…, django@…, cvrebert, moggers87, flisky, cmawebsite@… Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: yes Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

Background:
I just ran into a problem with iPhone emoji characters being saved into MySQL. The text was cut off after the first emoji character. After some research I found which explains how it works: http://mzsanford.wordpress.com/2010/12/28/mysql-and-unicode/

The recommendation is to use MySQL 5.5, and the "utf8mb4" encoding.

Suggestion:
Make "utf8mb4" the default encoding for MySQL 5.5 and up.

Change History (24)

comment:1 Changed 3 years ago by akaariai

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Triage Stage changed from Unreviewed to Accepted

To me it seems the character encoding is set on connect to 'charset': 'utf8'. Is it enough to change this from 'utf8' to 'utf8mb4' to change this default. If not, where should this default encoding be defined? Does this mean that any text column must be VARCHAR(N) CHARACTER SET utf8mb4 on creation?

I am accepting this as to me the change sounds valid.

comment:2 Changed 3 years ago by EmilStenstrom

I don't know enough about MySQL and the ORM to answer your questions, I hope someone else does.

comment:3 Changed 3 years ago by EmilStenstrom

Ok, there's some trickiness here. Indexes in InnoDB tables can't be longer than 255 chars with utf8, but only 191 chars with utf8mb4. This means that the default indexes that Django makes for CharField(max_length=255) is too long, and will break things (break what? I'm running a migration that converts all my tables to utf8mb4 automatically, and setting utf8mb4 on a long charfield breaks because the index is too long).

From the official docs:

"InnoDB has a maximum index length of 767 bytes, so for utf8 or utf8mb4 columns, you can index a maximum of 255 or 191 characters, respectively. If you currently have utf8 columns with indexes longer than 191 characters, you will need to index a smaller number of characters. In an InnoDB table, these column and index definitions are legal:

col1 VARCHAR(500) CHARACTER SET utf8, INDEX (col1(255))

To use utf8mb4 instead, the index must be smaller:

col1 VARCHAR(500) CHARACTER SET utf8mb4, INDEX (col1(191))"

From: http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-upgrading.html

comment:4 Changed 3 years ago by akaariai

  • Triage Stage changed from Accepted to Design decision needed

I am marking this as design decision needed. This will need a good solution which guarantees trouble-free upgrades for current users. While it would be nice to have full UTF8 support with MySQL, I don't think it is worth risking a breakage for existing users.

So, looking for good solutions here. Does MySQL infer the used character set from some runtime variable (perhaps something set at CREATE DATABASE time)? If so, we could just check the used charset for the db, and use that for test database creation and index creation. If so, this would be a simple solution to this problem.

comment:5 Changed 3 years ago by kmtracey

I don't believe setting the connection charset to this new utf8mb4 encoding (if we're talking to a server that supports it) would cause any problem with the indexing: that's a problem that comes into play when you change the database charset itself. Whether we need to change that isn't clear to me. It's possible that these new characters (stored in a DB with utf8mb4 encoding) would be transmitted over a connection that has charset set simply to utf8. If so then there's really no point in making any connection charset change. So one open question here is: do you get an error or incorrect behavior trying to read/write these characters over a utf8 connection created by Django when operating on a database with utf8mb4 encoding?

Traditionally Django has not gone beyond advising in the documentation what charset to use (uft8) for the database. Django doesn't attempt to set the charset for stuff it creates to utf8, it just uses the default charset for the DB, which is set before Django ever gets involved. Given this new MySQL 5.5 support for "more better" unicode we probably need to update the docs to mention the new option for database charset.

Having the index creation fail for a too-long CharField is a nuisance. I know we've had tickets before that dealt with this issue, but I don't recall offhand what the status is. Django tries to disallow creation of a CharField where the index creation will fail, but MySQL makes it incredibly difficult to figure out what the right value for "max allowed" is. This new encoding just makes for more of a mess there.

comment:6 Changed 3 years ago by EmilStenstrom

Maybe this ticket just needs some exploration before going further: A simple test that tries to save and their fetch some 4-byte unicode characters to MySQL... I think that would reveal what the minimal changes are, for full unicode support to be possible.

comment:7 follow-up: Changed 3 years ago by rogeliorv

  • Keywords hack utf8mb4 mysql added

As proof of concept I tried inserting into mysql the following values in a varchar.

😄😃😊☺😉😍😘😚

And got a "Warning: Incorrect string value" from mysql (which python treats as an exception)

As a way to test it. The hack consists in adding self.query('SET NAMES utf8mb4') in MySQLdb.connections in Connection.set_character_set function as shown here: http://pastebin.com/MW5BgRgP

Of course the correct way would be to change this in django when setting up the cursor connection.


comment:8 in reply to: ↑ 7 ; follow-up: Changed 3 years ago by EmilStenstrom

Replying to rogeliorv:

As a way to test it. The hack consists in adding self.query('SET NAMES utf8mb4') in MySQLdb.connections in Connection.set_character_set function as shown here: http://pastebin.com/MW5BgRgP

Of course the correct way would be to change this in django when setting up the cursor connection.

Did your hack remove the exception? What was the rationale behind the hack? What's the next step?

Last edited 3 years ago by EmilStenstrom (previous) (diff)

comment:9 in reply to: ↑ 8 Changed 3 years ago by rogeliorv

  • Has patch set
  • Keywords hack removed
  • Needs tests set
  • Patch needs improvement set

Replying to EmilStenstrom:

Replying to rogeliorv:

As a way to test it. The hack consists in adding self.query('SET NAMES utf8mb4') in MySQLdb.connections in Connection.set_character_set function as shown here: http://pastebin.com/MW5BgRgP

Of course the correct way would be to change this in django when setting up the cursor connection.

Did your hack remove the exception? What was the rationale behind the hack? What's the next step?

Yes, the hack removed the exception. The rationale followed was to make the mysql client to use a certain encoding.

The next step is to make django's mysql connections to use utf8mb4 by default or otherwise make it more configurable. Since utf8bm4 is utf8 compatible, there should be no extra changes in that regard.

To achieve this django.db.base.cursor should be changed in class DatabaseWrapper function _cursor, (complete function definition here http://pastebin.com/A6dMEMd4):

kwargs = {

"conv": django_conversions,
"charset": "utf8mb4",
"use_unicode": True,

}

Unfortunately this won't work unless we also change MySQLdb.connections class Connection function set_character_set:

Change the two bottom lines to (complete function definition here: http://pastebin.com/AMN1B8za)

#Hack so data can be decoded/encoded using python's utf8 since
# python does not know about mysql utf8mb4

if charset == 'utf8mb4':

charset = 'utf8


self.string_decoder.charset = charset

self.unicode_literal.charset = charset

This will guarantee you can use special characets like 😄😃😊☺😉😍😘😚

Unlike the previous hack, which worked on reading/writing data, this patch only allows me to read data in utfmb4 format, but now I've hit an error on insertion/creation where I get 'Cursor' object has no attribute '_last_executed'. I will report evidence on this error as I find it. All your help regarding this error is appreciated.

You can reach me via twitter, @rogeliorv

comment:10 Changed 3 years ago by kmichel_wgs

As a workaround, you can make python understand 'utf8mb4' as an alias for 'utf8':

import codecs
codecs.register(lambda name: codecs.lookup('utf8') if name == 'utf8mb4' else None)

comment:11 Changed 3 years ago by kitsunde

  • Cc kitsunde@… added

comment:12 Changed 3 years ago by evax

There is a fix for this issue in the upcoming MySQLdb-python release (https://github.com/farcepest/MySQLdb1/tree/utf8mb4)

This should be used in together with the appropriate OPTIONS in the database config:

DATABASES = {
        'default': {
            'ENGINE': 'django.db.backends.mysql',
            'OPTIONS': {'charset': 'utf8mb4'},
            (...)

comment:13 Changed 3 years ago by evax

The fix mentioned above has been merged to master ( https://github.com/farcepest/MySQLdb1) and released (http://pypi.python.org/pypi/MySQL-python/1.2.4b5)

comment:14 Changed 3 years ago by streeter

  • Cc django@… added

comment:15 Changed 3 years ago by akaariai

  • Type changed from Uncategorized to New feature

comment:16 Changed 2 years ago by aaugustin

  • Triage Stage changed from Design decision needed to Accepted

This is just another case of MySQL being purposefully and irreversibly brain-damaged. What it calls utf8 in actually a non-standard 3-bytes encoding unrelated to utf8.

Django should give the option to use utf8mb4, and maybe recommend it in the docs (if it works well; non-default features of MySQL rarely work well).

The comments above suggest this isn't possible right now, although I'm still confused as to the nature of the problem. Accepting on this basis.

comment:17 Changed 2 years ago by cvrebert

  • Cc cvrebert added

comment:18 Changed 22 months ago by timo

#21308 describes an issue with running tests on MySQL when charset=utf8mb4 that I've marked as a duplicate of this.

comment:19 follow-up: Changed 22 months ago by anonymous

This InnoDB restriction was fixed in MySQL 5.5.14. You have to set the following:

innodb_large_prefix=ON
innodb_file_per_table=ON
innodb_file_format=Barracuda

and CREATE TABLE or ALTER TABLE with the ROW_FORMAT=DYNAMIC attribute.

comment:20 in reply to: ↑ 19 ; follow-up: Changed 22 months ago by moggers87

  • Cc moggers87 added

Replying to anonymous:
*snip*

and CREATE TABLE or ALTER TABLE with the ROW_FORMAT=DYNAMIC attribute.


I'm not aware of any way to specify options like ROW_FORMAT=DYNAMIC for create/alter statements in Django other than specifying SET ROW_FORMAT=DYNAMIC with 'init_command' - but that adds overhead to each connection (and feels a bit hacky imo)

comment:21 in reply to: ↑ 20 Changed 16 months ago by donturner

Replying to moggers87:

Replying to anonymous:
*snip*

and CREATE TABLE or ALTER TABLE with the ROW_FORMAT=DYNAMIC attribute.


I'm not aware of any way to specify options like ROW_FORMAT=DYNAMIC for create/alter statements in Django other than specifying SET ROW_FORMAT=DYNAMIC with 'init_command' - but that adds overhead to each connection (and feels a bit hacky imo)

Having run into the same issue my conclusion is that the only proper solution is to switch to PostgreSQL. It would appear that MySQL has extremely bad support for indexing on utf8mb4 fields.

comment:22 Changed 14 months ago by flisky

  • Cc flisky added

comment:23 Changed 7 months ago by collinanderson

  • Cc cmawebsite@… added

One solution would be to reduce the INDEX size to 191 for mysql, like the example above:

col1 VARCHAR(500) CHARACTER SET utf8mb4, INDEX (col1(191))"

comment:24 Changed 4 months ago by aigarius

https://mathiasbynens.be/notes/mysql-utf8mb4 has more information on this. Ideally it would be wonderful if Django users could set an option and have Django migration backend generate all the code for the migration to the full utf8mb4 encoding for all databases, tables, columns and corresponding changes to all indexes as well. And then on next major version this option could become the new default.

Note: See TracTickets for help on using tickets.
Back to Top