Opened 12 years ago

Last modified 3 years ago

#18392 new New feature

Make MySQL backend default to utf8mb4 encoding

Reported by: EmilStenstrom Owned by: nobody
Component: Database layer (models, ORM) Version: 1.4
Severity: Normal Keywords: utf8mb4 mysql
Cc: kitsunde@…, django@…, cvrebert, moggers87, flisky, cmawebsite@…, contact@…, Jonas Trappenberg, ask@…, django@…, emorley@…, clokep@…, Adam Johnson Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

Background:
I just ran into a problem with iPhone emoji characters being saved into MySQL. The text was cut off after the first emoji character. After some research I found which explains how it works: ​http://mzsanford.wordpress.com/2010/12/28/mysql-and-unicode/

The recommendation is to use MySQL 5.5, and the "utf8mb4" encoding.

Suggestion:
Make "utf8mb4" the default encoding for MySQL 5.5 and up.

Change History (49)

comment:1 by Anssi KÀÀriÀinen, 12 years ago

Triage Stage: Unreviewed β†’ Accepted

To me it seems the character encoding is set on connect to 'charset': 'utf8'. Is it enough to change this from 'utf8' to 'utf8mb4' to change this default. If not, where should this default encoding be defined? Does this mean that any text column must be VARCHAR(N) CHARACTER SET utf8mb4 on creation?

I am accepting this as to me the change sounds valid.

comment:2 by EmilStenstrom, 12 years ago

I don't know enough about MySQL and the ORM to answer your questions, I hope someone else does.

comment:3 by EmilStenstrom, 12 years ago

Ok, there's some trickiness here. Indexes in InnoDB tables can't be longer than 255 chars with utf8, but only 191 chars with utf8mb4. This means that the default indexes that Django makes for CharField(max_length=255) is too long, and will break things (break what? I'm running a migration that converts all my tables to utf8mb4 automatically, and setting utf8mb4 on a long charfield breaks because the index is too long).

From the official docs:

"InnoDB has a maximum index length of 767 bytes, so for utf8 or utf8mb4 columns, you can index a maximum of 255 or 191 characters, respectively. If you currently have utf8 columns with indexes longer than 191 characters, you will need to index a smaller number of characters. In an InnoDB table, these column and index definitions are legal:

col1 VARCHAR(500) CHARACTER SET utf8, INDEX (col1(255))

To use utf8mb4 instead, the index must be smaller:

col1 VARCHAR(500) CHARACTER SET utf8mb4, INDEX (col1(191))"

From: ​http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-upgrading.html

comment:4 by Anssi KÀÀriÀinen, 12 years ago

Triage Stage: Accepted β†’ Design decision needed

I am marking this as design decision needed. This will need a good solution which guarantees trouble-free upgrades for current users. While it would be nice to have full UTF8 support with MySQL, I don't think it is worth risking a breakage for existing users.

So, looking for good solutions here. Does MySQL infer the used character set from some runtime variable (perhaps something set at CREATE DATABASE time)? If so, we could just check the used charset for the db, and use that for test database creation and index creation. If so, this would be a simple solution to this problem.

comment:5 by Karen Tracey, 12 years ago

I don't believe setting the connection charset to this new utf8mb4 encoding (if we're talking to a server that supports it) would cause any problem with the indexing: that's a problem that comes into play when you change the database charset itself. Whether we need to change that isn't clear to me. It's possible that these new characters (stored in a DB with utf8mb4 encoding) would be transmitted over a connection that has charset set simply to utf8. If so then there's really no point in making any connection charset change. So one open question here is: do you get an error or incorrect behavior trying to read/write these characters over a utf8 connection created by Django when operating on a database with utf8mb4 encoding?

Traditionally Django has not gone beyond advising in the documentation what charset to use (uft8) for the database. Django doesn't attempt to set the charset for stuff it creates to utf8, it just uses the default charset for the DB, which is set before Django ever gets involved. Given this new MySQL 5.5 support for "more better" unicode we probably need to update the docs to mention the new option for database charset.

Having the index creation fail for a too-long CharField is a nuisance. I know we've had tickets before that dealt with this issue, but I don't recall offhand what the status is. Django tries to disallow creation of a CharField where the index creation will fail, but MySQL makes it incredibly difficult to figure out what the right value for "max allowed" is. This new encoding just makes for more of a mess there.

comment:6 by EmilStenstrom, 12 years ago

Maybe this ticket just needs some exploration before going further: A simple test that tries to save and their fetch some 4-byte unicode characters to MySQL... I think that would reveal what the minimal changes are, for full unicode support to be possible.

comment:7 by rogeliorv, 12 years ago

Keywords: hack utf8mb4 mysql added

As proof of concept I tried inserting into mysql the following values in a varchar.

πŸ˜„πŸ˜ƒπŸ˜Šβ˜ΊπŸ˜‰πŸ˜πŸ˜˜πŸ˜š

And got a "Warning: Incorrect string value" from mysql (which python treats as an exception)

As a way to test it. The hack consists in adding self.query('SET NAMES utf8mb4') in MySQLdb.connections in Connection.set_character_set function as shown here: ​http://pastebin.com/MW5BgRgP

Of course the correct way would be to change this in django when setting up the cursor connection.


in reply to:  7 ; comment:8 by EmilStenstrom, 12 years ago

Replying to rogeliorv:

As a way to test it. The hack consists in adding self.query('SET NAMES utf8mb4') in MySQLdb.connections in Connection.set_character_set function as shown here: ​http://pastebin.com/MW5BgRgP

Of course the correct way would be to change this in django when setting up the cursor connection.

Did your hack remove the exception, or what was the rationale behind it? What's the next step here?

Version 0, edited 12 years ago by EmilStenstrom (next)

in reply to:  8 comment:9 by rogeliorv, 12 years ago

Has patch: set
Keywords: hack removed
Needs tests: set
Patch needs improvement: set

Replying to EmilStenstrom:

Replying to rogeliorv:

As a way to test it. The hack consists in adding self.query('SET NAMES utf8mb4') in MySQLdb.connections in Connection.set_character_set function as shown here: ​http://pastebin.com/MW5BgRgP

Of course the correct way would be to change this in django when setting up the cursor connection.

Did your hack remove the exception? What was the rationale behind the hack? What's the next step?

Yes, the hack removed the exception. The rationale followed was to make the mysql client to use a certain encoding.

The next step is to make django's mysql connections to use utf8mb4 by default or otherwise make it more configurable. Since utf8bm4 is utf8 compatible, there should be no extra changes in that regard.

To achieve this django.db.base.cursor should be changed in class DatabaseWrapper function _cursor, (complete function definition here ​http://pastebin.com/A6dMEMd4):

'kwargs = {
  "conv": django_conversions,
  "charset": "utf8mb4",
  "use_unicode": True,
}

Unfortunately this won't work unless we also change MySQLdb.connections class Connection function set_character_set:

Change the two bottom lines to (complete function definition here: ​http://pastebin.com/AMN1B8za)

#Hack so data can be decoded/encoded using python's utf8 since
# python does not know about mysql utf8mb4

''if charset == 'utf8mb4':''
    ''charset = 'utf8'''
        
''self.string_decoder.charset = charset''

''self.unicode_literal.charset = charset''

This will guarantee you can use special characters like πŸ˜„πŸ˜ƒπŸ˜Šβ˜ΊπŸ˜‰πŸ˜πŸ˜˜πŸ˜š

Unlike the previous hack, which worked on reading/writing data, this patch only allows me to read data in utfmb4 format, but now I've hit an error on insertion/creation where I get 'Cursor' object has no attribute '_last_executed'. I will report evidence on this error as I find it. All your help regarding this error is appreciated.

You can reach me via twitter, @rogeliorv

Last edited 8 years ago by Tim Graham (previous) (diff)

comment:10 by Kevin Michel, 12 years ago

As a workaround, you can make python understand 'utf8mb4' as an alias for 'utf8':

import codecs
codecs.register(lambda name: codecs.lookup('utf8') if name == 'utf8mb4' else None)

comment:11 by Kit Sunde, 11 years ago

Cc: kitsunde@… added

comment:12 by evax, 11 years ago

There is a fix for this issue in the upcoming MySQLdb-python release (​https://github.com/farcepest/MySQLdb1/tree/utf8mb4)

This should be used in together with the appropriate OPTIONS in the database config:

DATABASES = {
        'default': {
            'ENGINE': 'django.db.backends.mysql',
            'OPTIONS': {'charset': 'utf8mb4'},
            (...)

comment:13 by evax, 11 years ago

The fix mentioned above has been merged to master ( ​https://github.com/farcepest/MySQLdb1) and released (​http://pypi.python.org/pypi/MySQL-python/1.2.4b5)

comment:14 by Chris Streeter, 11 years ago

Cc: django@… added

comment:15 by Anssi KÀÀriÀinen, 11 years ago

Type: Uncategorized β†’ New feature

comment:16 by Aymeric Augustin, 11 years ago

Triage Stage: Design decision needed β†’ Accepted

This is just another case of MySQL being purposefully and irreversibly brain-damaged. What it calls utf8 in actually a non-standard 3-bytes encoding unrelated to utf8.

Django should give the option to use utf8mb4, and maybe recommend it in the docs (if it works well; non-default features of MySQL rarely work well).

The comments above suggest this isn't possible right now, although I'm still confused as to the nature of the problem. Accepting on this basis.

comment:17 by cvrebert, 11 years ago

Cc: cvrebert added

comment:18 by Tim Graham, 10 years ago

#21308 describes an issue with running tests on MySQL when charset=utf8mb4 that I've marked as a duplicate of this.

comment:19 by anonymous, 10 years ago

This ​InnoDB restriction was fixed in MySQL 5.5.14. You have to set the following:

​innodb_large_prefix=ON
innodb_file_per_table=ON
innodb_file_format=Barracuda

and CREATE TABLE or ALTER TABLE with the ROW_FORMAT=DYNAMIC attribute.

in reply to:  19 ; comment:20 by moggers87, 10 years ago

Cc: moggers87 added

Replying to anonymous:
*snip*

and CREATE TABLE or ALTER TABLE with the ROW_FORMAT=DYNAMIC attribute.


I'm not aware of any way to specify options like ROW_FORMAT=DYNAMIC for create/alter statements in Django other than specifying SET ROW_FORMAT=DYNAMIC with 'init_command' - but that adds overhead to each connection (and feels a bit hacky imo)

in reply to:  20 comment:21 by donturner, 10 years ago

Replying to moggers87:

Replying to anonymous:
*snip*

and CREATE TABLE or ALTER TABLE with the ROW_FORMAT=DYNAMIC attribute.


I'm not aware of any way to specify options like ROW_FORMAT=DYNAMIC for create/alter statements in Django other than specifying SET ROW_FORMAT=DYNAMIC with 'init_command' - but that adds overhead to each connection (and feels a bit hacky imo)

Having run into the same issue my conclusion is that the only proper solution is to switch to PostgreSQL. It would appear that MySQL has extremely bad support for indexing on utf8mb4 fields.

comment:22 by flisky, 10 years ago

Cc: flisky added

comment:23 by Collin Anderson, 9 years ago

Cc: cmawebsite@… added

One solution would be to reduce the INDEX size to 191 for mysql, like the example above:

col1 VARCHAR(500) CHARACTER SET utf8mb4, INDEX (col1(191))"

comment:24 by Aigars Mahinovs, 9 years ago

​https://mathiasbynens.be/notes/mysql-utf8mb4 has more information on this. Ideally it would be wonderful if Django users could set an option and have Django migration backend generate all the code for the migration to the full utf8mb4 encoding for all databases, tables, columns and corresponding changes to all indexes as well. And then on next major version this option could become the new default.

in reply to:  24 comment:25 by Jonas Trappenberg, 8 years ago

Replying to aigarius:

[...] have Django migration backend generate all the code for the migration [...]

Even the utf8 to utf8mb4 migration, which would be the easiest in terms of required changes, may cause data loss if you have indexes that are longer than 191 characters and should thus probably be inspected and fixed up individually and manually. For that reason, it seems impossible to me to automate. That being said, clear migration instructions would probably be helpful for many users.

Maybe this could be another backend-specific setting, similar to what was implemented for integer types in 1506c71a95cd7f58fbc6363edf2ef742c58d2487? Although, it only applying to index columns may make it way more painful to implement.

comment:26 by Collin Anderson, 8 years ago

For a first step, I propose a setting in DATABASES OPTIONS that tells Django the maximum index size to use when creating new indexes. I think it should default to 191. That way Django won't ever reduce the size of existing indexes, only new indexes.

If we have that setting, it's at least _possible_ to use utf8mb4. We could then down the road have Django try to default to using utf8mb4 encoding for new databases/tables/columns.

(We need this for #20846.)

Ohh and it _is_ possible to have indexes longer than 191, it just requires changing your db configuration, so Django should allow longer than 191 if it's told that it's ok.

Last edited 8 years ago by Collin Anderson (previous) (diff)

comment:27 by Tim Graham, 8 years ago

Will that setting work nicely with migrations though? I think we need to know the index names for some operations like AlterField. It seems problematic if we have a way that users can vary the index names without updating existing names.

comment:28 by Collin Anderson, 8 years ago

I was thinking it wouldn't actually change the name of the index, but I haven't actually looked at the code. :)

comment:29 by Tim Graham, 8 years ago

In case it's not clear, here's the problematic scenario I see:

  • Indexes are created with max index name length=191 and truncated accordingly.
  • Developer increases max index name length setting to 200 (no changes to index names in the database).
  • Now migrations can't operate on existing indexes because the 200 character index names it generates aren't what's in the database.

comment:30 by Collin Anderson, 8 years ago

I'm not thinking of limiting the _name_ of the index. The issue is "the maximum number of characters that can be indexed".

comment:31 by Tim Graham, 8 years ago

Thanks Collin, in that case your proposal makes more sense to me. It could be nice to get a consensus from more MySQL users though.

comment:32 by ernestoalejo, 8 years ago

Cc: contact@… added

comment:33 by Jonas Trappenberg, 8 years ago

Cc: Jonas Trappenberg added

comment:34 by Ask Solem Hoel, 8 years ago

Cc: ask@… added

comment:35 by Keryn Knight, 8 years ago

Cc: django@… added

By way of consensus: I agree that if possible, going forward, Django ought to seek to use utf8mb4, because for better or worse, the world is becoming more mobile oriented, and with it more emoji-laden; the non-mb4 charsets/collations choke on such things, which is just plain unfortunate.

If a suitable backwards-compatible, migration-friendly patch can't be achieved, it should at least be mentioned in the docs somewhere, IMHO (searching for utf8mb4 or emoji currently yields no results, so I assume there are none squirreled away). Whether that mention is a recommendation or not is probably still unclear.

comment:36 by EmilStenstrom, 8 years ago

Slightly off-topic but maybe interesting: The Wordpress team did convert all their users from utf8 to utf8mb4 according to this blog post: ​https://make.wordpress.org/core/2015/04/02/the-utf8mb4-upgrade/.

Looking at the code it seems they just dropped and recreated all the indexes: ​https://github.com/WordPress/WordPress/blob/master/wp-admin/includes/upgrade.php#L2687

And then converted all the tables one by one: ​https://github.com/WordPress/WordPress/blob/master/wp-admin/includes/upgrade.php#L1951

Last edited 8 years ago by EmilStenstrom (previous) (diff)

comment:37 by Collin Anderson, 8 years ago

Yes, I based my proposal off of what WordPress did. WordPress limited the length of the index without limiting the length of the field itself. Django currently doesn't have that option.

https://code.djangoproject.com/ticket/18392#comment:26

comment:38 by Ed Morley, 8 years ago

Cc: emorley@… added

comment:39 by Patrick Cloke, 8 years ago

Cc: clokep@… added

comment:40 by Tim Graham, 7 years ago

The skips for the tests added in 1a9f6db5ffd2d5e71d73340ab59476572e05a728 should be removed or modified when completing this ticket. Should we skip them conditionally based on the test charset or should we require running the MySQL tests with utf8mb4?

comment:41 by Claude Paroz, 7 years ago

I suggest to begin with a very minimal patch like ​this PR, which will at least allow users to begin converting some database columns to utf8mb4 through custom migrations, and use these columns in their code (where indexing doesn't come in their way).

Working on index issues can come later, and will be needed to run the Django test suite with utf8mb4.

comment:42 by Claude Paroz, 7 years ago

Oh, now I realize that utf8mb4 can also be set in DATABASES OPTIONS. Still, using it in Django by default is a strong signal.

comment:43 by Claude Paroz, 7 years ago

Needs tests: unset
Patch needs improvement: unset

comment:44 by Tim Graham, 7 years ago

Patch needs improvement: set
Summary: Use utf8mb4 encoding with MySQL 5.5 β†’ Make MySQL backend default to utf8mb4 encoding

There's an outstanding issue to fix on the pull request and Claude said, "I'm not sure if I'll have time to continue working on this, so if anyone wants to take this patch further, feel free!"

comment:45 by Kelly Campbell, 6 years ago

As a workaround, I came up with this monkey patch that limits the index size. We put this in our migrations/init.py:

from django.db.models.fields import CharField

def _create_index_sql(self, model, fields, suffix="", sql=None):
    """
    Return the SQL statement to create the index for one or several fields.
    `sql` can be specified if the syntax differs from the standard (GIS
    indexes, ...).
    """
    tablespace_sql = self._get_index_tablespace_sql(model, fields)
    idx_columns = []
    for field in fields:
        c = field.column
        if isinstance(field, CharField):
            if field.max_length > 255:
                idx_columns.append(self.quote_name(c) + '(255)')
            else:
                idx_columns.append(self.quote_name(c))
        else:
            idx_columns.append(self.quote_name(c))
    columns = [field.column for field in fields]
    sql_create_index = sql or self.sql_create_index
    return sql_create_index % {
        "table": self.quote_name(model._meta.db_table),
        "name": self.quote_name(self._create_index_name(model, columns, suffix=suffix)),
        "using": "",
        "columns": ", ".join(column for column in idx_columns),
        "extra": tablespace_sql,
    }


from django.db.backends.mysql.schema import DatabaseSchemaEditor
DatabaseSchemaEditor._create_index_sql = _create_index_sql

comment:46 by Arthur Pemberton, 5 years ago

Can anyone clarify the process to migrate the default Django generated MySQL schema to a utf8mb4 friendly one?

comment:47 by Adam Johnson, 4 years ago

Cc: Adam Johnson added

For all - Django-MySQL has a system to check to warn if you're not usnig utf8mb4: ​https://django-mysql.readthedocs.io/en/latest/checks.html#django-mysql-w003-utf8mb4

Replying to Arthur Pemberton:

Can anyone clarify the process to migrate the default Django generated MySQL schema to a utf8mb4 friendly one?

See this post: ​https://mathiasbynens.be/notes/mysql-utf8mb4 .

comment:48 by lambdaq, 3 years ago

I submitted a utf8mb4 fix on github and was redirected here to this old relic.

​https://github.com/django/django/pull/14563

Lots of stuff had changed since the issue was first open nine years ago. My two cents:

  1. In v5.7, innodb indexes no longer limits the 767 bytes hardcap on utf8mb4 indexes.

​https://dev.mysql.com/doc/relnotes/mysql/5.7/en/news-5-7-7.html#mysqld-5-7-7-feature

When innodb_file_format is set to Barracuda, innodb_large_prefix=ON allows index key prefixes longer than 767 bytes (up to 3072 bytes) for tables that use a Compressed or Dynamic row format.

  1. MySQL 8 will default use db-wide utf8mb4

​https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-1.html#mysqld-8-0-1-charset

Important Change: The default character set has changed from latin1 to utf8mb4. These system variables are affected:

IMHO Django's decision to hack MySQL's default settings to less supported utf8mb3 (aka the utf8) would be unwise. Maybe it was a proper compromise 9 years ago, but it will be a liability in the years to come.

Maybe at least we can add some notes to alarm the readers?

CREATE DATABASE <dbname> CHARACTER SET utf8;

This would be a huge misleading mistake on official Django doc.

@felixxm @pope1ni

Last edited 3 years ago by lambdaq (previous) (diff)

in reply to:  48 comment:49 by Nick Pope, 3 years ago

Replying to lambdaq:

Lots of stuff had changed since the issue was first open nine years ago.

Yes, it has. But it is still not necessarily straightforward.

  1. In v5.7, innodb indexes no longer limits the 767 bytes hardcap on utf8mb4 indexes.

​https://dev.mysql.com/doc/relnotes/mysql/5.7/en/news-5-7-7.html#mysqld-5-7-7-feature

When innodb_file_format is set to Barracuda, innodb_large_prefix=ON allows index key prefixes longer than 767 bytes (up to 3072 bytes) for tables that use a Compressed or Dynamic row format.

Based on your linked release notes, MySQL 5.7.7 changed the defaults of some settings to the following:

innodb_file_format=Barracuda  # Previous default was Antelope
innodb_large_prefix=ON        # Previous default was OFF

These allow indexing strings up to 768 characters instead of 191 characters for utf8mb4 which should eliminate the problem of . I don't think we need to worry about that new upper limit being an issue as 255 was the previous cap anyway with utf8 (a.k.a. utf8mb3).

It should be noted that these options are also deprecated as of that release and removed in ​8.0.0. The release notes also state that using non-default values for the above settings in MySQL 5.7.7+ will log a deprecation warning.

The underlying problem, however, is what ​row format is used. It is necessary for the row format of a table to be COMPRESSED or DYNAMIC for large key index support. These were unavailable with the previous default file format configuration (Antelope) and only REDUNDANT and COMPACT could be used.

Note that even in MySQL 8.0.0+ all of these ​row formats can be used, will not be changed automatically during upgrade and thus the migration of existing projects from utf8mb3 β†’ utf8mb4 (with 255 β†’ 191 characters) is still a potential problem.

That said, the default value of ​innodb_default_row_format added in MySQL 5.7.9 is DYNAMIC. (See ​here also.) Note that in 5.6 ​the default value was COMPACT.

I found the following ​upgrade details which state:

In MySQL 5.7.9, DYNAMIC replaces COMPACT as the implicit default row format for InnoDB tables. A new configuration option, innodb_default_row_format, specifies the default InnoDB row format. Permitted values include DYNAMIC (the default), COMPACT, and REDUNDANT.

After upgrading to 5.7.9, any new tables that you create use the row format defined by innodb_default_row_format unless you explicitly define a row format (ROW_FORMAT).

For existing tables that do not explicitly define a ROW_FORMAT option or that use ROW_FORMAT=DEFAULT, any operation that rebuilds a table also silently changes the row format of the table to the format defined by innodb_default_row_format. Otherwise, existing tables retain their current row format setting. For more information, see Defining the Row Format of a Table.

So maybe things are not as bad as they seem after all.

  1. MySQL 8 will default use db-wide utf8mb4

​https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-1.html#mysqld-8-0-1-charset

Important Change: The default character set has changed from latin1 to utf8mb4. These system variables are affected:

IMHO Django's decision to hack MySQL's default settings to less supported utf8mb3 (aka the utf8) would be unwise. Maybe it was a proper compromise 9 years ago, but it will be a liability in the years to come.

Maybe at least we can add some notes to alarm the readers?

CREATE DATABASE <dbname> CHARACTER SET utf8;

This would be a huge misleading mistake on official Django doc.

@felixxm @pope1ni

It would be nice to sort this out, not that I'm volunteering. You should reignite the discussion on the DevelopersMailingList if you want to take this on.

Given how things have changed maybe there is an easy path forward now and we could consider the following:

  • Require MySQL 5.7.9+. We can't drop 5.7 entirely as it is supported until October 2023. See SupportedDatabaseVersions.
  • Add a system check to ensure that the configuration options are set to expected values:
    # For MySQL < 8.0.0:
    innodb_file_format=Barracuda
    innodb_large_prefix=ON
    # For all versions:
    innodb_default_row_format=DYNAMIC
    
  • Add a system check that the ROW_FORMAT of all tables is DEFAULT, DYNAMIC, or COMPRESSED. (See ​here.)
  • Change the documentation and connection configuration to use utf8mb4.
  • Add plenty of warnings into the release notes detailing the restrictions and how to fix any issues in preparation for upgrading Django.

One other question is how this all affects MariaDB which is supported by django.db.backends.mysql. I haven't looked into that.

Note: See TracTickets for help on using tickets.
Back to Top