- How to make MySQL handle UTF-8 properly
- 15 Answers 15
- Как перейти с utf8 на utf8mb4 в MySQL
- utf8mb4_general_ci или utf8mb4_unicode_ci
- Настройка кодировки utf8mb4 для СУБД MySQL
- Кодировка и сравнение для базы данных, таблиц и столбцов в MySQL
- Восстановление и оптимизация всех таблиц
- Пример миграции для Yii2
How to make MySQL handle UTF-8 properly
One of the responses to a question I asked yesterday suggested that I should make sure my database can handle UTF-8 characters correctly. How I can do this with MySQL?
I really hope we get a comprehensive answer, covering various MySQL versions, incompatibilities, etc.
@EdwardZ.Yang — MySQL 4.1 introduced CHARACTER SETs ; 5.1.24 messed with the collation of German sharp-s (ß), which was rectified by adding another collation in 5.1.62 (arguably making things worse); 5.5.3 filled out utf8 with the new charset utf8mb4.
It’s worth pointing out that most of these answers are just plain wrong. Do not use utf8 . It only supports up to 3-byte characters. The correct character set you should use in MySQL is utf8mb4 .
15 Answers 15
Short answer — You should almost always be using the utf8mb4 charset and utf8mb4_unicode_ci collation.
ALTER DATABASE dbname CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Original Answer:
MySQL 4.1 and above has a default character set of UTF-8. You can verify this in your my.cnf file, remember to set both client and server ( default-character-set and character-set-server ).
If you have existing data that you wish to convert to UTF-8, dump your database, and import it back as UTF-8 making sure:
- use SET NAMES utf8 before you query/insert into the database
- use DEFAULT CHARSET=utf8 when creating new tables
- at this point your MySQL client and server should be in UTF-8 (see my.cnf ). remember any languages you use (such as PHP) must be UTF-8 as well. Some versions of PHP will use their own MySQL client library, which may not be UTF-8 aware.
If you do want to migrate existing data remember to backup first! Lots of weird choping of data can happen when things don’t go as planned!
My understanding is that utf8 within MySQL only refers to a small subset of full Unicode. You should use utf8mb4 instead to force full support. See mathiasbynens.be/notes/mysql-utf8mb4 «For a long time, I was using MySQL’s utf8 charset for databases, tables, and columns, assuming it mapped to the UTF-8 encoding described above.»
MySQL has never had a default character set of UTF-8. 4.1 and 5.x up to the latest 5.7 all use latin1 and latin1_swedish_ci for the default charset and collation. See the «Server Character Set and Collation» page in the MySQL manual for confirmation: dev.mysql.com/doc/refman/5.1/en/charset-server.html
@TimTisdall You need not worry utf8mb4 taking extra storage when most text is ASCII. Although char strings are preallocated, varchar strings are not — see the last few lines on this documentation page. For example, char(10) will be pessimistically reserve 40 bytes under utf8mb4, but varchar(10) will allocate bytes in keeping with the variable length encoding.
@Kevin I think you misread that. I think the maximum row length is 64k. You can only make a utf8mb4 field 1/4 of that because it had to reserve that amount of space. So, even if it’s ASCII you can only insert 16k characters.
@TimTisdall Oh, you’re talking about upper bounds. Yes, those are lower. Fortunately, current versions of mysql will automatically upgrade from varchar(n) to the text data type if you attempt to alter a varchar(n) field to larger than the feasible byte size (while issuing a warning). An index will also have a lower worst-case upper bound, and that may present other problems.
To make this ‘permanent’, in my.cnf :
[client] default-character-set=utf8 [mysqld] character-set-server = utf8
To check, go to the client and show some variables:
SHOW VARIABLES LIKE 'character_set%';
Verify that they’re all utf8 , except . _filesystem , which should be binary and . _dir , that points somewhere in the MySQL installation.
It didn’t work in my case but I created file my.cf in /etc with given content anyway. I used create table my_name(field_name varchar(25) character set utf8);
The «SHOW VARIABLES LIKE ‘character_set%’;» command revealed me the problem with my connection. Thanks!
MySQL 4.1 and above has a default character set that it calls utf8 but which is actually only a subset of UTF-8 (allows only three-byte characters and smaller).
Use utf8mb4 as your charset if you want «full» UTF-8.
Definitely agree, this is the only correct answer. utf8 doesn’t include chars like emoticons. utf8mb4 does. Check this for more info on how to update : mathiasbynens.be/notes/mysql-utf8mb4
@Basti — Mostly correct (latin1 was the default until just recently), and not complete (does not discuss correctly inserting/selecting utf8-encoded data, nor displaying in html).
Respectfully, @RickJames, Basti said «so far» — I don’t remember seeing your answer when I posted this.
Alas, there are about 5 distinctly different symptoms of utf8 problems, and about 4 things that programmers do wrong to cause trouble. Most answers point out only one thing that may need fixing. The original question was broad one, so the answer needed all 4. Perhaps Basti was familiar with one symptom for which your one aspect was the solution.
As an aside, I’d like to pause a moment and give the MySQL team a really good, hard stare. o_o WTF were you guys thinking? Do you realize how much confusion you’ve sown by creating a codepage in your program called «utf8» that isn’t actually UTF-8? Goddamn assholes.
The short answer: Use utf8mb4 in 4 places:
- The bytes in your client are utf8, not latin1/cp1251/etc.
- SET NAMES utf8mb4 or something equivalent when establishing the client’s connection to MySQL
- CHARACTER SET utf8mb4 on all tables/columns — except columns that are strictly ascii/hex/country_code/zip_code/etc.
- if you are outputting to HTML. (Yes the spelling is different here.)
The above links provide the «detailed canonical answer is required to address all the concerns». — There is a space limit on this forum.
In addition to CHARACTER SET utf8mb4 containing «all» the world’s characters, COLLATION utf8mb4_unicode_520_ci is arguable the ‘best all-around’ collation to use. (There are also Turkish, Spanish, etc, collations for those who want the nuances in those languages.)
@Louis — And as I implied Spanish and Turkish (as well as Polish) users may not happy. «Best all-around» tends to hurt everyone some. MySQL 8.0 has an even newer «best» collation: utf8mb4_0900_ai_ci. Alas, again L=Ł.
The charset is a property of the database (default) and the table. You can have a look (MySQL commands):
show create database foo; > CREATE DATABASE `foo`.`foo` /*!40100 DEFAULT CHARACTER SET latin1 */ show create table foo.bar; > lots of stuff ending with > ) ENGINE=InnoDB AUTO_INCREMENT=252 DEFAULT CHARSET=latin1
In other words; it’s quite easy to check your database charset or change it:
ALTER TABLE `foo`.`bar` CHARACTER SET utf8mb4; /* was: utf8 */
I followed Javier’s solution, but I added some different lines in my.cnf:
[myslqd] skip-character-set-client-handshake collation_server=utf8_unicode_ci character_set_server=utf8
I found this idea here: http://dev.mysql.com/doc/refman/5.0/en/charset-server.html in the first/only user comment on the bottom of the page. He mentions that skip-character-set-client-handshake has some importance.
This unloved, zero-vote answer was the only thing that helped me! So it gets my vote, that’s for darn sure. skip-character-set-client-handshake was the key.
To change the character set encoding to UTF-8 for the database itself, type the following command at the mysql> prompt. USE ALTER DATABASE .. Replace DBNAME with the database name:
ALTER DATABASE DBNAME CHARACTER SET utf8 COLLATE utf8_general_ci;
Set your database collation to UTF-8 then apply table collation to database default.
Use the collate utf8mb4 on mysql, add the attribute mysql_enable_utf8mb4 on DBI connection and do the sql command «SET NAMES utf8mb4» after connection to the mysql will make perl handle UTF-8 correctly.
#!/usr/bin/perl print "Content-type: text/html; charset=UTF-8\n\n"; #use utf8; #use open ':utf8'; #binmode STDOUT, ":utf8"; #binmode STDIN , ":utf8"; #use encoding 'utf8'; our $dbh = DBI->connect("DBI:mysql:database=$database;host=$servername;port=$port",$username,$password, 0, PrintError => 0, mysql_enable_utf8mb4 => 1>) || die; $dbh->do("SET NAMES utf8mb4");
Your answer is you can configure by MySql Settings. In My Answer may be something gone out of context but this is also know is help for you.
how to configure Character Set and Collation .
For applications that store data using the default MySQL character set and collation ( latin1, latin1_swedish_ci ), no special configuration should be needed. If applications require data storage using a different character set or collation, you can configure character set information several ways:
- Specify character settings per database. For example, applications that use one database might require utf8 , whereas applications that use another database might require sjis.
- Specify character settings at server startup. This causes the server to use the given settings for all applications that do not make other arrangements.
- Specify character settings at configuration time, if you build MySQL from source. This causes the server to use the given settings for all applications, without having to specify them at server startup.
The examples shown here for your question to set utf8 character set , here also set collation for more helpful( utf8_general_ci collation`).
Specify character settings per database
CREATE DATABASE new_db DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
Specify character settings at server startup
[mysqld] character-set-server=utf8 collation-server=utf8_general_ci
Specify character settings at MySQL configuration time
shell> cmake . -DDEFAULT_CHARSET=utf8 \ -DDEFAULT_COLLATION=utf8_general_ci
To see the values of the character set and collation system variables that apply to your connection, use these statements:
SHOW VARIABLES LIKE 'character_set%'; SHOW VARIABLES LIKE 'collation%';
Как перейти с utf8 на utf8mb4 в MySQL
Если ваша версия СУБД MySQL 5.5.3 и выше, то вам необходимо использовать кодировку utf8mb4, вместо utf8. Об этом упоминается здесь и здесь.
Следовательно, больше нет необходимости использовать ни utf8_general_ci, ни utf8_unicode_ci.
utf8mb4_general_ci или utf8mb4_unicode_ci
В настоящее время для баз данных и таблиц MySQL рекомендуется использовать кодировку utf8mb4_unicode_ci.
Настройка кодировки utf8mb4 для СУБД MySQL
Исходя из вышеизложенного нам необходимо произвести настройку основных параметров кодировки СУБД MySQL.
В конфигурационном файле MySQL ( my.ini (windows)/ my.cnf (Linux)) необходимо изменить кодировку на utf8mb4:
[client] default-character-set = utf8mb4 [mysql] default-character-set = utf8mb4 [mysqld] character-set-client-handshake = FALSE init_connect ='SET collation_connection = utf8mb4_unicode_ci' init_connect ='SET NAMES utf8mb4' character-set-server = utf8mb4 collation-server = utf8mb4_unicode_ci
Проверяем корректность работы применимых настроек:
SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+--------------------+ | Variable_name | Value | +--------------------------+--------------------+ | character_set_client | utf8mb4 | | character_set_connection | utf8mb4 | | character_set_database | utf8mb4 | | character_set_filesystem | binary | | character_set_results | utf8mb4 | | character_set_server | utf8mb4 | | character_set_system | utf8 | | collation_connection | utf8mb4_general_ci | | collation_database | utf8mb4_unicode_ci | | collation_server | utf8mb4_unicode_ci | +--------------------------+--------------------+ 10 rows in set, 1 warning (0.00 sec)
Кодировка и сравнение для базы данных, таблиц и столбцов в MySQL
Запросы для измениния кодировки и сравнения для базы данных, таблиц и столбцов на utf8mb4 .
Для базы данных:
ALTER DATABASE [db_name] CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
Для таблицы:
ALTER TABLE [table_name] CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Для столбцов:
ALTER TABLE [table_name] CHANGE [column_name] [column_name] VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Восстановление и оптимизация всех таблиц
После обновления версии MySQL сервера и применения действий по смене кодировки и сравнений, необходимо произвести восстановление и оптимизацию всех баз данных и таблиц. Для этого вы можете выполнить следующие запросы для каждой таблицы:
REPAIR TABLE [table_name]; OPTIMIZE TABLE [table_name];
Или с использованием команды mysqlcheck :
$ mysqlcheck -u root -p --auto-repair --optimize --all-databases
Пример миграции для Yii2
В этом примере мы изменим кодировку для столбца content в таблице post :
/** * @return void * @throws \yii\db\Exception */ public function safeUp() < $sql = "ALTER TABLE `post` CHANGE `content` `content` MEDIUMTEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci"; Yii::$app->db->createCommand($sql)->execute(); > /** * @return void * @throws \yii\db\Exception */ public function safeDown() < $sql = "ALTER TABLE `post` CHANGE `content` `content` MEDIUMTEXT CHARACTER SET utf8 COLLATE utf8_unicode_ci"; Yii::$app->db->createCommand($sql)->execute(); >