MySQL UTF-8 字符问题;我看到的不是我存储的

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38363566/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 21:55:21  来源:igfitidea点击:

Trouble with UTF-8 characters; what I see is not what I stored

mysqlunicodeutf-8character-encodingmariadb

提问by Rick James

I tried to use UTF-8 and ran into trouble.

我尝试使用 UTF-8 并遇到了麻烦。

I have tried so many things; here are the results I have gotten:

我尝试了很多东西;这是我得到的结果:

  • ????instead of Asian characters. Even for European text, I got Se?orfor Se?or.
  • Strange gibberish (Mojibake?) such as Se?±oror ?–°?μa?–°é—?for 新浪新闻.
  • Black diamonds, such as Se?or.
  • Finally, I got into a situation where the data was lost, or at least truncated: Sefor Se?or.
  • Even when I got text to lookright, it did not sortcorrectly.
  • ????而不是亚洲字符。即使对于欧洲文本,我也得到Se?orSe?or.
  • 奇怪的胡言乱语(Mojibake?),例如Se?±or?–°?μa?–°é—?for 新浪新闻
  • 黑色钻石,如 Se?or。
  • 最后,我陷入了数据丢失或至少被截断的情况:Se对于Se?or.
  • 即使我让文本看起来正确,它也没有正确排序

What am I doing wrong? How can I fix the code? Can I recover the data, if so, how?

我究竟做错了什么?我该如何修复代码?我可以恢复数据吗,如果可以,如何恢复?

回答by Rick James

This problem plagues the participants of this site, and many others.

这个问题困扰着这个网站的参与者以及许多其他人。

You have listed the five main cases of CHARACTER SETtroubles.

你列出了五种主要的CHARACTER SET麻烦事例。

Best Practice

最佳实践

Going forward, it is best to use CHARACTER SET utf8mb4and COLLATION utf8mb4_unicode_520_ci. (There is a newer version of the Unicode collation in the pipeline.)

展望未来,最好使用CHARACTER SET utf8mb4COLLATION utf8mb4_unicode_520_ci。(管道中有较新版本的 Unicode 排序规则。)

utf8mb4is a superset of utf8in that it handles 4-byte utf8 codes, which are needed by Emoji and some of Chinese.

utf8mb4是 的超集,utf8因为它处理 4 字节的 utf8 代码,这是 Emoji 和一些中文所需要的。

Outside of MySQL, "UTF-8" refers to all size encodings, hence effectively the same as MySQL's utf8mb4, not utf8.

MySQL的之外,“UTF-8”是指所有长编码,从而有效地同MySQL的utf8mb4,不是utf8

I will try to use those spellings and capitalizations to distinguish inside versus outside MySQL in the following.

在下文中,我将尝试使用这些拼写和大小写来区分 MySQL 内部和外部。

Overview of what you shoulddo

该做什么的概述

  • Have your editor, etc. set to UTF-8.
  • HTML forms should start like <form accept-charset="UTF-8">.
  • Have your bytes encoded as UTF-8.
  • Establish UTF-8 as the encoding being used in the client.
  • Have the column/table declared CHARACTER SET utf8mb4(Check with SHOW CREATE TABLE.)
  • <meta charset=UTF-8>at the beginning of HTML
  • Stored Routines acquire the current charset/collation. They may need rebuilding.
  • 将您的编辑器等设置为 UTF-8。
  • HTML 表单应该像<form accept-charset="UTF-8">.
  • 将您的字节编码为 UTF-8。
  • 建立 UTF-8 作为客户端中使用的编码。
  • 声明列/表CHARACTER SET utf8mb4(检查SHOW CREATE TABLE.)
  • <meta charset=UTF-8>在 HTML 开头
  • 存储例程获取当前字符集/排序规则。他们可能需要重建。

UTF-8 all the way through

UTF-8 贯穿始终

More details for computer languages(and its following sections)

有关计算机语言的更多详细信息(及其以下部分)

Test the data

测试数据

Viewing the data with a tool or with SELECTcannot be trusted. Too many such clients, especially browsers, try to compensate for incorrect encodings, and show you correct text even if the database is mangled. So, pick a table and column that has some non-English text and do

使用工具或使用工具查看数据SELECT是不可信的。太多这样的客户端,尤其是浏览器,试图补偿不正确的编码,即使数据库被破坏也能显示正确的文本。因此,选择一个包含一些非英语文本的表格和列并执行

SELECT col, HEX(col) FROM tbl WHERE ...

The HEX for correctly stored UTF-8 will be

正确存储的 UTF-8 的十六进制将是

  • For a blank space (in any language): 20
  • For English: 4x, 5x, 6x, or 7x
  • For most of Western Europe, accented letters should be Cxyy
  • Cyrillic, Hebrew, and Farsi/Arabic: Dxyy
  • Most of Asia: Exyyzz
  • Emoji and some of Chinese: F0yyzzww
  • More details
  • 对于空格(任何语言): 20
  • 对于英语: 4x, 5x, 6x, 或7x
  • 对于西欧的大部分地区,重音字母应该是 Cxyy
  • 西里尔文、希伯来文和波斯文/阿拉伯文: Dxyy
  • 亚洲大部分地区: Exyyzz
  • 表情符号和一些中文: F0yyzzww
  • 更多细节

Specific causes and fixes of the problems seen

所见问题的具体原因和解决方法

Truncatedtext (Sefor Se?or):

截断的文本(Sefor Se?or):

  • The bytes to be stored are not encoded as utf8mb4. Fix this.
  • Also, check that the connection during reading is UTF-8.
  • 要存储的字节未编码为 utf8mb4。解决这个问题。
  • 此外,请检查读取过程中的连接是否为 UTF-8。

Black Diamondswith question marks (Se?orfor Se?or); one of these cases exists:

带问号的黑色钻石( Se?orfor Se?or); 存在以下情况之一:

Case 1 (original bytes were notUTF-8):

情况 1(原始字节不是UTF-8):

  • The bytes to be stored are not encoded as utf8. Fix this.
  • The connection (or SET NAMES) for the INSERTandthe SELECTwas not utf8/utf8mb4. Fix this.
  • Also, check that the column in the database is CHARACTER SET utf8(or utf8mb4).
  • 要存储的字节未编码为 utf8。解决这个问题。
  • 的连接(或SET NAMES为)INSERT所述SELECT不UTF8 / utf8mb4。解决这个问题。
  • 另外,检查数据库中的列是否为CHARACTER SET utf8(或 utf8mb4)。

Case 2 (original bytes wereUTF-8):

情况 2(原始字节UTF-8):

  • The connection (or SET NAMES) for the SELECTwas not utf8/utf8mb4. Fix this.
  • Also, check that the column in the database is CHARACTER SET utf8(or utf8mb4).
  • 的连接(或SET NAMESSELECT不是 utf8/utf8mb4。解决这个问题。
  • 另外,检查数据库中的列是否为CHARACTER SET utf8(或 utf8mb4)。

Black diamonds occur only when the browser is set to <meta charset=UTF-8>.

黑色菱形仅在浏览器设置为 时出现<meta charset=UTF-8>

Question Marks(regular ones, not black diamonds) (Se?orfor Se?or):

问号(普通的,不是黑色菱形)(Se?or对于Se?or):

  • The bytes to be stored are not encoded as utf8/utf8mb4. Fix this.
  • The column in the database is not CHARACTER SET utf8(or utf8mb4). Fix this. (Use SHOW CREATE TABLE.)
  • Also, check that the connection during reading is UTF-8.
  • 要存储的字节未编码为 utf8/utf8mb4。解决这个问题。
  • 数据库中的列不是CHARACTER SET utf8(或 utf8mb4)。解决这个问题。(使用SHOW CREATE TABLE。)
  • 此外,请检查读取过程中的连接是否为 UTF-8。

Mojibake(Se?±orfor Se?or): (This discussion also applies to Double Encoding, which is not necessarily visible.)

Mojibake( Se?±orfor Se?or) :(这个讨论也适用于Double Encoding,它不一定是可见的。)

  • The bytes to be stored need to be UTF-8-encoded. Fix this.
  • The connection when INSERTingand SELECTingtext needs to specify utf8 or utf8mb4. Fix this.
  • The column needs to be declared CHARACTER SET utf8(or utf8mb4). Fix this.
  • HTML should start with <meta charset=UTF-8>.
  • 要存储的字节需要采用 UTF-8 编码。解决这个问题。
  • 连接时INSERTingSELECTing文本需要指定utf8或utf8mb4。解决这个问题。
  • 需要声明该列CHARACTER SET utf8(或 utf8mb4)。解决这个问题。
  • HTML 应该以<meta charset=UTF-8>.

If the data looks correct, but won't sort correctly, then either you have picked the wrong collation, or there is no collation that suits your need, or you have Double Encoding.

如果数据看起来正确,但排序不正确,那么要么您选择了错误的排序规则,要么没有适合您需要的排序规则,要么您有Double Encoding

Double Encodingcan be confirmed by doing the SELECT .. HEX ..described above.

通过执行SELECT .. HEX ..上述操作可以确认双重编码

é should come back C3A9, but instead shows C383C2A9
The Emoji  should come back F09F91BD, but comes back C3B0C5B8E28098C2BD

That is, the hex is about twice as long as it should be. This is caused by converting from latin1 (or whatever) to utf8, then treating those bytes as if they were latin1 and repeating the conversion. The sorting (and comparing) does not work correctly because it is, for example, sorting as if the string were Se?±or.

也就是说,十六进制大约是应有的长度的两倍。这是由于从 latin1(或其他)转换为 utf8,然后将这些字节视为 latin1 并重复转换造成的。排序(和比较)无法正常工作,因为例如,排序就像字符串是Se?±or.

Fixing the Data, where possible

在可能的情况下修复数据

For Truncationand Question Marks, the data is lost.

对于截断问号,数据会丢失。

For Mojibake/ Double Encoding, ...

对于Mojibake/双编码,...

For Black Diamonds, ...

对于黑钻,...

The Fixesare listed here. (5 different fixes for 5 different situations; pick carefully): http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases

修复程序列在这里。(针对 5 种不同情况的 5 种不同修复;谨慎选择):http: //mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases

回答by castro_pereira

I had similar issues with 2 of my projects, after a server migration. After searching and trying a lot of solutions i came across with this one:

在服务器迁移后,我的 2 个项目遇到了类似的问题。在搜索并尝试了很多解决方案后,我遇到了这个:

mysqli_set_charset($con,"utf8");

After adding this line to my config file everything works fine!

将此行添加到我的配置文件后,一切正常!

I found this solution for mysqli https://www.w3schools.com/PHP/func_mysqli_set_charset.aspwhen i was looking to solve a insert from html query

当我想解决来自 html 查询的插入时,我找到了 mysqli https://www.w3schools.com/PHP/func_mysqli_set_charset.asp 的这个解决方案

good luck!

祝你好运!

回答by Ashish Bhatt

I was also searching for the same issue, it took me nearly 1 month to find the appropriate solution. First of all, you will have to update you database will all the recent CHARACTER and COLLATION to utf8mb4 or atleast which support utf-8 data.

我也在寻找同样的问题,我花了将近 1 个月的时间才找到合适的解决方案。首先,您必须将所有最近的 CHARACTER 和 COLLATION 更新为 utf8mb4 或至少支持 utf-8 数据的数据库。

For Java:

对于 Java:

while making JDBC connection add this to the connection url useUnicode=yes&characterEncoding=UTF-8as params and it will work.

在进行 JDBC 连接时,将此添加到连接 url useUnicode=yes&characterEncoding=UTF-8作为参数,它将起作用。

For python:

对于蟒蛇:

Before querying into the database try enforcing this over the cursor * cursor.execute('SET NAMES utf8mb4')cursor.execute("SET CHARACTER SET utf8mb4")cursor.execute("SET character_set_connection=utf8mb4")*

在查询到数据库之前,尝试通过游标强制执行此操作 * cursor.execute('SET NAMES utf8mb4')cursor.execute("SET CHARACTER SET utf8mb4")cursor.execute("SET character_set_connection=utf8mb4")*

If it does not work, happy hunting for the right solution.

如果它不起作用,请愉快地寻找正确的解决方案。

回答by SIDU

Funny how you answer your own question :)

有趣的是你如何回答自己的问题:)

  1. Set your code IDE language to UTF8

  2. Add to your webpage header where you collect data form.

  3. Check your MySQL table definition looks like this:

    CREATE TABLE your_table (
      ...
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8
    
  4. If you are using PDO, make sure

    $options = array(PDO::MYSQL_ATTR_INIT_COMMAND=>'SET NAMES utf8'); 
    $dbL = new PDO($pdo, $user, $pass, $options);
    
  1. 将您的代码 IDE 语言设置为 UTF8

  2. 添加到您收集数据表单的网页标题。

  3. 检查您的 MySQL 表定义如下所示:

    CREATE TABLE your_table (
      ...
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8
    
  4. 如果您使用的是 PDO,请确保

    $options = array(PDO::MYSQL_ATTR_INIT_COMMAND=>'SET NAMES utf8'); 
    $dbL = new PDO($pdo, $user, $pass, $options);
    

If you already got a large database with above problem, you can try SIDU to export with correct charset, and import back with UTF8. Good luck

如果您已经有一个存在上述问题的大型数据库,您可以尝试使用 SIDU 以正确的字符集导出,并使用 UTF8 导入回来。祝你好运

回答by paul

Depending on how the server is setup you have to change the encode accordingly. utf8 from what you said should work the best however if your getting weird characters it might help if you change the webpage Encode to Ansi. This helped me when I was setting up a PHP MYSQLI this might help you understand more https://superuser.com/questions/762473/ansi-to-utf-8-in-notepad

根据服务器的设置方式,您必须相应地更改编码。您所说的 utf8 应该效果最好,但是如果您遇到奇怪的字符,如果您将网页编码更改为 Ansi 可能会有所帮助。这在我设置 PHP MYSQLI 时对我有所帮助,这可能会帮助您了解更多https://superuser.com/questions/762473/ansi-to-utf-8-in-notepad