MySQL UTF-8:一般?斌?统一码?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2344118/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 15:24:06  来源:igfitidea点击:

UTF-8: General? Bin? Unicode?

mysqlutf-8collation

提问by Dolph

I'm trying to figure out what collation I should be using for various types of data. 100% of the content I will be storing is user-submitted.

我试图弄清楚我应该对各种类型的数据使用什么排序规则。我将存储的内容 100% 是用户提交的。

My understanding is that I should be using UTF-8 General CI (Case-Insensitive) instead of UTF-8 Binary. However, I can't find a clear a distinction between UTF-8 General CI and UTF-8 Unicode CI.

我的理解是我应该使用 UTF-8 通用 CI(不区分大小写)而不是 UTF-8 二进制。但是,我找不到 UTF-8 General CI 和 UTF-8 Unicode CI 之间的明确区别。

  1. Should I be storing user-submitted content in UTF-8 General or UTF-8 Unicode CI columns?
  2. What type of data would UTF-8 Binary be applicable to?
  1. 我应该将用户提交的内容存储在 UTF-8 General 还是 UTF-8 Unicode CI 列中?
  2. UTF-8 二进制适用于什么类型的数据?

回答by Sagi

In general, utf8_general_ciis faster than utf8_unicode_ci, but less correct.

一般来说,utf8_general_ciutf8_unicode_ci快,但不太正确。

Here is the difference:

这是区别:

For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “?” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.

对于任何 Unicode 字符集,使用 _general_ci 排序规则执行的操作比使用 _unicode_ci 排序规则执行的操作快。例如,utf8_general_ci 归类的比较比 utf8_unicode_ci 的比较快,但准确性稍差。这样做的原因是 utf8_unicode_ci 支持扩展等映射;也就是说,当一个字符与其他字符的组合相等时。例如,在德语和其他一些语言中“?” 等于“ss”。utf8_unicode_ci 还支持收缩和可忽略的字符。utf8_general_ci 是不支持扩展、收缩或可忽略字符的旧排序规则。它只能在字符之间进行一对一的比较。

Quoted from: http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html

引自:http: //dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html

For more detailed explanation, please read the following post from MySQL forums: http://forums.mysql.com/read.php?103,187048,188748

更详细的解释,请阅读 MySQL 论坛的以下帖子:http: //forums.mysql.com/read.php?103,187048,188748

As for utf8_bin: Both utf8_general_ciand utf8_unicode_ciperform case-insensitive comparison. In constrast, utf8_bin is case-sensitive(among other differences), because it compares the binary values of the characters.

至于 utf8_bin:utf8_general_ciutf8_unicode_ci 都是不区分大小写的比较。相比之下,utf8_bin 区分大小写(除其他差异外),因为它比较字符的二进制值。

回答by Alex Hepp

You should also be aware of the fact, that with utf8_general_ci when using a varchar field as unique or primary index inserting 2 values like 'a' and 'á' would give a duplicate key error.

您还应该知道这样一个事实,当使用 varchar 字段作为唯一索引或主索引时,使用 utf8_general_ci 插入 2 个值(如“a”和“á”)会产生重复键错误。

回答by Rick James

  • utf8_bincompares the bits blindly. No case folding, no accent stripping.
  • utf8_general_cicompares one byte with one byte. It does case folding andaccent stripping, but no 2-character comparisions: ijis not equal ?in this collation.
  • utf8_*_ciis a set of language-specific rules, but otherwise like unicode_ci. Some special cases: ?, ?, ch, ll
  • utf8_unicode_cifollows an old Unicode standard for comparisons. ij=?, but ae!= ?
  • utf8_unicode_520_cifollows an newer Unicode standard. ae= ?
  • utf8_bin盲目地比较位。没有外壳折叠,没有重音剥离。
  • utf8_general_ci将一字节与一字节进行比较。它进行大小写折叠重音剥离,但没有 2 个字符的比较:在此排序规则中ij不相等?
  • utf8_*_ci是一组特定于语言的规则,但其他方面类似于unicode_ci. 一些特殊情况:?, ?, ch,ll
  • utf8_unicode_ci遵循旧的 Unicode 标准进行比较。ij= ?,但是ae!=?
  • utf8_unicode_520_ci遵循较新的 Unicode 标准。 ae=?

See collation chartfor details on what is equal to what in various utf8 collations.

有关什么等于各种 utf8 排序规则中的内容的详细信息,请参阅排序规则图表

utf8, as defined by MySQLis limited to the 1- to 3-byte utf8 codes. This leaves out Emoji and some of Chinese. So you should really switch to utf8mb4if you want to go much beyond Europe.

utf8,由 MySQL 定义仅限于 1 到 3 字节的 utf8 代码。这遗漏了表情符号和一些中文。因此,utf8mb4如果您想超越欧洲,您真的应该转向。

The above points apply to utf8mb4, after suitable spelling change. Going forward, utf8mb4and utf8mb4_unicode_520_ciare preferred.

以上几点适用于utf8mb4,经过适当的拼写更改。展望未来,utf8mb4并且utf8mb4_unicode_520_ci是首选。

  • utf16 and utf32 are variants on utf8; there is virtually no use for them.
  • ucs2 is closer to "Unicode" than "utf8"; there is virtually no use for it.
  • utf16 和 utf32 是 utf8 的变体;它们几乎没有用处。
  • ucs2 比“utf8”更接近“Unicode”;它几乎没有用处。

回答by vitalii

Really, I tested saving values like 'é' and 'e' in column with uniqueindex and they cause duplicate error on both 'utf8_unicode_ci' and 'utf8_general_ci'. You can save them only in 'utf8_bin' collated column.

真的,我测试了在具有唯一索引的列中保存诸如 'é' 和 'e' 之类的值,它们会在 'utf8_unicode_ci' 和 'utf8_general_ci' 上导致重复错误。您只能将它们保存在 'utf8_bin' 整理列中。

And mysql docs (in http://dev.mysql.com/doc/refman/5.7/en/charset-applications.html) suggest into its examples set 'utf8_general_ci' collation.

并且 mysql 文档(在http://dev.mysql.com/doc/refman/5.7/en/charset-applications.html 中)建议在其示例中设置“utf8_general_ci”排序规则。

[mysqld]
character-set-server=utf8
collation-server=utf8_general_ci

回答by Marwann

Accepted answer is outdated.

接受的答案已过时。

If you use MySQL 5.5.3+, use utf8mb4_unicode_ciinstead of utf8_unicode_cito ensure the characters typed by your users won't give you errors.

如果您使用 MySQL 5.5.3+,请使用utf8mb4_unicode_ci而不是utf8_unicode_ci确保您的用户输入的字符不会给您错误。

utf8mb4supports emojis for example, whereas utf8might give you hundreds of encoding-related bugs like:

utf8mb4例如,支持表情符号,而utf8可能会给您带来数百个与编码相关的错误,例如:

Incorrect string value: ‘\xF0\x9F\x98\x81…' for column ‘data' at row 1

Incorrect string value: ‘\xF0\x9F\x98\x81…' for column ‘data' at row 1