MySQL UTF-8：一般？斌？统一码？

Question

提问by Dolph

I'm trying to figure out what collation I should be using for various types of data. 100% of the content I will be storing is user-submitted.

我试图弄清楚我应该对各种类型的数据使用什么排序规则。我将存储的内容 100% 是用户提交的。

My understanding is that I should be using UTF-8 General CI (Case-Insensitive) instead of UTF-8 Binary. However, I can't find a clear a distinction between UTF-8 General CI and UTF-8 Unicode CI.

我的理解是我应该使用 UTF-8 通用 CI（不区分大小写）而不是 UTF-8 二进制。但是，我找不到 UTF-8 General CI 和 UTF-8 Unicode CI 之间的明确区别。

Should I be storing user-submitted content in UTF-8 General or UTF-8 Unicode CI columns?
What type of data would UTF-8 Binary be applicable to?

我应该将用户提交的内容存储在 UTF-8 General 还是 UTF-8 Unicode CI 列中？
UTF-8 二进制适用于什么类型的数据？

Answer 1

回答by Sagi

In general, utf8_general_ciis faster than utf8_unicode_ci, but less correct.

一般来说，utf8_general_ci比utf8_unicode_ci快，但不太正确。

Here is the difference:

这是区别：

For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “?” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.

对于任何 Unicode 字符集，使用 _general_ci 排序规则执行的操作比使用 _unicode_ci 排序规则执行的操作快。例如，utf8_general_ci 归类的比较比 utf8_unicode_ci 的比较快，但准确性稍差。这样做的原因是 utf8_unicode_ci 支持扩展等映射；也就是说，当一个字符与其他字符的组合相等时。例如，在德语和其他一些语言中“？” 等于“ss”。utf8_unicode_ci 还支持收缩和可忽略的字符。utf8_general_ci 是不支持扩展、收缩或可忽略字符的旧排序规则。它只能在字符之间进行一对一的比较。

Quoted from: http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html

引自：http: //dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html

For more detailed explanation, please read the following post from MySQL forums: http://forums.mysql.com/read.php?103,187048,188748

更详细的解释，请阅读 MySQL 论坛的以下帖子：http: //forums.mysql.com/read.php?103,187048,188748

As for utf8_bin: Both utf8_general_ciand utf8_unicode_ciperform case-insensitive comparison. In constrast, utf8_bin is case-sensitive(among other differences), because it compares the binary values of the characters.

至于 utf8_bin：utf8_general_ci和utf8_unicode_ci 都是不区分大小写的比较。相比之下，utf8_bin 区分大小写（除其他差异外），因为它比较字符的二进制值。

Answer 2

回答by Alex Hepp

You should also be aware of the fact, that with utf8_general_ci when using a varchar field as unique or primary index inserting 2 values like 'a' and 'á' would give a duplicate key error.

您还应该知道这样一个事实，当使用 varchar 字段作为唯一索引或主索引时，使用 utf8_general_ci 插入 2 个值（如“a”和“á”）会产生重复键错误。

Answer 3

回答by Rick James

utf8_bincompares the bits blindly. No case folding, no accent stripping.
utf8_general_cicompares one byte with one byte. It does case folding andaccent stripping, but no 2-character comparisions: ijis not equal ?in this collation.
utf8_*_ciis a set of language-specific rules, but otherwise like unicode_ci. Some special cases: ?, ?, ch, ll
utf8_unicode_cifollows an old Unicode standard for comparisons. ij=?, but ae!= ?
utf8_unicode_520_cifollows an newer Unicode standard. ae= ?

utf8_bin盲目地比较位。没有外壳折叠，没有重音剥离。
utf8_general_ci将一字节与一字节进行比较。它进行大小写折叠和重音剥离，但没有 2 个字符的比较：在此排序规则中ij不相等?。
utf8_*_ci是一组特定于语言的规则，但其他方面类似于unicode_ci. 一些特殊情况：?, ?, ch,ll
utf8_unicode_ci遵循旧的 Unicode 标准进行比较。ij= ?，但是ae！=?
utf8_unicode_520_ci遵循较新的 Unicode 标准。 ae=?

See collation chartfor details on what is equal to what in various utf8 collations.

有关什么等于各种 utf8 排序规则中的内容的详细信息，请参阅排序规则图表。

utf8, as defined by MySQLis limited to the 1- to 3-byte utf8 codes. This leaves out Emoji and some of Chinese. So you should really switch to utf8mb4if you want to go much beyond Europe.

utf8,由 MySQL 定义仅限于 1 到 3 字节的 utf8 代码。这遗漏了表情符号和一些中文。因此，utf8mb4如果您想超越欧洲，您真的应该转向。

The above points apply to utf8mb4, after suitable spelling change. Going forward, utf8mb4and utf8mb4_unicode_520_ciare preferred.

以上几点适用于utf8mb4，经过适当的拼写更改。展望未来，utf8mb4并且utf8mb4_unicode_520_ci是首选。

utf16 and utf32 are variants on utf8; there is virtually no use for them.
ucs2 is closer to "Unicode" than "utf8"; there is virtually no use for it.

utf16 和 utf32 是 utf8 的变体；它们几乎没有用处。
ucs2 比“utf8”更接近“Unicode”；它几乎没有用处。

Answer 4

回答by vitalii

Really, I tested saving values like 'é' and 'e' in column with uniqueindex and they cause duplicate error on both 'utf8_unicode_ci' and 'utf8_general_ci'. You can save them only in 'utf8_bin' collated column.

真的，我测试了在具有唯一索引的列中保存诸如 'é' 和 'e' 之类的值，它们会在 'utf8_unicode_ci' 和 'utf8_general_ci' 上导致重复错误。您只能将它们保存在 'utf8_bin' 整理列中。

And mysql docs (in http://dev.mysql.com/doc/refman/5.7/en/charset-applications.html) suggest into its examples set 'utf8_general_ci' collation.

并且 mysql 文档（在http://dev.mysql.com/doc/refman/5.7/en/charset-applications.html 中）建议在其示例中设置“utf8_general_ci”排序规则。

[mysqld]
character-set-server=utf8
collation-server=utf8_general_ci

Answer 5

回答by Marwann

Accepted answer is outdated.

接受的答案已过时。

If you use MySQL 5.5.3+, use utf8mb4_unicode_ciinstead of utf8_unicode_cito ensure the characters typed by your users won't give you errors.

如果您使用 MySQL 5.5.3+，请使用utf8mb4_unicode_ci而不是utf8_unicode_ci确保您的用户输入的字符不会给您错误。

utf8mb4supports emojis for example, whereas utf8might give you hundreds of encoding-related bugs like:

utf8mb4例如，支持表情符号，而utf8可能会给您带来数百个与编码相关的错误，例如：

Incorrect string value: ‘\xF0\x9F\x98\x81…' for column ‘data' at row 1

MySQL UTF-8：一般？斌？统一码？

提问by Dolph

回答by Sagi

回答by Alex Hepp

回答by Rick James

回答by vitalii

回答by Marwann

相关推荐

最近更新

标签

MySQL UTF-8：一般？斌？统一码？

提问by Dolph

回答by Sagi

回答by Alex Hepp

回答by Rick James

回答by vitalii

回答by Marwann

相关推荐

MySQL 使用 pdo 获取行数

MySQL 联合两个具有不同列数的表

MySQL - 从前一行减去值，分组依据

MySQL 如何强制对内部连接表建立索引？

相关推荐

最近更新

标签