MySQL UTF-8:一般?斌?统一码?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2344118/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
UTF-8: General? Bin? Unicode?
提问by Dolph
I'm trying to figure out what collation I should be using for various types of data. 100% of the content I will be storing is user-submitted.
我试图弄清楚我应该对各种类型的数据使用什么排序规则。我将存储的内容 100% 是用户提交的。
My understanding is that I should be using UTF-8 General CI (Case-Insensitive) instead of UTF-8 Binary. However, I can't find a clear a distinction between UTF-8 General CI and UTF-8 Unicode CI.
我的理解是我应该使用 UTF-8 通用 CI(不区分大小写)而不是 UTF-8 二进制。但是,我找不到 UTF-8 General CI 和 UTF-8 Unicode CI 之间的明确区别。
- Should I be storing user-submitted content in UTF-8 General or UTF-8 Unicode CI columns?
- What type of data would UTF-8 Binary be applicable to?
- 我应该将用户提交的内容存储在 UTF-8 General 还是 UTF-8 Unicode CI 列中?
- UTF-8 二进制适用于什么类型的数据?
回答by Sagi
In general, utf8_general_ciis faster than utf8_unicode_ci, but less correct.
一般来说,utf8_general_ci比utf8_unicode_ci快,但不太正确。
Here is the difference:
这是区别:
For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “?” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.
对于任何 Unicode 字符集,使用 _general_ci 排序规则执行的操作比使用 _unicode_ci 排序规则执行的操作快。例如,utf8_general_ci 归类的比较比 utf8_unicode_ci 的比较快,但准确性稍差。这样做的原因是 utf8_unicode_ci 支持扩展等映射;也就是说,当一个字符与其他字符的组合相等时。例如,在德语和其他一些语言中“?” 等于“ss”。utf8_unicode_ci 还支持收缩和可忽略的字符。utf8_general_ci 是不支持扩展、收缩或可忽略字符的旧排序规则。它只能在字符之间进行一对一的比较。
Quoted from: http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html
引自:http: //dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html
For more detailed explanation, please read the following post from MySQL forums: http://forums.mysql.com/read.php?103,187048,188748
更详细的解释,请阅读 MySQL 论坛的以下帖子:http: //forums.mysql.com/read.php?103,187048,188748
As for utf8_bin: Both utf8_general_ciand utf8_unicode_ciperform case-insensitive comparison. In constrast, utf8_bin is case-sensitive(among other differences), because it compares the binary values of the characters.
至于 utf8_bin:utf8_general_ci和utf8_unicode_ci 都是不区分大小写的比较。相比之下,utf8_bin 区分大小写(除其他差异外),因为它比较字符的二进制值。
回答by Alex Hepp
You should also be aware of the fact, that with utf8_general_ci when using a varchar field as unique or primary index inserting 2 values like 'a' and 'á' would give a duplicate key error.
您还应该知道这样一个事实,当使用 varchar 字段作为唯一索引或主索引时,使用 utf8_general_ci 插入 2 个值(如“a”和“á”)会产生重复键错误。
回答by Rick James
utf8_bin
compares the bits blindly. No case folding, no accent stripping.utf8_general_ci
compares one byte with one byte. It does case folding andaccent stripping, but no 2-character comparisions:ij
is not equal?
in this collation.utf8_*_ci
is a set of language-specific rules, but otherwise likeunicode_ci
. Some special cases:?
,?
,ch
,ll
utf8_unicode_ci
follows an old Unicode standard for comparisons.ij
=?
, butae
!=?
utf8_unicode_520_ci
follows an newer Unicode standard.ae
=?
utf8_bin
盲目地比较位。没有外壳折叠,没有重音剥离。utf8_general_ci
将一字节与一字节进行比较。它进行大小写折叠和重音剥离,但没有 2 个字符的比较:在此排序规则中ij
不相等?
。utf8_*_ci
是一组特定于语言的规则,但其他方面类似于unicode_ci
. 一些特殊情况:?
,?
,ch
,ll
utf8_unicode_ci
遵循旧的 Unicode 标准进行比较。ij
=?
,但是ae
!=?
utf8_unicode_520_ci
遵循较新的 Unicode 标准。ae
=?
See collation chartfor details on what is equal to what in various utf8 collations.
有关什么等于各种 utf8 排序规则中的内容的详细信息,请参阅排序规则图表。
utf8
, as defined by MySQLis limited to the 1- to 3-byte utf8 codes. This leaves out Emoji and some of Chinese. So you should really switch to utf8mb4
if you want to go much beyond Europe.
utf8
,由 MySQL 定义仅限于 1 到 3 字节的 utf8 代码。这遗漏了表情符号和一些中文。因此,utf8mb4
如果您想超越欧洲,您真的应该转向。
The above points apply to utf8mb4
, after suitable spelling change. Going forward, utf8mb4
and utf8mb4_unicode_520_ci
are preferred.
以上几点适用于utf8mb4
,经过适当的拼写更改。展望未来,utf8mb4
并且utf8mb4_unicode_520_ci
是首选。
- utf16 and utf32 are variants on utf8; there is virtually no use for them.
- ucs2 is closer to "Unicode" than "utf8"; there is virtually no use for it.
- utf16 和 utf32 是 utf8 的变体;它们几乎没有用处。
- ucs2 比“utf8”更接近“Unicode”;它几乎没有用处。
回答by vitalii
Really, I tested saving values like 'é' and 'e' in column with uniqueindex and they cause duplicate error on both 'utf8_unicode_ci' and 'utf8_general_ci'. You can save them only in 'utf8_bin' collated column.
真的,我测试了在具有唯一索引的列中保存诸如 'é' 和 'e' 之类的值,它们会在 'utf8_unicode_ci' 和 'utf8_general_ci' 上导致重复错误。您只能将它们保存在 'utf8_bin' 整理列中。
And mysql docs (in http://dev.mysql.com/doc/refman/5.7/en/charset-applications.html) suggest into its examples set 'utf8_general_ci' collation.
并且 mysql 文档(在http://dev.mysql.com/doc/refman/5.7/en/charset-applications.html 中)建议在其示例中设置“utf8_general_ci”排序规则。
[mysqld]
character-set-server=utf8
collation-server=utf8_general_ci
回答by Marwann
Accepted answer is outdated.
接受的答案已过时。
If you use MySQL 5.5.3+, use utf8mb4_unicode_ci
instead of utf8_unicode_ci
to ensure the characters typed by your users won't give you errors.
如果您使用 MySQL 5.5.3+,请使用utf8mb4_unicode_ci
而不是utf8_unicode_ci
确保您的用户输入的字符不会给您错误。
utf8mb4
supports emojis for example, whereas utf8
might give you hundreds of encoding-related bugs like:
utf8mb4
例如,支持表情符号,而utf8
可能会给您带来数百个与编码相关的错误,例如:
Incorrect string value: ‘\xF0\x9F\x98\x81…' for column ‘data' at row 1
Incorrect string value: ‘\xF0\x9F\x98\x81…' for column ‘data' at row 1