MySQL utf8_general_ci 和 utf8_unicode_ci 有什么区别?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1036454/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What are the differences between utf8_general_ci and utf8_unicode_ci?
提问by reconbot
Possible Duplicate:
What's the difference between utf8_general_ci and utf8_unicode_ci
I've got two options for unicode that look promising for a mysql database.
我有两个 unicode 选项,它们看起来对 mysql 数据库很有希望。
utf8_general_ci unicode (multilingual), case-insensitive
utf8_unicode_ci unicode (multilingual), case-insensitive
Can you please explain what is the difference between utf8_general_ci and utf8_unicode_ci? What are the effects of choosing one over the other when designing a database?
你能解释一下 utf8_general_ci 和 utf8_unicode_ci 有什么区别吗?在设计数据库时选择一个而不是另一个有什么影响?
回答by Timotei
utf8_general_ci
is a very simple — and on Unicode, very broken — collation, one that gives incorrect resultson general Unicode text. What it does is:
utf8_general_ci
是一个非常简单的 - 在 Unicode 上,非常破碎 - 排序规则,它在一般 Unicode 文本上给出错误的结果。它的作用是:
- converts to Unicode normalization form D for canonical decomposition
- removes any combining characters
- converts to upper case
- 转换为 Unicode 规范化形式 D 以进行规范分解
- 删除任何组合字符
- 转换为大写
This does not work correctly on Unicode, because it does not understand Unicode casing. Unicode casing alone is much more complicated than an ASCII-minded approach can handle. For example:
这在 Unicode 上不能正常工作,因为它不理解 Unicode 大小写。Unicode 大小写本身比 ASCII 方法可以处理的要复杂得多。例如:
- The lowercase of “?” is “?”, but the uppercase of “?” is “SS”.
- There are two lowercase Greek sigmas, but only one uppercase one; consider “Σ?συφο?”.
- Letters like “?” do not decompose to an “o” plus a diacritic, meaning that it won't correctly sort.
- “?”的小写 是“?”,但是“?”的大写 是“SS”。
- 有两个小写的希腊西格玛,但只有一个大写的;考虑“Σ?συφο?”。
- 像“?”这样的字母 不要分解为“o”加上变音符号,这意味着它不会正确排序。
There are many other subtleties.
还有许多其他微妙之处。
utf8_unicode_ci
uses the standard Unicode Collation Algorithm, supports so called expansions and ligatures, for example: German letter ? (U+00DF LETTER SHARP S) is sorted near "ss" Letter ? (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE".
utf8_unicode_ci
使用标准的Unicode Collation Algorithm,支持所谓的扩展和连字,例如: German letter ? (U+00DF LETTER SHARP S) 在“ss”字母附近排序?(U+0152 LATIN CAPITAL LIGATURE OE) 排序在“OE”附近。
utf8_general_ci
does not support expansions/ligatures, it sorts
all these letters as single characters, and sometimes in a wrong order.
utf8_general_ci
不支持扩展/连字,它将所有这些字母排序为单个字符,有时排序错误。
utf8_unicode_ci
is generallymore accurate for all scripts. For example, on Cyrillic block:utf8_unicode_ci
is fine for all these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are sorted not well.
utf8_unicode_ci
是一般所有脚本更准确。例如,在 Cyrillic 块上:utf8_unicode_ci
适用于所有这些语言:俄语、保加利亚语、白俄罗斯语、马其顿语、塞尔维亚语和乌克兰语。虽然 utf8_general_ci 仅适用于西里尔文的俄语和保加利亚语子集。白俄罗斯语、马其顿语、塞尔维亚语和乌克兰语中使用的额外字母排序不好。
The cost of utf8_unicode_ci
is that it is a littlebit
slower than utf8_general_ci
. But that's the price you pay for correctness. Either you can have a fast answer that's wrong, or a very slightly slower answer that's right. Your choice.
It is very difficult to ever justify giving wrong answers, so it's best to assume that utf8_general_ci
doesn't exist and to always use utf8_unicode_ci
. Well, unless you want wrong answers.
成本utf8_unicode_ci
是,它是一个稍微有点慢utf8_general_ci
。但这就是您为正确性付出的代价。您可以有一个快速的错误答案,或者一个非常慢的正确答案。你的选择。很难证明给出错误答案是正确的,因此最好假设它utf8_general_ci
不存在并始终使用utf8_unicode_ci
. 好吧,除非你想要错误的答案。
Source: http://forums.mysql.com/read.php?103,187048,188748#msg-188748
来源:http: //forums.mysql.com/read.php?103,187048,188748#msg-188748
回答by Gumbo
From Unicode Character Setsin the MySQL documentation:
来自MySQL 文档中的Unicode 字符集:
For any Unicode character set, operations performed using the
_general_ci
collation are faster than those for the_unicode_ci
collation. For example, comparisons for theutf8_general_ci
collation are faster, but slightly less correct, than comparisons forutf8_unicode_ci
. The reason for this is thatutf8_unicode_ci
supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “?
” is equal to “ss
”.utf8_unicode_ci
also supports contractions and ignorable characters.utf8_general_ci
is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.
对于任何 Unicode 字符集,使用
_general_ci
排序规则执行的操作比使用排序规则执行的操作更快_unicode_ci
。例如,utf8_general_ci
归类的比较比 的比较快,但正确性稍差utf8_unicode_ci
。这样做的原因是utf8_unicode_ci
支持扩展等映射;也就是说,当一个字符与其他字符的组合相等时。例如,在德语和其他一些语言中,“?
” 等于“ss
”。utf8_unicode_ci
还支持收缩和可忽略的字符。utf8_general_ci
是不支持扩展、收缩或可忽略字符的旧排序规则。它只能在字符之间进行一对一的比较。