MySQL utf8_general_ci 和 utf8_unicode_ci 有什么区别?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/766809/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 13:08:41  来源:igfitidea点击:

What's the difference between utf8_general_ci and utf8_unicode_ci?

mysqlunicodeutf-8collationcharacter-set

提问by KahWee Teng

Between utf8_general_ciand utf8_unicode_ci, are there any differences in terms of performance?

utf8_general_ci和之间utf8_unicode_ci,在性能方面有什么区别吗?

回答by thomasrutter

These two collations are both for the UTF-8 character encoding. The differences are in how text is sorted and compared.

这两个排序规则都是针对 UTF-8 字符编码的。不同之处在于文本的排序和比较方式。

Note: In MySQL you have to use utf8mb4rather than utf8. Confusingly, utf8is a flawed UTF-8 implementation from early MySQL versions which remains only for backward compatibility. The fixed version was given the name utf8mb4.

注意:在 MySQL 中,您必须使用utf8mb4而不是utf8. 令人困惑的是,这utf8是早期 MySQL 版本中存在缺陷的 UTF-8 实现,仅用于向后兼容。固定版本被命名为utf8mb4

Note: Newer versions of MySQL have updated Unicode sorting rules, available under names such asutf8mb4_0900_ai_cifor equivalent rules based on Unicode 9.0 - and with no equivalent_generalvariant. People reading this now should probably use one of these newer collations instead of either_unicodeor_general. Much of what's written below is not of much interest anymore if you can use one of the newer collations instead.

注意:较新版本的 MySQL 更新了 Unicode 排序规则,可在名称下使用,例如utf8mb4_0900_ai_ci基于 Unicode 9.0 的等效规则 - 并且没有等效_general变体。现在阅读本文的人可能应该使用这些较新的排序规则之一而不是_unicode_general如果您可以改用较新的排序规则之一,那么下面所写的大部分内容都不再引起人们的兴趣。

Key differences

主要区别

  • utf8mb4_unicode_ciis based on the official Unicode rules for universal sorting and comparison, which sorts accurately in a wide range of languages.

  • utf8mb4_general_ciis a simplified set of sorting rules which aims to do as well as it can while taking many short-cuts designed to improve speed. It does not follow the Unicode rules and will result in undesirable sorting or comparison in some situations, such as when using particular languages or characters.

    On modern servers, this performance boost will be all but negligible. It was devised in a time when servers had a tiny fraction of the CPU performance of today's computers.

  • utf8mb4_unicode_ci基于官方 Unicode 规则进行通用排序和比较,可在多种语言中准确排序。

  • utf8mb4_general_ci是一组简化的排序规则,旨在尽其所能,同时采取许多旨在提高速度的捷径。它不遵循 Unicode 规则,在某些情况下会导致不受欢迎的排序或比较,例如在使用特定语言或字符时。

    在现代服务器上,这种性能提升几乎可以忽略不计。它是在服务器具有当今计算机 CPU 性能的一小部分的时代设计的。

Benefits of utf8mb4_unicode_ciover utf8mb4_general_ci

utf8mb4_unicode_ci超过的好处utf8mb4_general_ci

utf8mb4_unicode_ci, which uses the Unicode rules for sorting and comparison, employs a fairly complex algorithm for correct sorting in a wide range of languages and when using a wide range of special characters. These rules need to take into account language-specific conventions; not everybody sorts their characters in what we would call 'alphabetical order'.

utf8mb4_unicode_ci,它使用 Unicode 规则进行排序和比较,使用相当复杂的算法来正确排序各种语言和使用各种特殊字符。这些规则需要考虑特定语言的约定;不是每个人都按照我们所说的“字母顺序”对他们的角色进行排序。

As far as Latin (ie "European") languages go, there is not much difference between the Unicode sorting and the simplified utf8mb4_general_cisorting in MySQL, but there are still a few differences:

就拉丁(即“欧洲”)语言而言,Unicode 排序和utf8mb4_general_ciMySQL 中的简化排序没有太大区别,但还是有一些区别:

  • For examples, the Unicode collation sorts "?" like "ss", and "?" like "OE" as people using those characters would normally want, whereas utf8mb4_general_cisorts them as single characters (presumably like "s" and "e" respectively).

  • Some Unicode characters are defined as ignorable, which means they shouldn't count toward the sort order and the comparison should move on to the next character instead. utf8mb4_unicode_cihandles these properly.

  • 例如,Unicode 排序规则对“?”进行排序。像“ss”和“?” 像使用这些字符的人通常想要的“OE”,而将utf8mb4_general_ci它们排序为单个字符(大概分别像“s”和“e”)。

  • 某些 Unicode 字符被定义为可忽略的,这意味着它们不应计入排序顺序,而应将比较转移到下一个字符。 utf8mb4_unicode_ci正确处理这些。

In non-latin languages, such as Asian languages or languages with different alphabets, there may be a lot moredifferences between Unicode sorting and the simplified utf8mb4_general_cisorting. The suitability of utf8mb4_general_ciwill depend heavily on the language used. For some languages, it'll be quite inadequate.

在非拉丁语言中,例如亚洲语言或具有不同字母表的语言,Unicode 排序和简化排序之间可能存在更多差异utf8mb4_general_ci。的适用性在utf8mb4_general_ci很大程度上取决于所使用的语言。对于某些语言,这将是相当不够的。

What should you use?

你应该用什么?

There is almost certainly no reason to use utf8mb4_general_cianymore, as we have left behind the point where CPU speed is low enough that the performance difference would be important. Your database will almost certainly be limited by other bottlenecks than this.

几乎可以肯定没有理由再使用utf8mb4_general_ci了,因为我们已经忘记了 CPU 速度足够低以至于性能差异很重要的点。您的数据库几乎肯定会受到除此之外的其他瓶颈的限制。

In the past, some people recommended to use utf8mb4_general_ciexcept when accurate sorting was going to be important enough to justify the performance cost. Today, that performance cost has all but disappeared, and developers are treating internationalization more seriously.

过去,有些人建议使用,utf8mb4_general_ci除非准确排序非常重要以证明性能成本是合理的。今天,这种性能成本几乎消失了,开发人员正在更加认真地对待国际化。

There's an argument to be made that if speed is more important to you than accuracy, you may as well not do any sorting at all. It's trivial to make an algorithm faster if you do not need it to be accurate. So, utf8mb4_general_ciis a compromise that's probably not needed for speed reasons and probably also not suitable for accuracy reasons.

有一种观点认为,如果速度对您来说比准确性更重要,那么您最好根本不进行任何排序。如果您不需要算法准确,那么使算法更快是微不足道的。因此,utf8mb4_general_ci出于速度原因可能不需要妥协,并且可能也不适合准确性原因。

One other thing I'll add is that even if you know your application only supports the English language, it may still need to deal with people's names, which can often contain characters used in other languages in which it is just as important to sort correctly. Using the Unicode rules for everything helps add peace of mind that the very smart Unicode people have worked very hard to make sorting work properly.

我要补充的另一件事是,即使你知道你的应用程序只支持英语,它可能仍然需要处理人名,它通常可能包含其他语言中使用的字符,在这些语言中正确排序同样重要. 对所有内容使用 Unicode 规则有助于让您高枕无忧,因为非常聪明的 Unicode 人员已经非常努力地使排序正常工作。

What the parts mean

各部分是什么意思

Firstly, ciis for case-insensitivesorting and comparison. This means it's suitable for textual data, and case is not important. The other types of collation are cs(case-sensitive) for textual data where case is important, and bin, for where the encoding needs to match, bit for bit, which is suitable for fields which are really encoded binary data (including, for example, Base64). Case-sensitive sorting leads to some weird results and case-sensitive comparison can result in duplicate values differing only in letter case, so case-sensitive collations are falling out of favor for textual data - if case is significant to you, then otherwise ignorable punctuation and so on is probably also significant, and a binary collation might be more appropriate.

首先,ci用于不区分大小写的排序和比较。这意味着它适用于文本数据,大小写并不重要。其他类型的排序规则cs(区分大小写)适用于区分大小写的文本数据,以及bin适用于真正编码的二进制数据的字段(包括,例如, Base64)。区分大小写的排序会导致一些奇怪的结果,区分大小写的比较可能会导致重复值仅在字母大小写中不同,因此区分大小写的排序规则对文本数据不受欢迎——如果大小写对您很重要,那么其他情况下的标点符号可以忽略等等可能也很重要,二进制排序可能更合适。

Next, unicodeor generalrefers to the specific sorting and comparison rules - in particular, the way text is normalized or compared. There are many different sets of rules for the utf8mb4 character encoding, with unicodeand generalbeing two that attempt to work well in all possible languages rather than one specific one. The differences between these two sets of rules are the subject of this answer. Note that unicodeuses rules from Unicode 4.0. Recent versions of MySQL add the rulesets unicode_520using rules from Unicode 5.2, and 0900(dropping the "unicode_" part) using rules from Unicode 9.0.

其次,unicode还是general指具体的排序和比较规则——特别是文本归一化或者比较的方式。有对utf8mb4字符编码许多不同的规则集,以unicodegeneral为两块尝试工作以及在所有可能的语言,而不是一个具体的一个。这两组规则之间的差异是本答案的主题。请注意,unicode使用来自 Unicode 4.0 的规则。MySQL 的最新版本unicode_520使用来自 Unicode 5.2的规则添加规则集,并且0900(删除“unicode_”部分)使用来自 Unicode 9.0 的规则。

And lastly, utf8mb4is of course the character encoding used internally. In this answer I'm talking only about Unicode based encodings.

最后,utf8mb4当然是内部使用的字符编码。在这个答案中,我只谈论基于 Unicode 的编码。

回答by nightcoder

I wanted to know what is the performance difference between using utf8_general_ciand utf8_unicode_ci, but I did not find any benchmarks listed on the internet, so I decided to create benchmarks myself.

我想知道使用utf8_general_ci和之间的性能差异是什么utf8_unicode_ci,但我没有在互联网上找到任何基准,所以我决定自己创建基准。

I created a very simple table with 500,000 rows:

我创建了一个包含 500,000 行的非常简单的表:

CREATE TABLE test(
  ID INT(11) DEFAULT NULL,
  Description VARCHAR(20) DEFAULT NULL
)
ENGINE = INNODB
CHARACTER SET utf8
COLLATE utf8_general_ci;

Then I filled it with random data by running this stored procedure:

然后我通过运行这个存储过程用随机数据填充它:

CREATE PROCEDURE randomizer()
BEGIN
  DECLARE i INT DEFAULT 0;
  DECLARE random CHAR(20) ;
  theloop: loop
    SET random = CONV(FLOOR(RAND() * 99999999999999), 20, 36);
    INSERT INTO test VALUES (i+1, random);
    SET i=i+1;
    IF i = 500000 THEN
      LEAVE theloop;
    END IF;
  END LOOP theloop;
END

Then I created the following stored procedures to benchmark simple SELECT, SELECTwith LIKE, and sorting (SELECTwith ORDER BY):

然后我创建了以下存储过程来对 simple SELECTSELECTwithLIKE和排序 ( SELECTwith ORDER BY)进行基准测试:

CREATE PROCEDURE benchmark_simple_select()
BEGIN
  DECLARE i INT DEFAULT 0;
  theloop: loop
    SELECT *
    FROM test
    WHERE Description = 'test' COLLATE utf8_general_ci;
    SET i = i + 1;
    IF i = 30 THEN
      LEAVE theloop;
    END IF;
  END LOOP theloop;
END;

CREATE PROCEDURE benchmark_select_like()
BEGIN
  DECLARE i INT DEFAULT 0;
  theloop: loop
    SELECT *
    FROM test
    WHERE Description LIKE '%test' COLLATE utf8_general_ci;
    SET i = i + 1;
    IF i = 30 THEN
      LEAVE theloop;
    END IF;
  END LOOP theloop;
END;

CREATE PROCEDURE benchmark_order_by()
BEGIN
  DECLARE i INT DEFAULT 0;
  theloop: loop
    SELECT *
    FROM test
    WHERE ID > FLOOR(1 + RAND() * (400000 - 1))
    ORDER BY Description COLLATE utf8_general_ci LIMIT 1000;
    SET i = i + 1;
    IF i = 10 THEN
      LEAVE theloop;
    END IF;
  END LOOP theloop;
END;

In the stored procedures above utf8_general_cicollation is used, but of course during the tests I used both utf8_general_ciand utf8_unicode_ci.

在上面的存储过程utf8_general_ci中使用了排序规则,但当然在测试期间我同时使用了utf8_general_ciutf8_unicode_ci.

I called each stored procedure 5 times for each collation (5 times for utf8_general_ciand 5 times for utf8_unicode_ci) and then calculated the average values.

我为每个排序规则调用每个存储过程 5 次(5 次 forutf8_general_ci和 5 次 for utf8_unicode_ci),然后计算平均值。

My results are:

我的结果是:

benchmark_simple_select()

benchmark_simple_select()

  • with utf8_general_ci: 9,957 ms
  • with utf8_unicode_ci: 10,271 ms
  • utf8_general_ci:9,957 毫秒
  • utf8_unicode_ci:10,271 毫秒

In this benchmark using utf8_unicode_ciis slower than utf8_general_ciby 3.2%.

在这个基准测试中使用utf8_unicode_ciutf8_general_ci3.2%慢。

benchmark_select_like()

benchmark_select_like()

  • with utf8_general_ci: 11,441 ms
  • with utf8_unicode_ci: 12,811 ms
  • utf8_general_ci:11,441 毫秒
  • utf8_unicode_ci:12,811 毫秒

In this benchmark using utf8_unicode_ciis slower than utf8_general_ciby 12%.

在这个基准测试中,使用utf8_unicode_ci速度比utf8_general_ci12%慢。

benchmark_order_by()

benchmark_order_by()

  • with utf8_general_ci: 11,944 ms
  • with utf8_unicode_ci: 12,887 ms
  • utf8_general_ci:11,944 毫秒
  • utf8_unicode_ci:12,887 毫秒

In this benchmark using utf8_unicode_ciis slower than utf8_general_ciby 7.9%.

在这个基准测试中使用utf8_unicode_ciutf8_general_ci7.9%慢。

回答by Michael Madsen

This postdescribes it very nicely.

这篇文章很好地描述了它。

In short: utf8_unicode_ci uses the Unicode Collation Algorithm as defined in the Unicode standards, whereas utf8_general_ci is a more simple sort order which results in "less accurate" sorting results.

简而言之:utf8_unicode_ci 使用 Unicode 标准中定义的 Unicode 整理算法,而 utf8_general_ci 是一种更简单的排序顺序,导致“不太准确”的排序结果。

回答by Dana the Sane

See the mysql manual, Unicode Character Setssection:

请参阅 mysql 手册,Unicode 字符集部分:

For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “?” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.

对于任何 Unicode 字符集,使用 _general_ci 排序规则执行的操作比使用 _unicode_ci 排序规则执行的操作快。例如,utf8_general_ci 归类的比较比 utf8_unicode_ci 的比较快,但准确性稍差。这样做的原因是 utf8_unicode_ci 支持扩展等映射;也就是说,当一个字符与其他字符的组合相等时。例如,在德语和其他一些语言中“?” 等于“ss”。utf8_unicode_ci 还支持收缩和可忽略的字符。utf8_general_ci 是不支持扩展、收缩或可忽略字符的旧排序规则。它只能在字符之间进行一对一的比较。

So to summarize, utf_general_ci uses a smaller and less correct (according to the standard) set of comparisons than utf_unicode_ci which shouldimplement the entire standard. The general_ci set will be faster because there is less computation to do.

总而言之,utf_general_ci 使用比 utf_unicode_ci 更小、更不正确(根据标准)的比较集,后者应该实现整个标准。general_ci 集会更快,因为要做的计算更少。

回答by simhumileco

In brief words:

简而言之:

If you need better sorting order - use utf8_unicode_ci(this is the preferred method),

如果您需要更好的排序顺序 - 使用utf8_unicode_ci(这是首选方法),

but if you utterly interested in performance - use utf8_general_ci, but know that it is a little outdated.

但是如果你对性能完全感兴趣,使用utf8_general_ci,但要知道它有点过时了。

The differences in terms of performance are very slight.

性能方面的差异非常小。

回答by Kamil Kie?czewski

Some details (PL)

一些细节(PL)

As we can read here(Peter Gulutzan) there is difference on sorting/comparing polish letter "?" (L with stroke - html esc: Ł) (lower case: "?" - html esc: ł) - we have following assumption:

正如我们在这里读到的(彼得古鲁赞),在排序/比较波兰语字母“?”方面存在差异。(L withstroke - html esc: Ł)(小写:“?” - html esc:) ł- 我们有以下假设:

utf8_polish_ci      ? greater than L and less than M
utf8_unicode_ci     ? greater than L and less than M
utf8_unicode_520_ci ? equal to L
utf8_general_ci     ? greater than Z

In polish language letter ?is after letter Land before M. No one of this coding is better or worse - it depends of your needs.

在波兰语中,字母?在字母之后L和之前M。这些编码没有一个更好或更坏 - 这取决于您的需求。

回答by Adam

There are two big difference the sorting and the character matching:

排序和字符匹配有两大区别:

Sorting:

排序

  • utf8mb4_general_ciremoves all accents and sorts one by one which may create incorrect sort results.
  • utf8mb4_unicode_cisorts accurate.
  • utf8mb4_general_ci删除所有重音符号并一一排序,这可能会产生不正确的排序结果。
  • utf8mb4_unicode_ci排序准确。

Character Matching

字符匹配

They match characters differently.

它们以不同的方式匹配字符。

For example, in utf8mb4_unicode_ciyou have i != ?, but in utf8mb4_general_ciit holds ?=i.

例如,在utf8mb4_unicode_ci你有i != ?,但在utf8mb4_general_ci它持有?=i

For example, imagine you have a row with name="Y?lmaz". Then

例如,假设您有一排带有name="Y?lmaz". 然后

select id from users where name='Yilmaz';

would return the row if collocation is utf8mb4_general_ci, but if it is collocated with utf8mb4_unicode_ciit would notreturn the row!

如果搭配为utf8mb4_general_ci,将返回该行,但如果与utf8mb4_unicode_ci它搭配,则不会返回该行!

On the other hand we have that a=aand ?=ssin utf8mb4_unicode_ciwhich is not the case in utf8mb4_general_ci. So imagine you have a row with name="a?i", then

另一方面,我们有那个a=a并且?=ssutf8mb4_unicode_ci中 的情况并非如此utf8mb4_general_ci。所以想象你有一排name="a?i",然后

select id from users where name='assi';

would return the row if collocation is utf8mb4_unicode_ci, but would notreturn a row if collocation is set to utf8mb4_general_ci.

如果搭配设置为utf8mb4_unicode_ci,则将返回行,但如果搭配设置为 ,则不会返回行utf8mb4_general_ci

A full list of matches for each collocation may be found here.

可以在此处找到每个搭配的完整匹配列表。

回答by DavidH

According to this post, there is a considerably large performance benefit on MySQL 5.7 when using utf8mb4_general_ci in stead of utf8mb4_unicode_ci: https://www.percona.com/blog/2019/02/27/charset-and-collation-settings-impact-on-mysql-performance/

根据这篇文章,当使用 utf8mb4_general_ci 而不是 utf8mb4_unicode_ci 时,MySQL 5.7 有相当大的性能优势:https://www.percona.com/blog/2019/02/27/charset-and-collat​​ion-settings-impact -on-mysql-性能/