MySQL 字符集和排序规则到底是什么意思?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/341273/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What does character set and collation mean exactly?
提问by Sander Versluys
I can read the MySQLdocumentation and it's pretty clear. But, how does one decide which character set to use? On what data does collation have an effect?
我可以阅读MySQL文档,它很清楚。但是,如何决定使用哪个字符集?整理对哪些数据有影响?
I'm asking for an explanation of the two and how to choose them.
我要求解释这两者以及如何选择它们。
回答by Dan Esparza
From MySQL docs:
来自 MySQL文档:
A character setis a set of symbols and encodings. A collationis a set of rules for comparing characters in a character set. Let's make the distinction clear with an example of an imaginary character set.
Suppose that we have an alphabet with four letters: 'A', 'B', 'a', 'b'. We give each letter a number: 'A' = 0, 'B' = 1, 'a' = 2, 'b' = 3. The letter 'A' is a symbol, the number 0 is the encoding for 'A', and the combination of all four letters and their encodings is a character set.
Now, suppose that we want to compare two string values, 'A' and 'B'. The simplest way to do this is to look at the encodings: 0 for 'A' and 1 for 'B'. Because 0 is less than 1, we say 'A' is less than 'B'. Now, what we've just done is apply a collation to our character set. The collation is a set of rules (only one rule in this case): "compare the encodings." We call this simplest of all possible collations a binary collation.
But what if we want to say that the lowercase and uppercase letters are equivalent? Then we would have at least two rules: (1) treat the lowercase letters 'a' and 'b' as equivalent to 'A' and 'B'; (2) then compare the encodings. We call this a case-insensitive collation. It's a little more complex than a binary collation.
In real life, most character sets have many characters: not just 'A' and 'B' but whole alphabets, sometimes multiple alphabets or eastern writing systems with thousands of characters, along with many special symbols and punctuation marks. Also in real life, most collations have many rules: not just case insensitivity but also accent insensitivity (an "accent" is a mark attached to a character as in German '?') and multiple-character mappings (such as the rule that '?' = 'OE' in one of the two German collations).
甲字符集是一组符号和编码。甲归类为在一个字符集的字符进行比较的一组规则。让我们通过一个虚构字符集的例子来明确区分。
假设我们有一个包含四个字母的字母表:'A'、'B'、'a'、'b'。我们给每个字母一个数字:'A' = 0, 'B' = 1, 'a' = 2, 'b' = 3。字母 'A' 是一个符号,数字 0 是 'A' 的编码, 并且所有四个字母及其编码的组合是一个字符集。
现在,假设我们要比较两个字符串值“A”和“B”。最简单的方法是查看编码:0 代表“A”,1 代表“B”。因为 0 小于 1,所以我们说 'A' 小于 'B'。现在,我们刚刚完成的是对我们的字符集应用排序规则。排序规则是一组规则(在这种情况下只有一个规则):“比较编码”。我们称这种最简单的归类为二进制归类。
但是如果我们想说小写字母和大写字母是等价的呢?那么我们至少会有两条规则: (1) 将小写字母 'a' 和 'b' 视为等同于 'A' 和 'B';(2) 然后比较编码。我们称之为不区分大小写的排序规则。它比二进制排序规则稍微复杂一些。
在现实生活中,大多数字符集都有很多字符:不仅仅是“A”和“B”,还有整个字母表,有时是多个字母表或东方书写系统,包含数千个字符,以及许多特殊符号和标点符号。同样在现实生活中,大多数排序规则都有许多规则:不仅不区分大小写,而且不区分重音(“重音”是附加到字符的标记,如德语 '?')和多字符映射(例如 ' ?' = 'OE' 在两个德语排序规则之一中)。
回答by mat
A character encodingis a way to encode characters so that they fit in memory. That is, if the charset is ISO-8859-15, the euro symbol, , will be encoded as 0xa4, and in UTF-8, it will be 0xe282ac.
甲字符编码是一种将编码的字符,使得它们适合在存储器中。也就是说,如果字符集是 ISO-8859-15,欧元符号 , 将被编码为 0xa4,而在 UTF-8 中,它将是 0xe282ac。
The collationis how to compare characters, in latin9, there are letters as e é è ê f
, if sorted by their binary representation, it will go e f é ê è
but if the collation is set to, for example, French, you'll have them in the order you thought they would be, which is all of e é è ê
are equal, and then f
.
该整理是如何比较字符,在latin9,有字母e é è ê f
,如果排序由二进制表示,它会去e f é ê è
,但是当核对设定,例如,法语,你就会有他们的顺序,你认为他们将是,这是所有的e é è ê
都相等,然后f
。
回答by erickson
A character set is a subset of all written glyphs. A character encoding specifies how those characters are mapped to numeric values. Some character encodings, like UTF-8 and UTF-16, can encode any character in the Universal Character Set. Others, like US-ASCII or ISO-8859-1 can only encode a small subset, since they use 7 and 8 bits per character, respectively. Because many standards specify both a character set and a character encoding, the term "character set" is often substituted freely for "character encoding".
字符集是所有书写字形的子集。字符编码指定如何将这些字符映射到数值。某些字符编码,如 UTF-8 和 UTF-16,可以对通用字符集中的任何字符进行编码。其他的,如 US-ASCII 或 ISO-8859-1 只能编码一个小的子集,因为它们分别使用 7 位和 8 位每个字符。由于许多标准既指定了字符集又指定了字符编码,因此术语“字符集”经常被随意替换为“字符编码”。
A collation comprises rules that specify how characters can be compared for sorting. Collations rules can be locale-specific: the proper order of two characters varies from language to language.
排序规则包含指定如何比较字符以进行排序的规则。排序规则可以是特定于语言环境的:两个字符的正确顺序因语言而异。
Choosing a character set and collation comes down to whether your application is internationalized or not. If not, what locale are you targeting?
选择字符集和排序规则取决于您的应用程序是否国际化。如果没有,您的目标是什么语言环境?
In order to choose what character set you want to support, you have to consider your application. If you are storing user-supplied input, it might be hard to foresee all the locales in which your software will eventually be used. To support them all, it might be best to support the UCS (Unicode) from the start. However, there is a cost to this; many western European characters will now require two bytes of storage per character instead of one.
为了选择您想要支持的字符集,您必须考虑您的应用程序。如果您正在存储用户提供的输入,则可能很难预见您的软件最终将在哪些地区使用。为了支持它们,最好从一开始就支持 UCS (Unicode)。然而,这样做是有代价的。许多西欧字符现在每个字符需要两个字节的存储空间,而不是一个字节。
Choosing the right collation can help performance if your database uses the collation to create an index, and later uses that index to provide sorted results. However, since collation rules are often locale-specific, that index will be worthless if you need to sort results according to the rules of another locale.
如果您的数据库使用排序规则创建索引,然后使用该索引提供排序结果,则选择正确的排序规则有助于提高性能。但是,由于整理规则通常是特定于语言环境的,如果您需要根据另一个语言环境的规则对结果进行排序,那么该索引将毫无价值。
回答by simhumileco
I suggest to use utf8mb4_unicode_ci
, which is based on the Unicode standard for sorting and comparison, which sorts accurately in a very wide range of languages.
我建议使用utf8mb4_unicode_ci
,它基于 Unicode 标准进行排序和比较,可以在非常广泛的语言中准确排序。