用于 MySQL 和 PHP 的最佳排序规则是什么?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/367711/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What is the best collation to use for MySQL with PHP?
提问by Darryl Hein
I'm wondering if there is a "best" choice for collation in MySQL for a general website where you aren't 100% sure of what will be entered? I understand that all the encodings should be the same, such as MySQL, Apache, the HTML and anything inside PHP.
我想知道在 MySQL 中是否有一个“最佳”选择,用于一般网站的排序规则,您不能 100% 确定将输入什么?我知道所有的编码都应该是相同的,例如 MySQL、Apache、HTML 和 PHP 中的任何内容。
In the past I have set PHP to output in "UTF-8", but which collation does this match in MySQL? I'm thinking it's one of the UTF-8 ones, but I have used utf8_unicode_ci, utf8_general_ci, and utf8_binbefore.
过去,我已将 PHP 设置为以“UTF-8”输出,但是在 MySQL 中该排序规则与哪种排序规则匹配?我想它是UTF-8的人之一,但我已经使用utf8_unicode_ci,utf8_general_ci和utf8_bin之前。
采纳答案by Eran Galperin
The main difference is sorting accuracy (when comparing characters in the language) and performance. The only special one is utf8_bin which is for comparing characters in binary format.
主要区别在于排序准确性(比较语言中的字符时)和性能。唯一特殊的是 utf8_bin 用于比较二进制格式的字符。
utf8_general_ciis somewhat faster than utf8_unicode_ci, but less accurate (for sorting). The specific language utf8 encoding(such as utf8_swedish_ci) contain additional language rules that make them the most accurate to sort for those languages. Most of the time I use utf8_unicode_ci(I prefer accuracy to small performance improvements), unless I have a good reason to prefer a specific language.
utf8_general_ci比 快一些utf8_unicode_ci,但不太准确(用于排序)。在具体的语言UTF8编码(如utf8_swedish_ci)包含其他语言的规则,使他们成为最准确的排序这些语言。大多数时候我使用utf8_unicode_ci(我更喜欢准确性而不是小的性能改进),除非我有充分的理由更喜欢特定语言。
You can read more on specific unicode character sets on the MySQL manual - http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html
您可以在 MySQL 手册上阅读有关特定 unicode 字符集的更多信息 - http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html
回答by Vegard Larsen
Actually, you probably want to use utf8_unicode_cior utf8_general_ci.
实际上,您可能想使用utf8_unicode_cior utf8_general_ci。
utf8_general_cisorts by stripping away all accents and sorting as if it were ASCIIutf8_unicode_ciuses the Unicode sort order, so it sorts correctly in more languages
utf8_general_ci通过剥离所有重音并像 ASCII 一样排序来排序utf8_unicode_ci使用 Unicode 排序顺序,因此它可以在更多语言中正确排序
However, if you are only using this to store English text, these shouldn't differ.
但是,如果您仅使用它来存储英文文本,则它们应该没有区别。
回答by Guus
Be very, very aware of this problem that can occur when using utf8_general_ci.
非常非常清楚使用utf8_general_ci.
MySQL will not distinguish between some characters in select statements, if the utf8_general_cicollation is used. This can lead to very nasty bugs - especially for example, where usernames are involved. Depending on the implementation that uses the database tables, this problem could allow malicious users to create a username matching an administrator account.
如果使用utf8_general_ci排序规则,MySQL 将不会区分 select 语句中的某些字符。这可能会导致非常讨厌的错误 - 特别是例如涉及用户名的地方。根据使用数据库表的实现,此问题可能允许恶意用户创建与管理员帐户匹配的用户名。
This problem exposes itself at the very least in early 5.x versions - I'm not sure if this behaviour as changed later.
这个问题至少在早期的 5.x 版本中会暴露出来 - 我不确定这种行为是否会在以后改变。
I'm no DBA, but to avoid this problem, I always go with utf8-bininstead of a case-insensitive one.
我不是 DBA,但为了避免这个问题,我总是使用utf8-bin而不是不区分大小写的。
The script below describes the problem by example.
下面的脚本通过示例描述了问题。
-- first, create a sandbox to play in
CREATE DATABASE `sandbox`;
use `sandbox`;
-- next, make sure that your client connection is of the same
-- character/collate type as the one we're going to test next:
charset utf8 collate utf8_general_ci
-- now, create the table and fill it with values
CREATE TABLE `test` (`key` VARCHAR(16), `value` VARCHAR(16) )
CHARACTER SET utf8 COLLATE utf8_general_ci;
INSERT INTO `test` VALUES ('Key ONE', 'value'), ('Key TWO', 'valúe');
-- (verify)
SELECT * FROM `test`;
-- now, expose the problem/bug:
SELECT * FROM test WHERE `value` = 'value';
--
-- Note that we get BOTH keys here! MySQLs UTF8 collates that are
-- case insensitive (ending with _ci) do not distinguish between
-- both values!
--
-- collate 'utf8_bin' doesn't have this problem, as I'll show next:
--
-- first, reset the client connection charset/collate type
charset utf8 collate utf8_bin
-- next, convert the values that we've previously inserted in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET utf8 COLLATE utf8_bin;
-- now, re-check for the bug
SELECT * FROM test WHERE `value` = 'value';
--
-- Note that we get just one key now, as you'd expect.
--
-- This problem appears to be specific to utf8. Next, I'll try to
-- do the same with the 'latin1' charset:
--
-- first, reset the client connection charset/collate type
charset latin1 collate latin1_general_ci
-- next, convert the values that we've previously inserted
-- in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET latin1 COLLATE latin1_general_ci;
-- now, re-check for the bug
SELECT * FROM test WHERE `value` = 'value';
--
-- Again, only one key is returned (expected). This shows
-- that the problem with utf8/utf8_generic_ci isn't present
-- in latin1/latin1_general_ci
--
-- To complete the example, I'll check with the binary collate
-- of latin1 as well:
-- first, reset the client connection charset/collate type
charset latin1 collate latin1_bin
-- next, convert the values that we've previously inserted in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET latin1 COLLATE latin1_bin;
-- now, re-check for the bug
SELECT * FROM test WHERE `value` = 'value';
--
-- Again, only one key is returned (expected).
--
-- Finally, I'll re-introduce the problem in the exact same
-- way (for any sceptics out there):
-- first, reset the client connection charset/collate type
charset utf8 collate utf8_generic_ci
-- next, convert the values that we've previously inserted in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
-- now, re-check for the problem/bug
SELECT * FROM test WHERE `value` = 'value';
--
-- Two keys.
--
DROP DATABASE sandbox;
回答by Jeremy Postlethwaite
It is best to use character set utf8mb4with the collation utf8mb4_unicode_ci.
最好使用utf8mb4带有排序规则的字符集utf8mb4_unicode_ci。
The character set, utf8, only supports a small amount of UTF-8 code points, about 6% of possible characters. utf8only supports the Basic Multilingual Plane (BMP). There 16 other planes. Each plane contains 65,536 characters. utf8mb4supports all 17 planes.
字符集utf8仅支持少量 UTF-8 代码点,约占可能字符的 6%。utf8仅支持基本多语言平面 (BMP)。还有16架其他飞机。每个平面包含 65,536 个字符。utf8mb4支持所有17架飞机。
MySQL will truncate 4 byte UTF-8 characters resulting in corrupted data.
MySQL 将截断 4 字节的 UTF-8 字符,从而导致数据损坏。
The utf8mb4character set was introduced in MySQL 5.5.3 on 2010-03-24.
该utf8mb4字符集于 2010 年 3 月 24 日在 MySQL 5.5.3 中引入。
Some of the required changes to use the new character set are not trivial:
使用新字符集所需的一些更改并非微不足道:
- Changes may need to be made in your application database adapter.
- Changes will need to be made to my.cnf, including setting the character set, the collation and switching innodb_file_format to Barracuda
- SQL CREATE statements may need to include:
ROW_FORMAT=DYNAMIC- DYNAMIC is required for indexes on VARCHAR(192) and larger.
- 可能需要在您的应用程序数据库适配器中进行更改。
- 需要对 my.cnf 进行更改,包括设置字符集、排序规则以及将 innodb_file_format 切换为 Barracuda
- SQL CREATE 语句可能需要包括:
ROW_FORMAT=DYNAMIC- VARCHAR(192) 和更大的索引需要 DYNAMIC。
NOTE: Switching to Barracudafrom Antelope, may require restarting the MySQL service more than once. innodb_file_format_maxdoes not change until after the MySQL service has been restarted to: innodb_file_format = barracuda.
注意:切换到Barracudafrom Antelope,可能需要多次重新启动 MySQL 服务。innodb_file_format_max直到 MySQL 服务重新启动后才会更改:innodb_file_format = barracuda。
MySQL uses the old AntelopeInnoDB file format. Barracudasupports dynamic row formats, which you will need if you do not want to hit the SQL errors for creating indexes and keys after you switch to the charset: utf8mb4
MySQL 使用旧的AntelopeInnoDB 文件格式。Barracuda支持动态行格式,如果您不想在切换到字符集后遇到创建索引和键的 SQL 错误,您将需要这种格式:utf8mb4
- #1709 - Index column size too large. The maximum column size is 767 bytes.
- #1071 - Specified key was too long; max key length is 767 bytes
- #1709 - 索引列大小太大。最大列大小为 767 字节。
- #1071 - 指定的密钥太长;最大密钥长度为 767 字节
The following scenario has been tested on MySQL 5.6.17: By default, MySQL is configured like this:
以下场景已经在 MySQL 5.6.17 上测试过: 默认情况下,MySQL 是这样配置的:
SHOW VARIABLES;
innodb_large_prefix = OFF
innodb_file_format = Antelope
Stop your MySQL service and add the options to your existing my.cnf:
停止您的 MySQL 服务并将选项添加到您现有的 my.cnf 中:
[client]
default-character-set= utf8mb4
[mysqld]
explicit_defaults_for_timestamp = true
innodb_large_prefix = true
innodb_file_format = barracuda
innodb_file_format_max = barracuda
innodb_file_per_table = true
# Character collation
character_set_server=utf8mb4
collation_server=utf8mb4_unicode_ci
Example SQL CREATE statement:
示例 SQL CREATE 语句:
CREATE TABLE Contacts (
id INT AUTO_INCREMENT NOT NULL,
ownerId INT DEFAULT NULL,
created timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
modified timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
contact VARCHAR(640) NOT NULL,
prefix VARCHAR(128) NOT NULL,
first VARCHAR(128) NOT NULL,
middle VARCHAR(128) NOT NULL,
last VARCHAR(128) NOT NULL,
suffix VARCHAR(128) NOT NULL,
notes MEDIUMTEXT NOT NULL,
INDEX IDX_CA367725E05EFD25 (ownerId),
INDEX created (created),
INDEX modified_idx (modified),
INDEX contact_idx (contact),
PRIMARY KEY(id)
) DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci ENGINE = InnoDB ROW_FORMAT=DYNAMIC;
- You can see error #1709 generated for
INDEX contact_idx (contact)ifROW_FORMAT=DYNAMICis removed from the CREATE statement.
- 您可以看到为
INDEX contact_idx (contact)if生成的错误 #1709ROW_FORMAT=DYNAMIC从 CREATE 语句中删除。
NOTE: Changing the index to limit to the first 128 characters on contacteliminates the requirement for using Barracuda with ROW_FORMAT=DYNAMIC
注意:将索引更改为限制为前 128 个字符contact消除了使用梭子鱼的要求ROW_FORMAT=DYNAMIC
INDEX contact_idx (contact(128)),
Also note: when it says the size of the field is VARCHAR(128), that is not 128 bytes. You can use have 128, 4 byte characters or 128, 1 byte characters.
另请注意:当它说字段的大小是 时VARCHAR(128),那不是 128 字节。您可以使用 128、4 字节字符或 128、1 字节字符。
This INSERTstatement should contain the 4 byte 'poo' character in the 2 row:
此INSERT语句应在第 2 行中包含 4 字节的“便便”字符:
INSERT INTO `Contacts` (`id`, `ownerId`, `created`, `modified`, `contact`, `prefix`, `first`, `middle`, `last`, `suffix`, `notes`) VALUES
(1, NULL, '0000-00-00 00:00:00', '2014-08-25 03:00:36', '1234567890', '12345678901234567890', '1234567890123456789012345678901234567890', '1234567890123456789012345678901234567890', '12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678', '', ''),
(2, NULL, '0000-00-00 00:00:00', '2014-08-25 03:05:57', 'poo', '12345678901234567890', '', '', '', '', ''),
(3, NULL, '0000-00-00 00:00:00', '2014-08-25 03:05:57', 'poo', '12345678901234567890', '', '', '123', '', '');
You can see the amount of space used by the lastcolumn:
您可以看到该last列使用的空间量:
mysql> SELECT BIT_LENGTH(`last`), CHAR_LENGTH(`last`) FROM `Contacts`;
+--------------------+---------------------+
| BIT_LENGTH(`last`) | CHAR_LENGTH(`last`) |
+--------------------+---------------------+
| 1024 | 128 | -- All characters are ASCII
| 4096 | 128 | -- All characters are 4 bytes
| 4024 | 128 | -- 3 characters are ASCII, 125 are 4 bytes
+--------------------+---------------------+
In your database adapter, you may want to set the charset and collation for your connection:
在您的数据库适配器中,您可能需要为您的连接设置字符集和排序规则:
SET NAMES 'utf8mb4' COLLATE 'utf8mb4_unicode_ci'
In PHP, this would be set for: \PDO::MYSQL_ATTR_INIT_COMMAND
在 PHP 中,这将被设置为: \PDO::MYSQL_ATTR_INIT_COMMAND
References:
参考:
回答by Tomalak
Collations affect how data is sorted and how strings are compared to each other. That means you should use the collation that most of your users expect.
排序规则会影响数据的排序方式以及字符串之间的比较方式。这意味着您应该使用大多数用户期望的排序规则。
Example from the documentation for charset unicode:
utf8_general_cialso is satisfactory for both German and French, except that ‘?' is equal to ‘s', and not to ‘ss'. If this is acceptable for your application, then you should useutf8_general_cibecause it is faster. Otherwise, useutf8_unicode_cibecause it is more accurate.
utf8_general_ci除了“?”之外,德语和法语也令人满意。等于's',而不是'ss'。如果这对您的应用程序来说是可以接受的,那么您应该使用utf8_general_ci它, 因为它更快。否则,使用utf8_unicode_ci它,因为它更准确。
So - it depends on your expected user base and on how much you need correctsorting. For an English user base, utf8_general_cishould suffice, for other languages, like Swedish, special collations have been created.
所以 - 这取决于您预期的用户群以及您需要正确排序的程度。对于英语用户群,utf8_general_ci应该足够了,对于其他语言,如瑞典语,已经创建了特殊的排序规则。
回答by Phil
Essentially, it depends on how you think of a string.
本质上,这取决于您如何看待字符串。
I always use utf8_bin because of the problem highlighted by Guus. In my opinion, as far as the database should be concerned, a string is still just a string. A string is a number of UTF-8 characters. A character has a binary representation so why does it need to know the language you're using? Usually, people will be constructing databases for systems with the scope for multilingual sites. This is the whole point of using UTF-8 as a character set. I'm a bit of a pureist but I think the bug risks heavily outweigh the slight advantage you may get on indexing. Any language related rules should be done at a much higher level than the DBMS.
由于 Guus 突出显示的问题,我总是使用 utf8_bin。在我看来,就数据库而言,字符串仍然只是一个字符串。字符串是许多 UTF-8 字符。一个字符有二进制表示,为什么它需要知道你使用的语言?通常,人们会为具有多语言站点范围的系统构建数据库。这就是使用 UTF-8 作为字符集的全部意义所在。我有点纯粹主义者,但我认为错误风险大大超过了您在索引方面可能获得的微小优势。任何与语言相关的规则都应该在比 DBMS 更高的级别上完成。
In my books "value" should never in a million years be equal to "valúe".
在我的书中,“价值”在一百万年内永远不应该等于“价值”。
If I want to store a text field and do a case insensitive search, I will use MYSQL string functions with PHP functions such as LOWER() and the php function strtolower().
如果我想存储一个文本字段并进行不区分大小写的搜索,我将使用 MYSQL 字符串函数和 PHP 函数,例如 LOWER() 和 php 函数 strtolower()。
回答by mepcotterell
For UTF-8 textual information, you should use utf8_general_cibecause...
对于 UTF-8 文本信息,您应该使用utf8_general_ci因为...
utf8_bin: compare strings by the binary value of each character in the stringutf8_general_ci: compare strings using general language rules and using case-insensitive comparisons
utf8_bin:通过字符串中每个字符的二进制值比较字符串utf8_general_ci: 使用通用语言规则和不区分大小写的比较来比较字符串
a.k.a. it will should making searching and indexing the data faster/more efficient/more useful.
也就是它应该使搜索和索引数据更快/更有效/更有用。
回答by George Lund
The accepted answer fairly definitively suggests using utf8_unicode_ci, and whilst for new projects that's great, I wanted to relate my recent contrary experience just in case it saves anyone some time.
接受的答案相当明确地建议使用 utf8_unicode_ci,虽然对于很棒的新项目,我想将我最近的相反经历联系起来,以防万一它为任何人节省一些时间。
Because utf8_general_ci is the default collation for Unicode in MySQL, if you want to use utf8_unicode_ci then you end up having to specify it in a lotof places.
因为 utf8_general_ci 是 MySQL 中 Unicode 的默认排序规则,如果你想使用 utf8_unicode_ci 那么你最终必须在很多地方指定它。
For example, all client connections not only have a default charset (makes sense to me) but also a default collation (i.e. the collation will always default to utf8_general_ci for unicode).
例如,所有客户端连接不仅有一个默认字符集(对我来说很有意义),而且还有一个默认排序规则(即对于 unicode,排序规则将始终默认为 utf8_general_ci)。
Likely, if you use utf8_unicode_ci for your fields, your scripts that connect to the database will need to be updated to mention the desired collation explicitly -- otherwise queries using text strings can fail when your connection is using the default collation.
很可能,如果您对字段使用 utf8_unicode_ci,则需要更新连接到数据库的脚本以明确提及所需的排序规则——否则当您的连接使用默认排序规则时,使用文本字符串的查询可能会失败。
The upshot is that when converting an existing system of any size to Unicode/utf8, you may end up being forced to use utf8_general_ci because of the way MySQL handles defaults.
结果是,当将任何大小的现有系统转换为 Unicode/utf8 时,由于 MySQL 处理默认值的方式,您最终可能会被迫使用 utf8_general_ci。
回答by SEoF
For the case highlighted by Guus, I would strongly suggest using either utf8_unicode_cs (case sensitive, strict matching, ordering correctly for the most part) instead of utf8_bin (strict matching, incorrect ordering).
对于 Guus 强调的案例,我强烈建议使用 utf8_unicode_cs(区分大小写,严格匹配,大部分情况下正确排序)而不是 utf8_bin(严格匹配,错误排序)。
If the field is intended to be searched, as opposed to matched for a user, then use utf8_general_ci or utf8_unicode_ci. Both are case-insensitive, one will losely match (‘?' is equal to ‘s', and not to ‘ss'). There are also language specific versions, like utf8_german_ci where the lose matching is more suitable for the language specified.
如果要搜索该字段,而不是为用户匹配,则使用 utf8_general_ci 或 utf8_unicode_ci。两者都不区分大小写,一个将失败匹配('?' 等于 's',而不是 'ss')。还有特定于语言的版本,例如 utf8_german_ci,其中丢失匹配更适合指定的语言。
[Edit - nearly 6 years later]
[编辑 - 近 6 年后]
I no longer recommend the "utf8" character set on MySQL, and instead recommend the "utf8mb4" character set. They match almost entirely, but allow for a little (lot) more unicode characters.
我不再推荐 MySQL 上的“utf8”字符集,而是推荐“utf8mb4”字符集。它们几乎完全匹配,但允许多一点(很多)unicode 字符。
Realistically, MySQL should have updated the "utf8" character set and respective collations to match the "utf8" specification, but instead, a separate character set and respective collations as to not impact storage designation for those already using their incomplete "utf8" character set.
实际上,MySQL 应该更新“utf8”字符集和相应的排序规则以匹配“utf8”规范,而是一个单独的字符集和相应的排序规则,以免影响那些已经使用不完整的“utf8”字符集的存储指定.
回答by jiv-e
I found these collation charts helpful. http://collation-charts.org/mysql60/. I'm no sure which is the used utf8_general_ci though.
我发现这些整理图表很有帮助。http://collation-charts.org/mysql60/。我不确定哪个是使用过的 utf8_general_ci。
For example here is the chart for utf8_swedish_ci. It shows which characters it interprets as the same. http://collation-charts.org/mysql60/mysql604.utf8_swedish_ci.html
例如,这里是 utf8_swedish_ci 的图表。它显示了它解释为相同的字符。http://collation-charts.org/mysql60/mysql604.utf8_swedish_ci.html

