标准化 MySQL 查询中的重音字符
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2302813/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
normalizing accented characters in MySQL queries
提问by George Armhold
I'd like to be able to do queries that normalize accented characters, so that for example:
我希望能够进行标准化重音字符的查询,例如:
é, è, and ê
are all treated as 'e', in queries using '=' and 'like'. I have a row with username field set to 'rené', and I'd like to be able to match on it with both 'rene' and 'rené'.
在使用 '=' 和 'like' 的查询中都被视为 'e'。我有一行用户名字段设置为“ rené”,我希望能够将其与“ rene”和“ rené”进行匹配。
I'm attempting to do this with the 'collate' clause in MySQL 5.0.8. I get the following error:
我正在尝试使用 MySQL 5.0.8 中的“collate”子句来做到这一点。我收到以下错误:
mysql> select * from User where username = 'rené' collate utf8_general_ci;
ERROR 1253 (42000): COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'latin1'
FWIW, my table was created with:
FWIW,我的表是用以下方法创建的:
CREATE TABLE `User` (
`id` bigint(19) NOT NULL auto_increment,
`username` varchar(32) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `uniqueUsername` (`username`)
) ENGINE=InnoDB AUTO_INCREMENT=56790 DEFAULT CHARSET=utf8
回答by sfussenegger
The reason for the error is not the table but the characterset of your input, i.e. the 'rené' in your query. The behaviour depends on the character_set_connectionvariable:
错误的原因不是表格,而是您输入的字符集,即查询中的“rené”。行为取决于character_set_connection变量:
The character set used for literals that do not have a character set introducer and for number-to-string conversion.
用于没有字符集介绍器的文字和用于数字到字符串转换的字符集。
Using the MySQL Client, change it using SET NAMES
:
使用 MySQL 客户端,使用SET NAMES
以下命令更改它:
A SET NAMES 'charset_name' statement is equivalent to these three statements:
SET NAMES 'charset_name' 语句等效于以下三个语句:
SET character_set_client = charset_name;
SET character_set_results = charset_name;
SET character_set_connection = charset_name;
(from http://dev.mysql.com/doc/refman/5.5/en/charset-connection.html)
(来自http://dev.mysql.com/doc/refman/5.5/en/charset-connection.html)
Example output:
示例输出:
mysql> set names latin1;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from User where username = 'rené' collate utf8_general_ci;
ERROR 1253 (42000): COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'latin1'
mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from User where username = 'rené' collate utf8_general_ci;
Empty set (0.00 sec)
Altenatively, use can explicitly set the character set using a 'character set introducer':
或者, use 可以使用“字符集介绍器”显式设置字符集:
mysql> set names latin1;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from User where username = _utf8'rené' collate utf8_general_ci;
Empty set (0.00 sec)
I know this question is pretty old but since Google led me here for a related question, I though it still deserves an answer :)
我知道这个问题已经很老了,但由于谷歌把我带到了一个相关的问题,我认为它仍然值得回答:)
回答by Tatu Ulmanen
I'd suggest that you save the normalized versions to your table in addition with the real username. Changing the encoding on the fly can be expensive, and you have to do the conversion again for every row on every search.
除了真实用户名之外,我建议您将规范化版本保存到您的表中。动态更改编码可能会很昂贵,并且您必须在每次搜索时为每一行再次进行转换。
If you're using PHP, you can use iconv()to handle the conversion:
如果您使用的是 PHP,则可以使用iconv()来处理转换:
$username = 'rené';
$normalized = iconv('UTF-8', 'ASCII//TRANSLIT', $string);
Then you'd just save both versions and use the normalized version for searching and normal username for display. Comparing and selecting will be alot faster from the normalized column, provided that you normalize the search string also:
然后您只需保存两个版本并使用规范化版本进行搜索和使用普通用户名进行显示。如果您还对搜索字符串进行了标准化,则从标准化列中进行比较和选择的速度会快很多:
$search = mysql_real_escape_string(iconv('UTF-8', 'ASCII//TRANSLIT', $_GET['search']));
mysql_query("SELECT * FROM User WHERE normalized LIKE '%".$search."%'");
Of course this method might not be viable if you have several columns that need normalizations, but in your specific case this might work allright.
当然,如果您有几个需要标准化的列,这种方法可能不可行,但在您的特定情况下,这可能没问题。
回答by Felipe Buccioni
回答by José Nobile
$normalized = iconv('UTF-8', 'ASCII//TRANSLIT', $string);
is a perfect php solution, but in mysql? CONVERT?
是一个完美的 php 解决方案,但在 mysql 中?转变?
in mysql
在 mysql
SELECT 'álvaro José' as accented, (CONVERT ('álvaro José' USING ascii)) as notaccented
Produce:
生产:
álvaro José ?lvaro Jos?
The accented words is not converted to no accented words, it is not equivalent a translit of iconv.
重音词不会转换为无重音词,它不等同于 iconv 的转译。
RegExp don't work with UTF-8.
RegExp 不适用于 UTF-8。
Not any solution.
没有任何解决办法。
回答by Frank Forte
Does a search using Englishcharacters return results with foreign characters? I wrote the following script to compare collations in MySQL 5.7 (Should also work for MariaDB 10.2+):
使用英文字符搜索是否返回带有外文字符的结果?我编写了以下脚本来比较 MySQL 5.7 中的排序规则(也适用于 MariaDB 10.2+):
$db->query('CREATE TABLE IF NOT EXISTS test (name varchar(20))
Engine=InnoDB character set utf8mb4 collate utf8mb4_unicode_520_ci');
$db->query('CREATE TABLE IF NOT EXISTS test2 (name varchar(20))
Engine=InnoDB character set utf8mb4 collate utf8mb4_unicode_ci');
$db->query("insert into test values('?ove 520')");
$db->query("insert into test2 values('?ove 520')");
$types = ['utf8mb4_unicode_520_ci', 'utf8mb4_unicode_ci'];
$tables = ['test' => 'utf8mb4_unicode_520_ci', 'test2' => 'utf8mb4_unicode_ci'];
foreach($types as $n)
{
foreach($tables as $ta => $tc)
{
$db->query("SET NAMES 'utf8mb4' COLLATE '$n'");
$res = $db->query("Select * from $ta where name like 'Love%'"); // ? equal
echo "\ntable $ta($tc), names($n): ".$res->fetchColumn(0);
}
}
Here are the results:
结果如下:
table test(utf8mb4_unicode_520_ci), names(utf8mb4_unicode_520_ci): Łove 520
table test2(utf8mb4_unicode_ci), names(utf8mb4_unicode_520_ci):
table test(utf8mb4_unicode_520_ci), names(utf8mb4_unicode_ci): Łove 520
table test2(utf8mb4_unicode_ci), names(utf8mb4_unicode_ci):
(Note: I ran the script from the command line, so it appears as Łove 520 instead of ?ove 520)
(注意:我从命令行运行脚本,所以它显示为 ┼üove 520 而不是 ?ove 520)
It appears that L == ? when the table collation is utf8mb4_unicode_520_ci, regardless of the connection collation. However, it is notequivalent if you only use utf8mb4_unicode_ci.
看起来 L == ? 当表排序规则为 utf8mb4_unicode_ 520_ci 时,无论连接排序规则如何。但是,如果仅使用 utf8mb4_unicode_ci ,则不等效。