php 如何对一组 UTF-8 字符串进行排序?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/120334/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to sort an array of UTF-8 strings?
提问by Stefan Gehrig
I currentyl have no clue on how to sort an array which contains UTF-8 encoded strings in PHP. The array comes from a LDAP server so sorting via a database (would be no problem) is no solution. The following does not work on my windows development machine (although I'd think that this should be at least a possible solution):
我目前不知道如何在 PHP 中对包含 UTF-8 编码字符串的数组进行排序。该阵列来自 LDAP 服务器,因此通过数据库进行排序(不会有问题)不是解决方案。以下在我的 Windows 开发机器上不起作用(尽管我认为这至少应该是一个可能的解决方案):
$array=array('Birnen', '?pfel', 'Ungetüme', 'Apfel', 'Ungetiere', '?sterreich');
$oldLocal=setlocale(LC_COLLATE, "0");
var_dump(setlocale(LC_COLLATE, 'German_Germany.65001'));
usort($array, 'strcoll');
var_dump(setlocale(LC_COLLATE, $oldLocal));
var_dump($array);
The output is:
输出是:
string(20) "German_Germany.65001"
string(1) "C"
array(6) {
[0]=>
string(6) "Birnen"
[1]=>
string(9) "Ungetiere"
[2]=>
string(6) "?pfel"
[3]=>
string(5) "Apfel"
[4]=>
string(9) "Ungetüme"
[5]=>
string(11) "?sterreich"
}
This is complete nonsense. Using 1252 as the codepage for setlocale()gives another output but still a plainly wrong one:
这完全是胡说八道。使用 1252 作为代码页setlocale()给出了另一个输出,但仍然是一个明显错误的输出:
string(19) "German_Germany.1252"
string(1) "C"
array(6) {
[0]=>
string(11) "?sterreich"
[1]=>
string(6) "?pfel"
[2]=>
string(5) "Apfel"
[3]=>
string(6) "Birnen"
[4]=>
string(9) "Ungetüme"
[5]=>
string(9) "Ungetiere"
}
Is there a way to sort an array with UTF-8 strings locale aware?
有没有办法对具有 UTF-8 字符串区域设置感知的数组进行排序?
Just noted that this seems to be PHP on Windows problem, as the same snippet with de_DE.utf8used as locale works on a Linux machine. Nevertheless a solution for this Windows-specific problem would be nice...
刚刚注意到这似乎是 Windows 上的 PHP 问题,因为与de_DE.utf8用作语言环境的相同片段在 Linux 机器上工作。尽管如此,针对这个 Windows 特定问题的解决方案会很好......
采纳答案by Stefan Gehrig
Eventually this problem cannot be solved in a simple way without using recoded strings (UTF-8 → Windows-1252 or ISO-8859-1) as suggested by ΤΖΩΤΖΙΟΥ due to an obvious PHP bug as discovered by Huppie. To summarize the problem, I created the following code snippet which clearly demonstrates that the problem is the strcoll() function when using the 65001 Windows-UTF-8-codepage.
最终,由于 Huppie 发现的一个明显的 PHP 错误,如果不使用 ΤΖΩΤΖΙΟΥ 建议的重新编码的字符串(UTF-8 → Windows-1252 或 ISO-8859-1),则无法以简单的方式解决此问题。为了总结这个问题,我创建了以下代码片段,它清楚地表明问题出在使用 65001 Windows-UTF-8-codepage 时的 strcoll() 函数。
function traceStrColl($a, $b) {
$outValue=strcoll($a, $b);
echo "$a $b $outValue\r\n";
return $outValue;
}
$locale=(defined('PHP_OS') && stristr(PHP_OS, 'win')) ? 'German_Germany.65001' : 'de_DE.utf8';
$string="ABCDEFGHIJKLMNOPQRSTUVWXYZ??üabcdefghijklmnopqrstuvwxyz??ü?";
$array=array();
for ($i=0; $i<mb_strlen($string, 'UTF-8'); $i++) {
$array[]=mb_substr($string, $i, 1, 'UTF-8');
}
$oldLocale=setlocale(LC_COLLATE, "0");
var_dump(setlocale(LC_COLLATE, $locale));
usort($array, 'traceStrColl');
setlocale(LC_COLLATE, $oldLocale);
var_dump($array);
The result is:
结果是:
string(20) "German_Germany.65001"
a B 2147483647
[...]
array(59) {
[0]=>
string(1) "c"
[1]=>
string(1) "B"
[2]=>
string(1) "s"
[3]=>
string(1) "C"
[4]=>
string(1) "k"
[5]=>
string(1) "D"
[6]=>
string(2) "?"
[7]=>
string(1) "E"
[8]=>
string(1) "g"
[...]
The same snippet works on a Linux machine without any problems producing the following output:
相同的代码段在 Linux 机器上运行没有任何问题,产生以下输出:
string(10) "de_DE.utf8"
a B -1
[...]
array(59) {
[0]=>
string(1) "a"
[1]=>
string(1) "A"
[2]=>
string(2) "?"
[3]=>
string(2) "?"
[4]=>
string(1) "b"
[5]=>
string(1) "B"
[6]=>
string(1) "c"
[7]=>
string(1) "C"
[...]
The snippet also works when using Windows-1252 (ISO-8859-1) encoded strings (of course the mb_* encodings and the locale must be changed then).
该代码段在使用 Windows-1252 (ISO-8859-1) 编码字符串时也有效(当然,必须更改 mb_* 编码和语言环境)。
I filed a bug report on bugs.php.net: Bug #46165 strcoll() does not work with UTF-8 strings on Windows. If you experience the same problem, you can give your feedback to the PHP team on the bug-report page (two other, probably related, bugs have been classified as bogus- I don't think that this bug is bogus;-).
我在bugs.php.net上提交了一个错误报告:错误 #46165 strcoll() 不适用于 Windows 上的 UTF-8 字符串。如果您遇到同样的问题,您可以在错误报告页面上向 PHP 团队提供反馈(另外两个可能相关的错误已被归类为虚假错误- 我不认为此错误是虚假的;-)。
Thanks to all of you.
感谢大家。
回答by Delian Krustev
$a = array( 'Кръстев', 'Делян1', 'делян1', 'Делян2', 'делян3', 'кръстев' );
$col = new \Collator('bg_BG');
$col->asort( $a );
var_dump( $a );
Prints:
印刷:
array
2 => string 'делян1' (length=11)
1 => string 'Делян1' (length=11)
3 => string 'Делян2' (length=11)
4 => string 'делян3' (length=11)
5 => string 'кръстев' (length=14)
0 => string 'Кръстев' (length=14)
The Collatorclass is defined in PECL intl extension. It is distributed with PHP 5.3 sources but might be disabled for some builds. E.g. in Debian it is in package php5-intl .
的Collator类中定义PECL国际延伸。它与 PHP 5.3 源一起分发,但可能在某些构建中被禁用。例如,在 Debian 中,它位于 php5-intl 包中。
Collator::compareis useful for usort.
Collator::compare对 有用usort。
回答by Stefan Gehrig
Update on this issue:
关于这个问题的更新:
Even though the discussion around this problem revealed that we could have discovered a PHP bug with strcoll()and/or setlocale(), this is clearly not the case. The problem is rather a limitation of the Windows CRT implementation of setlocale()(PHPs setlocale()is just a thin wrapper around the CRT call). The following is a citation of the MSDN page "setlocale, _wsetlocale":
尽管围绕这个问题的讨论表明我们可以通过strcoll()和/或发现 PHP 错误setlocale(),但显然情况并非如此。问题是 Windows CRT 实现的限制setlocale()(PHPsetlocale()只是 CRT 调用的一个薄包装器)。以下是对MSDN 页面“setlocale, _wsetlocale”的引用:
The set of available languages, country/region codes, and code pages includes all those supported by the Win32 NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8. If you provide a code page like UTF-7 or UTF-8, setlocale will fail, returning NULL.The set of language and country/region codes supported by setlocale is listed in Language and Country/Region Strings.
可用语言、国家/地区代码和代码页集包括 Win32 NLS API 支持的所有语言,但每个字符需要超过两个字节的代码页(例如 UTF-7 和 UTF-8)除外。如果您提供像 UTF-7 或 UTF-8 这样的代码页,setlocale 将失败,返回 NULL。setlocale 支持的语言和国家/地区代码集列在语言和国家/地区字符串中。
It therefore is impossible to use locale-aware string operations within PHP on Windows when strings are multi-byte encoded.
因此,当字符串是多字节编码时,不可能在 Windows 上的 PHP 中使用区域设置感知字符串操作。
回答by tzot
This is a very complex issue, since UTF-8 encoded data can contain any Unicode character (i.e. characters from many 8-bit encodings which collate differently in different locales).
这是一个非常复杂的问题,因为 UTF-8 编码的数据可以包含任何 Unicode 字符(即来自许多 8 位编码的字符,它们在不同的语言环境中进行不同的整理)。
Perhaps if you converted your UTF-8 data into Unicode (not familiar with PHP unicode functions, sorry) and then normalized them into NFD or NFKDand then sorting on code points might give some collation that would make sense to you (ie "A" before "?").
也许如果您将 UTF-8 数据转换为 Unicode(不熟悉 PHP unicode 函数,抱歉)然后将它们规范化为NFD 或 NFKD,然后对代码点进行排序可能会提供一些对您有意义的排序规则(即“A”前 ”?”)。
Check the links I provided.
检查我提供的链接。
EDIT: since you mention that your input data are clear (I assume they all fall in the "windows-1252" codepage), then you should do the following conversion: UTF-8 → Unicode → Windows-1252, on which Windows-1252 encoded data do a sort selecting the "CP1252" locale.
编辑:既然你提到你的输入数据很清楚(我假设它们都在“windows-1252”代码页中),那么你应该做以下转换:UTF-8 → Unicode → Windows-1252,Windows-1252编码数据进行排序选择“CP1252”语言环境。
回答by Huppie
Using your example with codepage 1252 worked perfectly fine here on my windows development machine.
使用代码页 1252 的示例在我的 Windows 开发机器上工作得非常好。
$array=array('Birnen', '?pfel', 'Ungetüme', 'Apfel', 'Ungetiere', '?sterreich');
$oldLocal=setlocale(LC_COLLATE, "0");
var_dump(setlocale(LC_COLLATE, 'German_Germany.1252'));
usort($array, 'strcoll');
var_dump(setlocale(LC_COLLATE, $oldLocal));
var_dump($array);
...snip...
...剪...
This was with PHP 5.2.6. btw.
这是 PHP 5.2.6。顺便提一句。
上面的例子是 wrong错误,它使用 ASCII 编码而不是 UTF-8。我确实跟踪了 strcoll() 调用并查看了我发现的内容:
function traceStrColl($a, $b) {
$outValue = strcoll($a, $b);
echo "$a $b $outValue\r\n";
return $outValue;
}
$array=array('Birnen', '?pfel', 'Ungetüme', 'Apfel', 'Ungetiere', '?sterreich');
setlocale(LC_COLLATE, 'German_Germany.65001');
usort($array, 'traceStrColl');
print_r($array);
gives:
给出:
Ungetüme ?pfel 2147483647 Ungetüme Birnen 2147483647 Ungetüme Apfel 2147483647 Ungetüme Ungetiere 2147483647 ?sterreich Ungetüme 2147483647 ?pfel Ungetiere 2147483647 ?pfel Birnen 2147483647 Apfel ?pfel 2147483647 Ungetiere Birnen 2147483647
I did find some bug reportswhich have been flagged being bogus... The best bet you have is filing a bug-report I suppose though...
回答by leymannx
I found this following helper functionto convert all letters of a string to ASCII letters very helpful here.
我发现以下帮助函数将字符串的所有字母转换为 ASCII 字母非常有用。
function _all_letters_to_ASCII($string) {
return strtr(utf8_decode($string),
utf8_decode('???????¥μàá??????èéê?ìí??D?òó????ùú?üY?àáa?????èéê?ìí??e?òó????ùú?üy?'),
'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
}
After that a simple array_multisort()gives you what you want.
之后,一个简单的array_multisort()给你你想要的。
$array = array('Birnen', '?pfel', 'Ungetüme', 'Apfel', 'Ungetiere', '?sterreich');
$reference_array = $array;
foreach ($reference_array as $key => &$value) {
$value = _all_letters_to_ASCII($value);
}
var_dump($reference_array);
array_multisort($reference_array, $array);
var_dump($array);
Of course you can make the helper function fit more advanced needs. But for now, it looks pretty good.
当然你可以让辅助函数满足更高级的需求。但就目前而言,它看起来还不错。
array(6) {
[0]=> string(6) "Birnen"
[1]=> string(5) "Apfel"
[2]=> string(8) "Ungetume"
[3]=> string(5) "Apfel"
[4]=> string(9) "Ungetiere"
[5]=> string(10) "Osterreich"
}
array(6) {
[0]=> string(5) "Apfel"
[1]=> string(6) "?pfel"
[2]=> string(6) "Birnen"
[3]=> string(11) "?sterreich"
[4]=> string(9) "Ungetiere"
[5]=> string(9) "Ungetüme"
}
回答by Friedrich Siever
I am confronted with the same problem with German "Umlaute". After some research, this worked for me:
我在使用德语“Umlaute”时遇到了同样的问题。经过一些研究,这对我有用:
$laender =array("?sterreich", "Schweiz", "England", "France", "?gypten");
$laender = array_map("utf8_decode", $laender);
setlocale(LC_ALL,"de_DE@euro", "de_DE", "deu_deu");
sort($laender, SORT_LOCALE_STRING);
$laender = array_map("utf8_encode", $laender);
print_r($laender);
The result:
结果:
Array
(
[0] => ?gypten
[1] => England
[2] => France
[3] => ?sterreich
[4] => Schweiz
)
数组
(
[0] => ?gypten
[1] => 英国
[2] => 法国
[3] => ?sterreich
[4] => Schweiz
)
回答by troelskn
Your collation needs to match the character set. Since your data is UTF-8 encoded, you should use a UTF-8 collation. It could be named differently on different platforms, but a good guess would be de_DE.utf8.
您的排序规则需要匹配字符集。由于您的数据采用 UTF-8 编码,因此您应该使用 UTF-8 归类。它可以在不同的平台上以不同的方式命名,但一个很好的猜测是de_DE.utf8.
On UNIX systems, you can get a list of currently installed locales with the command
在 UNIX 系统上,您可以使用以下命令获取当前安装的语言环境列表
locale -a

