在 PHP 中处理文件和 utf8

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3800292/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 11:09:57  来源:igfitidea点击:

Working with files and utf8 in PHP

phpfile-iounicodeutf-8

提问by Gerardo Marset

Lets say I have a file called foo.txt encoded in utf8:

假设我有一个以 utf8 编码的名为 foo.txt 的文件:

aoeu  
qjkx
?pyf

And I want to get an array that contains all the lines in that file (one line per index) that have the letters aoeu?pyf, and only the lines with these letters.

我想得到一个数组,其中包含该文件中所有具有字母 aoeu?pyf 的行(每个索引一行),并且只有带有这些字母的行。

I wrote the following code (also encoded as utf8):

我编写了以下代码(也编码为 utf8):

$allowed_letters=array("a","o","e","u","?","p","y","f");

$lines=array();
$f=fopen("foo.txt","r");
while(!feof($f)){
    $line=fgets($f);
    foreach(preg_split("//",$line,-1,PREG_SPLIT_NO_EMPTY) as $letter){
        if(!in_array($letter,$allowed_letters)){
            $line="";
        }
    }
    if($line!=""){
        $lines[]=$line;
    }
}
fclose($f);

However, after that, the $linesarray just has the aoeu line in it.
This seems to be because somehow, the "?" in $allowed_lettersis not the same as the "?" in foo.txt.
Also if I print a "?" of the file, a question mark appears, but if I print it like this print "?";, it works.
How can I make it work?

但是,在那之后,$lines数组中只有 aoeu 行。
这似乎是因为不知何故,“?” in$allowed_letters不等于“?” 在 foo.txt 中。
另外,如果我打印一个“?” 的文件,出现一个问号,但如果我这样打印它print "?";,它的工作原理。
我怎样才能让它工作?

回答by Yanick Rochon

If you are running Windows, the OS does not save files in UTF-8, but in cp1251 (or something...) by default you need to save the file in that format explicitly or run each line in utf8_encode()before performing your check. I.e.:

如果您运行的是 Windows,操作系统不会以 UTF-8 格式保存文件,而是以 cp1251(或其他格式)格式保存文件,默认情况下您需要显式保存该格式的文件或utf8_encode()在执行检查之前运行每一行。IE:

$line=utf8_encode(fgets($f));

If you are sure that the file is UTF-8 encoded, is your PHP file also UTF-8 encoded?

如果您确定该文件是 UTF-8 编码的,那么您的 PHP 文件是否也是 UTF-8 编码的?

If everything is UTF-8, then this is what you need :

如果一切都是 UTF-8,那么这就是您所需要的:

foreach(preg_split("//u",$line,-1,PREG_SPLIT_NO_EMPTY) as $letter){
   // ...
}

(append ufor unicode chars)

(附加u为 unicode 字符)

However, let me suggest a yet faster way to perform your check :

但是,让我建议一种更快的方法来执行检查:

$allowed_letters=array("a","o","e","u","?","p","y","f");

$lines=array();
$f=fopen("foo.txt","r");
while(!feof($f)){
    $line=fgets($f);

    $line = str_split(rtrim($line));
    if (count(array_intersect($line, $allowed_letters)) == count($line)) {
            $lines[] = $line;
    }
}
fclose($f);

(add space chars to allow space characters as well, and remove the rtrim($line))

(添加空格字符以允许空格字符,并删除rtrim($line)

回答by bobince

In UTF-8, ?is encoded as two bytes. Normally in PHP all string operations are byte-based, so when you preg_splitthe input it splits up the first byte and the second byte into separate array items. Neither the first byte on its own nor the second byte on its own will match both bytes together as found in $allowed_letters, so it'll never match ?.

在 UTF-8 中,?编码为两个字节。通常在 PHP 中,所有字符串操作都是基于字节的,因此当您preg_split输入时,它会将第一个字节和第二个字节拆分为单独的数组项。无论是第一个字节本身还是第二个字节本身都不会像 中找到的那样将两个字节匹配在一起$allowed_letters,因此它永远不会匹配?

As Yanick posted, the solution is to add the umodifier. This makes PHP's regex engine treat both the pattern and the input line as Unicode characters instead of bytes. It's lucky that PHP has special Unicode support here; elsewhere PHP's Unicode support is extremely spotty.

正如 Yanick 发布的那样,解决方案是添加u修饰符。这使得 PHP 的正则表达式引擎将模式和输入行都视为 Unicode 字符而不是字节。幸运的是 PHP 在这里有特殊的 Unicode 支持;在其他地方,PHP 的 Unicode 支持非常参差不齐。

A simpler and quicker way than splitting would be to compare each line against a character-group regex. Again, this must be a uregex.

比拆分更简单快捷的方法是将每一行与字符组正则表达式进行比较。同样,这必须是一个u正则表达式。

if(preg_match('/^[aoeu?pyf]+$/u', $line))
    $lines[]= $line;

回答by M2tM

It sounds like you've already got your answer, but it is important to recognize that unicode characters can be stored in multiple ways. Unicode normalization* is a process which can help ensure comparisons work as expected.

听起来您已经有了答案,但重要的是要认识到 unicode 字符可以以多种方式存储。Unicode 规范化* 是一个有助于确保比较按预期工作的过程。