php 检测PHP中的文件编码

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/505562/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-24 22:55:56  来源:igfitidea点击:

Detect file encoding in PHP

phputf-8character-encoding

提问by nickf

I have a script which combines a number of files into one, and it breaks when one of the files has UTF8 encoding. I figure that I should be using the utf8_decode()function when reading the files, but I don't know how to tell which need decoding.

我有一个脚本,它将多个文件合并为一个,当其中一个文件具有 UTF8 编码时它会中断。我想我应该utf8_decode()在读取文件时使用该功能,但我不知道如何判断哪些需要解码。

My code is basically:

我的代码基本上是:

$output = '';
foreach ($files as $filename) {
    $output .= file_get_contents($filename) . "\n";
}
file_put_contents('combined.txt', $output);

Currently, at the start of a UTF8 file, it adds these characters in the output: ???

目前,在 UTF8 文件的开头,它会在输出中添加以下字符: ???

回答by Ben Blank

Try using the mb_detect_encodingfunction. This function will examine your string and attempt to "guess" what its encoding is. You can then convert it as desired. As brulak suggested, however, you're probably better off converting toUTF-8 rather than from, to preserve the data you're transmitting.

尝试使用该mb_detect_encoding功能。此函数将检查您的字符串并尝试“猜测”其编码是什么。然后,您可以根据需要对其进行转换。但是,正如brulak 建议的那样,您最好转换UTF-8 而不是from,以保留您正在传输的数据。

回答by powtac

To make sure that the output is UTF-8, no matter what kind of input it was, I use this check:

为了确保输出是 UTF-8,无论输入是什么类型,我都使用以下检查

if(!mb_check_encoding($output, 'UTF-8')
    OR !($output === mb_convert_encoding(mb_convert_encoding($output, 'UTF-32', 'UTF-8' ), 'UTF-8', 'UTF-32'))) {

    $output = mb_convert_encoding($content, 'UTF-8', 'pass'); 
}

// $output is now safely converted to UTF-8!

回答by yanek1988m

mb_detect_encodingfunction should be your last choice. That could return the wrongencoding. Linux command file -i /path/myfile.txtis working great. In PHP you could use:

mb_detect_encoding函数应该是你最后的选择。这可能会返回错误的编码。Linux 命令file -i /path/myfile.txt运行良好。在 PHP 中,您可以使用:

function _detectFileEncoding($filepath) {
    // VALIDATE $filepath !!!
    $output = array();
    exec('file -i ' . $filepath, $output);
    if (isset($output[0])){
        $ex = explode('charset=', $output[0]);
        return isset($ex[1]) ? $ex[1] : null;
    }
    return null;
}

回答by PapaKai

This is my solution which worked like a charm:

这是我的解决方案,它就像一个魅力:

//check string strict for encoding out of list of supported encodings
$enc = mb_detect_encoding($str, mb_list_encodings(), true);

if ($enc===false){
    //could not detect encoding
}
else if ($enc!=="UTF-8"){
    $str = mb_convert_encoding($str, "UTF-8", $enc);
}
else {
    //UTF-8 detected
}

回答by jgpATs2w

For Linux servers, I use this command:

对于 Linux 服务器,我使用以下命令:

$file = 'your/file.ext'
exec( "from=`file -bi $file | awk -F'=' '{print  }'` && iconv -f $from -t utf-8 $file -o $file" );

回答by akakargul

Scans all file, finds any kind of encoding from mb_list_encodings, good performance..

扫描所有文件,从 mb_list_encodings 中找到任何一种编码,性能良好..

    function detectFileEncoding($filePath){

    $fopen=fopen($filePath,'r');

    $row = fgets($fopen);
    $encodings = mb_list_encodings();
    $encoding = mb_detect_encoding( $row, "UTF-8, ASCII, Windows-1252, Windows-1254" );//these are my favorite encodings

    if($encoding !== false) {
        $key = array_search($encoding, $encodings) !== false;
        if ($key !== false)
            unset($encodings[$key]);
        $encodings = array_values($encodings);
    }

    $encKey = 0;
    while ($row = fgets($fopen)) {
        if($encoding == false){
            $encoding = $encodings[$encKey++];
        }

        if(!mb_check_encoding($row, $encoding)){
            $encoding =false;
            rewind($fopen);
        }

    }

    return $encoding;
}

回答by Amereservant

I recently encountered this issue and the mb_convert_encoding()function output was UTF-8.

我最近遇到了这个问题,mb_convert_encoding()函数输出是UTF-8

After taking a look at the response headers, there wasn't anything mentioning the encoding type, so I found Set HTTP header to UTF-8 using PHP, which proposes the following:

查看响应头后,没有任何提及编码类型,所以我发现Set HTTP header to UTF-8 using PHP,它提出以下建议:

<?php
header('Content-Type: text/html; charset=utf-8');

After adding that to the top of the PHP file, all of the funky characters went away and it rendered as it should. I am not sure if that's the issue the original poster was seeking for, but I found this in trying to solve the issue myself and figured I'd share.

将它添加到 PHP 文件的顶部后,所有时髦的字符都消失了,并按原样呈现。我不确定这是否是原始海报寻求的问题,但我在尝试自己解决问题时发现了这一点,并认为我会分享。

回答by cbrulak

How are you going to handle the non-ASCII characters from the UTF-8 or 16 or 32 file?

您将如何处理来自 UTF-8 或 16 或 32 文件的非 ASCII 字符?

I ask because I think you may have a design issue here.

我问是因为我认为您在这里可能有设计问题。

I would convert your output file into UTF-8 (or 16 or 32) instead of the other way around.

我会将您的输出文件转换为 UTF-8(或 16 或 32),而不是相反。

Then you won't have this problem.

那你就不会有这个问题了。

Have you also considered the security issues that may arise from converting an escaped UTF-8 code? See this comment:

您是否也考虑过转换转义的 UTF-8 代码可能出现的安全问题?看到这个评论

Detecting multi-byte encoding

检测多字节编码

Figure out what encoding your source file is in, then convert it to UTF-8, and you should be good to go.

弄清楚您的源文件的编码方式,然后将其转换为 UTF-8,您就可以开始使用了。