php 如何将像“\u00ed”这样的 Unicode 转义序列解码为正确的 UTF-8 编码字符?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2934563/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 08:11:13  来源:igfitidea点击:

How to decode Unicode escape sequences like "\u00ed" to proper UTF-8 encoded characters?

phpunicodeutf-8escapingdecoding

提问by Docstero

Is there a function in PHP that can decode Unicode escape sequences like "\u00ed" to "í" and all other similar occurrences?

PHP 中是否有一个函数可以解码 Unicode 转义序列,如“ \u00ed”到“ í”以及所有其他类似事件?

I found similar question herebut is doesn't seem to work.

我在这里发现了类似的问题但似乎不起作用。

回答by Gumbo

Try this:

尝试这个:

$str = preg_replace_callback('/\\u([0-9a-fA-F]{4})/', function ($match) {
    return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE');
}, $str);

In case it's UTF-16 based C/C++/Java/Json-style:

如果是基于 UTF-16 的 C/C++/Java/Json 样式:

$str = preg_replace_callback('/\\u([0-9a-fA-F]{4})/', function ($match) {
    return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UTF-16BE');
}, $str);

回答by 2BJ

print_r(json_decode('{"t":"\u00ed"}')); // -> stdClass Object ( [t] => í )

回答by Rabin Lama Dong

PHP 7+

PHP 7+

As of PHP 7, you can use the Unicode codepoint escape syntaxto do this.

从 PHP 7 开始,您可以使用Unicode 代码点转义语法来执行此操作。

echo "\u{00ed}";outputs í.

echo "\u{00ed}";输出í

回答by masakielastic

$str = '\u0063\u0061\u0074'.'\ud83d\ude38';
$str2 = '\u0063\u0061\u0074'.'\ud83d';

// U+1F638
var_dump(
    "cat\xF0\x9F\x98\xB8" === escape_sequence_decode($str),
    "cat\xEF\xBF\xBD" === escape_sequence_decode($str2)
);

function escape_sequence_decode($str) {

    // [U+D800 - U+DBFF][U+DC00 - U+DFFF]|[U+0000 - U+FFFF]
    $regex = '/\\u([dD][89abAB][\da-fA-F]{2})\\u([dD][c-fC-F][\da-fA-F]{2})
              |\\u([\da-fA-F]{4})/sx';

    return preg_replace_callback($regex, function($matches) {

        if (isset($matches[3])) {
            $cp = hexdec($matches[3]);
        } else {
            $lead = hexdec($matches[1]);
            $trail = hexdec($matches[2]);

            // http://unicode.org/faq/utf_bom.html#utf16-4
            $cp = ($lead << 10) + $trail + 0x10000 - (0xD800 << 10) - 0xDC00;
        }

        // https://tools.ietf.org/html/rfc3629#section-3
        // Characters between U+D800 and U+DFFF are not allowed in UTF-8
        if ($cp > 0xD7FF && 0xE000 > $cp) {
            $cp = 0xFFFD;
        }

        // https://github.com/php/php-src/blob/php-5.6.4/ext/standard/html.c#L471
        // php_utf32_utf8(unsigned char *buf, unsigned k)

        if ($cp < 0x80) {
            return chr($cp);
        } else if ($cp < 0xA0) {
            return chr(0xC0 | $cp >> 6).chr(0x80 | $cp & 0x3F);
        }

        return html_entity_decode('&#'.$cp.';');
    }, $str);
}

回答by Nemo Noman

This is a sledgehammer approach to replacing raw UNICODE with HTML. I haven't seen any other place to put this solution, but I assume others have had this problem.

这是用 HTML 替换原始 UNICODE 的大锤方法。我还没有看到任何其他地方可以放置此解决方案,但我认为其他人也遇到过这个问题。

Apply this str_replace function to the RAW JSON, before doing anything else.

在执行任何其他操作之前,将此 str_replace 函数应用于RAW JSON

function unicode2html($str){
    $i=65535;
    while($i>0){
        $hex=dechex($i);
        $str=str_replace("\u$hex","&#$i;",$str);
        $i--;
     }
     return $str;
}

This won't take as long as you think, and this will replace ANY unicode with HTML.

这不会像您想象的那么长,这将用 HTML 替换任何 unicode。

Of course this can be reduced if you know the unicode types that are being returned in the JSON.

当然,如果您知道 JSON 中返回的 unicode 类型,则可以减少这种情况。

For example my code was getting lots of arrows and dingbat unicode. These are between 8448 an 11263. So my production code looks like:

例如,我的代码有很多箭头和 dingbat unicode。它们介于 8448 和 11263 之间。所以我的生产代码如下所示:

$i=11263;
while($i>08448){
    ...etc...

You can look up the blocks of Unicode by type here: http://unicode-table.com/en/If you know you're translating Arabic or Telegu or whatever, you can just replace those codes, not all 65,000.

您可以在此处按类型查找 Unicode 块:http: //unicode-table.com/en/ 如果您知道要翻译阿拉伯语或 Telegu 或其他任何内容,则可以替换这些代码,而不是全部 65,000。

You could apply this same sledgehammer to simple encoding:

您可以将相同的大锤应用于简单编码:

 $str=str_replace("\u$hex",chr($i),$str);

回答by orel

fix json values, it's add \ before u{xxx} to all +" "

修复 json 值,它在 u{xxx} 之前添加 \ 到所有 +" "

  $item = preg_replace_callback('/"(.+?)":"(u.+?)",/', function ($matches) {
        $matches[2] = preg_replace('/(u)/', '\u', $matches[2]);
            $matches[2] = preg_replace('/(")/', '&quot;', $matches[2]); 
            $matches[2] = json_decode('"' . $matches[2] . '"'); 
            return '"' . $matches[1] . '":"' . $matches[2] . '",';
        }, $item);

回答by jianyong

There is also a solution:
http://www.welefen.com/php-unicode-to-utf8.html

还有一个解决方案:http:
//www.welefen.com/php-unicode-to-utf8.html

function entity2utf8onechar($unicode_c){
    $unicode_c_val = intval($unicode_c);
    $f=0x80; // 10000000
    $str = "";
    // U-00000000 - U-0000007F:   0xxxxxxx
    if($unicode_c_val <= 0x7F){         $str = chr($unicode_c_val);     }     //U-00000080 - U-000007FF:  110xxxxx 10xxxxxx
    else if($unicode_c_val >= 0x80 && $unicode_c_val <= 0x7FF){         $h=0xC0; // 11000000
        $c1 = $unicode_c_val >> 6 | $h;
        $c2 = ($unicode_c_val & 0x3F) | $f;
        $str = chr($c1).chr($c2);
    } else if($unicode_c_val >= 0x800 && $unicode_c_val <= 0xFFFF){         $h=0xE0; // 11100000
        $c1 = $unicode_c_val >> 12 | $h;
        $c2 = (($unicode_c_val & 0xFC0) >> 6) | $f;
        $c3 = ($unicode_c_val & 0x3F) | $f;
        $str=chr($c1).chr($c2).chr($c3);
    }
    //U-00010000 - U-001FFFFF:  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    else if($unicode_c_val >= 0x10000 && $unicode_c_val <= 0x1FFFFF){         $h=0xF0; // 11110000
        $c1 = $unicode_c_val >> 18 | $h;
        $c2 = (($unicode_c_val & 0x3F000) >>12) | $f;
        $c3 = (($unicode_c_val & 0xFC0) >>6) | $f;
        $c4 = ($unicode_c_val & 0x3F) | $f;
        $str = chr($c1).chr($c2).chr($c3).chr($c4);
    }
    //U-00200000 - U-03FFFFFF:  111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    else if($unicode_c_val >= 0x200000 && $unicode_c_val <= 0x3FFFFFF){         $h=0xF8; // 11111000
        $c1 = $unicode_c_val >> 24 | $h;
        $c2 = (($unicode_c_val & 0xFC0000)>>18) | $f;
        $c3 = (($unicode_c_val & 0x3F000) >>12) | $f;
        $c4 = (($unicode_c_val & 0xFC0) >>6) | $f;
        $c5 = ($unicode_c_val & 0x3F) | $f;
        $str = chr($c1).chr($c2).chr($c3).chr($c4).chr($c5);
    }
    //U-04000000 - U-7FFFFFFF:  1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    else if($unicode_c_val >= 0x4000000 && $unicode_c_val <= 0x7FFFFFFF){         $h=0xFC; // 11111100
        $c1 = $unicode_c_val >> 30 | $h;
        $c2 = (($unicode_c_val & 0x3F000000)>>24) | $f;
        $c3 = (($unicode_c_val & 0xFC0000)>>18) | $f;
        $c4 = (($unicode_c_val & 0x3F000) >>12) | $f;
        $c5 = (($unicode_c_val & 0xFC0) >>6) | $f;
        $c6 = ($unicode_c_val & 0x3F) | $f;
        $str = chr($c1).chr($c2).chr($c3).chr($c4).chr($c5).chr($c6);
    }
    return $str;
}
function entities2utf8($unicode_c){
    $unicode_c = preg_replace("/\&\#([\da-f]{5})\;/es", "entity2utf8onechar('\1')", $unicode_c);
    return $unicode_c;
}