php 如何将像“\u00ed”这样的 Unicode 转义序列解码为正确的 UTF-8 编码字符？

Question

提问by Docstero

Is there a function in PHP that can decode Unicode escape sequences like "\u00ed" to "í" and all other similar occurrences?

PHP 中是否有一个函数可以解码 Unicode 转义序列，如“ \u00ed”到“ í”以及所有其他类似事件？

I found similar question herebut is doesn't seem to work.

我在这里发现了类似的问题，但似乎不起作用。

Answer 1

回答by Gumbo

Try this:

尝试这个：

$str = preg_replace_callback('/\\u([0-9a-fA-F]{4})/', function ($match) {
    return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE');
}, $str);

In case it's UTF-16 based C/C++/Java/Json-style:

如果是基于 UTF-16 的 C/C++/Java/Json 样式：

$str = preg_replace_callback('/\\u([0-9a-fA-F]{4})/', function ($match) {
    return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UTF-16BE');
}, $str);

Answer 2

回答by 2BJ

print_r(json_decode('{"t":"\u00ed"}')); // -> stdClass Object ( [t] => í )

Answer 3

回答by Rabin Lama Dong

PHP 7+

As of PHP 7, you can use the Unicode codepoint escape syntaxto do this.

从 PHP 7 开始，您可以使用Unicode 代码点转义语法来执行此操作。

echo "\u{00ed}";outputs í.

echo "\u{00ed}";输出í。

Answer 4

回答by masakielastic

$str = '\u0063\u0061\u0074'.'\ud83d\ude38';
$str2 = '\u0063\u0061\u0074'.'\ud83d';

// U+1F638
var_dump(
    "cat\xF0\x9F\x98\xB8" === escape_sequence_decode($str),
    "cat\xEF\xBF\xBD" === escape_sequence_decode($str2)
);

function escape_sequence_decode($str) {

    // [U+D800 - U+DBFF][U+DC00 - U+DFFF]|[U+0000 - U+FFFF]
    $regex = '/\\u([dD][89abAB][\da-fA-F]{2})\\u([dD][c-fC-F][\da-fA-F]{2})
              |\\u([\da-fA-F]{4})/sx';

    return preg_replace_callback($regex, function($matches) {

        if (isset($matches[3])) {
            $cp = hexdec($matches[3]);
        } else {
            $lead = hexdec($matches[1]);
            $trail = hexdec($matches[2]);

            // http://unicode.org/faq/utf_bom.html#utf16-4
            $cp = ($lead << 10) + $trail + 0x10000 - (0xD800 << 10) - 0xDC00;
        }

        // https://tools.ietf.org/html/rfc3629#section-3
        // Characters between U+D800 and U+DFFF are not allowed in UTF-8
        if ($cp > 0xD7FF && 0xE000 > $cp) {
            $cp = 0xFFFD;
        }

        // https://github.com/php/php-src/blob/php-5.6.4/ext/standard/html.c#L471
        // php_utf32_utf8(unsigned char *buf, unsigned k)

        if ($cp < 0x80) {
            return chr($cp);
        } else if ($cp < 0xA0) {
            return chr(0xC0 | $cp >> 6).chr(0x80 | $cp & 0x3F);
        }

        return html_entity_decode('&#'.$cp.';');
    }, $str);
}

Answer 5

回答by Nemo Noman

This is a sledgehammer approach to replacing raw UNICODE with HTML. I haven't seen any other place to put this solution, but I assume others have had this problem.

这是用 HTML 替换原始 UNICODE 的大锤方法。我还没有看到任何其他地方可以放置此解决方案，但我认为其他人也遇到过这个问题。

Apply this str_replace function to the RAW JSON, before doing anything else.

在执行任何其他操作之前，将此 str_replace 函数应用于RAW JSON。

function unicode2html($str){
    $i=65535;
    while($i>0){
        $hex=dechex($i);
        $str=str_replace("\u$hex","&#$i;",$str);
        $i--;
     }
     return $str;
}

This won't take as long as you think, and this will replace ANY unicode with HTML.

这不会像您想象的那么长，这将用 HTML 替换任何 unicode。

Of course this can be reduced if you know the unicode types that are being returned in the JSON.

当然，如果您知道 JSON 中返回的 unicode 类型，则可以减少这种情况。

For example my code was getting lots of arrows and dingbat unicode. These are between 8448 an 11263. So my production code looks like:

例如，我的代码有很多箭头和 dingbat unicode。它们介于 8448 和 11263 之间。所以我的生产代码如下所示：

$i=11263;
while($i>08448){
    ...etc...

You can look up the blocks of Unicode by type here: http://unicode-table.com/en/If you know you're translating Arabic or Telegu or whatever, you can just replace those codes, not all 65,000.

您可以在此处按类型查找 Unicode 块：http: //unicode-table.com/en/ 如果您知道要翻译阿拉伯语或 Telegu 或其他任何内容，则可以替换这些代码，而不是全部 65,000。

You could apply this same sledgehammer to simple encoding:

您可以将相同的大锤应用于简单编码：

 $str=str_replace("\u$hex",chr($i),$str);

Answer 6

回答by orel

fix json values, it's add \ before u{xxx} to all +" "

修复 json 值，它在 u{xxx} 之前添加 \ 到所有 +" "

  $item = preg_replace_callback('/"(.+?)":"(u.+?)",/', function ($matches) {
        $matches[2] = preg_replace('/(u)/', '\u', $matches[2]);
            $matches[2] = preg_replace('/(")/', '&quot;', $matches[2]); 
            $matches[2] = json_decode('"' . $matches[2] . '"'); 
            return '"' . $matches[1] . '":"' . $matches[2] . '",';
        }, $item);

Answer 7

回答by jianyong

There is also a solution:
http://www.welefen.com/php-unicode-to-utf8.html

还有一个解决方案：http:
//www.welefen.com/php-unicode-to-utf8.html

function entity2utf8onechar($unicode_c){
    $unicode_c_val = intval($unicode_c);
    $f=0x80; // 10000000
    $str = "";
    // U-00000000 - U-0000007F:   0xxxxxxx
    if($unicode_c_val <= 0x7F){         $str = chr($unicode_c_val);     }     //U-00000080 - U-000007FF:  110xxxxx 10xxxxxx
    else if($unicode_c_val >= 0x80 && $unicode_c_val <= 0x7FF){         $h=0xC0; // 11000000
        $c1 = $unicode_c_val >> 6 | $h;
        $c2 = ($unicode_c_val & 0x3F) | $f;
        $str = chr($c1).chr($c2);
    } else if($unicode_c_val >= 0x800 && $unicode_c_val <= 0xFFFF){         $h=0xE0; // 11100000
        $c1 = $unicode_c_val >> 12 | $h;
        $c2 = (($unicode_c_val & 0xFC0) >> 6) | $f;
        $c3 = ($unicode_c_val & 0x3F) | $f;
        $str=chr($c1).chr($c2).chr($c3);
    }
    //U-00010000 - U-001FFFFF:  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    else if($unicode_c_val >= 0x10000 && $unicode_c_val <= 0x1FFFFF){         $h=0xF0; // 11110000
        $c1 = $unicode_c_val >> 18 | $h;
        $c2 = (($unicode_c_val & 0x3F000) >>12) | $f;
        $c3 = (($unicode_c_val & 0xFC0) >>6) | $f;
        $c4 = ($unicode_c_val & 0x3F) | $f;
        $str = chr($c1).chr($c2).chr($c3).chr($c4);
    }
    //U-00200000 - U-03FFFFFF:  111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    else if($unicode_c_val >= 0x200000 && $unicode_c_val <= 0x3FFFFFF){         $h=0xF8; // 11111000
        $c1 = $unicode_c_val >> 24 | $h;
        $c2 = (($unicode_c_val & 0xFC0000)>>18) | $f;
        $c3 = (($unicode_c_val & 0x3F000) >>12) | $f;
        $c4 = (($unicode_c_val & 0xFC0) >>6) | $f;
        $c5 = ($unicode_c_val & 0x3F) | $f;
        $str = chr($c1).chr($c2).chr($c3).chr($c4).chr($c5);
    }
    //U-04000000 - U-7FFFFFFF:  1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    else if($unicode_c_val >= 0x4000000 && $unicode_c_val <= 0x7FFFFFFF){         $h=0xFC; // 11111100
        $c1 = $unicode_c_val >> 30 | $h;
        $c2 = (($unicode_c_val & 0x3F000000)>>24) | $f;
        $c3 = (($unicode_c_val & 0xFC0000)>>18) | $f;
        $c4 = (($unicode_c_val & 0x3F000) >>12) | $f;
        $c5 = (($unicode_c_val & 0xFC0) >>6) | $f;
        $c6 = ($unicode_c_val & 0x3F) | $f;
        $str = chr($c1).chr($c2).chr($c3).chr($c4).chr($c5).chr($c6);
    }
    return $str;
}
function entities2utf8($unicode_c){
    $unicode_c = preg_replace("/\&\#([\da-f]{5})\;/es", "entity2utf8onechar('\1')", $unicode_c);
    return $unicode_c;
}

php 如何将像“\u00ed”这样的 Unicode 转义序列解码为正确的 UTF-8 编码字符？

提问by Docstero

回答by Gumbo

回答by 2BJ

回答by Rabin Lama Dong

PHP 7+

PHP 7+

回答by masakielastic

回答by Nemo Noman

回答by orel

回答by jianyong

相关推荐

最近更新

标签

php 如何将像“\u00ed”这样的 Unicode 转义序列解码为正确的 UTF-8 编码字符？

提问by Docstero

回答by Gumbo

回答by 2BJ

回答by Rabin Lama Dong

PHP 7+

PHP 7+

回答by masakielastic

回答by Nemo Noman

回答by orel

回答by jianyong

相关推荐

php 提取字符串PHP中两个字符之间的子字符串

PHP 读取另一个域上的 cookie

php 安装 symfony-cmf-standard 时“执行缓存时发生错误：clear --no-warmup”

php 如何计算PHP中的时差？

相关推荐

最近更新

标签