PHP 字符串中的 Unicode 字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6058394/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 23:14:13  来源:igfitidea点击:

Unicode character in PHP string

phpunicode

提问by Telaclavo

This question looks embarrassingly simple, but I haven't been able to find an answer.

这个问题看起来很简单,但我一直没能找到答案。

What is the PHP equivalent to the following C# line of code?

与以下 C# 代码行等效的 PHP 是什么?

string str = "\u1000";

This sample creates a string with a single Unicode character whose "Unicode numeric value" is 1000 in hexadecimal (4096 in decimal).

此示例创建一个带有单个 Unicode 字符的字符串,其“Unicode 数值”的十六进制为 1000(十进制为 4096)。

That is, in PHP, how can I create a string with a single Unicode character whose "Unicode numeric value" is known?

也就是说,在 PHP 中,如何创建一个字符串,其中包含一个已知“Unicode 数值”的 Unicode 字符?

回答by Stefan Gehrig

Because JSON directly supports the \uxxxxsyntax the first thing that comes into my mind is:

因为 JSON 直接支持\uxxxx语法,所以我想到的第一件事是:

$unicodeChar = '\u1000';
echo json_decode('"'.$unicodeChar.'"');

Another option would be to use mb_convert_encoding()

另一种选择是使用 mb_convert_encoding()

echo mb_convert_encoding('က', 'UTF-8', 'HTML-ENTITIES');

or make use of the direct mapping between UTF-16BE (big endian) and the Unicode codepoint:

或者利用 UTF-16BE(大端)和 Unicode 代码点之间的直接映射:

echo mb_convert_encoding("\x10\x00", 'UTF-8', 'UTF-16BE');

回答by Blackhole

PHP 7.0.0 has introduced the "Unicode codepoint escape" syntax.

PHP 7.0.0 引入了“Unicode 代码点转义”语法

It's now possible to write Unicode characters easily by using a double-quotedor a heredocstring, without calling any function.

现在可以通过使用双引号heredoc字符串轻松编写Unicode 字符,而无需调用任何函数。

$unicodeChar = "\u{1000}";

回答by Pacerier

I wonder why no one has mentioned this yet, but you can do an almost equivalent version using escape sequences in double quoted strings:

我想知道为什么还没有人提到这一点,但是您可以使用双引号字符串中的转义序列来做一个几乎等效的版本:

\x[0-9A-Fa-f]{1,2}

The sequence of characters matching the regular expression is a character in hexadecimal notation.

\x[0-9A-Fa-f]{1,2}

与正则表达式匹配的字符序列是一个十六进制字符。

ASCII example:

ASCII 示例:

<?php
    echo("\x48\x65\x6C\x6C\x6F\x20\x57\x6F\x72\x6C\x64\x21");
?>

Hello World!

你好,世界!

So for your case, all you need to do is $str = "\x30\xA2";. But these are bytes, not characters. The byte representation of the Unicode codepoint coincides with UTF-16 big endian, so we could print it out directly as such:

因此,对于您的情况,您需要做的就是$str = "\x30\xA2";. 但这些是字节,而不是字符。Unicode 代码点的字节表示与 UTF-16 big endian 一致,因此我们可以直接将其打印出来:

<?php
    header('content-type:text/html;charset=utf-16be');
    echo("\x30\xA2");
?>

If you are using a different encoding, you'll need alter the bytes accordingly (mostly done with a library, though possible by hand too).

如果您使用不同的编码,则需要相应地更改字节(主要使用库完成,但也可以手动完成)。

UTF-16 little endian example:

UTF-16 小端示例:

<?php
    header('content-type:text/html;charset=utf-16le');
    echo("\xA2\x30");
?>

UTF-8 example:

UTF-8 示例:

<?php
    header('content-type:text/html;charset=utf-8');
    echo("\xE3\x82\xA2");
?>

There is also the packfunction, but you can expect it to be slow.

还有这个pack功能,但你可以预料它会很慢。

回答by Gumbo

PHP does not know these Unicode escape sequences. But as unknown escape sequences remain unaffected, you can write your own function that converts such Unicode escape sequences:

PHP 不知道这些 Unicode 转义序列。但由于未知的转义序列不受影响,您可以编写自己的函数来转换此类 Unicode 转义序列:

function unicodeString($str, $encoding=null) {
    if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
    return preg_replace_callback('/\\u([0-9a-fA-F]{4})/u', create_function('$match', 'return mb_convert_encoding(pack("H*", $match[1]), '.var_export($encoding, true).', "UTF-16BE");'), $str);
}

Or with an anonymous function expressioninstead of create_function:

或者使用匿名函数表达式而不是create_function

function unicodeString($str, $encoding=null) {
    if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
    return preg_replace_callback('/\\u([0-9a-fA-F]{4})/u', function($match) use ($encoding) {
        return mb_convert_encoding(pack('H*', $match[1]), $encoding, 'UTF-16BE');
    }, $str);
}

Its usage:

它的用法:

$str = unicodeString("\u1000");

回答by flori

html_entity_decode('&#x30a8;', 0, 'UTF-8');

This works too. However the json_decode() solution is a lot faster (around 50 times).

这也有效。然而 json_decode() 解决方案要快得多(大约 50 倍)。

回答by Hamid Sarfraz

Try Portable UTF-8:

尝试便携式 UTF-8

$str = utf8_chr( 0x1000 );
$str = utf8_chr( '\u1000' );
$str = utf8_chr( 4096 );

All work exactly the same way. You can get the codepoint of a character with utf8_ord(). Read more about Portable UTF-8.

所有的工作方式都完全一样。您可以使用utf8_ord(). 阅读有关便携式 UTF-8 的更多信息

回答by Timo Tijhof

As mentioned by others, PHP 7 introduces support for the \uUnicode syntax directly.

正如其他人提到的,PHP 7\u直接引入了对Unicode 语法的支持。

As also mentioned by others, the only way to obtain a string value from any sensible Unicode character description in PHP, is by converting it from something else (e.g. JSON parsing, HTML parsing or some other form). But this comes at a run-time performance cost.

正如其他人所提到的,从 PHP 中任何合理的 Unicode 字符描述中获取字符串值的唯一方法是将其从其他内容(例如 JSON 解析、HTML 解析或某种其他形式)转换。但这是以运行时性能为代价的。

However, there is one other option. You can encode the character directly in PHP with \xbinary escaping. The \xescape syntax is also supported in PHP 5.

但是,还有另一种选择。您可以使用\x二进制转义直接在 PHP 中对字符进行编码。该\x转义语法也支持PHP 5

This is especially useful if you prefer not to enter the character directly in a string through its natural form. For example, if it is an invisible control character, or other hard to detect whitespace.

如果您不想通过字符串的自然形式直接在字符串中输入字符,这将特别有用。例如,如果它是一个不可见的控制字符,或者其他难以检测的空格。

First, a proof example:

先举个证明例子:

// Unicode Character 'HAIR SPACE' (U+200A)
$htmlEntityChar = "&#8202;";
$realChar = html_entity_decode($htmlEntityChar);
$phpChar = "\xE2\x80\x8A";
echo 'Proof: ';
var_dump($realChar === $phpChar); // bool(true)

Note that, as mentioned by Pacerier in another answer, this binary code is unique to a specific character encoding. In the above example, \xE2\x80\x8Ais the binary coding for U+200A in UTF-8.

请注意,正如 Pacerier 在另一个答案中提到的,这个二进制代码对于特定的字符编码是唯一的。在上面的例子中,\xE2\x80\x8A是 UTF-8 中 U+200A 的二进制编码。

The next question is, how do you get from U+200Ato \xE2\x80\x8A?

下一个问题是,你如何从U+200A\xE2\x80\x8A

Below is a PHP script to generate the escape sequence for any character, based on either a JSON string, HTML entity, or any other method once you have it as a native string.

下面是一个 PHP 脚本,用于为任何字符生成转义序列,基于 JSON 字符串、HTML 实体或任何其他方法,一旦您将其作为本机字符串。

function str_encode_utf8binary($str) {
    /** @author Krinkle 2018 */
    $output = '';
    foreach (str_split($str) as $octet) {
        $ordInt = ord($octet);
        // Convert from int (base 10) to hex (base 16), for PHP \x syntax
        $ordHex = base_convert($ordInt, 10, 16);
        $output .= '\x' . $ordHex;
    }
    return $output;
}

function str_convert_html_to_utf8binary($str) {
    return str_encode_utf8binary(html_entity_decode($str));
}
function str_convert_json_to_utf8binary($str) {
    return str_encode_utf8binary(json_decode($str));
}

// Example for raw string: Unicode Character 'INFINITY' (U+221E)
echo str_encode_utf8binary('∞') . "\n";
// \xe2\x88\x9e

// Example for HTML: Unicode Character 'HAIR SPACE' (U+200A)
echo str_convert_html_to_utf8binary('&#8202;') . "\n";
// \xe2\x80\x8a

// Example for JSON: Unicode Character 'HAIR SPACE' (U+200A)
echo str_convert_json_to_utf8binary('"\u200a"') . "\n";
// \xe2\x80\x8a