PHP:将 unicode 代码点转换为 UTF-8
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1805802/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
PHP: Convert unicode codepoint to UTF-8
提问by Anthony
I have my data in this format: U+597Dor like this U+6211. I want to convert them to UTF-8 (original characters are 好 and 我). How can I do it?
我有这种格式的数据:U+597D或者像这样U+6211。我想将它们转换为 UTF-8(原始字符是好和我)。我该怎么做?
回答by Mez
$utf8string = html_entity_decode(preg_replace("/U\+([0-9A-F]{4})/", "&#x\1;", $string), ENT_NOQUOTES, 'UTF-8');
is probably the simplest solution.
可能是最简单的解决方案。
回答by velcrow
function utf8($num)
{
if($num<=0x7F) return chr($num);
if($num<=0x7FF) return chr(($num>>6)+192).chr(($num&63)+128);
if($num<=0xFFFF) return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
if($num<=0x1FFFFF) return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128).chr(($num&63)+128);
return '';
}
function uniord($c)
{
$ord0 = ord($c{0}); if ($ord0>=0 && $ord0<=127) return $ord0;
$ord1 = ord($c{1}); if ($ord0>=192 && $ord0<=223) return ($ord0-192)*64 + ($ord1-128);
$ord2 = ord($c{2}); if ($ord0>=224 && $ord0<=239) return ($ord0-224)*4096 + ($ord1-128)*64 + ($ord2-128);
$ord3 = ord($c{3}); if ($ord0>=240 && $ord0<=247) return ($ord0-240)*262144 + ($ord1-128)*4096 + ($ord2-128)*64 + ($ord3-128);
return false;
}
utf8() and uniord() try to mirror the chr() and ord() functions on php:
utf8() 和 uniord() 尝试在 php 上镜像 chr() 和 ord() 函数:
echo utf8(0x6211)."\n";
echo uniord(utf8(0x6211))."\n";
echo "U+".dechex(uniord(utf8(0x6211)))."\n";
//In your case:
$wo='U+6211';
$hao='U+597D';
echo utf8(hexdec(str_replace("U+","", $wo)))."\n";
echo utf8(hexdec(str_replace("U+","", $hao)))."\n";
output:
输出:
我
25105
U+6211
我
好
回答by Rabin Lama Dong
PHP 7+
PHP 7+
As of PHP 7, you can use the Unicode codepoint escape syntaxto do this.
从 PHP 7 开始,您可以使用Unicode 代码点转义语法来执行此操作。
echo "\u{597D}";outputs 好.
echo "\u{597D}";输出好。
回答by John Slegers
I just wrote a polyfillfor missing multibyte versions of ordand chrwith the following in mind:
我只是写了一个polyfill缺少多字节版本的ord和chr考虑到以下几点:
It defines functions
mb_ordandmb_chronly if they don't already exist. If they do exist in your framework or some future version of PHP, the polyfill will be ignored.It uses the widely used
mbstringextension to do the conversion. If thembstringextension is not loaded, it will use theiconvextension instead.
它定义了函数,
mb_ord并且mb_chr仅当它们不存在时才定义。如果它们确实存在于您的框架或 PHP 的某个未来版本中,则 polyfill 将被忽略。它使用广泛使用的
mbstring扩展来进行转换。如果mbstring未加载扩展名,它将改用iconv扩展名。
I also added functions for HTMLentities encoding / decoding and encoding / decoding to JSON format as well as some demo code for how to use these functions
我还添加了 HTMLentities 编码/解码和编码/解码为 JSON 格式的函数以及一些关于如何使用这些函数的演示代码
Code
代码
if (!function_exists('codepoint_encode')) {
function codepoint_encode($str) {
return substr(json_encode($str), 1, -1);
}
}
if (!function_exists('codepoint_decode')) {
function codepoint_decode($str) {
return json_decode(sprintf('"%s"', $str));
}
}
if (!function_exists('mb_internal_encoding')) {
function mb_internal_encoding($encoding = NULL) {
return ($from_encoding === NULL) ? iconv_get_encoding() : iconv_set_encoding($encoding);
}
}
if (!function_exists('mb_convert_encoding')) {
function mb_convert_encoding($str, $to_encoding, $from_encoding = NULL) {
return iconv(($from_encoding === NULL) ? mb_internal_encoding() : $from_encoding, $to_encoding, $str);
}
}
if (!function_exists('mb_chr')) {
function mb_chr($ord, $encoding = 'UTF-8') {
if ($encoding === 'UCS-4BE') {
return pack("N", $ord);
} else {
return mb_convert_encoding(mb_chr($ord, 'UCS-4BE'), $encoding, 'UCS-4BE');
}
}
}
if (!function_exists('mb_ord')) {
function mb_ord($char, $encoding = 'UTF-8') {
if ($encoding === 'UCS-4BE') {
list(, $ord) = (strlen($char) === 4) ? @unpack('N', $char) : @unpack('n', $char);
return $ord;
} else {
return mb_ord(mb_convert_encoding($char, 'UCS-4BE', $encoding), 'UCS-4BE');
}
}
}
if (!function_exists('mb_htmlentities')) {
function mb_htmlentities($string, $hex = true, $encoding = 'UTF-8') {
return preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) use ($hex) {
return sprintf($hex ? '&#x%X;' : '&#%d;', mb_ord($match[0]));
}, $string);
}
}
if (!function_exists('mb_html_entity_decode')) {
function mb_html_entity_decode($string, $flags = null, $encoding = 'UTF-8') {
return html_entity_decode($string, ($flags === NULL) ? ENT_COMPAT | ENT_HTML401 : $flags, $encoding);
}
}
How to use
如何使用
echo "\nGet string from numeric DEC value\n";
var_dump(mb_chr(25105));
var_dump(mb_chr(22909));
echo "\nGet string from numeric HEX value\n";
var_dump(mb_chr(0x6211));
var_dump(mb_chr(0x597D));
echo "\nGet numeric value of character as DEC int\n";
var_dump(mb_ord('我'));
var_dump(mb_ord('好'));
echo "\nGet numeric value of character as HEX string\n";
var_dump(dechex(mb_ord('我')));
var_dump(dechex(mb_ord('好')));
echo "\nEncode / decode to DEC based HTML entities\n";
var_dump(mb_htmlentities('我好', false));
var_dump(mb_html_entity_decode('我好'));
echo "\nEncode / decode to HEX based HTML entities\n";
var_dump(mb_htmlentities('我好'));
var_dump(mb_html_entity_decode('我好'));
echo "\nUse JSON encoding / decoding\n";
var_dump(codepoint_encode("我好"));
var_dump(codepoint_decode('\u6211\u597d'));
Output
输出
Get string from numeric DEC value
string(3) "我"
string(3) "好"
Get string from numeric HEX value
string(3) "我"
string(3) "好"
Get numeric value of character as DEC string
int(25105)
int(22909)
Get numeric value of character as HEX string
string(4) "6211"
string(4) "597d"
Encode / decode to DEC based HTML entities
string(16) "我好"
string(6) "我好"
Encode / decode to HEX based HTML entities
string(16) "我好"
string(6) "我好"
Use JSON encoding / decoding
string(12) "\u6211\u597d"
string(6) "我好"
回答by Php'Regex
<?php
function chr_utf8($n,$f='C*'){
return $n<(1<<7)?chr($n):($n<1<<11?pack($f,192|$n>>6,1<<7|191&$n):
($n<(1<<16)?pack($f,224|$n>>12,1<<7|63&$n>>6,1<<7|63&$n):
($n<(1<<20|1<<16)?pack($f,240|$n>>18,1<<7|63&$n>>12,1<<7|63&$n>>6,1<<7|63&$n):'')));
}
$your_input='U+597D';
echo (chr_utf8(hexdec(ltrim($your_input,'U+'))));
// Output 好
If you want to use a callback function you can try it :
如果你想使用回调函数,你可以试试:
<?php
// Note: function chr_utf8 shown above is required
$your_input='U+597DU+6211';
$result=preg_replace_callback('#U\+([a-f0-9]+)#i',function($a){return chr_utf8(hexdec($a[1]));},$your_input);
echo $result;
// Output 好我
Check it in https://eval.in/748187
回答by eleg
mb_convert_encoding(
preg_replace("/U\+([0-9A-F]*)/"
,"&#x\1;"
,'U+597DU+6211'
)
,"UTF-8"
,"HTML-ENTITIES"
);
works fine, too.
工作正常,也是。
回答by Dor
With the aid of the following table:
借助下表:
http://en.wikipedia.org/wiki/UTF-8#Description
http://en.wikipedia.org/wiki/UTF-8#Description
can't be simpler :)
不能更简单:)
Simply mask the unicode numbers according to which range they fit in.
只需根据它们适合的范围屏蔽 unicode 数字。
回答by Tschallacka
I was in the position I needed to filter specific characters without affecting the html because I was using a wysiwig editor, but people copy pasting from word would add some nice unrenderable characters to the content.
我处于需要过滤特定字符而不影响 html 的位置,因为我使用的是 wysiwig 编辑器,但是人们从 word 复制粘贴会向内容添加一些不错的不可渲染字符。
My solution boils down to simple replacement lists.
我的解决方案归结为简单的替换列表。
class ReplaceIllegal {
public static $find = array ( 0 => '\x0', 1 => '\x1', 2 => '\x2', 3 => '\x3', 4 => '\x4', 5 => '\x5', 6 => '\x6', 7 => '\x7', 8 => '\x8', 9 => '\x9', 10 => '\xA', 11 => '\xB', 12 => '\xC', 13 => '\xD', 14 => '\xE', 15 => '\xF', 16 => '\x10', 17 => '\x11', 18 => '\x12', 19 => '\x13', 20 => '\x14', 21 => '\x15', 22 => '\x16', 23 => '\x17', 24 => '\x18', 25 => '\x19', 26 => '\x1A', 27 => '\x1B', 28 => '\x1C', 29 => '\x1D', 30 => '\x1E', 31 => '\x80', 32 => '\x81', 33 => '\x82', 34 => '\x83', 35 => '\x84', 36 => '\x85', 37 => '\x86', 38 => '\x87', 39 => '\x88', 40 => '\x89', 41 => '\x8A', 42 => '\x8B', 43 => '\x8C', 44 => '\x8D', 45 => '\x8E', 46 => '\x8F', 47 => '\x90', 48 => '\x91', 49 => '\x92', 50 => '\x93', 51 => '\x94', 52 => '\x95', 53 => '\x96', 54 => '\x97', 55 => '\x98', 56 => '\x99', 57 => '\x9A', 58 => '\x9B', 59 => '\x9C', 60 => '\x9D', 61 => '\x9E', 62 => '\x9F', 63 => '\xA0', 64 => '\xA1', 65 => '\xA2', 66 => '\xA3', 67 => '\xA4', 68 => '\xA5', 69 => '\xA6', 70 => '\xA7', 71 => '\xA8', 72 => '\xA9', 73 => '\xAA', 74 => '\xAB', 75 => '\xAC', 76 => '\xAD', 77 => '\xAE', 78 => '\xAF', 79 => '\xB0', 80 => '\xB1', 81 => '\xB2', 82 => '\xB3', 83 => '\xB4', 84 => '\xB5', 85 => '\xB6', 86 => '\xB7', 87 => '\xB8', 88 => '\xB9', 89 => '\xBA', 90 => '\xBB', 91 => '\xBC', 92 => '\xBD', 93 => '\xBE', 94 => '\xBF', 95 => '\xC0', 96 => '\xC1', 97 => '\xC2', 98 => '\xC3', 99 => '\xC4', 100 => '\xC5', 101 => '\xC6', 102 => '\xC7', 103 => '\xC8', 104 => '\xC9', 105 => '\xCA', 106 => '\xCB', 107 => '\xCC', 108 => '\xCD', 109 => '\xCE', 110 => '\xCF', 111 => '\xD0', 112 => '\xD1', 113 => '\xD2', 114 => '\xD3', 115 => '\xD4', 116 => '\xD5', 117 => '\xD6', 118 => '\xD7', 119 => '\xD8', 120 => '\xD9', 121 => '\xDA', 122 => '\xDB', 123 => '\xDC', 124 => '\xDD', 125 => '\xDE', 126 => '\xDF', 127 => '\xE0', 128 => '\xE1', 129 => '\xE2', 130 => '\xE3', 131 => '\xE4', 132 => '\xE5', 133 => '\xE6', 134 => '\xE7', 135 => '\xE8', 136 => '\xE9', 137 => '\xEA', 138 => '\xEB', 139 => '\xEC', 140 => '\xED', 141 => '\xEE', 142 => '\xEF', 143 => '\xF0', 144 => '\xF1', 145 => '\xF2', 146 => '\xF3', 147 => '\xF4', 148 => '\xF5', 149 => '\xF6', 150 => '\xF7', 151 => '\xF8', 152 => '\xF9', 153 => '\xFA', 154 => '\xFB', 155 => '\xFC', 156 => '\xFD', 157 => '\xFE', );
private static $replace = array ( 0 => '�', 1 => '', 2 => '', 3 => '', 4 => '', 5 => '', 6 => '', 7 => '', 8 => '', 9 => '	', 10 => ' ', 11 => '', 12 => '', 13 => ' ', 14 => '', 15 => '', 16 => '', 17 => '', 18 => '', 19 => '', 20 => '', 21 => '', 22 => '', 23 => '', 24 => '', 25 => '', 26 => '', 27 => '', 28 => '', 29 => '', 30 => '', 31 => '€', 32 => '', 33 => '‚', 34 => 'ƒ', 35 => '„', 36 => '…', 37 => '†', 38 => '‡', 39 => 'ˆ', 40 => '‰', 41 => 'Š', 42 => '‹', 43 => 'Œ', 44 => '', 45 => 'Ž', 46 => '', 47 => '', 48 => '‘', 49 => '’', 50 => '“', 51 => '”', 52 => '•', 53 => '–', 54 => '—', 55 => '˜', 56 => '™', 57 => 'š', 58 => '›', 59 => 'œ', 60 => '', 61 => 'ž', 62 => 'Ÿ', 63 => ' ', 64 => '¡', 65 => '¢', 66 => '£', 67 => '¤', 68 => '¥', 69 => '¦', 70 => '§', 71 => '¨', 72 => '©', 73 => 'ª', 74 => '«', 75 => '¬', 76 => '­', 77 => '®', 78 => '¯', 79 => '°', 80 => '±', 81 => '²', 82 => '³', 83 => '´', 84 => 'µ', 85 => '¶', 86 => '·', 87 => '¸', 88 => '¹', 89 => 'º', 90 => '»', 91 => '¼', 92 => '½', 93 => '¾', 94 => '¿', 95 => 'À', 96 => 'Á', 97 => 'Â', 98 => 'Ã', 99 => 'Ä', 100 => 'Å', 101 => 'Æ', 102 => 'Ç', 103 => 'È', 104 => 'É', 105 => 'Ê', 106 => 'Ë', 107 => 'Ì', 108 => 'Í', 109 => 'Î', 110 => 'Ï', 111 => 'Ð', 112 => 'Ñ', 113 => 'Ò', 114 => 'Ó', 115 => 'Ô', 116 => 'Õ', 117 => 'Ö', 118 => '×', 119 => 'Ø', 120 => 'Ù', 121 => 'Ú', 122 => 'Û', 123 => 'Ü', 124 => 'Ý', 125 => 'Þ', 126 => 'ß', 127 => 'à', 128 => 'á', 129 => 'â', 130 => 'ã', 131 => 'ä', 132 => 'å', 133 => 'æ', 134 => 'ç', 135 => 'è', 136 => 'é', 137 => 'ê', 138 => 'ë', 139 => 'ì', 140 => 'í', 141 => 'î', 142 => 'ï', 143 => 'ð', 144 => 'ñ', 145 => 'ò', 146 => 'ó', 147 => 'ô', 148 => 'õ', 149 => 'ö', 150 => '÷', 151 => 'ø', 152 => 'ù', 153 => 'ú', 154 => 'û', 155 => 'ü', 156 => 'ý', 157 => 'þ', );
/*
* replace illegal characters for escaped html character but don't touch anything else.
*/
public static function getSaveValue($value) {
return str_replace(self::$find, self::$replace, $value);
}
public static function makeIllegal($find,$replace) {
self::$find[] = $find;
self::$replace[] = $replace;
}
}
回答by Claudio Garaycochea
This worked fine for me. If you have a string "Letters u00e1 u00e9 etc." replace by "Letters á é".
这对我来说很好。如果您有一个字符串“Letters u00e1 u00e9 etc.” 替换为“字母áé”。
function unicode2html($str){
// Set the locale to something that's UTF-8 capable
setlocale(LC_ALL, 'en_US.UTF-8');
// Convert the codepoints to entities
$str = preg_replace("/u([0-9a-fA-F]{4})/", "&#x\1;", $str);
// Convert the entities to a UTF-8 string
return iconv("UTF-8", "ISO-8859-1//TRANSLIT", $str);
}

