获取 PHP 中所有 UTF-8 空白字符的完整列表的最简单方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2227921/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Simplest way to get a complete list of all the UTF-8 whitespace characters in PHP
提问by Ivan Krechetov
In PHP, what's the most elegant way to get the complete list (array of strings) of all the Unicode whitespace characters, encoded in utf8?
在 PHP 中,获取以 utf8 编码的所有 Unicode空白字符的完整列表(字符串数组)的最优雅方法是什么?
I need that to generate test data.
我需要它来生成测试数据。
采纳答案by devio
This emailcontains a list of all Unicode whitespace characters encoded in UTF-8, UTF-16, and HTML.
此电子邮件包含以 UTF-8、UTF-16 和 HTML 编码的所有 Unicode 空白字符的列表。
edit
编辑
Originally answered Feb 9 '10 (!). Really guys, if the information is outdated, you can add your own answer, rather than complain. Just google for the URL mentioned in my answer, and earn some rep:
最初于 2010 年 2 月 9 日(!)回答。真的,伙计们,如果信息已经过时,您可以添加自己的答案,而不是抱怨。只需谷歌搜索我的答案中提到的网址,并获得一些代表:
The mail has been archived here(took me seconds), and the whitespace table is even mentioned in the introduction
邮件已经归档到这里(花了我几秒钟),在介绍中甚至提到了空白表
static $whitespace = array(
"SPACE" => "\x20",
"NO-BREAK SPACE" => "\xc2\xa0",
"OGHAM SPACE MARK" => "\xe1\x9a\x80",
"EN QUAD" => "\xe2\x80\x80",
"EM QUAD" => "\xe2\x80\x81",
"EN SPACE" => "\xe2\x80\x82",
"EM SPACE" => "\xe2\x80\x83",
"THREE-PER-EM SPACE" => "\xe2\x80\x84",
"FOUR-PER-EM SPACE" => "\xe2\x80\x85",
"SIX-PER-EM SPACE" => "\xe2\x80\x86",
"FIGURE SPACE" => "\xe2\x80\x87",
"PUNCTUATION SPACE" => "\xe2\x80\x88",
"THIN SPACE" => "\xe2\x80\x89",
"HAIR SPACE" => "\xe2\x80\x8a",
"ZERO WIDTH SPACE" => "\xe2\x80\x8b",
"NARROW NO-BREAK SPACE" => "\xe2\x80\xaf",
"MEDIUM MATHEMATICAL SPACE" => "\xe2\x81\x9f",
"IDEOGRAPHIC SPACE" => "\xe3\x80\x80",
);
回答by cegfault
Years later, this question still has top results on Google when looking for unicode whitespace characters. devio's answer is great, but incomplete. As of this writing (October 2017) Wikipedia has a list of whitespace characters here: https://en.wikipedia.org/wiki/Whitespace_character
多年后,这个问题在寻找 unicode 空白字符时仍然在谷歌上获得最高结果。devio 的回答很好,但不完整。在撰写本文时(2017 年 10 月),维基百科在此处提供了一个空白字符列表:https: //en.wikipedia.org/wiki/Whitespace_character
This list has specifies 25 code points, whereas the currently accepted answer lists 18. Including the seven other code points, the list is:
此列表指定了 25 个代码点,而当前接受的答案列出了 18 个。 包括其他七个代码点,该列表是:
U+0009 character tabulation
U+000A line feed
U+000B line tabulation
U+000C form feed
U+000D carriage return
U+0020 space
U+0085 next line
U+00A0 no-break space
U+1680 ogham space mark
U+180E mongolian vowel separator
U+2000 en quad
U+2001 em quad
U+2002 en space
U+2003 em space
U+2004 three-per-em space
U+2005 four-per-em space
U+2006 six-per-em space
U+2007 figure space
U+2008 punctuation space
U+2009 thin space
U+200A hair space
U+200B zero width space
U+200C zero width non-joiner
U+200D zero width joiner
U+2028 line separator
U+2029 paragraph separator
U+202F narrow no-break space
U+205F medium mathematical space
U+2060 word joiner
U+3000 ideographic space
U+FEFF zero width non-breaking space
回答by prewett
http://en.wikipedia.org/wiki/Space_%28punctuation%29#Spaces_in_Unicode
http://en.wikipedia.org/wiki/Space_%28punctuation%29#Spaces_in_Unicode
Unfortunately, it doesn't give UTF-8, but it does have the character in the web page, so you could cut and paste into your editor (if it saves in UTF-8). Alternatively, http://www.fileformat.info/info/unicode/char/180E/index.htmgives UTF-8 (replace "180E" with the hex UTF-16 value you are looking up).
不幸的是,它没有提供 UTF-8,但它确实在网页中有该字符,因此您可以剪切并粘贴到您的编辑器中(如果它以 UTF-8 格式保存)。或者,http://www.fileformat.info/info/unicode/char/180E/index.htm提供 UTF-8(用您正在查找的十六进制 UTF-16 值替换“180E”)。
This also gives a couple extra characters that @devio's excellent answer misses.
这也提供了@devio 的优秀答案遗漏的几个额外字符。
回答by j-a
0x9 b'\t'
0xa b'\n'
0xb b'\x0b'
0xc b'\x0c'
0xd b'\r'
0x20 b' '
0x85 b'\xc2\x85'
0xa0 b'\xc2\xa0'
0x1680 b'\xe1\x9a\x80'
0x180e b'\xe1\xa0\x8e'
0x2000 b'\xe2\x80\x80'
0x2001 b'\xe2\x80\x81'
0x2002 b'\xe2\x80\x82'
0x2003 b'\xe2\x80\x83'
0x2004 b'\xe2\x80\x84'
0x2005 b'\xe2\x80\x85'
0x2006 b'\xe2\x80\x86'
0x2007 b'\xe2\x80\x87'
0x2008 b'\xe2\x80\x88'
0x2009 b'\xe2\x80\x89'
0x200a b'\xe2\x80\x8a'
0x200b b'\xe2\x80\x8b'
0x200c b'\xe2\x80\x8c'
0x200d b'\xe2\x80\x8d'
0x2028 b'\xe2\x80\xa8'
0x2029 b'\xe2\x80\xa9'
0x202f b'\xe2\x80\xaf'
0x205f b'\xe2\x81\x9f'
0x2060 b'\xe2\x81\xa0'
0x3000 b'\xe3\x80\x80'
0xfeff b'\xef\xbb\xbf'

