获取 PHP 中所有 UTF-8 空白字符的完整列表的最简单方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2227921/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 05:38:11  来源:igfitidea点击:

Simplest way to get a complete list of all the UTF-8 whitespace characters in PHP

phputf-8whitespacespace

提问by Ivan Krechetov

In PHP, what's the most elegant way to get the complete list (array of strings) of all the Unicode whitespace characters, encoded in utf8?

在 PHP 中,获取以 utf8 编码的所有 Unicode空白字符的完整列表(字符串数组)的最优雅方法是什么?

I need that to generate test data.

我需要它来生成测试数据。

采纳答案by devio

This emailcontains a list of all Unicode whitespace characters encoded in UTF-8, UTF-16, and HTML.

此电子邮件包含以 UTF-8、UTF-16 和 HTML 编码的所有 Unicode 空白字符的列表。

edit

编辑

Originally answered Feb 9 '10 (!). Really guys, if the information is outdated, you can add your own answer, rather than complain. Just google for the URL mentioned in my answer, and earn some rep:

最初于 2010 年 2 月 9 日(!)回答。真的,伙计们,如果信息已经过时,您可以添加自己的答案,而不是抱怨。只需谷歌搜索我的答案中提到的网址,并获得一些代表:

The mail has been archived here(took me seconds), and the whitespace table is even mentioned in the introduction

邮件已经归档到这里(花了我几秒钟),在介绍中甚至提到了空白表

static $whitespace = array(
    "SPACE" => "\x20",
    "NO-BREAK SPACE" => "\xc2\xa0",
    "OGHAM SPACE MARK" => "\xe1\x9a\x80",
    "EN QUAD" => "\xe2\x80\x80",
    "EM QUAD" => "\xe2\x80\x81",
    "EN SPACE" => "\xe2\x80\x82",
    "EM SPACE" => "\xe2\x80\x83",
    "THREE-PER-EM SPACE" => "\xe2\x80\x84",
    "FOUR-PER-EM SPACE" => "\xe2\x80\x85",
    "SIX-PER-EM SPACE" => "\xe2\x80\x86",
    "FIGURE SPACE" => "\xe2\x80\x87",
    "PUNCTUATION SPACE" => "\xe2\x80\x88",
    "THIN SPACE" => "\xe2\x80\x89",
    "HAIR SPACE" => "\xe2\x80\x8a",
    "ZERO WIDTH SPACE" => "\xe2\x80\x8b",
    "NARROW NO-BREAK SPACE" => "\xe2\x80\xaf",
    "MEDIUM MATHEMATICAL SPACE" => "\xe2\x81\x9f",
    "IDEOGRAPHIC SPACE" => "\xe3\x80\x80",
);

回答by cegfault

Years later, this question still has top results on Google when looking for unicode whitespace characters. devio's answer is great, but incomplete. As of this writing (October 2017) Wikipedia has a list of whitespace characters here: https://en.wikipedia.org/wiki/Whitespace_character

多年后,这个问题在寻找 unicode 空白字符时仍然在谷歌上获得最高结果。devio 的回答很好,但不完整。在撰写本文时(2017 年 10 月),维基百科在此处提供了一个空白字符列表:https: //en.wikipedia.org/wiki/Whitespace_character

This list has specifies 25 code points, whereas the currently accepted answer lists 18. Including the seven other code points, the list is:

此列表指定了 25 个代码点,而当前接受的答案列出了 18 个。 包括其他七个代码点,该列表是:

U+0009  character tabulation
U+000A  line feed
U+000B  line tabulation
U+000C  form feed
U+000D  carriage return
U+0020  space
U+0085  next line
U+00A0  no-break space
U+1680  ogham space mark
U+180E  mongolian vowel separator
U+2000  en quad
U+2001  em quad
U+2002  en space
U+2003  em space
U+2004  three-per-em space
U+2005  four-per-em space
U+2006  six-per-em space
U+2007  figure space
U+2008  punctuation space
U+2009  thin space
U+200A  hair space
U+200B  zero width space
U+200C  zero width non-joiner
U+200D  zero width joiner
U+2028  line separator
U+2029  paragraph separator
U+202F  narrow no-break space
U+205F  medium mathematical space
U+2060  word joiner
U+3000  ideographic space
U+FEFF  zero width non-breaking space

回答by prewett

http://en.wikipedia.org/wiki/Space_%28punctuation%29#Spaces_in_Unicode

http://en.wikipedia.org/wiki/Space_%28punctuation%29#Spaces_in_Unicode

Unfortunately, it doesn't give UTF-8, but it does have the character in the web page, so you could cut and paste into your editor (if it saves in UTF-8). Alternatively, http://www.fileformat.info/info/unicode/char/180E/index.htmgives UTF-8 (replace "180E" with the hex UTF-16 value you are looking up).

不幸的是,它没有提供 UTF-8,但它确实在网页中有该字符,因此您可以剪切并粘贴到您的编辑器中(如果它以 UTF-8 格式保存)。或者,http://www.fileformat.info/info/unicode/char/180E/index.htm提供 UTF-8(用您正在查找的十六进制 UTF-16 值替换“180E”)。

This also gives a couple extra characters that @devio's excellent answer misses.

这也提供了@devio 的优秀答案遗漏的几个额外字符。

回答by j-a

0x9 b'\t'
0xa b'\n'
0xb b'\x0b'
0xc b'\x0c'
0xd b'\r'
0x20 b' '
0x85 b'\xc2\x85'
0xa0 b'\xc2\xa0'
0x1680 b'\xe1\x9a\x80'
0x180e b'\xe1\xa0\x8e'
0x2000 b'\xe2\x80\x80'
0x2001 b'\xe2\x80\x81'
0x2002 b'\xe2\x80\x82'
0x2003 b'\xe2\x80\x83'
0x2004 b'\xe2\x80\x84'
0x2005 b'\xe2\x80\x85'
0x2006 b'\xe2\x80\x86'
0x2007 b'\xe2\x80\x87'
0x2008 b'\xe2\x80\x88'
0x2009 b'\xe2\x80\x89'
0x200a b'\xe2\x80\x8a'
0x200b b'\xe2\x80\x8b'
0x200c b'\xe2\x80\x8c'
0x200d b'\xe2\x80\x8d'
0x2028 b'\xe2\x80\xa8'
0x2029 b'\xe2\x80\xa9'
0x202f b'\xe2\x80\xaf'
0x205f b'\xe2\x81\x9f'
0x2060 b'\xe2\x81\xa0'
0x3000 b'\xe3\x80\x80'
0xfeff b'\xef\xbb\xbf'