Javascript unicode 字符串,汉字但没有标点符号
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/21109011/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Javascript unicode string, chinese character but no punctuation
提问by resle
I am trying to scrap a unicode string using javascript. Said string could countain mixed characters. Example: 我的中文不好。我是意大利人。你知道吗?
我正在尝试使用 javascript 废弃一个 unicode 字符串。所述字符串可以包括混合字符。例子:我的中文不好。我是意大利人。你知道吗?
Ultimately, the string may contain - Chinese characters - Chinese punctuation - ANSI characters and punctuation
最终,字符串可能包含 - 汉字 - 中文标点 - ANSI 字符和标点
I need to leave the Chinese characters only . Any hint ?
我只需要留下汉字。任何提示?
回答by Brett Zamir
You can see the relevant blocks at http://www.unicode.org/reports/tr38/#BlockListingor http://www.unicode.org/charts/.
您可以在http://www.unicode.org/reports/tr38/#BlockListing或http://www.unicode.org/charts/查看相关块。
If you are excluding compatibility characters (ones which should no longer be used), as well as strokes, radicals, and Enclosed CJK Letters and Months, the following ought to cover it (I've added the individual JavaScript equivalent expressions afterward):
如果您要排除兼容字符(不应再使用的字符),以及笔画、部首和封闭的 CJK 字母和月份,以下内容应该包括在内(我随后添加了各个 JavaScript 等效表达式):
- CJK Unified Ideographs (4E00-9FCC)
[\u4E00-\u9FCC]
- CJK Unified Ideographs Extension A (3400-4DB5)
[\u3400-\u4DB5]
- CJK Unified Ideographs Extension B (20000-2A6D6)
[\ud840-\ud868][\udc00-\udfff]|\ud869[\udc00-\uded6]
- CJK Unified Ideographs Extension C (2A700-2B734)
\ud869[\udf00-\udfff]|[\ud86a-\ud86c][\udc00-\udfff]|\ud86d[\udc00-\udf34]
- CJK Unified Ideographs Extension D (2B840-2B81D)
\ud86d[\udf40-\udfff]|\ud86e[\udc00-\udc1d]
- 12 characters within the CJK Compatibility Ideographs (F900-FA6D/FA70-FAD9) but which are actually CJK unified ideographs
[\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29]
- CJK 统一表意文字 (4E00-9FCC)
[\u4E00-\u9FCC]
- CJK 统一表意文字扩展 A (3400-4DB5)
[\u3400-\u4DB5]
- CJK 统一表意文字扩展 B (20000-2A6D6)
[\ud840-\ud868][\udc00-\udfff]|\ud869[\udc00-\uded6]
- CJK 统一表意文字扩展 C (2A700-2B734)
\ud869[\udf00-\udfff]|[\ud86a-\ud86c][\udc00-\udfff]|\ud86d[\udc00-\udf34]
- CJK 统一表意文字扩展 D (2B840-2B81D)
\ud86d[\udf40-\udfff]|\ud86e[\udc00-\udc1d]
- CJK 兼容表意文字 (F900-FA6D/FA70-FAD9) 中的 12 个字符,但实际上是 CJK 统一表意文字
[\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29]
...so, a regex to grab the Chinese characters would be:
...所以,抓取汉字的正则表达式是:
/[\u4E00-\u9FCC\u3400-\u4DB5\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29]|[\ud840-\ud868][\udc00-\udfff]|\ud869[\udc00-\uded6\udf00-\udfff]|[\ud86a-\ud86c][\udc00-\udfff]|\ud86d[\udc00-\udf34\udf40-\udfff]|\ud86e[\udc00-\udc1d]/
/[\u4E00-\u9FCC\u3400-\u4DB5\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29]|[\ud840-\ud868][\udc00-\udfff]|\ud869[\udc00-\uded6\udf00-\udfff]|[\ud86a-\ud86c][\udc00-\udfff]|\ud86d[\udc00-\udf34\udf40-\udfff]|\ud86e[\udc00-\udc1d]/
Due in fact to the many CJK (Chinese-Japanese-Korean) characters, Unicode was expanded to handle more characters beyond the "Basic Multilingual Plane" (called "astral" characters), and since the CJK Unified Ideographs extensions B-D are examples of such astral characters, those extensions have ranges that are more complicated because they have to be encoded using surrogate pairs in UTF-16 systems like JavaScript. A surrogate pair consists of a high surrogate and a low surrogate, neither of which is valid by itself but when joined together form an actual single character despite their string length being 2).
事实上,由于有许多 CJK(中日韩)字符,Unicode 被扩展以处理更多超出“基本多语言平面”(称为“星体”字符)的字符,并且由于 CJK 统一表意文字扩展 BD 就是这样的例子星体字符,这些扩展的范围更复杂,因为它们必须在 UTF-16 系统(如 JavaScript)中使用代理对进行编码。代理对由一个高代理和一个低代理组成,这两个代理本身都不是有效的,但是当它们连接在一起时形成一个实际的单个字符,尽管它们的字符串长度为 2)。
While it would probably be easier for replacement purposes to express this as the non-Chinese characters (to replace them with the empty string), I provided the expression for the Chinese characters instead so that it would be easier to track in case you needed to add or remove from the blocks.
虽然出于替换目的将其表示为非中文字符可能更容易(用空字符串替换它们),但我提供了中文字符的表达式,以便在您需要时更容易跟踪从块中添加或删除。
Update September 2017
2017 年 9 月更新
As of ES6, one may express the regular expressions without resorting to surrogates by using the "u" flag along with the code point inside of the new escape sequence with brackets, e.g., /^[\u{20000}-\u{2A6D6}]*$/u
for "CJK Unified Ideographs Extension B".
从 ES6 开始,可以通过使用“u”标志以及带括号的新转义序列内的代码点来表达正则表达式,而无需求助于代理,例如,/^[\u{20000}-\u{2A6D6}]*$/u
对于“CJK Unified Ideographs Extension B”。
Note that Unicode too has progressed to include "CJK Unified Ideographs Extension E" ([\u{2B820}-\u{2CEAF}]
) and "CJK Unified Ideographs Extension F" ([\u{2CEB0}-\u{2EBEF}]
).
请注意,Unicode 也已发展到包括“CJK 统一表意文字扩展 E”( [\u{2B820}-\u{2CEAF}]
) 和“CJK 统一表意文字扩展 F”( [\u{2CEB0}-\u{2EBEF}]
)。
For ES2018, it appears that Unicode property escapes will be able to simplify things even further. Per http://2ality.com/2017/07/regexp-unicode-property-escapes.html, it looks like will be able to do:
对于 ES2018,Unicode 属性转义似乎能够进一步简化事情。根据http://2ality.com/2017/07/regexp-unicode-property-escapes.html,它看起来能够做到:
/^(\p{Block=CJK Unified Ideographs}|\p{Block=CJK Unified Ideographs Extension A}|\p{Block=CJK Unified Ideographs Extension B}|\p{Block=CJK Unified Ideographs Extension C}|\p{Block=CJK Unified Ideographs Extension D}|\p{Block=CJK Unified Ideographs Extension E}|\p{Block=CJK Unified Ideographs Extension F}|[\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29])+$/u
And as the shorter aliases from http://unicode.org/Public/UNIDATA/PropertyAliases.txtand http://unicode.org/Public/UNIDATA/PropertyValueAliases.txtcan also be used for these blocks, you could shorten this to the following (and changing underscores to spaces or casing apparently too if desired):
/^(\p{Blk=CJK}|\p{Blk=CJK_Ext_A}|\p{Blk=CJK_Ext_B}|\p{Blk=CJK_Ext_C}|\p{Blk=CJK_Ext_D}|\p{Blk=CJK_Ext_E}|\p{Blk=CJK_Ext_F}|[\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29])+$/u
由于来自http://unicode.org/Public/UNIDATA/PropertyAliases.txt和http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt的较短别名也可用于这些块,您可以将其缩短为以下(如果需要,也可以明显地将下划线更改为空格或大小写):
/^(\p{Blk=CJK}|\p{Blk=CJK_Ext_A}|\p{Blk=CJK_Ext_B}|\p{Blk=CJK_Ext_C}|\p{Blk=CJK_Ext_D}|\p{Blk=CJK_Ext_E}|\p{Blk=CJK_Ext_F}|[\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29])+$/u
And if we wanted to improve readability, we could document the falsely labeled compatibility characters using named capture groups (see http://2ality.com/2017/05/regexp-named-capture-groups.html):
如果我们想提高可读性,我们可以使用命名捕获组记录错误标记的兼容性字符(请参阅http://2ality.com/2017/05/regexp-named-capture-groups.html):
/^(\p{Blk=CJK}|\p{Blk=CJK_Ext_A}|\p{Blk=CJK_Ext_B}|\p{Blk=CJK_Ext_C}|\p{Blk=CJK_Ext_D}|\p{Blk=CJK_Ext_E}|\p{Blk=CJK_Ext_F}|(?<CJKFalseCompatibilityUnifieds>[\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29]))+$/u
/^(\p{Blk=CJK}|\p{Blk=CJK_Ext_A}|\p{Blk=CJK_Ext_B}|\p{Blk=CJK_Ext_C}|\p{Blk=CJK_Ext_D}|\p{Blk=CJK_Ext_E}|\p{Blk=CJK_Ext_F}|(?<CJKFalseCompatibilityUnifieds>[\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29]))+$/u
And as it looks per http://unicode.org/reports/tr44/#Unified_Ideographlike the "Unified_Ideograph" property (alias "UIdeo") covers all of our unified ideographs and excluding symbols/punctuation and compatibility characters, if you don't need to pick and choose out of the above, the following may be all you need:
正如http://unicode.org/reports/tr44/#Unified_Ideograph所见,“Unified_Ideograph”属性(别名“UIdeo”)涵盖了我们所有的统一表意文字,并且不包括符号/标点符号和兼容性字符,如果您不这样做的话不需要从上面挑选,以下可能就是你所需要的:
/^\p{Unified_Ideograph=yes}*$/u
/^\p{Unified_Ideograph=yes}*$/u
or in shorthand:
或简写:
/^\p{UIdeo=y}*$/u
/^\p{UIdeo=y}*$/u
回答by T.J. Crowder
There's no shortcut. You'll have to construct an expression with either the character class(es) you want to retain or the character classes you want to remove, and then process that.
没有捷径可走。您必须使用要保留的字符类或要删除的字符类构造一个表达式,然后对其进行处理。
The Unicode consortium provides code charts (index) (like this PDF of CJK Symbols and Punctuation) for various ranges defined by the standard. Since they frequently have long runs of contiguous code points, you can put them in a character class relatively easily.
Unicode 联盟为标准定义的各种范围提供了代码图表(索引)(如CJK 符号和标点的 PDF)。由于它们经常有大量连续的代码点,因此您可以相对容易地将它们放入字符类中。
回答by jdunning
As of Chrome 64, Firefox 79, and Safari 11.1, the simplest regex to test whether a string is a Chinese character is /\p{Script=Han}/u
. The \p
specifies a Unicode property escape, and the Script=Han
matches any character whose script
property is Han
(Chinese).
从Chrome 64、Firefox 79 和 Safari 11.1 开始,测试字符串是否为汉字的最简单的正则表达式是/\p{Script=Han}/u
. 该\p
指定一个Unicode的属性逃生,并Script=Han
匹配它的任何字符script
属性Han
( CN )。
So you could filter out just the Chinese characters in a string like this:
所以你可以像这样过滤掉字符串中的中文字符:
console.log(
"hello! 42 我的中文不好。我是意大利人。你知道吗?"
.split("")
.filter(char => /\p{Script=Han}/u.test(char))
.join("")
);
回答by tutturu
Rather than inventing your own solution you could probably use unicode-datamodule (one of the modules generated by it, to be precise), which is essentially a javascript interface to UnicodeData.txt database(akin to unicodedata standard module in python, if it rings your bell).
与其发明自己的解决方案,不如使用unicode-data模块(准确地说是由它生成的模块之一),它本质上是UnicodeData.txt 数据库的 javascript 接口(类似于 python 中的 unicodedata 标准模块,如果它敲响你的钟声)。
回答by Twifty
A copy and paste solution. Uses ES6's unicode flag. All current extensions, up to Extension F, and the Ideographs.
复制和粘贴解决方案。使用 ES6 的 unicode 标志。所有当前的扩展,直到扩展 F,以及象形文字。
const character_xp = new RegExp(String.raw`
[\u{FA0E}\u{FA0F}\u{FA11}\u{FA13}\u{FA14}\u{FA1F}\u{FA21}\u{FA23}\u{FA24}\u{FA27}-\u{FA29}]
|[\u{4E00}-\u{9FCC}]
|[\u{3400}-\u{4DB5}]
|[\u{20000}-\u{2A6D6}]
|[\u{2A700}-\u{2B734}]
|[\u{2B740}-\u{2B81D}]
|[\u{2B820}-\u{2CEAF}]
|[\u{2CEB0}-\u{2EBEF}]
`.replace(/\s+/g, ''), "u")