jQuery 带有西里尔字母的正则表达式
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18471159/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Regular expression with the cyrillic alphabet
提问by Ji?í Valou?ek
I have an jQuery function for word counting in textarea field. In addition its excludes all words, which are closed in [[[tripple bracket]]]. It works great with latin character, but it has a problem with cyrillic sentences. I suppose that the error is in part with regular expression:
我有一个 jQuery 函数,用于在 textarea 字段中进行字数统计。此外,它排除了所有在 [[[三方括号]]] 中封闭的单词。它适用于拉丁字符,但在西里尔文句子中存在问题。我想错误部分与正则表达式有关:
$(field).val().replace(/\[\[\[[^\]]*\]\]\]/g, '').match(/\b/g);
Example with both kind of phrases: http://jsfiddle.net/A3cEG/2/
两种短语的示例:http: //jsfiddle.net/A3cEG/2/
I need count all word, including cirillic expressions, not only words in latin. How to do that?
我需要计算所有单词,包括西里尔语表达,而不仅仅是拉丁语单词。怎么做?
回答by p.s.w.g
JavaScript (at least the versions most widely used) does not fully support Unicode. That is to say, \w
matches only Latin letters, decimal digits, and underscores ([a-zA-Z0-9_]
), and \b
matches the boundary the between a word character and and a non-word character.
JavaScript(至少是使用最广泛的版本)并不完全支持 Unicode。即\w
只匹配拉丁字母、十进制数字和下划线([a-zA-Z0-9_]
),\b
匹配单词字符与非单词字符的边界。
To find all words in an input string using Latin or Cyrillic, you'd have to do something like this:
要使用拉丁文或西里尔文查找输入字符串中的所有单词,您必须执行以下操作:
.match(/[\wа-я]+/ig); // where а is the Cyrillic а.
Or if you prefer:
或者,如果您更喜欢:
.match(/[\w\u0430-\u044f]+/ig);
Of course this will probably mean you need to tweak your code a little bit, since here it will match all words rather than word boundaries. Note that [а-я]
matches any letter in the 'basic Cyrillic alphabet' as described here. To match letters outside of this range, you can modify the character set as necessary to include those letters, e.g. to also match the Russian Ё/ё, use [а-яё]
.
当然,这可能意味着您需要稍微调整您的代码,因为在这里它将匹配所有单词而不是单词边界。请注意,[а-я]
匹配此处所述的“基本西里尔字母”中的任何字母。要匹配此范围之外的字母,您可以根据需要修改字符集以包含这些字母,例如,还匹配俄语 Ё/ё,请使用[а-яё]
.
Also note that your triple-bracket pattern can be simplified to:
另请注意,您的三重括号模式可以简化为:
.replace(/\[{3}[^]]*]{3}/g, '')
Alternatively, you might want to look at the XRegExpproject—which is an open-source project to add new features to the base JavaScript regular expression engine—and its Unicodeaddon.
或者,您可能想查看XRegExp项目(这是一个开源项目,用于向基本 JavaScript 正则表达式引擎添加新功能)及其Unicode插件。
回答by Dubaua
Beware of using range of cyrillic letters, it may contain unnecessary characters within. There is bulletproof regexp contains only cyrillic letters:
小心使用西里尔字母范围,其中可能包含不必要的字符。有防弹正则表达式只包含西里尔字母:
/^[аАбБвВгГдДеЕёЁжЖзЗиИйЙкКлЛмМнНоОпПрРсСтТуУфФхХцЦчЧшШщЩъЪыЫьЬэЭюЮяЯ]+$/