Javascript RegExp + Word 边界 + unicode 字符
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/10590098/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Javascript RegExp + Word boundaries + unicode characters
提问by user1394520
I am building search and I am going to use javascript autocomplete with it. I am from Finland (finnish language) so I have to deal with some special characters like ?, ? and ?
我正在构建搜索,我将使用 javascript 自动完成功能。我来自芬兰(芬兰语),所以我必须处理一些特殊字符,例如 ?, ? 和 ?
When user types text in to the search input field I try to match the text to data.
当用户在搜索输入字段中键入文本时,我尝试将文本与数据进行匹配。
Here is simple example that is not working correctly if user types for example "??". Same thing with "?l"
这是一个简单的示例,如果用户键入例如“??”,则该示例无法正常工作。与“?l”相同
var title = "this is simple string with finnish word t?m? on ??kk?stesti ?lk?? ihmetelk?";
// Does not work
var searchterm = "?l";
// does not work
//var searchterm = "??";
// Works
//var searchterm = "wi";
if ( new RegExp("\b"+searchterm, "gi").test(title) ) {
$("#result").html("Match: ("+searchterm+"): "+title);
} else {
$("#result").html("nothing found with term: "+searchterm);
}
So how can I get those ?,? and ? characters to work with javascript regex?
那么我怎样才能得到那些?,?和 ?使用javascript正则表达式的字符?
I think I should use unicode codes but how should I do that? Codes for those characters are: [\u00C4,\u00E4,\u00C5,\u00E5,\u00D6,\u00F6]
我想我应该使用 unicode 代码,但我应该怎么做?这些字符的代码是:[\u00C4,\u00E4,\u00C5,\u00E5,\u00D6,\u00F6]
=> ??????
=> ??????
回答by mowwwalker
There appears to be a problem with Regex and the word boundary \b
matching the beginning of a string with a starting character out of the normal 256 byte range.
正则表达式似乎存在问题,并且单词边界\b
匹配具有超出正常 256 字节范围的起始字符的字符串开头。
Instead of using \b
, try using (?:^|\\s)
而不是使用\b
,尝试使用(?:^|\\s)
var title = "this is simple string with finnish word t?m? on ??kk?stesti ?lk?? ihmetelk?";
// Does not work
var searchterm = "?l";
// does not work
//var searchterm = "??";
// Works
//var searchterm = "wi";
if ( new RegExp("(?:^|\s)"+searchterm, "gi").test(title) ) {
$("#result").html("Match: ("+searchterm+"): "+title);
} else {
$("#result").html("nothing found with term: "+searchterm);
}
Breakdown:
分解:
(?:
parenthesis ()
form a capture group in Regex. Parenthesis started with a question mark and colon ?:
form a non-capturing group. They just group the terms together
(?:
括号()
在 Regex中形成一个捕获组。括号以问号和冒号开始,?:
形成一个非捕获组。他们只是将术语组合在一起
^
the caret symbol matches the beginning of a string
^
插入符号匹配字符串的开头
|
the bar is the "or" operator.
|
bar 是“或”运算符。
\s
matches whitespace (appears as \\s
in the string because we have to escape the backslash)
\s
匹配空格(出现\\s
在字符串中,因为我们必须转义反斜杠)
)
closes the group
)
关闭群
So instead of using \b
, which matches word boundaries and doesn't work for unicode characters, we use a non-capturing group which matches the beginning of a string OR whitespace.
因此\b
,我们不使用匹配单词边界且不适用于 unicode 字符的 ,而是使用匹配字符串或空格开头的非捕获组。
回答by Noah Freitas
The \b
character class in JavaScript RegEx is really only useful with simple ASCII encoding. \b
is a shortcut code for the boundary between \w
and \W
sets or \w
and the beginning or end of the string. These character sets only take into account ASCII "word" characters, where \w
is equal to [a-zA-Z0-9_]
and \W
is the negation of that class.
\b
JavaScript RegEx 中的字符类实际上只对简单的 ASCII 编码有用。 \b
是用于设置\w
和\W
设置或\w
和字符串开头或结尾之间的边界的快捷代码。这些字符集只考虑 ASCII“单词”字符,其中\w
等于[a-zA-Z0-9_]
并且\W
是该类的否定。
This makes the RegEx character classes largely useless for dealing with any real language.
这使得 RegEx 字符类在处理任何真实语言时基本上无用。
\s
should work for what you want to do, provided that search terms are only delimited by whitespace.
\s
应该适用于您想要做的事情,前提是搜索词仅由空格分隔。
回答by max masetti
this question is old, but I think I found a better solution for boundary in regular expressions with unicode letters. Using XRegExp library you can implement a valid \b boundary expanding this
这个问题很老了,但我想我找到了一个更好的解决方案来解决带有 unicode 字母的正则表达式中的边界问题。使用 XRegExp 库,你可以实现一个有效的 \b 边界扩展这个
XRegExp('(?=^|$|[^\p{L}])')
the result is a 4000+ char long, but it seems to work quite performing.
结果是一个 4000+ 个字符长,但它似乎工作得很好。
Some explanation: (?= ) is a zero-length lookahead that looks for a begin or end boundary or a non-letter unicode character. The most important think is the lookahead, because the \b doesn't capture anything: it is simply true or false.
一些解释: (?= ) 是一个零长度的前瞻,它寻找开始或结束边界或非字母 unicode 字符。最重要的想法是前瞻,因为 \b 不捕获任何内容:它只是真或假。
回答by micnic
I would recommend you to use XRegExpwhen you have to work with a specific set of characters from Unicode, the author of this library mapped all kind of regional sets of characters making the work with different languages easier.
当您必须使用来自 Unicode 的特定字符集时,我建议您使用XRegExp,该库的作者映射了所有类型的区域字符集,从而使使用不同语言的工作更容易。
回答by andrefs
\b
is a shortcut for the transition between a letter and a non-letter character, or vice-versa.
\b
是字母和非字母字符之间转换的快捷方式,反之亦然。
Updating and improving on max_masseti's answer:
更新和改进max_masseti的答案:
With the introduction of the /u
modifier for RegExs in ES2018, you can now use \p{L}
to represent any unicode letter, and \P{L}
(notice the uppercase P
) to represent anything but.
随着/u
ES2018 中 RegEx 修饰符的引入,您现在可以\p{L}
用来表示任何 unicode 字母,并且\P{L}
(注意大写P
)表示除此之外的任何内容。
EDIT: Previous version was incomplete.
编辑:以前的版本不完整。
As such:
像这样:
const text = 'A Fé, o Império, e as terras viciosas';
text.split(/(?<=\p{L})(?=\P{L})|(?<=\P{L})(?=\p{L})/);
// ['A', ' Fé', ',', ' o', ' Império', ',', ' e', ' as', ' terras', ' viciosas']
We're using a lookbehind (?<=...)
to find a letter and a lookahead (?=...)
to find a non-letter, or vice versa.
我们使用lookbehind(?<=...)
来查找字母,使用lookahead(?=...)
来查找非字母,反之亦然。
回答by apsillers
I noticed something really weird with \b
when using Unicode:
\b
在使用 Unicode 时,我注意到一些非常奇怪的事情:
/\bo/.test("pop"); // false (obviously)
/\b?/.test("p?p"); // true (what..?)
/\Bo/.test("pop"); // true
/\B?/.test("p?p"); // false (what..?)
It appears that meaning of \b
and \B
are reversed, but only when used with non-ASCII Unicode? There might be something deeper going on here, but I'm not sure what it is.
看来,意义\b
和\B
反转,而只用非ASCII的Unicode使用时?这里可能有更深层次的事情发生,但我不确定它是什么。
In any case, it seems that the word boundary is the issue, not the Unicode characters themselves. Perhaps you should just replace \b
with (^|[\s\\/-_&])
, as that seems to work correctly. (Make your list of symbols more comprehensive than mine, though.)
无论如何,似乎单词边界是问题所在,而不是 Unicode 字符本身。也许您应该只替换\b
为(^|[\s\\/-_&])
,因为这似乎可以正常工作。(不过,让你的符号列表比我的更全面。)
回答by Ed.
What you are looking for is the Unicode word boundaries standard:
您正在寻找的是 Unicode 字边界标准:
http://unicode.org/reports/tr29/tr29-9.html#Word_Boundaries
http://unicode.org/reports/tr29/tr29-9.html#Word_Boundaries
There is a JavaScript implementation here (unciodejs.wordbreak.js)
这里有一个 JavaScript 实现(unciodejs.wordbreak.js)
回答by Heitor Chang
My idea is to search with codes representing the Finnish letters
我的想法是用代表芬兰字母的代码进行搜索
new RegExp("\\b"+asciiOnly(searchterm), "gi").test(asciiOnly(title))
new RegExp("\\b"+asciiOnly(searchterm), "gi").test(asciiOnly(title))
My original idea was to use plain encodeURI
but the % sign seemed to interfere with the regexp.
我最初的想法是使用普通encodeURI
但 % 符号似乎干扰了正则表达式。
I wrote a crude function using encodeURI to encode every character with code over 128 but removing its % and adding 'QQ' in the beginning. It is not the best marker but I couldn't get non alphanumeric to work.
我使用 encodeURI 编写了一个粗略的函数,用超过 128 的代码对每个字符进行编码,但删除了它的 % 并在开头添加了“QQ”。它不是最好的标记,但我无法让非字母数字工作。
回答by Antonín Slej?ka
I have had a similar problem, but I had to replace an array of terms. All solutions, which I have found did not worked, if two terms were in the text next to each other (because their boundaries overlaped). So I had to use a little modified approach:
我遇到了类似的问题,但我不得不替换一系列术语。如果两个术语在文本中彼此相邻(因为它们的边界重叠),我发现的所有解决方案都不起作用。所以我不得不使用一些修改过的方法:
var text = "Je?tě. ?e; \"u?\" à. Fürs, 'anl?sslich' ?e ?e ?e.";
var terms = ["à","anl?sslich","Fürs","u?","Je?tě", "?e"];
var replaced = [];
var order = 0;
for (i = 0; i < terms.length; i++) {
terms[i] = "(^\|[ \n\r\t.,;'\"\+!?-])(" + terms[i] + ")([ \n\r\t.,;'\"\+!?-]+\|$)";
}
var re = new RegExp(terms.join("|"), "");
while (true) {
var replacedString = "";
text = text.replace(re, function replacer(match){
var beginning = match.match("^[ \n\r\t.,;'\"\+!?-]+");
if (beginning == null) beginning = "";
var ending = match.match("[ \n\r\t.,;'\"\+!?-]+$");
if (ending == null) ending = "";
replacedString = match.replace(beginning,"");
replacedString = replacedString.replace(ending,"");
replaced.push(replacedString);
return beginning+"{{"+order+"}}"+ending;
});
if (replacedString == "") break;
order += 1;
}
See the code in a fiddle: http://jsfiddle.net/antoninslejska/bvbLpdos/1/
查看小提琴中的代码:http: //jsfiddle.net/antoninslejska/bvbLpdos/1/
The regular expression is inspired by: http://breakthebit.org/post/3446894238/word-boundaries-in-javascripts-regular
正则表达式的灵感来自:http: //breakthebit.org/post/3446894238/word-boundaries-in-javascripts-regular
I can't say, that I find the solution elegant...
我不能说,我找到了优雅的解决方案......
回答by Manthos
The correct answer to the question is given by andrefs. I will only rewrite it more clearly, after putting all required things together.
该问题的正确答案由andrefs给出。在把所有需要的东西放在一起之后,我只会更清楚地重写它。
For ASCII text, you can use \b
for matching a word boundary both at the start and the end of a pattern. When using Unicode text, you need to use 2 different patterns for doing the same:
对于 ASCII 文本,您可以\b
用于匹配模式开头和结尾的单词边界。使用 Unicode 文本时,您需要使用 2 种不同的模式来做同样的事情:
- Use
(?<=^|\P{L})
for matching the start or a word boundary before the main pattern. - Use
(?=\P{L}|$)
for matching the end or a word boundary after the main pattern. - Additionally, use
(?i)
in the beginning of everything, to make all those matchings case-insensitive.
- 使用
(?<=^|\P{L})
匹配的开始或主图案之前单词边界。 - 使用
(?=\P{L}|$)
为最终或主图案后一个单词边界匹配。 - 此外,
(?i)
在所有内容的开头使用,以使所有匹配项不区分大小写。
So the resulting answer is: (?i)(?<=^|\P{L})xxx(?=\P{L}|$)
, where xxx is your main pattern. This would be the equivalent of (?i)\bxxx\b
for ASCII text.
所以得到的答案是:(?i)(?<=^|\P{L})xxx(?=\P{L}|$)
,其中 xxx 是您的主要模式。这相当于(?i)\bxxx\b
ASCII 文本。
For your code to work, you now need to do the following:
为了让您的代码正常工作,您现在需要执行以下操作:
- Assign to your variable "searchterm", the pattern or words you want to find.
- Escape the variable's contents. For example, replace
'\'
with'\\'
and also do the same for any reserved special character of regex, like'\^', '\$', '\/'
, etc. Check herefor a question on how to do this. - Insert the variable's contents to the pattern above, in the place of "xxx", by simply using the
string.replace()
method.
- 将您要查找的模式或单词分配给您的变量“searchterm”。
- 转义变量的内容。例如,对正则表达式的任何保留特殊字符(如等)替换
'\'
为'\\'
并执行相同操作'\^', '\$', '\/'
。请在此处查看有关如何执行此操作的问题。 - 只需使用该
string.replace()
方法,将变量的内容插入到上面的模式中,代替“xxx” 。