javascript 为什么这个正则表达式不适用于德语单词?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4043307/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why this regex is not working for german words?
提问by Rakesh Juyal
I am trying to break the following sentence in words and wrap them in span.
我试图用单词打破以下句子并将它们包裹起来。
<p class="german_p big">Das ist ein sch?nes Armband</p>
I followed this: How to get a word under cursor using JavaScript?
我遵循了这个: How to get a word under cursor using JavaScript?
$('p').each(function() {
var $this = $(this);
$this.html($this.text().replace(/\b(\w+)\b/g, "<span></span>"));
});
The only problem i am facing is, after wrapping the words in span the resultant html is like this:
我面临的唯一问题是,将单词包装在 span 中后,生成的 html 如下所示:
<p class="german_p big"><span>Das</span> <span>ist</span> <span>ein</span> <span>sch</span>?<span>nes</span> <span>Armband</span>.</p>
so, sch?nes is broken into three words sch, ? and nes. why this is happening? What could be the correct regex for this?
所以,sch?nes 被分成三个词 sch, ? 和内斯。为什么会这样?什么可能是正确的正则表达式?
采纳答案by kijin
\wonly matches A-Z, a-z, 0-9, and _ (underscore).
\w只匹配 AZ、az、0-9 和 _(下划线)。
You could use something like \S+to match all non-space characters, including non-ASCII characters like ?. This might or might not work depending on how the rest of your string is formatted.
您可以使用 like\S+来匹配所有非空格字符,包括非 ASCII 字符,如 ?。这可能会或可能不会起作用,具体取决于您的字符串其余部分的格式。
Reference: http://www.javascriptkit.com/javatutors/redev2.shtml
回答by tchrist
Unicode in Javascript Regexen
Javascript Regexen 中的 Unicode
Like Java itself, Javascript doesn't support Unicode in its \w, \d, and \bregex shortcuts. This is (arguably) a bug in Java and Javascript. Even if one manages through casuistry or obstinacy to argue that it is nota bug, it's sure a big gotcha. Kinda bites, really.
像Java本身,JavaScript不支持Unicode它\w, \d和\b正则表达式的快捷方式。这(可以说)是 Java 和 Javascript 中的一个错误。即使有人通过诡辩或顽固来辩称这不是错误,也肯定是一个大问题。有点咬,真的。
The problem is that those popular regex shortcuts onlyapply to 7-bit ASCII whether in Java or in Javascript. This restriction is painfully 1970s‐ish; it makes absolutely no sense in the 21?? century. This blog postingfrom this past March makes a good argument for fixing this problem in Javascript.
问题是那些流行的正则表达式快捷方式仅适用于 Java 或 Javascript 中的 7 位 ASCII。这个限制是 1970 年代的痛苦;在 21 中完全没有意义??世纪。今年 3 月的这篇博文为在 Javascript 中解决这个问题提供了一个很好的论据。
It would be really niceif some public-spirited soul would please add Javascript to this Wikipedia pagethat compares the support regex features in various languages.
这将是非常好的,如果一些热心公益的灵魂会请加Javascript来此维基百科页面,在各种语言比较支持正则表达式的功能。
This pagesays that Javascript doesn't support any Unicode properties at all. That same site has a tablethat's a lot more detailed than the Wikipedia page I mention above. For Javascript features, look under its ECMA column.
此页面说 Javascript 根本不支持任何 Unicode 属性。同一个站点的表格比我上面提到的维基百科页面详细得多。对于 Javascript 功能,请查看其 ECMA 列。
However, that table is in some cases at least five years out of date, so I can't completely vouch for it. It's a good start, though.
但是,该表在某些情况下至少已过时五年,所以我不能完全保证它。不过,这是一个好的开始。
Unicode Support in Other Languages
其他语言的 Unicode 支持
Ruby, Python, Perl, and PCRE all offer ways to extend \wto mean what it is supposedto mean, but the two J‐thingies do not.
Ruby、Python、Perl 和 PCRE 都提供了扩展方式以\w表示其应有的含义,但是这两个 J 事物没有。
In Java, however, there isa good workaround available. There, you can use \pLto mean any character that has the Unicode General_Category=Letter property. That means you can always emulate a proper \wusing [\pL\p{Nd}_].
在Java中,然而,是一个很好的解决办法可用。在那里,您可以使用\pL表示具有 Unicode General_Category=Letter 属性的任何字符。这意味着您始终可以模拟正确的\w使用[\pL\p{Nd}_].
Indeed, there's even an advantage to writing it that way, because it keeps you aware that you're adding decimal numbers and the underscore character to the character class. With a simple \w, please sometimes forget this is going on.
事实上,以这种方式编写它甚至还有一个优势,因为它让您知道您正在向字符类添加十进制数字和下划线字符。用一个简单的\w,请有时忘记这是怎么回事。
I don't believe that this workaround is available in Javascript, though. You can also use Unicode properties like those in Perl and PCRE, and in Ruby 1.9, but not in Python.
不过,我不相信这种解决方法在 Javascript 中可用。您还可以使用 Perl 和 PCRE 以及 Ruby 1.9 中的 Unicode 属性,但不能在 Python 中使用。
The only Unicode properties current Java supports are the one- and two-character general properties like \pNand \p{Lu}and the block properties like \p{InAncientSymbols}, but not scripts like \p{IsGreek}, etc.
当前 Java 支持的唯一 Unicode 属性是一个和两个字符的通用属性,如\pNand\p{Lu}和块属性\p{InAncientSymbols},但不是脚本,如\p{IsGreek}等。
The future JDK7 will finally get around to adding scripts. Even then Java still won't support most of the Unicode properties, though, not even critical ones like \p{WhiteSpace}or handy ones like \p{Dash}and \p{Quotation_Mark}.
未来的 JDK7 最终会开始添加脚本。即便如此,Java 仍然不会支持大多数 Unicode 属性,甚至不支持\p{WhiteSpace}像\p{Dash}和这样的关键属性或方便的属性\p{Quotation_Mark}。
SIGH!To understand just how limited Java's property support is, merely compare it with Perl. Perl supports 1633 Unicode properties as of 2007's 5.10 release, and 2478 of them as of this year's 5.12 release. I haven't counted them for ancient releases, but Perl started supporting Unicode properties back during the last millennium.
叹!要了解 Java 的属性支持是多么有限,只需将其与 Perl 进行比较。截至 2007 年的 5.10 版本,Perl 支持 1633 个 Unicode 属性,截至今年的 5.12 版本,支持其中的 2478 个。我没有将它们计算为古代版本,但是 Perl 在上个千年开始支持 Unicode 属性。
Lame as Java is, it's still better than Javascript, because Javascript doesn't support any Unicode properties whatsoCENSOREDever. I'm afraid that Javascript's paltry 7-bit mindsetmakes it pretty close to unusable for Unicode. This is a tremendously huge gaping hole in the language that's extremely difficult to account for given its target domain.
Java 是蹩脚的,它仍然比 Javascript 好,因为 Javascript 不支持任何 Unicode 属性,无论是CENSORED。我担心Javascript 微不足道的 7 位思维方式使它几乎无法用于 Unicode。这是语言中一个巨大的漏洞,鉴于其目标域,很难解释。
Sorry 'bout that. ?
对不起。?
回答by hqx5
To include allthe Latin 1 Supplement characters like ??ü?ò? you can use:
包括所有拉丁语 1 补充字符,如 ??ü?ò? 您可以使用:
[\w\u00C0-\u00ff]
however, there are even more funny characters in the Latin Extended-A and Latin Extended-B unicode blocks like ??? . To include that you can use:
然而,在拉丁语 Extended-A 和 Latin Extended-B unicode 块中还有更多有趣的字符,比如 ??? . 要包含它,您可以使用:
[\w\u00C0-\u024f]
回答by XViD
You can also use
你也可以使用
/\b([??ü??ü?\w]+)\b/g
instead of
代替
/\b(\w+)\b/g
in order to handle the umlauts
为了处理变音
回答by Wooble
\wand \bare not unicode-aware in javascript; they only match ASCII word/boundary characters. If you use cases will all allow splitting on whitespace, you can use \s/\S, which are unicode-aware.
\w并且\b在 javascript 中不支持 unicode;它们只匹配 ASCII 字/边界字符。如果您的用例都允许在空格上进行拆分,则可以使用\s/ \S,它是 unicode 感知的。
回答by annakata
As others note, the \w shortcut is not very useful for non-Latin character sets. If you need to match other text ranges you should use hex* notation (Ref1) (Ref2) for the appropriate range.
正如其他人所指出的, \w 快捷方式对于非拉丁字符集不是很有用。如果您需要匹配其他文本范围,您应该对适当的范围使用 hex* 表示法 ( Ref1) ( Ref2)。
* could be hex or octal or unicode, you'll often see these collectively referred as hex notation.
*可以是十六进制或八进制或 unicode,您经常会看到这些统称为十六进制表示法。
回答by Dave
the \b's will also not work correctly. It is possible to use Xregex library \p{L} tag for unicode support, however there is still not \b support so you wont be able to find the word boundaries. It would be nice to provide \b support by doing lookbehind/lookaheads with \P{L} in the following implementation
\b 也不会正常工作。可以使用 Xregex 库 \p{L} 标签来支持 unicode,但是仍然没有 \b 支持,因此您将无法找到单词边界。通过在以下实现中使用 \P{L} 进行后视/前瞻来提供 \b 支持会很好
http://blog.stevenlevithan.com/archives/mimic-lookbehind-javascript
http://blog.stevenlevithan.com/archives/mimic-lookbehind-javascript
回答by Joan-Diego Rodriguez
While javascript doesn't support Unicode natively, you could use this library to work around it: http://xregexp.com/
虽然 javascript 本身不支持 Unicode,但您可以使用这个库来解决它:http: //xregexp.com/

