在 JavaScript 中,除非单词在排除单词列表中,否则如何使用正则表达式进行匹配?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/8854817/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
In JavaScript, how can I use regex to match unless words are in a list of excluded words?
提问by Jake Wilson
How do I use regex to match any word (\w) except a list of certain words? For example:
如何使用正则表达式匹配除某些单词列表之外的任何单词 (\w)?例如:
I want to match the words use
and utilize
and any words after it except if the words are something
or fish
.
我想匹配单词use
andutilize
和它后面的任何单词,除非单词是something
or fish
。
use this <-- match
utilize that <-- match
use something <-- don't want to match this
utilize fish <-- don't want to match this
How do I specify a list of words I don't want to match against?
如何指定不想匹配的单词列表?
回答by murgatroid99
You can use a negative lookahead to determine that the word you are about to match is not a particular thing. You can use the following regex to do this:
您可以使用否定前瞻来确定您将要匹配的单词不是特定的事物。您可以使用以下正则表达式来执行此操作:
(use|utilize)\s(?!fish|something)(\w+)
This will match "use" or "utilize" followed by a space, and then if the following word is not "fish" or "something", it will match that next word.
这将匹配后跟一个空格的“use”或“utilize”,然后如果接下来的单词不是“fish”或“something”,它将匹配下一个单词。
回答by Todd A. Jacobs
Don't Hard-Code Your Regular Expressions
不要硬编码你的正则表达式
Rather than trying to put all your search and exclusion terms into a single, hard-coded regular expression, it's often more maintainable (and certainly more readable) to use short-circuit evaluation to selectstrings that match desirable terms, and then rejectstrings that contain undesirable terms.
与其尝试将所有搜索和排除项放入单个硬编码的正则表达式中,不如使用短路评估来选择匹配所需项的字符串,然后拒绝符合要求的字符串通常更易于维护(当然也更易读)包含不受欢迎的条款。
You can then encapsulate this testing into a function that returns a Boolean value based on the run-time values of your arrays. For example:
然后,您可以将此测试封装到一个函数中,该函数根据数组的运行时值返回一个布尔值。例如:
'use strict';
// Join arrays of terms with the alternation operator.
var searchTerms = new RegExp(['use', 'utilize'].join('|'));
var excludedTerms = new RegExp(['fish', 'something'].join('|'));
// Return true if a string contains only valid search terms without any
// excluded terms.
var isValidStr = function (str) {
return (searchTerms.test(str) && !excludedTerms.test(str));
};
isValidStr('use fish'); // false
isValidStr('utilize hammer'); // true
回答by Cfreak
This should do it:
这应该这样做:
/(?:use|utilize)\s+(?!something|fish)\w+/
回答by Kaerber
Some people, when confronted with a problem, think “I know, I'll use regular expressions.”
Now they have two problems.
有些人在遇到问题时会想“我知道,我会使用正则表达式”。
现在他们有两个问题。
Regular expressions are suited to match regular sequences of symbols, not words. Any lexer+parser would be much more suitable. For example, the grammar for this task will look very simple in Antlr. If you can't afford a learning curve behind lexers/parsers (they are pretty easy for your given task), then splitting your text into words with regular expression, and then simple search with look-ahead of 1 would be enough.
正则表达式适合匹配规则的符号序列,而不是单词。任何词法分析器+解析器都会更合适。例如,此任务的语法在 Antlr 中看起来非常简单。如果您负担不起词法分析器/解析器背后的学习曲线(它们对于您的给定任务非常容易),那么使用正则表达式将文本拆分为单词,然后使用前瞻为 1 的简单搜索就足够了。
Regular expressions with words get very complex very fast. They are hard to read and hard to mantain.
带有单词的正则表达式很快就会变得非常复杂。它们难以阅读且难以维护。
Update: Thanks for all the downvotes. Here's an example of what I meant.
更新:感谢所有反对票。这是我的意思的一个例子。
import re
def Tokenize( text ):
return re.findall( "\w+", text )
def ParseWhiteListedWordThenBlackListedWord( tokens, whiteList, blackList ):
for i in range( 0, len( tokens ) - 1 ):
if tokens[i] in whiteList and tokens[i + 1] not in blackList:
yield ( tokens[i], tokens[i + 1] )
Here is some performance testing:
下面是一些性能测试:
>>> timeit.timeit( 'oldtime()', 'from __main__ import oldtime', number=1 )
0.02636446265387349
>>> timeit.timeit( 'oldtime()', 'from __main__ import oldtime', number=1000 )
28.80968123656703
>>> timeit.timeit( 'newtime()', 'from __main__ import newtime', number=100 )
44.24506212427741
>>> timeit.timeit( 'newtime11()', 'from __main__ import newtime11', number=1 ) +
timeit.timeit( 'newtime13()', 'from __main__ import newtime13', number=1000 )
103.07938725936083
>>> timeit.timeit( 'newtime11()', 'from __main__ import newtime11', number=1 ) +
timeit.timeit( 'newtime12()', 'from __main__ import newtime12', number=1000 )
0.3191265909927097
Some notes: testing was over the English text of Pride anf Prejudice by Jane Austin, first words were 'Mr' and 'my', second words were 'Bennet' and 'dear'.
一些注意事项:测试是在简奥斯汀的傲慢与偏见的英文文本上,第一个词是“先生”和“我的”,第二个词是“班纳特”和“亲爱的”。
oldtime() is regular expression. newtime() is Tokenizer+Parser , mind that it was run 100 times, not 1000, so, a comparable time for it would be ~442.
oldtime() 是正则表达式。newtime() 是 Tokenizer+Parser ,请注意它运行了 100 次,而不是 1000 次,因此,它的可比时间约为 442 次。
The next two test are to simulate repeated runs of Parser over the same text, as you reuse Tokenizer results.
接下来的两个测试是在重复使用 Tokenizer 结果时模拟 Parser 对同一文本的重复运行。
newtime11() is Tokenizer only. newtime13() is Parser with results converted to list (to simulate traversal of the results). newtime12() is just Parser.
newtime11() 只是 Tokenizer。newtime13() 是将结果转换为列表的解析器(以模拟结果的遍历)。newtime12() 只是解析器。
Well, regular expressions are faster, by quite a lot in the case of a single pass, even in the case of generator (the bulk of the time is spent tokenizing text, in case of Tokenizer+Parser). But generator expressions are extremely fast when you can reuse tokenized text and evaluate parser results lazily.
好吧,正则表达式更快,在单次传递的情况下快了很多,即使是在生成器的情况下(在 Tokenizer+Parser 的情况下,大部分时间都花在标记文本上)。但是当您可以重用标记化文本并懒惰地评估解析器结果时,生成器表达式非常快。
There is quite a bit of performance optimization possible, but it'll complicate the solution, possibly to the point where regular expressions are to become the best implementation.
有相当多的性能优化是可能的,但它会使解决方案复杂化,可能会导致正则表达式成为最佳实现。
The tokenizer+parser approach has both advantages and disadvantages: - the structure of the solution is more complex (more elements) but each element is simpler - elements are easy to test, including automatic testing - it IS slow, but it gets better with reusing the same text and evaluating results lazily - due to generators and lazy evaluation, some work may be avoided - it is trivial to change white list and/or black list - it is trivial to have several white lists, several black lists and/or their combinations - it is trivial to add new parsers reusing tokenizer results
分词器+解析器方法既有优点也有缺点: - 解决方案的结构更复杂(更多元素),但每个元素更简单 - 元素易于测试,包括自动测试 - 它很慢,但通过重用会变得更好相同的文本和懒惰的评估结果 - 由于生成器和懒惰的评估,可以避免一些工作 - 更改白名单和/或黑名单是微不足道的 - 有几个白名单,几个黑名单和/或他们的微不足道组合 - 添加新的解析器重用标记器结果是微不足道的
Now to that thorny You Aint Gonna Need It question. Neither you are gonna need the solution to the original question, unless it is a part of a bigger task. And that bigger task should dictate the best approach.
现在到那个棘手的你不会需要它的问题。您也不需要原始问题的解决方案,除非它是更大任务的一部分。更大的任务应该决定最好的方法。
Update: There is a good discussion of regualr expressions in lexing and parsing at http://commandcenter.blogspot.ru/2011/08/regular-expressions-in-lexing-and.html. I'll summarize it with a quote
更新:在http://commandcenter.blogspot.ru/2011/08/regular-expressions-in-lexing-and.html 上对词法分析中的 regualr 表达式进行了很好的讨论。我会用报价来总结它
Encouraging regular expressions as a panacea for all text processing problems is not only lazy and poor engineering, it also reinforces their use by people who shouldn't be using them at all.
鼓励将正则表达式作为解决所有文本处理问题的灵丹妙药不仅是懒惰和糟糕的工程,而且还加强了那些根本不应该使用它们的人的使用。