Javascript 正则表达式匹配所有不在引号内的实例
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/6462578/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Regex to match all instances not inside quotes
提问by Azmisov
From this q/a, I deduced that matching all instances of a given regex notinside quotes, is impossible. That is, it can't match escaped quotes (ex: "this whole \"match\" should be taken"
). If there is a way to do it that I don't know about, that would solve my problem.
从这个 q/a,我推断匹配给定正则表达式的所有实例不在引号内,这是不可能的。也就是说,它不能匹配转义引号(例如:)"this whole \"match\" should be taken"
。如果有一种我不知道的方法可以解决我的问题。
If not, however, I'd like to know if there is any efficient alternative that could be used in JavaScript. I've thought about it a bit, but can't come with any elegant solutions that would work in most, if not all, cases.
但是,如果没有,我想知道是否有任何有效的替代方法可以在 JavaScript 中使用。我已经考虑了一下,但无法提出任何适用于大多数(如果不是全部)情况的优雅解决方案。
Specifically, I just need the alternative to work with .split() and .replace() methods, but if it could be more generalized, that would be the best.
具体来说,我只需要使用 .split() 和 .replace() 方法的替代方法,但如果它可以更通用,那将是最好的。
For Example:
An input string of:+bar+baz"not+or\"+or+\"this+"foo+bar+
replacing + with #, not inside quotes, would return:#bar#baz"not+or\"+or+\"this+"foo#bar#
例如:
输入字符串:+bar+baz"not+or\"+or+\"this+"foo+bar+
用 # 替换 +,不在引号内,将返回:#bar#baz"not+or\"+or+\"this+"foo#bar#
回答by Jens
Actually, you can match all instances of a regex not inside quotes for any string, where each opening quote is closed again. Say, as in you example above, you want to match \+
.
实际上,您可以匹配任何字符串中不在引号内的正则表达式的所有实例,其中每个开始引号再次关闭。说,就像上面的例子一样,你想匹配\+
.
The key observation here is, that a word is outside quotes if there are an even number of quotes following it. This can be modeled as a look-ahead assertion:
这里的关键观察是,如果一个单词后面有偶数个引号,则该单词在引号之外。这可以建模为一个前瞻断言:
\+(?=([^"]*"[^"]*")*[^"]*$)
Now, you'd like to not count escaped quotes. This gets a little more complicated. Instead of [^"]*
, which advanced to the next quote, you need to consider backslashes as well and use [^"\\]*
. After you arrive at either a backslash or a quote, you need to ignore the next character if you encounter a backslash, or else advance to the next unescaped quote. That looks like (\\.|"([^"\\]*\\.)*[^"\\]*")
. Combined, you arrive at
现在,您不想计算转义引号。这变得有点复杂。相反的[^"]*
,它前进到下一个报价,你需要考虑反斜杠以及和使用[^"\\]*
。到达反斜杠或引号后,如果遇到反斜杠,则需要忽略下一个字符,否则前进到下一个未转义的引号。那看起来像(\\.|"([^"\\]*\\.)*[^"\\]*")
。结合起来,你到达
\+(?=([^"\]*(\.|"([^"\]*\.)*[^"\]*"))*[^"]*$)
I admit it is a littlecryptic. =)
我承认这有点神秘。=)
回答by zx81
Azmisov, resurrecting this question because you said you were looking for any efficient alternative that could be used in JavaScript
and any elegant solutions that would work in most, if not all, cases
.
Azmisov,重新提出这个问题是因为你说你正在寻找any efficient alternative that could be used in JavaScript
和any elegant solutions that would work in most, if not all, cases
。
There happens to be a simple, general solution that wasn't mentioned.
碰巧有一个没有提到的简单、通用的解决方案。
Compared with alternatives, the regex for this solution is amazingly simple:
与替代方案相比,此解决方案的正则表达式非常简单:
"[^"]+"|(\+)
The idea is that we match but ignore anything within quotes to neutralize that content (on the left side of the alternation). On the right side, we capture all the +
that were not neutralized into Group 1, and the replace function examines Group 1. Here is full working code:
这个想法是我们匹配但忽略引号内的任何内容以中和该内容(在交替的左侧)。在右侧,我们捕获所有+
未中和到组 1 中的内容,替换函数检查组 1。以下是完整的工作代码:
<script>
var subject = '+bar+baz"not+these+"foo+bar+';
var regex = /"[^"]+"|(\+)/g;
replaced = subject.replace(regex, function(m, group1) {
if (!group1) return m;
else return "#";
});
document.write(replaced);
You can use the same principle to match or split. See the question and article in the reference, which will also point you code samples.
您可以使用相同的原理进行匹配或拆分。请参阅参考资料中的问题和文章,其中也将指向您的代码示例。
Hope this gives you a different idea of a very general way to do this. :)
希望这能让您对执行此操作的非常通用方法有不同的想法。:)
What about Empty Strings?
空字符串呢?
The above is a general answer to showcase the technique. It can be tweaked depending on your exact needs. If you worry that your text might contain empty strings, just change the quantifier inside the string-capture expression from +
to *
:
以上是展示该技术的一般答案。它可以根据您的确切需求进行调整。如果您担心您的文本可能包含空字符串,只需将字符串捕获表达式中的量词从+
更改为*
:
"[^"]*"|(\+)
See demo.
见演示。
What about Escaped Quotes?
转义行情呢?
Again, the above is a general answer to showcase the technique. Not only can the "ignore this match" regex can be refined to your needs, you can add multiple expressions to ignore. For instance, if you want to make sure escaped quotes are adequately ignored, you can start by adding an alternation \\"|
in front of the other two in order to match (and ignore) straggling escaped double quotes.
同样,以上是展示该技术的一般答案。“忽略此匹配”正则表达式不仅可以根据您的需要进行细化,您还可以添加多个表达式来忽略。例如,如果您想确保转义引号被充分忽略,您可以首先\\"|
在其他两个前面添加一个交替,以匹配(并忽略)散乱的转义双引号。
Next, within the section "[^"]*"
that captures the content of double-quoted strings, you can add an alternation to ensure escaped double quotes are matched before their "
has a chance to turn into a closing sentinel, turning it into "(?:\\"|[^"])*"
接下来,在"[^"]*"
捕获双引号字符串内容的部分中,您可以添加一个替代项以确保转义的双引号在它们"
有机会变成结束哨兵之前匹配,将其变成"(?:\\"|[^"])*"
The resulting expression has three branches:
结果表达式具有三个分支:
\\"
to match and ignore"(?:\\"|[^"])*"
to match and ignore(\+)
to match, capture and handle
\\"
匹配和忽略"(?:\\"|[^"])*"
匹配和忽略(\+)
匹配、捕获和处理
Note that in other regex flavors, we could do this job more easily with lookbehind, but JS doesn't support it.
请注意,在其他 regex 风格中,我们可以使用lookbehind 更轻松地完成这项工作,但 JS 不支持它。
The full regex becomes:
完整的正则表达式变为:
\"|"(?:\"|[^"])*"|(\+)
See regex demoand full script.
Reference
参考
回答by Mike Samuel
You can do it in three steps.
您可以分三步完成。
- Use a regex global replace to extract all string body contents into a side-table.
- Do your comma translation
- Use a regex global replace to swap the string bodies back
- 使用正则表达式全局替换将所有字符串正文内容提取到边表中。
- 做你的逗号翻译
- 使用正则表达式全局替换来交换字符串体
Code below
下面的代码
// Step 1
var sideTable = [];
myString = myString.replace(
/"(?:[^"\]|\.)*"/g,
function (_) {
var index = sideTable.length;
sideTable[index] = _;
return '"' + index + '"';
});
// Step 2, replace commas with newlines
myString = myString.replace(/,/g, "\n");
// Step 3, swap the string bodies back
myString = myString.replace(/"(\d+)"/g,
function (_, index) {
return sideTable[index];
});
If you run that after setting
如果你在设置后运行
myString = '{:a "ab,cd, efg", :b "ab,def, egf,", :c "Conjecture"}';
you should get
你应该得到
{:a "ab,cd, efg"
:b "ab,def, egf,"
:c "Conjecture"}
It works, because after step 1,
它有效,因为在第 1 步之后,
myString = '{:a "0", :b "1", :c "2"}'
sideTable = ["ab,cd, efg", "ab,def, egf,", "Conjecture"];
so the only commas in myString are outside strings. Step 2, then turns commas into newlines:
所以 myString 中唯一的逗号在字符串之外。第 2 步,然后将逗号转换为换行符:
myString = '{:a "0"\n :b "1"\n :c "2"}'
Finally we replace the strings that only contain numbers with their original content.
最后,我们将仅包含数字的字符串替换为其原始内容。
回答by Marius
Although the answer by zx81 seems to be the best performing and clean one, it needes these fixes to correctly catch the escaped quotes:
尽管 zx81 的答案似乎是性能最佳且干净的答案,但它需要这些修复程序才能正确捕获转义引号:
var subject = '+bar+baz"not+or\"+or+\"this+"foo+bar+';
and
和
var regex = /"(?:[^"\]|\.)*"|(\+)/g;
Also the already mentioned "group1 === undefined" or "!group1". Especially 2. seems important to actually take everything asked in the original question into account.
还有已经提到的“group1 === undefined”或“!group1”。尤其是 2. 考虑到原始问题中提出的所有问题似乎很重要。
It should be mentioned though that this method implicitly requires the string to not have escaped quotes outside of unescaped quote pairs.
应该提到的是,此方法隐式要求字符串在未转义的引号对之外没有转义的引号。