Java 当没有被单引号或双引号包围时,使用空格分割字符串的正则表达式

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/366202/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 13:52:30  来源:igfitidea点击:

Regex for splitting a string using space when not surrounded by single or double quotes

javaregexsplit

提问by carlsz

I'm new to regular expressions and would appreciate your help. I'm trying to put together an expression that will split the example string using all spaces that are not surrounded by single or double quotes. My last attempt looks like this: (?!")and isn't quite working. It's splitting on the space before the quote.

我是正则表达式的新手,非常感谢您的帮助。我正在尝试组合一个表达式,该表达式将使用所有未被单引号或双引号括起来的空格来拆分示例字符串。我的最后一次尝试看起来像这样:(?!")并且不太有效。它在引用前的空格处分裂。

Example input:

示例输入:

This is a string that "will be" highlighted when your 'regular expression' matches something.

Desired output:

期望的输出:

This
is
a
string
that
will be
highlighted
when
your
regular expression
matches
something.

Note that "will be"and 'regular expression'retain the space between the words.

注意"will be"'regular expression'保留单词之间的空格。

采纳答案by Jan Goyvaerts

I don't understand why all the others are proposing such complex regular expressions or such long code. Essentially, you want to grab two kinds of things from your string: sequences of characters that aren't spaces or quotes, and sequences of characters that begin and end with a quote, with no quotes in between, for two kinds of quotes. You can easily match those things with this regular expression:

我不明白为什么所有其他人都提出如此复杂的正则表达式或如此长的代码。本质上,您希望从字符串中获取两种内容:不是空格或引号的字符序列,以及以引号开头和结尾的字符序列,中间没有引号,用于两种引号。您可以轻松地将这些内容与此正则表达式匹配:

[^\s"']+|"([^"]*)"|'([^']*)'

I added the capturing groups because you don't want the quotes in the list.

我添加了捕获组,因为您不想要列表中的引号。

This Java code builds the list, adding the capturing group if it matched to exclude the quotes, and adding the overall regex match if the capturing group didn't match (an unquoted word was matched).

此 Java 代码构建列表,如果匹配,则添加捕获组以排除引号,如果捕获组不匹配(匹配未引用的单词),则添加整体正则表达式匹配。

List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\s\"']+|\"([^\"]*)\"|'([^']*)'");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
    if (regexMatcher.group(1) != null) {
        // Add double-quoted string without the quotes
        matchList.add(regexMatcher.group(1));
    } else if (regexMatcher.group(2) != null) {
        // Add single-quoted string without the quotes
        matchList.add(regexMatcher.group(2));
    } else {
        // Add unquoted word
        matchList.add(regexMatcher.group());
    }
} 

If you don't mind having the quotes in the returned list, you can use much simpler code:

如果您不介意在返回的列表中包含引号,您可以使用更简单的代码:

List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\s\"']+|\"[^\"]*\"|'[^']*'");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
    matchList.add(regexMatcher.group());
} 

回答by Jonathan Lonowski

It'll probably be easier to search the string, grabbing each part, vs. split it.

搜索字符串可能更容易,抓取每个部分,而不是拆分它。

Reason being, you can have it split at the spaces before and after "will be". But, I can't think of any way to specify ignoring the space between inside a split.

原因是,您可以在"will be".前后的空格处将其拆分。但是,我想不出任何方法来指定忽略拆分内部之间的空间。

(not actual Java)

(不是真正的Java)

string = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";

regex = "\"(\\"|(?!\\").)+\"|[^ ]+"; // search for a quoted or non-spaced group
final = new Array();

while (string.length > 0) {
    string = string.trim();
    if (Regex(regex).test(string)) {
        final.push(Regex(regex).match(string)[0]);
        string = string.replace(regex, ""); // progress to next "word"
    }
}


Also, capturing single quotes could lead to issues:

此外,捕获单引号可能会导致问题:

"Foo's Bar 'n Grill"

//=>

"Foo"
"s Bar "
"n"
"Grill"

回答by Zach Scrivena

String.split()is not helpful here because there is no way to distinguish between spaces within quotes (don't split) and those outside (split). Matcher.lookingAt()is probably what you need:

String.split()在这里没有帮助,因为无法区分引号内的空格(不拆分)和外面的空格(拆分)。Matcher.lookingAt()可能是你需要的:

String str = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
str = str + " "; // add trailing space
int len = str.length();
Matcher m = Pattern.compile("((\"[^\"]+?\")|('[^']+?')|([^\s]+?))\s++").matcher(str);

for (int i = 0; i < len; i++)
{
    m.region(i, len);

    if (m.lookingAt())
    {
        String s = m.group(1);

        if ((s.startsWith("\"") && s.endsWith("\"")) ||
            (s.startsWith("'") && s.endsWith("'")))
        {
            s = s.substring(1, s.length() - 1);
        }

        System.out.println(i + ": \"" + s + "\"");
        i += (m.group(0).length() - 1);
    }
}

which produces the following output:

产生以下输出:

0: "This"
5: "is"
8: "a"
10: "string"
17: "that"
22: "will be"
32: "highlighted"
44: "when"
49: "your"
54: "regular expression"
75: "matches"
83: "something."

回答by rmeador

I'm reasonably certain this is not possible using regular expressions alone. Checking whether something is contained inside some other tag is a parsing operation. This seems like the same problem as trying to parse XML with a regex -- it can't be done correctly. You may be able to get your desired outcome by repeatedly applying a non-greedy, non-global regex that matches the quoted strings, then once you can't find anything else, split it at the spaces... that has a number of problems, including keeping track of the original order of all the substrings. Your best bet is to just write a really simple function that iterates over the string and pulls out the tokens you want.

我有理由确定单独使用正则表达式是不可能的。检查某些其他标记中是否包含某些内容是一种解析操作。这似乎与尝试使用正则表达式解析 XML 存在相同的问题——它无法正确完成。您可以通过重复应用与引用字符串匹配的非贪婪的非全局正则表达式来获得所需的结果,然后一旦找不到其他任何内容,将其拆分为空格......问题,包括跟踪所有子串的原始顺序。最好的办法是编写一个非常简单的函数来遍历字符串并提取出您想要的标记。

回答by Jay

There are several questions on StackOverflow that cover this same question in various contexts using regular expressions. For instance:

StackOverflow 上有几个问题使用正则表达式在各种上下文中涵盖了相同的问题。例如:

UPDATE: Sample regex to handle single and double quoted strings. Ref: How can I split on a string except when inside quotes?

更新:处理单引号和双引号字符串的示例正则表达式。参考:除了在引号内,我如何拆分字符串?

m/('.*?'|".*?"|\S+)/g 

Tested this with a quick Perl snippet and the output was as reproduced below. Also works for empty strings or whitespace-only strings if they are between quotes (not sure if that's desired or not).

使用快速 Perl 片段对此进行了测试,输出如下所示。也适用于空字符串或纯空格字符串,如果它们在引号之间(不确定是否需要)。

This
is
a
string
that
"will be"
highlighted
when
your
'regular expression'
matches
something.

Note that this does include the quote characters themselves in the matched values, though you can remove that with a string replace, or modify the regex to not include them. I'll leave that as an exercise for the reader or another poster for now, as 2am is way too late to be messing with regular expressions anymore ;)

请注意,这确实包括匹配值中的引号字符本身,但您可以使用字符串替换将其删除,或修改正则表达式以不包括它们。我暂时把它留给读者或其他海报作为练习,因为凌晨 2 点已经太晚了,不能再搞乱正则表达式了 ;)

回答by mcrumley

If you want to allow escaped quotes inside the string, you can use something like this:

如果你想在字符串中允许转义引号,你可以使用这样的东西:

(?:(['"])(.*?)(?<!\)(?>\\)*|([^\s]+))

Quoted strings will be group 2, single unquoted words will be group 3.

带引号的字符串将是第 2 组,单个未带引号的单词将是第 3 组。

You can try it on various strings here: http://www.fileformat.info/tool/regex.htmor http://gskinner.com/RegExr/

您可以在这里尝试各种字符串:http: //www.fileformat.info/tool/regex.htmhttp://gskinner.com/RegExr/

回答by Marcus Andromeda

(?<!\G".{0,99999})\s|(?<=\G".{0,99999}")\s

This will match the spaces not surrounded by double quotes. I have to use min,max {0,99999} because Java doesn't support * and + in lookbehind.

这将匹配未被双引号包围的空格。我必须使用 min,max {0,99999} 因为 Java 在后视中不支持 * 和 +。

回答by Eric Woodruff

I liked Marcus's approach, however, I modified it so that I could allow text near the quotes, and support both " and ' quote characters. For example, I needed a="some value" to not split it into [a=, "some value"].

我喜欢 Marcus 的方法,但是,我对其进行了修改,以便在引号附近允许文本,并支持 " 和 ' 引号字符。例如,我需要 a="some value" 来不将其拆分为 [a=, "一些价值”]。

(?<!\G\S{0,99999}[\"'].{0,99999})\s|(?<=\G\S{0,99999}\".{0,99999}\"\S{0,99999})\s|(?<=\G\S{0,99999}'.{0,99999}'\S{0,99999})\s"

回答by pascals

A couple hopefully helpful tweaks on Jan's accepted answer:

对 Jan 接受的答案进行了一些希望有用的调整:

(['"])((?:\|.)+?)|([^\s"']+)
  • Allows escaped quotes within quoted strings
  • Avoids repeating the pattern for the single and double quote; this also simplifies adding more quoting symbols if needed (at the expense of one more capturing group)
  • 允许在带引号的字符串中转义引号
  • 避免重复单引号和双引号的模式;如果需要,这也简化了添加更多引用符号的过程(以增加一个捕获组为代价)

回答by iRon

The regex from Jan Goyvaerts is the best solution I found so far, but creates also empty (null) matches, which he excludes in his program. These empty matches also appear from regex testers (e.g. rubular.com). If you turn the searches arround (first look for the quoted parts and than the space separed words) then you might do it in once with:

Jan Goyvaerts 的正则表达式是我目前找到的最好的解决方案,但也创建了空(空)匹配,他在他的程序中排除了这些匹配。这些空匹配也出现在正则表达式测试者(例如 rubular.com)中。如果您将搜索转向(首先查找引用的部分而不是空格分隔的单词),那么您可以使用以下命令一次性完成:

("[^"]*"|'[^']*'|[\S]+)+