Java 如何在忽略转义逗号的同时拆分逗号分隔的字符串?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/820172/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 19:52:27  来源:igfitidea点击:

How to split a comma separated String while ignoring escaped commas?

javaregexcsv

提问by arturh

I need to write a extended version of the StringUtils.commaDelimitedListToStringArray function which gets an additional parameter: the escape char.

我需要编写 StringUtils.commaDelimitedListToStringArray 函数的扩展版本,它获取一个附加参数:转义字符。

so calling my:

所以打电话给我:

commaDelimitedListToStringArray("test,test\,test\,test,test", "\")

should return:

应该返回:

["test", "test,test,test", "test"]



My current attempt is to use String.split() to split the String using regular expressions:



我目前的尝试是使用 String.split() 使用正则表达式拆分字符串:

String[] array = str.split("[^\\],");

But the returned array is:

但返回的数组是:

["tes", "test\,test\,tes", "test"]

Any ideas?

有任何想法吗?

采纳答案by matt b

The regular expression

正则表达式

[^\],

means "match a character which is not a backslash followed by a comma" - this is why patterns such as t,are matching, because tis a character which is not a backslash.

意思是“匹配一个不是反斜杠后跟逗号的字符”——这就是为什么这样的模式t,匹配,因为t是一个不是反斜杠的字符。

I think you need to use some sort of negative lookbehind, to capture a ,which is not preceded by a \without capturing the preceding character, something like

我认为您需要使用某种否定的lookbehind来捕获,前面\没有a而不捕获前面的字符的 a ,例如

(?<!\),

(BTW, note that I have purposefully not doubly-escaped the backslashes to make this more readable)

(顺便说一句,请注意,我故意没有双重转义反斜杠以使其更具可读性)

回答by cletus

Try:

尝试:

String array[] = str.split("(?<!\\),");

Basically this is saying split on a comma, except where that comma is preceded by two backslashes. This is called a negative lookbehind zero-width assertion.

基本上这就是用逗号分隔,除非逗号前面有两个反斜杠。这称为负后视零宽度断言

回答by arturh

For future reference, here is the complete method i ended up with:

为了将来参考,这是我最终得到的完整方法:

public static String[] commaDelimitedListToStringArray(String str, String escapeChar) {
    // these characters need to be escaped in a regular expression
    String regularExpressionSpecialChars = "/.*+?|()[]{}\";

    String escapedEscapeChar = escapeChar;

    // if the escape char for our comma separated list needs to be escaped 
    // for the regular expression, escape it using the \ char
    if(regularExpressionSpecialChars.indexOf(escapeChar) != -1) 
        escapedEscapeChar = "\" + escapeChar;

    // see http://stackoverflow.com/questions/820172/how-to-split-a-comma-separated-string-while-ignoring-escaped-commas
    String[] temp = str.split("(?<!" + escapedEscapeChar + "),", -1);

    // remove the escapeChar for the end result
    String[] result = new String[temp.length];
    for(int i=0; i<temp.length; i++) {
        result[i] = temp[i].replaceAll(escapedEscapeChar + ",", ",");
    }

    return result;
}

回答by boumbh

As matt b said, [^\\],will interpret the character preceding the comma as a part of the delimiter.

正如 matt b 所说,[^\\],将逗号前面的字符解释为分隔符的一部分。

"test\\\,test\\,test\,test,test"
  -(split)->
["test\\\,test\\,test\,tes" , "test"]

As drvdijk said, (?<!\\),will misinterpret escaped backslashes.

正如 drvdijk 所说,(?<!\\),会误解转义的反斜杠。

"test\\\,test\\,test\,test,test"
  -(split)->
["test\\\,test\\,test\,test" , "test"]
  -(unescape commas)->
["test\\,test\,test,test" , "test"]

I would expect being able to escape backslashes as well...

我希望也能够逃脱反斜杠......

"test\\\,test\\,test\,test,test"
  -(split)->
["test\\\,test\\" , "test\,test" , "test"]
  -(unescape commas and backslashes)->
["test\,test\" , "test,test" , "test"]

drvdijk suggested (?<=(?<!\\\\)(\\\\\\\\){0,100}),which works well for lists with elements ending with up to 100 backslashes. This is far enough... but why a limit? Is there a more efficient way (isn't lookbehind greedy)? What about invalid strings?

drvdijk 建议(?<=(?<!\\\\)(\\\\\\\\){0,100}),对于以最多 100 个反斜杠结尾的元素的列表来说,这很有效。这已经足够了......但为什么要限制?有没有更有效的方法(不是lookbehind greedy)吗?无效字符串怎么办?

I searched for a while for a generic solution, then I wrote the thing myself... The idea is to split following a pattern that matches the list elements (instead of matching the delimiter).

我搜索了一段时间的通用解决方案,然后我自己写了这个东西......这个想法是按照与列表元素匹配的模式(而不是匹配分隔符)进行拆分。

My answer does not take the escape character as a parameter.

我的回答没有将转义字符作为参数。

public static List<String> commaDelimitedListStringToStringList(String list) {
    // Check the validity of the list
    // ex: "te\st" is not valid, backslash should be escaped
    if (!list.matches("^(([^\\,]|\\,|\\\\)*(,|$))+")) {
        // Could also raise an exception
        return null;
    }
    // Matcher for the list elements
    Matcher matcher = Pattern
            .compile("(?<=(^|,))([^\\,]|\\,|\\\\)*(?=(,|$))")
            .matcher(list);
    ArrayList<String> result = new ArrayList<String>();
    while (matcher.find()) {
        // Unescape the list element
        result.add(matcher.group().replaceAll("\\([\\,])", ""));
    }
    return result;
}

Description for the pattern (unescaped):

模式描述(未转义):

(?<=(^|,))forward is start of string or a ,

(?<=(^|,))forward 是字符串的开始或一个 ,

([^\\,]|\\,|\\\\)*the element composed of \,, \\or characters wich are neither \nor ,

([^\\,]|\\,|\\\\)*\,,\\或 字符组成的元素既不是也不\,

(?=(,|$))behind is end of string or a ,

(?=(,|$))后面是字符串的结尾或一个 ,

The pattern may be simplified.

可以简化模式。

Even with the 3 parsings (matches+ find+ replaceAll), this method seems faster than the one suggested by drvdijk. It can still be optimized by writing a specific parser.

即使使用3个parsings(matches+ find+ replaceAll),这种方法似乎比一个由drvdijk建议更快。它仍然可以通过编写特定的解析器来优化。

Also, what is the need of having an escape character if only one character is special, it could simply be doubled...

另外,如果只有一个字符是特殊的,那么需要转义字符是什么,它可以简单地加倍......

public static List<String> commaDelimitedListStringToStringList2(String list) {
    if (!list.matches("^(([^,]|,,)*(,|$))+")) {
        return null;
    }
    Matcher matcher = Pattern.compile("(?<=(^|,))([^,]|,,)*(?=(,|$))")
                    .matcher(list);
    ArrayList<String> result = new ArrayList<String>();
    while (matcher.find()) {
        result.add(matcher.group().replaceAll(",,", ","));
    }
    return result;
}