Java:拆分逗号分隔的字符串但忽略引号中的逗号

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1757065/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 22:21:27  来源:igfitidea点击:

Java: splitting a comma-separated string but ignoring commas in quotes

javaregexstring

提问by Jason S

I have a string vaguely like this:

我有一个像这样模糊的字符串:

foo,bar,c;qual="baz,blurb",d;junk="quux,syzygy"

that I want to split by commas -- but I need to ignore commas in quotes. How can I do this? Seems like a regexp approach fails; I suppose I can manually scan and enter a different mode when I see a quote, but it would be nice to use preexisting libraries. (edit: I guess I meant libraries that are already part of the JDK or already part of a commonly-used libraries like Apache Commons.)

我想用逗号分隔 - 但我需要忽略引号中的逗号。我怎样才能做到这一点?似乎正则表达式方法失败了;我想我可以在看到引用时手动扫描并输入不同的模式,但是使用预先存在的库会很好。(编辑:我想我的意思是已经是 JDK 的一部分或已经是像 Apache Commons 这样的常用库的一部分的库。)

the above string should split into:

上面的字符串应该拆分为:

foo
bar
c;qual="baz,blurb"
d;junk="quux,syzygy"

note:this is NOT a CSV file, it's a single string contained in a file with a larger overall structure

注意:这不是 CSV 文件,它是包含在具有更大整体结构的文件中的单个字符串

采纳答案by Bart Kiers

Try:

尝试:

public class Main { 
    public static void main(String[] args) {
        String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
        String[] tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);
        for(String t : tokens) {
            System.out.println("> "+t);
        }
    }
}

Output:

输出:

> foo
> bar
> c;qual="baz,blurb"
> d;junk="quux,syzygy"

In other words: split on the comma only if that comma has zero, or an even number of quotes ahead of it.

换句话说:仅当逗号前面有零个或偶数个引号时才在逗号上拆分

Or, a bit friendlier for the eyes:

或者,对眼睛更友好一点:

public class Main { 
    public static void main(String[] args) {
        String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";

        String otherThanQuote = " [^\"] ";
        String quotedString = String.format(" \" %s* \" ", otherThanQuote);
        String regex = String.format("(?x) "+ // enable comments, ignore white spaces
                ",                         "+ // match a comma
                "(?=                       "+ // start positive look ahead
                "  (?:                     "+ //   start non-capturing group 1
                "    %s*                   "+ //     match 'otherThanQuote' zero or more times
                "    %s                    "+ //     match 'quotedString'
                "  )*                      "+ //   end group 1 and repeat it zero or more times
                "  %s*                     "+ //   match 'otherThanQuote'
                "  $                       "+ // match the end of the string
                ")                         ", // stop positive look ahead
                otherThanQuote, quotedString, otherThanQuote);

        String[] tokens = line.split(regex, -1);
        for(String t : tokens) {
            System.out.println("> "+t);
        }
    }
}

which produces the same as the first example.

它产生与第一个示例相同的结果。

EDIT

编辑

As mentioned by @MikeFHay in the comments:

正如@MikeFHay 在评论中提到的:

I prefer using Guava's Splitter, as it has saner defaults (see discussion above about empty matches being trimmed by String#split(), so I did:

Splitter.on(Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))

我更喜欢使用Guava 的 Splitter,因为它具有更合理的默认值(请参阅上面关于被 修剪的空匹配项的讨论String#split(),所以我做了:

Splitter.on(Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))

回答by Woot4Moo

I would do something like this:

我会做这样的事情:

boolean foundQuote = false;

if(charAtIndex(currentStringIndex) == '"')
{
   foundQuote = true;
}

if(foundQuote == true)
{
   //do nothing
}

else 

{
  string[] split = currentString.split(',');  
}

回答by Stefan Kendall

Rather than use lookahead and other crazy regex, just pull out the quotes first. That is, for every quote grouping, replace that grouping with __IDENTIFIER_1or some other indicator, and map that grouping to a map of string,string.

与其使用前瞻和其他疯狂的正则表达式,不如先拔出引号。也就是说,对于每个引用分组,用__IDENTIFIER_1或 一些其他指示符替换该分组,并将该分组映射到 string,string 的映射。

After you split on comma, replace all mapped identifiers with the original string values.

在逗号分割后,用原始字符串值替换所有映射的标识符。

回答by Matthew Sowders

Try a lookaroundlike (?!\"),(?!\"). This should match ,that are not surrounded by ".

尝试环视一样(?!\"),(?!\")。这应该匹配,没有被 包围的"

回答by djna

You're in that annoying boundary area where regexps almost won't do (as has been pointed out by Bart, escaping the quotes would make life hard) , and yet a full-blown parser seems like overkill.

您处于正则表达式几乎不会做的令人讨厌的边界区域(正如 Bart 所指出的那样,转义引号会使生活变得艰难),但一个成熟的解析器似乎有点过头了。

If you are likely to need greater complexity any time soon I would go looking for a parser library. For example this one

如果您可能很快需要更大的复杂性,我会去寻找解析器库。例如这个

回答by Jason S

I was impatient and chose not to wait for answers... for reference it doesn't look that hard to do something like this (which works for my application, I don't need to worry about escaped quotes, as the stuff in quotes is limited to a few constrained forms):

我很不耐烦并选择不等待答案......作为参考,做这样的事情看起来并不难(这适用于我的应用程序,我不需要担心转义引号,因为引号中的内容仅限于一些受约束的形式):

final static private Pattern splitSearchPattern = Pattern.compile("[\",]"); 
private List<String> splitByCommasNotInQuotes(String s) {
    if (s == null)
        return Collections.emptyList();

    List<String> list = new ArrayList<String>();
    Matcher m = splitSearchPattern.matcher(s);
    int pos = 0;
    boolean quoteMode = false;
    while (m.find())
    {
        String sep = m.group();
        if ("\"".equals(sep))
        {
            quoteMode = !quoteMode;
        }
        else if (!quoteMode && ",".equals(sep))
        {
            int toPos = m.start(); 
            list.add(s.substring(pos, toPos));
            pos = m.end();
        }
    }
    if (pos < s.length())
        list.add(s.substring(pos));
    return list;
}

(exercise for the reader: extend to handling escaped quotes by looking for backslashes also.)

(读者练习:扩展到通过查找反斜杠来处理转义引号。)

回答by Fabian Steeg

While I do like regular expressions in general, for this kind of state-dependent tokenization I believe a simple parser (which in this case is much simpler than that word might make it sound) is probably a cleaner solution, in particular with regards to maintainability, e.g.:

虽然我确实喜欢正则表达式,但对于这种依赖于状态的标记化,我相信一个简单的解析器(在这种情况下比这个词听起来简单得多)可能是一个更清晰的解决方案,特别是在可维护性方面,例如:

String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
List<String> result = new ArrayList<String>();
int start = 0;
boolean inQuotes = false;
for (int current = 0; current < input.length(); current++) {
    if (input.charAt(current) == '\"') inQuotes = !inQuotes; // toggle state
    boolean atLastChar = (current == input.length() - 1);
    if(atLastChar) result.add(input.substring(start));
    else if (input.charAt(current) == ',' && !inQuotes) {
        result.add(input.substring(start, current));
        start = current + 1;
    }
}

If you don't care about preserving the commas inside the quotes you could simplify this approach (no handling of start index, no last characterspecial case) by replacing your commas in quotes by something else and then split at commas:

如果您不关心保留引号内的逗号,您可以通过用其他东西替换引号中的逗号然后以逗号分割来简化这种方法(不处理开始索引,没有最后一个字符特殊情况):

String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
StringBuilder builder = new StringBuilder(input);
boolean inQuotes = false;
for (int currentIndex = 0; currentIndex < builder.length(); currentIndex++) {
    char currentChar = builder.charAt(currentIndex);
    if (currentChar == '\"') inQuotes = !inQuotes; // toggle state
    if (currentChar == ',' && inQuotes) {
        builder.setCharAt(currentIndex, ';'); // or '?', and replace later
    }
}
List<String> result = Arrays.asList(builder.toString().split(","));

回答by Marcin Kosinski

I would not advise a regex answer from Bart, I find parsing solution better in this particular case (as Fabian proposed). I've tried regex solution and own parsing implementation I have found that:

我不会建议来自 Bart 的正则表达式答案,我发现在这种特殊情况下更好地解析解决方案(如 Fabian 提出的那样)。我尝试过正则表达式解决方案和自己的解析实现,我发现:

  1. Parsing is much faster than splitting with regex with backreferences - ~20 times faster for short strings, ~40 times faster for long strings.
  2. Regex fails to find empty string after last comma. That was not in original question though, it was mine requirement.
  1. 解析比使用带有反向引用的正则表达式拆分要快得多 - 短字符串快 20 倍,长字符串快 40 倍。
  2. 正则表达式在最后一个逗号后找不到空字符串。不过,这不是最初的问题,这是我的要求。

My solution and test below.

我的解决方案和测试如下。

String tested = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\",";
long start = System.nanoTime();
String[] tokens = tested.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
long timeWithSplitting = System.nanoTime() - start;

start = System.nanoTime(); 
List<String> tokensList = new ArrayList<String>();
boolean inQuotes = false;
StringBuilder b = new StringBuilder();
for (char c : tested.toCharArray()) {
    switch (c) {
    case ',':
        if (inQuotes) {
            b.append(c);
        } else {
            tokensList.add(b.toString());
            b = new StringBuilder();
        }
        break;
    case '\"':
        inQuotes = !inQuotes;
    default:
        b.append(c);
    break;
    }
}
tokensList.add(b.toString());
long timeWithParsing = System.nanoTime() - start;

System.out.println(Arrays.toString(tokens));
System.out.println(tokensList.toString());
System.out.printf("Time with splitting:\t%10d\n",timeWithSplitting);
System.out.printf("Time with parsing:\t%10d\n",timeWithParsing);

Of course you are free to change switch to else-ifs in this snippet if you feel uncomfortable with its ugliness. Note then lack of break after switch with separator. StringBuilder was chosen instead to StringBuffer by design to increase speed, where thread safety is irrelevant.

当然,如果您对它的丑陋感到不舒服,您可以自由地在此代码段中将 switch 更改为 else-ifs。请注意,使用分隔符切换后没有中断。在设计上选择 StringBuilder 而不是 StringBuffer 是为了提高速度,其中线程安全无关紧要。

回答by Holger

The simplest approach is not to match delimiters, i.e. commas, with a complex additional logic to match what is actually intended (the data which might be quoted strings), just to exclude false delimiters, but rather match the intended data in the first place.

最简单的方法是不匹配定界符,即逗号,使用复杂的附加逻辑来匹配实际预期的内容(可能是引用字符串的数据),只是为了排除错误的定界符,而是首先匹配预期的数据。

The pattern consists of two alternatives, a quoted string ("[^"]*"or ".*?") or everything up to the next comma ([^,]+). To support empty cells, we have to allow the unquoted item to be empty and to consume the next comma, if any, and use the \\Ganchor:

该模式由两个备选方案组成,一个带引号的字符串 ("[^"]*"".*?") 或直到下一个逗号 ( [^,]+) 的所有内容。为了支持空单元格,我们必须允许未引用的项目为空并使用下一个逗号(如果有)并使用\\G锚点:

Pattern p = Pattern.compile("\G\"(.*?)\",?|([^,]*),?");

The pattern also contains two capturing groups to get either, the quoted string's content or the plain content.

该模式还包含两个捕获组以获取引用字符串的内容或纯内容。

Then, with Java?9, we can get an array as

然后,使用 Java?9,我们可以得到一个数组

String[] a = p.matcher(input).results()
    .map(m -> m.group(m.start(1)<0? 2: 1))
    .toArray(String[]::new);

whereas older Java versions need a loop like

而较旧的 Java 版本需要一个循环,如

for(Matcher m = p.matcher(input); m.find(); ) {
    String token = m.group(m.start(1)<0? 2: 1);
    System.out.println("found: "+token);
}

Adding the items to a Listor an array is left as an excise to the reader.

将项目添加到 aList或数组是留给读者的消费。

For Java?8, you can use the results()implementation of this answer, to do it like the Java?9 solution.

对于 Java?8,您可以使用此答案results()实现,像 Java?9 解决方案一样执行操作。

For mixed content with embedded strings, like in the question, you can simply use

对于带有嵌入字符串的混合内容,就像在问题中一样,您可以简单地使用

Pattern p = Pattern.compile("\G((\"(.*?)\"|[^,])*),?");

But then, the strings are kept in their quoted form.

但是,字符串保留在引用的形式中。