Java String Tokenizer:用逗号分割字符串并忽略双引号中的逗号

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19241010/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 15:21:36  来源:igfitidea点击:

String Tokenizer : split string by comma and ignore comma in double quotes

javaregexstring

提问by Shashi

I have a string like below -

我有一个像下面这样的字符串 -

value1, value2, value3, value4, "value5, 1234", value6, value7, "value8", value9, "value10, 123.23"

value1, value2, value3, value4, "value5, 1234", value6, value7, "value8", value9, "value10, 123.23"

If I tokenize above string I'm getting comma separated tokens. But I would like to say to string tokenizer ignore comma's after double quotes while doing splits. How can I say this?

如果我标记上面的字符串,我会得到逗号分隔的标记。但是我想说字符串标记器在进行拆分时忽略双引号后的逗号。我怎么能说这个?

Thanks in advance

提前致谢

Shashi

沙市

采纳答案by Ravi Thapliyal

Use a CSV parser like OpenCSVto take care of things like commas in quoted elements, values that span multiple lines etc. automatically. You can use the library to serialize your text back as CSV as well.

使用像OpenCSV这样的 CSV 解析器来自动处理引用元素中的逗号、跨越多行的值等。您也可以使用该库将文本序列化回 CSV。

String str = "value1, value2, value3, value4, \"value5, 1234\", " +
        "value6, value7, \"value8\", value9, \"value10, 123.23\"";

CSVReader reader = new CSVReader(new StringReader(str));

String [] tokens;
while ((tokens = reader.readNext()) != null) {
    System.out.println(tokens[0]); // value1
    System.out.println(tokens[4]); // value5, 1234
    System.out.println(tokens[9]); // value10, 123.23
}

回答by Ivan Mushketyk

You can use several approaches:

您可以使用多种方法:

  1. Write code that search for comas and maintain a state weather a particular coma is in quotes or note.
  2. Tokenize by double-quote symbol and than tokenize strings in the result array by comma symbol (make sure you tokenize strings with indexes 0, 2, 4, etc., since they were not in double quotes in the original string)
  1. 编写代码来搜索昏迷并保持特定昏迷在引号或注释中的天气状态。
  2. 通过双引号标记,然后通过逗号符号标记结果数组中的字符串(确保使用索引 0、2、4 等标记字符串,因为它们不在原始字符串中的双引号中)

回答by Bohemian

You just need one line and the right regex:

你只需要一行和正确的正则表达式:

String[] values = input.replaceAll("^\"", "").split("\"?(,|$)(?=(([^\"]*\"){2})*[^\"]*$) *\"?");

This also neatly trims off the wrapping double quotes for you too, including the final quote!

这也可以为您整齐地修剪双引号,包括最后的引号!

Note: Interesting edge case when the firstterm is quoted required an extra step of trimming the leading quote using replaceAll().

注意:引用第一个术语时的有趣边缘情况需要使用replaceAll().

Here's some test code:

下面是一些测试代码:

String input= "\"value1, value2\", value3, value4, \"value5, 1234\", " +
    "value6, value7, \"value8\", value9, \"value10, 123.23\"";
String[] values = input.replaceAll("^\"", "").split("\"?(,|$)(?=(([^\"]*\"){2})*[^\"]*$) *\"?");
for (String s : values)
System.out.println(s);

Output:

输出:

value1, value2
value3
value4
value5, 1234
value6
value7
value8
value9
value10, 123.23

回答by Sumedh Kapoor

Without any third party library dependency, following code can also parse the fields as per the requirements given:

在没有任何第三方库依赖的情况下,以下代码也可以根据给定的要求解析字段:

import java.util.*;

public class CSVSpliter {

  public static void main (String [] args) {
    String inputStr = "value1, value2, value3, value4, \"value5, 1234\", value6, value7, \"value8\", value9, \"value10, 123.23\"";

    StringBuffer sb = new StringBuffer (inputStr);
    List<String> splitStringList = new ArrayList<String> ();
    boolean insideDoubleQuotes = false;
    StringBuffer field = new StringBuffer ();

    for (int i=0; i < sb.length(); i++) {
        if (sb.charAt (i) == '"' && !insideDoubleQuotes) {
            insideDoubleQuotes = true;
        } else if (sb.charAt(i) == '"' && insideDoubleQuotes) {
            insideDoubleQuotes = false;
            splitStringList.add (field.toString().trim());
            field.setLength(0);
        } else if (sb.charAt(i) == ',' && !insideDoubleQuotes) {
            // ignore the comma after double quotes.
            if (field.length() > 0) {
                splitStringList.add (field.toString().trim());
            }
            // clear the field for next word
            field.setLength(0);
        } else {
            field.append (sb.charAt(i));
        }
    }
    for (String str: splitStringList) {
        System.out.println ("Split fields: "+str);
    }
}

}

}

This will give the following output:

这将提供以下输出:

Split fields: value1

Split fields: value2

Split fields: value3

Split fields: value4

Split fields: value5, 1234

Split fields: value6

Split fields: value7

Split fields: value8

Split fields: value9

Split fields: value10, 123.23

拆分字段:value1

拆分字段:value2

拆分字段:value3

拆分字段:value4

拆分字段:value5、1234

拆分字段:value6

拆分字段:value7

拆分字段:value8

拆分字段:value9

拆分字段:value10、123.23

回答by Reza

String delimiter = ",";

String v = "value1, value2, value3, value4, \"value5, 1234\", value6, value7, \"value8\", value9, \"value10, 123.23\"";

String[] a = v.split(delimiter + "(?=(?:(?:[^\"]*+\"){2})*+[^\"]*+$)");

回答by Igor Baikalov

I'm allergic to regex; why not double-split as someone suggested?

我对正则表达式过敏;为什么不像有人建议的那样双重分裂?

    String str = "value1, value2, value3, value4, \"value5, 1234\", value6, value7, \"value8\", value9, \"value10, 123.23\"";
    boolean quoted = false;
    for(String q : str.split("\"")) {
        if(quoted)
            System.out.println(q.trim());
        else
            for(String s : q.split(","))
                if(!s.trim().isEmpty())
                    System.out.println(s.trim());
        quoted = !quoted;
    }