Java 在引号外的逗号上拆分
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18893390/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Splitting on comma outside quotes
提问by Jakob Mathiasen
My program reads a line from a file. This line contains comma-separated text like:
我的程序从文件中读取一行。此行包含逗号分隔的文本,例如:
123,test,444,"don't split, this",more test,1
I would like the result of a split to be this:
我希望拆分的结果是这样的:
123
test
444
"don't split, this"
more test
1
If I use the String.split(",")
, I would get this:
如果我使用String.split(",")
,我会得到这个:
123
test
444
"don't split
this"
more test
1
In other words: The comma in the substring "don't split, this"
is not a separator. How to deal with this?
换句话说:子字符串中的逗号"don't split, this"
不是分隔符。如何处理?
采纳答案by Rohit Jain
You can try out this regex:
你可以试试这个正则表达式:
str.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");
This splits the string on ,
that is followed by an even number of double quotes. In other words, it splits on comma outside the double quotes. This will work provided you have balanced quotes in your string.
这将拆分,
后跟偶数双引号的字符串。换句话说,它在双引号外以逗号分隔。如果您的字符串中有平衡的引号,这将起作用。
Explanation:
解释:
, // Split on comma
(?= // Followed by
(?: // Start a non-capture group
[^"]* // 0 or more non-quote characters
" // 1 quote
[^"]* // 0 or more non-quote characters
" // 1 quote
)* // 0 or more repetition of non-capture group (multiple of 2 quotes will be even)
[^"]* // Finally 0 or more non-quotes
$ // Till the end (This is necessary, else every comma will satisfy the condition)
)
You can even type like this in your code, using (?x)
modifier with your regex. The modifier ignores any whitespaces in your regex, so it's becomes more easy to read a regex broken into multiple lines like so:
您甚至可以在代码中键入这样的内容(?x)
,在正则表达式中使用修饰符。修饰符会忽略正则表达式中的任何空格,因此更容易阅读分成多行的正则表达式,如下所示:
String[] arr = str.split("(?x) " +
", " + // Split on comma
"(?= " + // Followed by
" (?: " + // Start a non-capture group
" [^\"]* " + // 0 or more non-quote characters
" \" " + // 1 quote
" [^\"]* " + // 0 or more non-quote characters
" \" " + // 1 quote
" )* " + // 0 or more repetition of non-capture group (multiple of 2 quotes will be even)
" [^\"]* " + // Finally 0 or more non-quotes
" $ " + // Till the end (This is necessary, else every comma will satisfy the condition)
") " // End look-ahead
);
回答by stefan.schwetschke
You can do this very easily without complex regular expression:
你可以很容易地做到这一点,而无需复杂的正则表达式:
- Split on the character
"
. You get a list of Strings - Process each string in the list: Split every string that is on an even position in the List (starting indexing with zero) on "," (you get a list inside a list), leave every odd positioned string alone (directly putting it in a list inside the list).
- Join the list of lists, so you get only a list.
- 在字符上拆分
"
。你得到一个字符串列表 - 处理列表中的每个字符串:将列表中偶数位置的每个字符串(从零开始索引)拆分为“,”(您在列表中得到一个列表),单独保留每个奇数位置的字符串(直接将其放入列表中的列表)。
- 加入列表列表,所以你只得到一个列表。
If you want to handle quoting of '"', you have to adapt the algorithm a little bit (joining some parts, you have incorrectly split of, or changing splitting to simple regexp), but the basic structure stays.
如果你想处理 '"' 的引用,你必须稍微调整算法(加入一些部分,你错误地拆分,或者将拆分更改为简单的正则表达式),但基本结构保持不变。
So basically it is something like this:
所以基本上它是这样的:
public class SplitTest {
public static void main(String[] args) {
final String splitMe="123,test,444,\"don't split, this\",more test,1";
final String[] splitByQuote=splitMe.split("\"");
final String[][] splitByComma=new String[splitByQuote.length][];
for(int i=0;i<splitByQuote.length;i++) {
String part=splitByQuote[i];
if (i % 2 == 0){
splitByComma[i]=part.split(",");
}else{
splitByComma[i]=new String[1];
splitByComma[i][0]=part;
}
}
for (String parts[] : splitByComma) {
for (String part : parts) {
System.out.println(part);
}
}
}
}
This will be much cleaner with lambdas, promised!
使用 lambdas 会更干净,承诺!
回答by Abhijith Nagarajan
Please see the below code snippet. This code only considers happy flow. Change the according to your requirement
请参阅下面的代码片段。这段代码只考虑了happy flow。根据您的要求更改
public static String[] splitWithEscape(final String str, char split,
char escapeCharacter) {
final List<String> list = new LinkedList<String>();
char[] cArr = str.toCharArray();
boolean isEscape = false;
StringBuilder sb = new StringBuilder();
for (char c : cArr) {
if (isEscape && c != escapeCharacter) {
sb.append(c);
} else if (c != split && c != escapeCharacter) {
sb.append(c);
} else if (c == escapeCharacter) {
if (!isEscape) {
isEscape = true;
if (sb.length() > 0) {
list.add(sb.toString());
sb = new StringBuilder();
}
} else {
isEscape = false;
}
} else if (c == split) {
list.add(sb.toString());
sb = new StringBuilder();
}
}
if (sb.length() > 0) {
list.add(sb.toString());
}
String[] strArr = new String[list.size()];
return list.toArray(strArr);
}
回答by zx81
Why Split when you can Match?
可以匹配时为什么要拆分?
Resurrecting this question because for some reason, the easy solution wasn't mentioned. Here is our beautifully compact regex:
重新提出这个问题是因为出于某种原因,没有提到简单的解决方案。这是我们精美紧凑的正则表达式:
"[^"]*"|[^,]+
This will match all the desired fragments (see demo).
这将匹配所有所需的片段(参见演示)。
Explanation
解释
- With
"[^"]*"
, we match complete"double-quoted strings"
- or
|
- we match
[^,]+
any characters that are not a comma.
- 与
"[^"]*"
,我们匹配完成"double-quoted strings"
- 或者
|
- 我们匹配
[^,]+
任何不是逗号的字符。
A possible refinement is to improve the string side of the alternation to allow the quoted strings to include escaped quotes.
一个可能的改进是改进交替的字符串侧,以允许引用的字符串包含转义的引号。
回答by LAFK says Reinstate Monica
Building upon @zx81'sanswer, cause matching idea is really nice, I've added Java 9 results
call, which returns a Stream
. Since OP wanted to use split
, I've collected to String[]
, as split
does.
基于@zx81 的回答,原因匹配的想法非常好,我添加了 Java 9results
调用,它返回一个Stream
. 由于 OP 想要使用split
,我已经收集到String[]
,就像split
那样。
Caution if you have spaces after your comma-separators (a, b, "c,d"
). Then you need to change the pattern.
如果逗号分隔符 ( a, b, "c,d"
)后有空格,请注意。然后你需要改变模式。
Jshell demo
Jshell 演示
$ jshell
-> String so = "123,test,444,\"don't split, this\",more test,1";
| Added variable so of type String with initial value "123,test,444,"don't split, this",more test,1"
-> Pattern.compile("\"[^\"]*\"|[^,]+").matcher(so).results();
| Expression value is: java.util.stream.ReferencePipeline$Head@2038ae61
| assigned to temporary variable of type java.util.stream.Stream<MatchResult>
-> .map(MatchResult::group).toArray(String[]::new);
| Expression value is: [Ljava.lang.String;@6b09bb57
| assigned to temporary variable of type String[]
-> Arrays.stream().forEach(System.out::println);
123
test
444
"don't split, this"
more test
1
Code
代码
String so = "123,test,444,\"don't split, this\",more test,1";
Pattern.compile("\"[^\"]*\"|[^,]+")
.matcher(so)
.results()
.map(MatchResult::group)
.toArray(String[]::new);
Explanation
解释
- Regex
[^"]
matches: a quote, anything but a quote, a quote. - Regex
[^"]*
matches: a quote, anything but a quote 0 (or more) times , a quote. - That regex needs to go first to "win", otherwise matching anything but a comma 1 or more times- that is:
[^,]+
- would "win". results()
requires Java 9 or higher.- It returns
Stream<MatchResult>
, which I map usinggroup()
call and collect to array of Strings. ParameterlesstoArray()
call would returnObject[]
.
- 正则表达式
[^"]
匹配:报价,除报价外的任何内容,报价。 - 正则表达式
[^"]*
匹配:一个引用,除了引用 0 次(或更多)次之外的任何东西,一个引用。 - 该正则表达式需要首先“赢”,否则匹配除逗号以外的任何内容 1 次或多次- 即:
[^,]+
- 会“赢”。 results()
需要 Java 9 或更高版本。- 它返回
Stream<MatchResult>
,我使用group()
call 和 collect 将其映射到字符串数组。无参数toArray()
调用将返回Object[]
.