在java中使用RegEx解析CSV输入
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1441556/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parsing CSV input with a RegEx in java
提问by Nathan Spears
I know, now I have two problems. But I'm having fun!
我知道,现在我有两个问题。但我玩得很开心!
I started with this advicenot to try and split, but instead to match on what is an acceptable field, and expanded from there to this expression.
我从这个建议开始,不要尝试拆分,而是匹配可接受的字段,并从那里扩展到这个表达式。
final Pattern pattern = Pattern.compile("\"([^\"]*)\"|(?<=,|^)([^,]*)(?=,|$)");
The expression looks like this without the annoying escaped quotes:
没有烦人的转义引号的表达式如下所示:
"([^"]*)"|(?<=,|^)([^,]*)(?=,|$)
This is working well for me - either it matches on "two quotes and whatever is between them", or "something between the start of the line or a comma and the end of the line or a comma". Iterating through the matches gets me all the fields, even if they are empty. For instance,
这对我来说很有效 - 它要么匹配“两个引号以及它们之间的任何内容”,要么“行首或逗号与行尾或逗号之间的某些内容”。遍历匹配项让我得到所有字段,即使它们是空的。例如,
the quick, "brown, fox jumps", over, "the",,"lazy dog"
breaks down into
分解为
the quick
"brown, fox jumps"
over
"the"
"lazy dog"
Great! Now I want to drop the quotes, so I added the lookahead and lookbehind non-capturing groups like I was doing for the commas.
伟大的!现在我想去掉引号,所以我添加了前瞻和后视非捕获组,就像我为逗号所做的那样。
final Pattern pattern = Pattern.compile("(?<=\")([^\"]*)(?=\")|(?<=,|^)([^,]*)(?=,|$)");
again the expression is:
再次表达是:
(?<=")([^"]*)(?=")|(?<=,|^)([^,]*)(?=,|$)
Instead of the desired result
而不是想要的结果
the quick
brown, fox jumps
over
the
lazy dog
now I get this breakdown:
现在我得到了这个细分:
the quick
"brown
fox jumps"
,over,
"the"
,,
"lazy dog"
What am I missing?
我错过了什么?
采纳答案by Devon_C_Miller
Operator precedence. Basically there is none. It's all left to right. So the or (|) is applying to the closing quote lookahead and the comma lookahead
运算符优先级。基本上没有。都是从左到右。所以 or (|) 应用于结束引号前瞻和逗号前瞻
Try:
尝试:
(?:(?<=")([^"]*)(?="))|(?<=,|^)([^,]*)(?=,|$)
回答by Nathan Spears
When I started to understand what I had done wrong, I also started to understand how convoluted the lookarounds were making this. I finally realized that I didn't want all the matched text, I wanted specific groups inside of it. I ended up using something very similar to my original RegEx except that I didn't do a lookahead on the closing comma, which I think should be a little more efficient. Here is my final code.
当我开始明白我做错了什么时,我也开始明白环视是多么令人费解。我终于意识到我不想要所有匹配的文本,我想要其中的特定组。我最终使用了与我原来的 RegEx 非常相似的东西,只是我没有对结束逗号进行前瞻,我认为这应该更有效率。这是我的最终代码。
package regex.parser;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class CSVParser {
/*
* This Pattern will match on either quoted text or text between commas, including
* whitespace, and accounting for beginning and end of line.
*/
private final Pattern csvPattern = Pattern.compile("\"([^\"]*)\"|(?<=,|^)([^,]*)(?:,|$)");
private ArrayList<String> allMatches = null;
private Matcher matcher = null;
private String match = null;
private int size;
public CSVParser() {
allMatches = new ArrayList<String>();
matcher = null;
match = null;
}
public String[] parse(String csvLine) {
matcher = csvPattern.matcher(csvLine);
allMatches.clear();
String match;
while (matcher.find()) {
match = matcher.group(1);
if (match!=null) {
allMatches.add(match);
}
else {
allMatches.add(matcher.group(2));
}
}
size = allMatches.size();
if (size > 0) {
return allMatches.toArray(new String[size]);
}
else {
return new String[0];
}
}
public static void main(String[] args) {
String lineinput = "the quick,\"brown, fox jumps\",over,\"the\",,\"lazy dog\"";
CSVParser myCSV = new CSVParser();
System.out.println("Testing CSVParser with: \n " + lineinput);
for (String s : myCSV.parse(lineinput)) {
System.out.println(s);
}
}
}
回答by Tim Bender
I know this isn't what the OP wants, but for other readers, one of the String.replace methods could be used to strip the quotes from each element in the result array of the OPs current regex.
我知道这不是 OP 想要的,但对于其他读者,可以使用 String.replace 方法之一从 OP 当前正则表达式的结果数组中的每个元素中去除引号。
回答by Parantapa Bhattacharya
(?:^|,)\s*(?:(?:(?=")"([^"].*?)")|(?:(?!")(.*?)))(?=,|$)
This should do what you want.
这应该做你想做的。
Explanation:
解释:
(?:^|,)\s*
The pattern should start with a , or beginning of string. Also, ignore all whitespace at the beginning.
模式应以 , 或字符串开头。另外,忽略开头的所有空格。
Lookahead and see if the rest starts with a quote
向前看,看看其余的是否以报价开头
(?:(?=")"([^"].*?)")
If it does, then match non-greedily till next quote.
如果是,则非贪婪地匹配直到下一个引用。
(?:(?!")(.*?))
If it does not begin with a quote, then match non-greedily till next comma or end of string.
如果它不以引号开头,则非贪婪地匹配直到下一个逗号或字符串结尾。
(?=,|$)
The pattern should end with a comma or end of string.
该模式应以逗号或字符串结尾结尾。