Java 标记字符串但忽略引号内的分隔符
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3366281/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Tokenizing a String but ignoring delimiters within quotes
提问by Ploo
I wish to have have the following String
我希望有以下字符串
!cmd 45 90 "An argument" Another AndAnother "Another one in quotes"
to become an array of the following
成为以下数组
{ "!cmd", "45", "90", "An argument", "Another", "AndAnother", "Another one in quotes" }
I tried
我试过
new StringTokenizer(cmd, "\"")
but this would return "Another" and "AndAnother as "Another AndAnother" which is not the desired effect.
但这会将“Another”和“AndAnother”返回为“Another AndAnother”,这不是预期的效果。
Thanks.
谢谢。
EDIT: I have changed the example yet again, this time I believe it explains the situation best although it is no different than the second example.
编辑:我再次更改了示例,这次我相信它最好地解释了情况,尽管它与第二个示例没有什么不同。
采纳答案by polygenelubricants
It's much easier to use a java.util.regex.Matcher
and do a find()
rather than any kind of split
in these kinds of scenario.
在这些场景中,使用 ajava.util.regex.Matcher
和执行 afind()
而不是任何一种都容易得多split
。
That is, instead of defining the pattern for the delimiterbetween the tokens, you define the pattern for the tokens themselves.
也就是说,不是为标记之间的分隔符定义模式,而是为标记本身定义模式。
Here's an example:
下面是一个例子:
String text = "1 2 \"333 4\" 55 6 \"77\" 8 999";
// 1 2 "333 4" 55 6 "77" 8 999
String regex = "\"([^\"]*)\"|(\S+)";
Matcher m = Pattern.compile(regex).matcher(text);
while (m.find()) {
if (m.group(1) != null) {
System.out.println("Quoted [" + m.group(1) + "]");
} else {
System.out.println("Plain [" + m.group(2) + "]");
}
}
The above prints (as seen on ideone.com):
上面的打印(如在 ideone.com 上看到的):
Plain [1]
Plain [2]
Quoted [333 4]
Plain [55]
Plain [6]
Quoted [77]
Plain [8]
Plain [999]
The pattern is essentially:
该模式本质上是:
"([^"]*)"|(\S+)
\_____/ \___/
1 2
There are 2 alternates:
有2个备选:
- The first alternate matches the opening double quote, a sequence of anything but double quote (captured in group 1), then the closing double quote
- The second alternate matches any sequence of non-whitespace characters, captured in group 2
- The order of the alternates matter in this pattern
- 第一个替代匹配开头的双引号,除双引号之外的任何序列(在组 1 中捕获),然后是结尾的双引号
- 第二个替代匹配任何非空白字符序列,在组 2 中捕获
- 在这种模式中,交替的顺序很重要
Note that this does not handle escaped double quotes within quoted segments. If you need to do this, then the pattern becomes more complicated, but the Matcher
solution still works.
请注意,这不会处理引用段内的转义双引号。如果您需要这样做,则模式会变得更加复杂,但Matcher
解决方案仍然有效。
References
参考
- regular-expressions.info/Brackets for Grouping and Capturing, Alternation with Vertical Bar, Character Class, Repetition with Star and Plus
- 正则表达式.info/Brackets for Grouping and Captures, Alternation with Vertical Bar, Character Class, Repetition with Star and Plus
See also
也可以看看
- regular-expressions.info/Examples - Programmer - Strings- for pattern with escaped quotes
- 正则表达式.信息/示例 - 程序员 - 字符串- 用于带转义引号的模式
Appendix
附录
Note that StringTokenizer
is a legacy class. It's recommended to use java.util.Scanner
or String.split
, or of course java.util.regex.Matcher
for most flexibility.
请注意,这StringTokenizer
是一个遗留类。建议使用java.util.Scanner
or String.split
,或者当然是java.util.regex.Matcher
为了最大的灵活性。
Related questions
相关问题
回答by Nikolaos
The example you have here would just have to be split by the double quote character.
你在这里的例子只需要被双引号字符分割。
回答by danyim
Try this:
尝试这个:
String str = "One two \"three four\" five \"six seven eight\" nine \"ten\"";
String strArr[] = str.split("\"|\s");
It's kind of tricky because you need to escape the double quotes. This regular expression should tokenize the string using either a whitespace (\s) or a double quote.
这有点棘手,因为您需要转义双引号。此正则表达式应使用空格 (\s) 或双引号对字符串进行标记。
You should use String's split
method because it accepts regular expressions, whereas the constructor argument for delimiter in StringTokenizer
doesn't. At the end of what I provided above, you can just add the following:
您应该使用 String 的split
方法,因为它接受正则表达式,而 delimiter in 的构造函数参数StringTokenizer
不接受。在我上面提供的内容的最后,您可以添加以下内容:
String s;
for(String k : strArr) {
s += k;
}
StringTokenizer strTok = new StringTokenizer(s);
回答by smp7d
try this:
尝试这个:
String str = "One two \"three four\" five \"six seven eight\" nine \"ten\"";
String[] strings = str.split("[ ]?\"[ ]?");
回答by Kiersten Arnold
I don't know the context of what your trying to do, but it looks like your trying to parse command line arguments. In general, this is pretty tricky with all the escaping issues; if this is your goal I'd personally look at something like JCommander.
我不知道您尝试做什么的上下文,但看起来您正在尝试解析命令行参数。一般来说,这对于所有转义问题都非常棘手;如果这是您的目标,我会亲自查看 JCommander 之类的东西。
回答by GrandmasterB
Do it the old fashioned way. Make a function that looks at each character in a for loop. If the character is a space, take everything up to that (excluding the space) and add it as an entry to the array. Note the position, and do the same again, adding that next part to the array after a space. When a double quote is encountered, mark a boolean named 'inQuote' as true, and ignore spaces when inQuote is true. When you hit quotes when inQuote is true, flag it as false and go back to breaking things up when a space is encountered. You can then extend this as necessary to support escape chars, etc.
用老式的方法来做。创建一个函数来查看 for 循环中的每个字符。如果字符是空格,则将所有内容(不包括空格)作为一个条目添加到数组中。注意位置,然后再次执行相同的操作,将下一部分添加到数组中一个空格之后。当遇到双引号时,将名为 'inQuote' 的布尔值标记为 true,当 inQuote 为 true 时忽略空格。当 inQuote 为 true 时点击引号时,将其标记为 false 并在遇到空格时返回分解。然后,您可以根据需要扩展它以支持转义字符等。
Could this be done with a regex? I dont know, I guess. But the whole function would take less to write than this reply did.
这可以用正则表达式完成吗?我不知道,我猜。但是整个函数的编写时间比这个回复要少。
回答by Eyal Schneider
In an old fashioned way:
以老式的方式:
public static String[] split(String str) {
str += " "; // To detect last token when not quoted...
ArrayList<String> strings = new ArrayList<String>();
boolean inQuote = false;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
char c = str.charAt(i);
if (c == '"' || c == ' ' && !inQuote) {
if (c == '"')
inQuote = !inQuote;
if (!inQuote && sb.length() > 0) {
strings.add(sb.toString());
sb.delete(0, sb.length());
}
} else
sb.append(c);
}
return strings.toArray(new String[strings.size()]);
}
I assume that nested quotes are illegal, and also that empty tokens can be omitted.
我假设嵌套引号是非法的,并且可以省略空标记。
回答by deadfire19
This is an old question, however this was my solution as a finite state machine.
这是一个老问题,但是这是我作为有限状态机的解决方案。
Efficient, predictable and no fancy tricks.
高效、可预测且没有花哨的技巧。
100% coverage on tests.
100% 的测试覆盖率。
Drag and drop into your code.
拖放到您的代码中。
/**
* Splits a command on whitespaces. Preserves whitespace in quotes. Trims excess whitespace between chunks. Supports quote
* escape within quotes. Failed escape will preserve escape char.
*
* @return List of split commands
*/
static List<String> splitCommand(String inputString) {
List<String> matchList = new LinkedList<>();
LinkedList<Character> charList = inputString.chars()
.mapToObj(i -> (char) i)
.collect(Collectors.toCollection(LinkedList::new));
// Finite-State Automaton for parsing.
CommandSplitterState state = CommandSplitterState.BeginningChunk;
LinkedList<Character> chunkBuffer = new LinkedList<>();
for (Character currentChar : charList) {
switch (state) {
case BeginningChunk:
switch (currentChar) {
case '"':
state = CommandSplitterState.ParsingQuote;
break;
case ' ':
break;
default:
state = CommandSplitterState.ParsingWord;
chunkBuffer.add(currentChar);
}
break;
case ParsingWord:
switch (currentChar) {
case ' ':
state = CommandSplitterState.BeginningChunk;
String newWord = chunkBuffer.stream().map(Object::toString).collect(Collectors.joining());
matchList.add(newWord);
chunkBuffer = new LinkedList<>();
break;
default:
chunkBuffer.add(currentChar);
}
break;
case ParsingQuote:
switch (currentChar) {
case '"':
state = CommandSplitterState.BeginningChunk;
String newWord = chunkBuffer.stream().map(Object::toString).collect(Collectors.joining());
matchList.add(newWord);
chunkBuffer = new LinkedList<>();
break;
case '\':
state = CommandSplitterState.EscapeChar;
break;
default:
chunkBuffer.add(currentChar);
}
break;
case EscapeChar:
switch (currentChar) {
case '"': // Intentional fall through
case '\':
state = CommandSplitterState.ParsingQuote;
chunkBuffer.add(currentChar);
break;
default:
state = CommandSplitterState.ParsingQuote;
chunkBuffer.add('\');
chunkBuffer.add(currentChar);
}
}
}
if (state != CommandSplitterState.BeginningChunk) {
String newWord = chunkBuffer.stream().map(Object::toString).collect(Collectors.joining());
matchList.add(newWord);
}
return matchList;
}
private enum CommandSplitterState {
BeginningChunk, ParsingWord, ParsingQuote, EscapeChar
}
回答by mike rodent
Apache Commons to the rescue!
Apache Commons 来救援!
import org.apache.commons.text.StringTokenizer
import org.apache.commons.text.matcher.StringMatcher
import org.apache.commons.text.matcher.StringMatcherFactory
@Grab(group='org.apache.commons', module='commons-text', version='1.3')
def str = /is this 'completely "impossible"' or """slightly"" impossible" to parse?/
StringTokenizer st = new StringTokenizer( str )
StringMatcher sm = StringMatcherFactory.INSTANCE.quoteMatcher()
st.setQuoteMatcher( sm )
println st.tokenList
Output:
输出:
[is, this, completely "impossible", or, "slightly" impossible, to, parse?]
[是,这,完全“不可能”,还是“稍微”不可能,解析?]
A few notes:
一些注意事项:
- this is written in Groovy... it is in fact a Groovy script. The
@Grab
line gives a clue to the sort of dependency line you need (e.g. inbuild.gradle
) ... or just include the .jar in your classpath of course StringTokenizer
here is NOTjava.util.StringTokenizer
... as theimport
line shows it isorg.apache.commons.text.StringTokenizer
- the
def str = ...
line is a way to produce aString
in Groovy which contains both single quotes and double quotes without having to go in for escaping StringMatcherFactory
in apache commons-text 1.3 can be found here: as you can see, theINSTANCE
can provide you with a bunch of differentStringMatcher
s. You could even roll your own: but you'd need to examine theStringMatcherFactory
source code to see how it's done.- YES! You can not only include the "other type of quote" and it is correctly interpreted as not being a token boundary ... but you can even escape the actual quote which is being used to turn off tokenising, by doubling the quote within the tokenisation-protected bit of the String! Try implementing that with a few lines of code ... or rather don't!
- 这是用 Groovy 编写的……它实际上是一个 Groovy 脚本。该
@Grab
行提供了您需要的依赖行类型的线索(例如 inbuild.gradle
)……或者当然只是在您的类路径中包含 .jar StringTokenizer
这里不是java.util.StringTokenizer
......正如该import
行所示org.apache.commons.text.StringTokenizer
- 该
def str = ...
行是一种String
在 Groovy 中生成包含单引号和双引号而无需转义的方法 StringMatcherFactory
在 apache commons-text 1.3 中可以在这里找到 :如您所见,它INSTANCE
可以为您提供一堆不同的StringMatcher
s。您甚至可以推出自己的:但您需要检查StringMatcherFactory
源代码以了解它是如何完成的。- 是的!您不仅可以包含“其他类型的引用”,而且它被正确解释为不是标记边界……但您甚至可以通过将标记化中的引用加倍来逃避用于关闭标记化的实际引用- 字符串的保护位!尝试用几行代码实现它……或者更确切地说,不要!
PS why is it better to use Apache Commons than any other solution? Apart from the fact that there is no point re-inventing the wheel, I can think of at least two reasons:
PS 为什么使用 Apache Commons 比任何其他解决方案更好?除了没有必要重新发明轮子这一事实之外,我至少可以想到两个原因:
- The Apache engineers can be counted on to have anticipated all the gotchas and developed robust, comprehensively tested, reliable code
- It means you don't clutter up your beautiful code with stoopid utility methods - you just have a nice, clean bit of code which does exactly what it says on the tin, leaving you to get on with the, um, interesting stuff...
- 可以指望 Apache 工程师已经预见到所有问题并开发出健壮的、经过全面测试的、可靠的代码
- 这意味着你不会用笨拙的实用方法把你漂亮的代码弄得乱七八糟——你只是有一段漂亮、干净的代码,它完全按照它在罐头上说的做,让你继续处理,嗯,有趣的东西.. .
PPS Nothing obliges you to look on the Apache code as mysterious "black boxes". The source is open, and written in usually perfectly "accessible" Java. Consequently you are free to examine how things are done to your heart's content. It's often quite instructive to do so.
PPS 没有什么迫使您将 Apache 代码视为神秘的“黑匣子”。源代码是开放的,并用通常完全“可访问”的 Java 编写。因此,您可以自由地检查事情是如何做到心满意足的。这样做通常很有启发性。
later
之后
Sufficiently intrigued by ArtB's question I had a look at the source:
对 ArtB 的问题非常感兴趣,我查看了来源:
in StringMatcherFactory.java we see:
在 StringMatcherFactory.java 中我们看到:
private static final AbstractStringMatcher.CharSetMatcher QUOTE_MATCHER = new AbstractStringMatcher.CharSetMatcher(
"'\"".toCharArray());
... rather dull ...
……比较沉闷……
so that leads one to look at StringTokenizer.java:
因此,我们可以查看 StringTokenizer.java:
public StringTokenizer setQuoteMatcher(final StringMatcher quote) {
if (quote != null) {
this.quoteMatcher = quote;
}
return this;
}
OK... and then, in the same java file:
好的...然后,在同一个 java 文件中:
private int readWithQuotes(final char[] srcChars ...
which contains the comment:
其中包含评论:
// If we've found a quote character, see if it's followed by a second quote. If so, then we need to actually put the quote character into the token rather than end the token.
... I can't be bothered to follow the clues any further. You have a choice: either your "hackish" solution, where you systematically pre-process your strings before submitting them for tokenising, turning |\\\"|s into |\"\"|s... (i.e. where you replace each |\"| with |""|)...
Or... you examine org.apache.commons.text.StringTokenizer.java to figure out how to tweak the code. It's a small file. I don't think it would be that difficult. Then you compile, essentially making a fork of the Apache code.
......我懒得继续追踪线索了。您有一个选择:要么是您的“hackish”解决方案,您在提交字符串以进行标记化之前系统地预处理您的字符串,将 |\\\"|s 转换为 |\"\"|s ...(即您在哪里替换每个 | \"| with | ""|)...
或者...您检查 org.apache.commons.text.StringTokenizer.java 以找出如何调整代码。这是一个小文件。我不认为这会那么困难。然后进行编译,本质上是创建 Apache 代码的一个分支。
I don't think it can be configured. But if you found a code-tweak solution which made sense you might submit it to Apache and then it might be accepted for the next iteration of the code, and your name would figure at least in the "features request" part of Apache: this could be a form of kleosthrough which you achieve programming immortality...
我不认为它可以配置。但是,如果您找到了一个有意义的代码调整解决方案,您可能会将其提交给 Apache,然后它可能会被接受用于代码的下一次迭代,并且您的名字至少会出现在 Apache 的“功能请求”部分中:可能是kleos 的一种形式,通过它您可以实现编程永生......
回答by hemantvsn
Another old school way is :
另一种老派的方式是:
public static void main(String[] args) {
String text = "One two \"three four\" five \"six seven eight\" nine \"ten\"";
String[] splits = text.split(" ");
List<String> list = new ArrayList<>();
String token = null;
for(String s : splits) {
if(s.startsWith("\"") ) {
token = "" + s;
} else if (s.endsWith("\"")) {
token = token + " "+ s;
list.add(token);
token = null;
} else {
if (token != null) {
token = token + " " + s;
} else {
list.add(s);
}
}
}
System.out.println(list);
}
Output : - [One, two, "three four", five, "six seven eight", nine]
输出:-[一,二,“三四”,五,“六七八”,九]