Java 如何拆分字符串,同时保留分隔符?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2206378/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to split a string, but also keep the delimiters?
提问by Daniel Rikowski
I have a multiline string which is delimited by a set of different delimiters:
我有一个多行字符串,它由一组不同的分隔符分隔:
(Text1)(DelimiterA)(Text2)(DelimiterC)(Text3)(DelimiterB)(Text4)
I can split this string into its parts, using String.split
, but it seems that I can't get the actual string, which matched the delimiter regex.
我可以使用 将这个字符串拆分成它的部分,String.split
但似乎我无法获得与分隔符正则表达式匹配的实际字符串。
In other words, this is what I get:
换句话说,这就是我得到的:
Text1
Text2
Text3
Text4
Text1
Text2
Text3
Text4
This is what I want
这就是我要的
Text1
DelimiterA
Text2
DelimiterC
Text3
DelimiterB
Text4
Text1
DelimiterA
Text2
DelimiterC
Text3
DelimiterB
Text4
Is there any JDK way to split the string using a delimiter regex but also keep the delimiters?
是否有任何 JDK 方法可以使用分隔符正则表达式拆分字符串但同时保留分隔符?
采纳答案by NawaMan
You can use Lookahead and Lookbehind. Like this:
您可以使用 Lookahead 和 Lookbehind。像这样:
System.out.println(Arrays.toString("a;b;c;d".split("(?<=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("(?=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("((?<=;)|(?=;))")));
And you will get:
你会得到:
[a;, b;, c;, d]
[a, ;b, ;c, ;d]
[a, ;, b, ;, c, ;, d]
The last one is what you want.
最后一个是你想要的。
((?<=;)|(?=;))
equals to select an empty character before ;
or after ;
.
((?<=;)|(?=;))
等于在 之前;
或之后选择一个空字符;
。
Hope this helps.
希望这可以帮助。
EDITFabian Steeg comments on Readability is valid. Readability is always the problem for RegEx. One thing, I do to help easing this is to create a variable whose name represent what the regex does and use Java String format to help that. Like this:
编辑Fabian Steeg 关于可读性的评论是有效的。可读性始终是 RegEx 的问题。为了帮助缓解这一问题,我所做的一件事是创建一个变量,其名称表示正则表达式的作用,并使用 Java 字符串格式来帮助实现这一点。像这样:
static public final String WITH_DELIMITER = "((?<=%1$s)|(?=%1$s))";
...
public void someMethod() {
...
final String[] aEach = "a;b;c;d".split(String.format(WITH_DELIMITER, ";"));
...
}
...
This helps a little bit. :-D
这有点帮助。:-D
回答by Alon L
I don't know Java too well, but if you can't find a Split method that does that, I suggest you just make your own.
我不太了解 Java,但是如果您找不到能够做到这一点的 Split 方法,我建议您自己制作。
string[] mySplit(string s,string delimiter)
{
string[] result = s.Split(delimiter);
for(int i=0;i<result.Length-1;i++)
{
result[i] += delimiter; //this one would add the delimiter to each items end except the last item,
//you can modify it however you want
}
}
string[] res = mySplit(myString,myDelimiter);
Its not too elegant, but it'll do.
它不太优雅,但它会做。
回答by PhiLho
Fast answer: use non physical bounds like \b to split. I will try and experiment to see if it works (used that in PHP and JS).
快速回答:使用像 \b 这样的非物理边界来分割。我会尝试和实验看看它是否有效(在 PHP 和 JS 中使用)。
It is possible, and kind of work, but might split too much. Actually, it depends on the string you want to split and the result you need. Give more details, we will help you better.
这是可能的,而且是一种工作,但可能会分裂太多。实际上,这取决于您要拆分的字符串和您需要的结果。提供更多详细信息,我们会更好地帮助您。
Another way is to do your own split, capturing the delimiter (supposing it is variable) and adding it afterward to the result.
另一种方法是进行自己的拆分,捕获分隔符(假设它是可变的)并将其添加到结果中。
My quick test:
我的快速测试:
String str = "'ab','cd','eg'";
String[] stra = str.split("\b");
for (String s : stra) System.out.print(s + "|");
System.out.println();
Result:
结果:
'|ab|','|cd|','|eg|'|
A bit too much... :-)
有点太多了... :-)
回答by bdumitriu
I don't know of an existing function in the Java API that does this (which is not to say it doesn't exist), but here's my own implementation (one or more delimiters will be returned as a single token; if you want each delimiter to be returned as a separate token, it will need a bit of adaptation):
我不知道 Java API 中是否存在执行此操作的现有函数(这并不是说它不存在),但这是我自己的实现(一个或多个分隔符将作为单个标记返回;如果您想要每个分隔符作为一个单独的标记返回,它需要一些调整):
static String[] splitWithDelimiters(String s) {
if (s == null || s.length() == 0) {
return new String[0];
}
LinkedList<String> result = new LinkedList<String>();
StringBuilder sb = null;
boolean wasLetterOrDigit = !Character.isLetterOrDigit(s.charAt(0));
for (char c : s.toCharArray()) {
if (Character.isLetterOrDigit(c) ^ wasLetterOrDigit) {
if (sb != null) {
result.add(sb.toString());
}
sb = new StringBuilder();
wasLetterOrDigit = !wasLetterOrDigit;
}
sb.append(c);
}
result.add(sb.toString());
return result.toArray(new String[0]);
}
回答by VonC
I like the idea of StringTokenizer because it is Enumerable.
But it is also obsolete, and replace by String.split which return a boring String[] (and does not includes the delimiters).
我喜欢 StringTokenizer 的想法,因为它是可枚举的。
但它也已过时,并替换为 String.split ,它返回一个无聊的 String[] (并且不包括分隔符)。
So I implemented a StringTokenizerEx which is an Iterable, and which takes a true regexp to split a string.
所以我实现了一个 StringTokenizerEx,它是一个可迭代的,它需要一个真正的正则表达式来分割一个字符串。
A true regexp means it is not a 'Character sequence' repeated to form the delimiter:
'o' will only match 'o', and split 'ooo' into three delimiter, with two empty string inside:
真正的正则表达式意味着它不是重复形成分隔符的“字符序列”:
“o”只会匹配“o”,并将“ooo”拆分为三个分隔符,其中包含两个空字符串:
[o], '', [o], '', [o]
But the regexp o+ will return the expected result when splitting "aooob"
但是正则表达式 o+ 在拆分 "aooob" 时会返回预期的结果
[], 'a', [ooo], 'b', []
To use this StringTokenizerEx:
要使用此 StringTokenizerEx:
final StringTokenizerEx aStringTokenizerEx = new StringTokenizerEx("boo:and:foo", "o+");
final String firstDelimiter = aStringTokenizerEx.getDelimiter();
for(String aString: aStringTokenizerEx )
{
// uses the split String detected and memorized in 'aString'
final nextDelimiter = aStringTokenizerEx.getDelimiter();
}
The code of this class is available at DZone Snippets.
此类的代码可在DZone Snippets 获得。
As usual for a code-challengeresponse (one self-contained class with test cases included), copy-paste it(in a 'src/test' directory) and run it. Its main() method illustrates the different usages.
对于代码挑战响应(一个包含测试用例的自包含类),像往常一样,复制粘贴它(在“src/test”目录中)并运行它。它的 main() 方法说明了不同的用法。
Note: (late 2009 edit)
注:(2009 年末编辑)
The article Final Thoughts: Java Puzzler: Splitting Hairsdoes a good work explaning the bizarre behavior in String.split()
.
Josh Bloch even commented in response to that article:
文章Final Thoughts: Java Puzzler: SplittingHairs 很好地解释了String.split()
.
Josh Bloch 甚至评论了那篇文章:
Yes, this is a pain. FWIW, it was done for a very good reason: compatibility with Perl.
The guy who did it is Mike "madbot" McCloskey, who now works with us at Google. Mike made sure that Java's regular expressions passed virtually every one of the 30K Perl regular expression tests (and ran faster).
是的,这是一种痛苦。FWIW,这样做有一个很好的理由:与 Perl 兼容。
这样做的人是 Mike “madbot” McCloskey,他现在在 Google 与我们一起工作。Mike 确保 Java 的正则表达式几乎通过了 30K Perl 正则表达式测试中的每一项(并且运行得更快)。
The Google common-library Guavacontains also a Splitter which is:
Google公共库 Guava还包含一个 Splitter,它是:
- simpler to use
- maintained by Google (and not by you)
- 使用更简单
- 由 Google(而非您)维护
So it may worth being checked out. From their initial rough documentation (pdf):
所以它可能值得检查。从他们最初的粗略文档(pdf):
JDK has this:
JDK有这个:
String[] pieces = "foo.bar".split("\.");
It's fine to use this if you want exactly what it does: - regular expression - result as an array - its way of handling empty pieces
Mini-puzzler: ",a,,b,".split(",") returns...
如果您想要它的确切作用,可以使用它: - 正则表达式 - 结果为数组 - 它处理空块的方式
迷你拼图:",a,,b,".split(",") 返回...
(a) "", "a", "", "b", ""
(b) null, "a", null, "b", null
(c) "a", null, "b"
(d) "a", "b"
(e) None of the above
Answer: (e) None of the above.
答案:(e) 以上都不是。
",a,,b,".split(",")
returns
"", "a", "", "b"
Only trailing empties are skipped! (Who knows the workaround to prevent the skipping? It's a fun one...)
In any case, our Splitter is simply more flexible: The default behavior is simplistic:
只跳过尾随的空字符!(谁知道防止跳过的解决方法?这很有趣......)
在任何情况下,我们的 Splitter 都更加灵活:默认行为很简单:
Splitter.on(',').split(" foo, ,bar, quux,")
--> [" foo", " ", "bar", " quux", ""]
If you want extra features, ask for them!
如果您想要额外的功能,请要求他们!
Splitter.on(',')
.trimResults()
.omitEmptyStrings()
.split(" foo, ,bar, quux,")
--> ["foo", "bar", "quux"]
Order of config methods doesn't matter -- during splitting, trimming happens before checking for empties.
配置方法的顺序无关紧要——在拆分期间,在检查空值之前进行修剪。
回答by Markus Jarderot
import java.util.regex.*;
import java.util.LinkedList;
public class Splitter {
private static final Pattern DEFAULT_PATTERN = Pattern.compile("\s+");
private Pattern pattern;
private boolean keep_delimiters;
public Splitter(Pattern pattern, boolean keep_delimiters) {
this.pattern = pattern;
this.keep_delimiters = keep_delimiters;
}
public Splitter(String pattern, boolean keep_delimiters) {
this(Pattern.compile(pattern==null?"":pattern), keep_delimiters);
}
public Splitter(Pattern pattern) { this(pattern, true); }
public Splitter(String pattern) { this(pattern, true); }
public Splitter(boolean keep_delimiters) { this(DEFAULT_PATTERN, keep_delimiters); }
public Splitter() { this(DEFAULT_PATTERN); }
public String[] split(String text) {
if (text == null) {
text = "";
}
int last_match = 0;
LinkedList<String> splitted = new LinkedList<String>();
Matcher m = this.pattern.matcher(text);
while (m.find()) {
splitted.add(text.substring(last_match,m.start()));
if (this.keep_delimiters) {
splitted.add(m.group());
}
last_match = m.end();
}
splitted.add(text.substring(last_match));
return splitted.toArray(new String[splitted.size()]);
}
public static void main(String[] argv) {
if (argv.length != 2) {
System.err.println("Syntax: java Splitter <pattern> <text>");
return;
}
Pattern pattern = null;
try {
pattern = Pattern.compile(argv[0]);
}
catch (PatternSyntaxException e) {
System.err.println(e);
return;
}
Splitter splitter = new Splitter(pattern);
String text = argv[1];
int counter = 1;
for (String part : splitter.split(text)) {
System.out.printf("Part %d: \"%s\"\n", counter++, part);
}
}
}
/*
Example:
> java Splitter "\W+" "Hello World!"
Part 1: "Hello"
Part 2: " "
Part 3: "World"
Part 4: "!"
Part 5: ""
*/
I don't really like the other way, where you get an empty element in front and back. A delimiter is usually not at the beginning or at the end of the string, thus you most often end up wasting two good array slots.
我不太喜欢另一种方式,在这种方式中,前后都有一个空元素。分隔符通常不在字符串的开头或结尾,因此通常会浪费两个好的数组插槽。
Edit:Fixed limit cases. Commented source with test cases can be found here: http://snippets.dzone.com/posts/show/6453
编辑:固定限制情况。可以在此处找到带有测试用例的注释源:http: //snippets.dzone.com/posts/show/6453
回答by Alan Moore
I got here late, but returning to the original question, why not just use lookarounds?
我来晚了,但回到最初的问题,为什么不直接使用环视?
Pattern p = Pattern.compile("(?<=\w)(?=\W)|(?<=\W)(?=\w)");
System.out.println(Arrays.toString(p.split("'ab','cd','eg'")));
System.out.println(Arrays.toString(p.split("boo:and:foo")));
output:
输出:
[', ab, ',', cd, ',', eg, ']
[boo, :, and, :, foo]
EDIT: What you see above is what appears on the command line when I run that code, but I now see that it's a bit confusing. It's difficult to keep track of which commas are part of the result and which were added by Arrays.toString()
. SO's syntax highlighting isn't helping either. In hopes of getting the highlighting to work withme instead of against me, here's how those arrays would look it I were declaring them in source code:
编辑:您在上面看到的是我运行该代码时出现在命令行上的内容,但我现在看到它有点令人困惑。很难跟踪哪些逗号是结果的一部分,哪些是由Arrays.toString()
. SO 的语法高亮也无济于事。为了让突出显示与我一起工作而不是反对我,以下是这些数组的外观,我在源代码中声明它们:
{ "'", "ab", "','", "cd", "','", "eg", "'" }
{ "boo", ":", "and", ":", "foo" }
I hope that's easier to read. Thanks for the heads-up, @finnw.
我希望这更容易阅读。感谢您的提醒,@finnw。
回答by cletus
I had a look at the above answers and honestly none of them I find satisfactory. What you want to do is essentially mimic the Perl split functionality. Why Java doesn't allow this and have a join() method somewhere is beyond me but I digress. You don't even need a class for this really. Its just a function. Run this sample program:
我看了上面的答案,老实说,没有一个我觉得满意。您想要做的实质上是模仿 Perl 拆分功能。为什么 Java 不允许这样做并且在某处有一个 join() 方法超出了我的范围,但我离题了。您甚至不需要为此开设课程。它只是一个功能。运行这个示例程序:
Some of the earlier answers have excessive null-checking, which I recently wrote a response to a question here:
一些较早的答案有过多的空检查,我最近在这里写了一个问题的回复:
https://stackoverflow.com/users/18393/cletus
https://stackoverflow.com/users/18393/cletus
Anyway, the code:
无论如何,代码:
public class Split {
public static List<String> split(String s, String pattern) {
assert s != null;
assert pattern != null;
return split(s, Pattern.compile(pattern));
}
public static List<String> split(String s, Pattern pattern) {
assert s != null;
assert pattern != null;
Matcher m = pattern.matcher(s);
List<String> ret = new ArrayList<String>();
int start = 0;
while (m.find()) {
ret.add(s.substring(start, m.start()));
ret.add(m.group());
start = m.end();
}
ret.add(start >= s.length() ? "" : s.substring(start));
return ret;
}
private static void testSplit(String s, String pattern) {
System.out.printf("Splitting '%s' with pattern '%s'%n", s, pattern);
List<String> tokens = split(s, pattern);
System.out.printf("Found %d matches%n", tokens.size());
int i = 0;
for (String token : tokens) {
System.out.printf(" %d/%d: '%s'%n", ++i, tokens.size(), token);
}
System.out.println();
}
public static void main(String args[]) {
testSplit("abcdefghij", "z"); // "abcdefghij"
testSplit("abcdefghij", "f"); // "abcde", "f", "ghi"
testSplit("abcdefghij", "j"); // "abcdefghi", "j", ""
testSplit("abcdefghij", "a"); // "", "a", "bcdefghij"
testSplit("abcdefghij", "[bdfh]"); // "a", "b", "c", "d", "e", "f", "g", "h", "ij"
}
}
回答by Fabian Steeg
I don't think it is possible with String#split
, but you can use a StringTokenizer
, though that won't allow you to define your delimiter as a regex, but only as a class of single-digit characters:
我认为不可能使用String#split
,但您可以使用 a StringTokenizer
,尽管这不允许您将分隔符定义为正则表达式,而只能定义为一类单数字符:
new StringTokenizer("Hello, world. Hi!", ",.!", true); // true for returnDelims
回答by Steve McLeod
I suggest using Pattern and Matcher, which will almost certainly achieve what you want. Your regular expression will need to be somewhat more complicated than what you are using in String.split.
我建议使用 Pattern 和 Matcher,这几乎肯定会实现您想要的。您的正则表达式需要比您在 String.split 中使用的更复杂一些。