为什么在 Java 8 split 中有时会在结果数组的开头删除空字符串?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22718744/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why in Java 8 split sometimes removes empty strings at start of result array?
提问by Pshemo
Before Java 8when we split on empty string like
在 Java 8 之前,当我们拆分空字符串时,例如
String[] tokens = "abc".split("");
split mechanism would split in places marked with |
拆分机制会在标有的地方拆分 |
|a|b|c|
because empty space ""
exists before and after each character. So as result it would generate at first this array
因为""
每个字符前后都存在空格。因此,它首先会生成这个数组
["", "a", "b", "c", ""]
and later will remove trailing empty strings(because we didn't explicitly provide negative value to limit
argument) so it will finally return
然后将删除尾随的空字符串(因为我们没有明确地为limit
参数提供负值)所以它最终会返回
["", "a", "b", "c"]
In Java 8split mechanism seems to have changed. Now when we use
在 Java 8 中拆分机制似乎发生了变化。现在当我们使用
"abc".split("")
we will get ["a", "b", "c"]
array instead of ["", "a", "b", "c"]
so it looks like empty strings at start are also removed. But this theory fails because for instance
我们将得到["a", "b", "c"]
数组而不是["", "a", "b", "c"]
看起来像开始时的空字符串也被删除。但是这个理论失败了,因为例如
"abc".split("a")
returns array with empty string at start ["", "bc"]
.
在 start 处返回空字符串数组["", "bc"]
。
Can someone explain what is going on here and how rules of split have changed in Java 8?
有人可以解释一下这里发生了什么以及 Java 8 中拆分规则是如何变化的吗?
采纳答案by nhahtdh
The behavior of String.split
(which calls Pattern.split
) changes between Java 7 and Java 8.
String.split
(调用Pattern.split
)的行为在 Java 7 和 Java 8 之间发生变化。
Documentation
文档
Comparing between the documentation of Pattern.split
in Java 7and Java 8, we observe the following clause being added:
的文档之间的比较Pattern.split
中的Java 7和Java的8,我们遵守以下条款添加:
When there is a positive-width match at the beginning of the input sequence then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.
当输入序列的开头存在正宽度匹配时,结果数组的开头将包含一个空的前导子字符串。然而,开头的零宽度匹配永远不会产生这样的空前导子串。
The same clause is also added to String.split
in Java 8, compared to Java 7.
与Java 7相比String.split
,Java 8 中也添加了相同的子句。
Reference implementation
参考实现
Let us compare the code of Pattern.split
of the reference implemetation in Java 7 and Java 8. The code is retrieved from grepcode, for version 7u40-b43 and 8-b132.
让我们比较Pattern.split
Java 7 和 Java 8 中参考实现的代码。代码是从 grepcode 中检索的,版本为 7u40-b43 和 8-b132。
Java 7
爪哇 7
public String[] split(CharSequence input, int limit) {
int index = 0;
boolean matchLimited = limit > 0;
ArrayList<String> matchList = new ArrayList<>();
Matcher m = matcher(input);
// Add segments before each match found
while(m.find()) {
if (!matchLimited || matchList.size() < limit - 1) {
String match = input.subSequence(index, m.start()).toString();
matchList.add(match);
index = m.end();
} else if (matchList.size() == limit - 1) { // last one
String match = input.subSequence(index,
input.length()).toString();
matchList.add(match);
index = m.end();
}
}
// If no match was found, return this
if (index == 0)
return new String[] {input.toString()};
// Add remaining segment
if (!matchLimited || matchList.size() < limit)
matchList.add(input.subSequence(index, input.length()).toString());
// Construct result
int resultSize = matchList.size();
if (limit == 0)
while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
resultSize--;
String[] result = new String[resultSize];
return matchList.subList(0, resultSize).toArray(result);
}
Java 8
爪哇 8
public String[] split(CharSequence input, int limit) {
int index = 0;
boolean matchLimited = limit > 0;
ArrayList<String> matchList = new ArrayList<>();
Matcher m = matcher(input);
// Add segments before each match found
while(m.find()) {
if (!matchLimited || matchList.size() < limit - 1) {
if (index == 0 && index == m.start() && m.start() == m.end()) {
// no empty leading substring included for zero-width match
// at the beginning of the input char sequence.
continue;
}
String match = input.subSequence(index, m.start()).toString();
matchList.add(match);
index = m.end();
} else if (matchList.size() == limit - 1) { // last one
String match = input.subSequence(index,
input.length()).toString();
matchList.add(match);
index = m.end();
}
}
// If no match was found, return this
if (index == 0)
return new String[] {input.toString()};
// Add remaining segment
if (!matchLimited || matchList.size() < limit)
matchList.add(input.subSequence(index, input.length()).toString());
// Construct result
int resultSize = matchList.size();
if (limit == 0)
while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
resultSize--;
String[] result = new String[resultSize];
return matchList.subList(0, resultSize).toArray(result);
}
The addition of the following code in Java 8 excludes the zero-length match at the beginning of the input string, which explains the behavior above.
Java 8 中添加的以下代码排除了输入字符串开头的零长度匹配,这解释了上述行为。
if (index == 0 && index == m.start() && m.start() == m.end()) {
// no empty leading substring included for zero-width match
// at the beginning of the input char sequence.
continue;
}
Maintaining compatibility
保持兼容性
Following behavior in Java 8 and above
遵循 Java 8 及更高版本中的行为
To make split
behaves consistently across versions and compatible with the behavior in Java 8:
要使split
跨版本的行为一致并与 Java 8 中的行为兼容:
- If your regex canmatch zero-length string, just add
(?!\A)
at the endof the regex and wrap the original regex in non-capturing group(?:...)
(if necessary). - If your regex can'tmatch zero-length string, you don't need to do anything.
- If you don't know whether the regex can match zero-length string or not, do both the actions in step 1.
- 如果您的正则表达式可以匹配零长度字符串,只需
(?!\A)
在正则表达式的末尾添加并将原始正则表达式包装在非捕获组中(?:...)
(如有必要)。 - 如果您的正则表达式无法匹配零长度字符串,则无需执行任何操作。
- 如果您不知道正则表达式是否可以匹配零长度字符串,请执行步骤 1 中的两个操作。
(?!\A)
checks that the string does not end at the beginning of the string, which implies that the match is an empty match at the beginning of the string.
(?!\A)
检查字符串是否在字符串的开头结束,这意味着匹配是字符串开头的空匹配。
Following behavior in Java 7 and prior
遵循 Java 7 及之前版本中的行为
There is no general solution to make split
backward-compatible with Java 7 and prior, short of replacing all instance of split
to point to your own custom implementation.
没有通用的解决方案可以使split
Java 7 和之前的版本向后兼容,除非将所有实例替换split
为指向您自己的自定义实现。
回答by arshajii
There was a slight change in the docs for split()
from Java 7 to Java 8. Specifically, the following statement was added:
split()
从 Java 7 到 Java 8的文档略有变化。 具体来说,添加了以下语句:
When there is a positive-width match at the beginning of this string then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.
如果此字符串的开头存在正宽度匹配,则结果数组的开头将包含一个空的前导子字符串。然而,开头的零宽度匹配永远不会产生这样的空前导子串。
(emphasis mine)
(强调我的)
The empty string split generates a zero-width match at the beginning, so an empty string is not included at the start of the resulting array in accordance with what is specified above. By contrast, your second example which splits on "a"
generates a positive-width match at the start of the string, so an empty string is in fact included at the start of the resulting array.
空字符串拆分在开头生成零宽度匹配,因此根据上面指定的内容,在结果数组的开头不包含空字符串。相比之下,您拆分的第二个示例在字符串的开头"a"
生成正宽度匹配,因此实际上在结果数组的开头包含了一个空字符串。
回答by Alexis C.
This has been specified in the documentation of split(String regex, limit)
.
这已在 的文档中指定split(String regex, limit)
。
When there is a positive-width match at the beginning of this string then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.
如果此字符串的开头存在正宽度匹配,则结果数组的开头将包含一个空的前导子字符串。然而,开头的零宽度匹配永远不会产生这样的空前导子串。
In "abc".split("")
you got a zero-width match at the beginning so the leading empty substring is not included in the resulting array.
在开始时"abc".split("")
您有一个零宽度匹配,因此结果数组中不包含前导空子字符串。
However in your second snippet when you split on "a"
you got a positive width match (1 in this case), so the empty leading substring is included as expected.
但是,在您拆分的第二个代码段中,"a"
您获得了正宽度匹配(在本例中为 1),因此按预期包含空的前导子字符串。
(Removed irrelevant source code)
(删除了不相关的源代码)