为什么在 Java 8 split 中有时会在结果数组的开头删除空字符串?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22718744/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 17:28:15  来源:igfitidea点击:

Why in Java 8 split sometimes removes empty strings at start of result array?

javaregexsplitjava-8

提问by Pshemo

Before Java 8when we split on empty string like

在 Java 8 之前,当我们拆分空字符串时,例如

String[] tokens = "abc".split("");

split mechanism would split in places marked with |

拆分机制会在标有的地方拆分 |

|a|b|c|

because empty space ""exists before and after each character. So as result it would generate at first this array

因为""每个字符前后都存在空格。因此,它首先会生成这个数组

["", "a", "b", "c", ""]

and later will remove trailing empty strings(because we didn't explicitly provide negative value to limitargument) so it will finally return

然后将删除尾随的空字符串(因为我们没有明确地为limit参数提供负值)所以它最终会返回

["", "a", "b", "c"]


In Java 8split mechanism seems to have changed. Now when we use

在 Java 8 中拆分机制似乎发生了变化。现在当我们使用

"abc".split("")

we will get ["a", "b", "c"]array instead of ["", "a", "b", "c"]so it looks like empty strings at start are also removed. But this theory fails because for instance

我们将得到["a", "b", "c"]数组而不是["", "a", "b", "c"]看起来像开始时的空字符串也被删除。但是这个理论失败了,因为例如

"abc".split("a")

returns array with empty string at start ["", "bc"].

在 start 处返回空字符串数组["", "bc"]

Can someone explain what is going on here and how rules of split have changed in Java 8?

有人可以解释一下这里发生了什么以及 Java 8 中拆分规则是如何变化的吗?

采纳答案by nhahtdh

The behavior of String.split(which calls Pattern.split) changes between Java 7 and Java 8.

String.split(调用Pattern.split)的行为在 Java 7 和 Java 8 之间发生变化。

Documentation

文档

Comparing between the documentation of Pattern.splitin Java 7and Java 8, we observe the following clause being added:

的文档之间的比较Pattern.split的Java 7Java的8,我们遵守以下条款添加:

When there is a positive-width match at the beginning of the input sequence then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.

当输入序列的开头存在正宽度匹配时,结果数组的开头将包含一个空的前导子字符串。然而,开头的零宽度匹配永远不会产生这样的空前导子串。

The same clause is also added to String.splitin Java 8, compared to Java 7.

Java 7相比String.splitJava 8 中也添加了相同的子句。

Reference implementation

参考实现

Let us compare the code of Pattern.splitof the reference implemetation in Java 7 and Java 8. The code is retrieved from grepcode, for version 7u40-b43 and 8-b132.

让我们比较Pattern.splitJava 7 和 Java 8 中参考实现的代码。代码是从 grepcode 中检索的,版本为 7u40-b43 和 8-b132。

Java 7

爪哇 7

public String[] split(CharSequence input, int limit) {
    int index = 0;
    boolean matchLimited = limit > 0;
    ArrayList<String> matchList = new ArrayList<>();
    Matcher m = matcher(input);

    // Add segments before each match found
    while(m.find()) {
        if (!matchLimited || matchList.size() < limit - 1) {
            String match = input.subSequence(index, m.start()).toString();
            matchList.add(match);
            index = m.end();
        } else if (matchList.size() == limit - 1) { // last one
            String match = input.subSequence(index,
                                             input.length()).toString();
            matchList.add(match);
            index = m.end();
        }
    }

    // If no match was found, return this
    if (index == 0)
        return new String[] {input.toString()};

    // Add remaining segment
    if (!matchLimited || matchList.size() < limit)
        matchList.add(input.subSequence(index, input.length()).toString());

    // Construct result
    int resultSize = matchList.size();
    if (limit == 0)
        while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
            resultSize--;
    String[] result = new String[resultSize];
    return matchList.subList(0, resultSize).toArray(result);
}

Java 8

爪哇 8

public String[] split(CharSequence input, int limit) {
    int index = 0;
    boolean matchLimited = limit > 0;
    ArrayList<String> matchList = new ArrayList<>();
    Matcher m = matcher(input);

    // Add segments before each match found
    while(m.find()) {
        if (!matchLimited || matchList.size() < limit - 1) {
            if (index == 0 && index == m.start() && m.start() == m.end()) {
                // no empty leading substring included for zero-width match
                // at the beginning of the input char sequence.
                continue;
            }
            String match = input.subSequence(index, m.start()).toString();
            matchList.add(match);
            index = m.end();
        } else if (matchList.size() == limit - 1) { // last one
            String match = input.subSequence(index,
                                             input.length()).toString();
            matchList.add(match);
            index = m.end();
        }
    }

    // If no match was found, return this
    if (index == 0)
        return new String[] {input.toString()};

    // Add remaining segment
    if (!matchLimited || matchList.size() < limit)
        matchList.add(input.subSequence(index, input.length()).toString());

    // Construct result
    int resultSize = matchList.size();
    if (limit == 0)
        while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
            resultSize--;
    String[] result = new String[resultSize];
    return matchList.subList(0, resultSize).toArray(result);
}

The addition of the following code in Java 8 excludes the zero-length match at the beginning of the input string, which explains the behavior above.

Java 8 中添加的以下代码排除了输入字符串开头的零长度匹配,这解释了上述行为。

            if (index == 0 && index == m.start() && m.start() == m.end()) {
                // no empty leading substring included for zero-width match
                // at the beginning of the input char sequence.
                continue;
            }

Maintaining compatibility

保持兼容性

Following behavior in Java 8 and above

遵循 Java 8 及更高版本中的行为

To make splitbehaves consistently across versions and compatible with the behavior in Java 8:

要使split跨版本的行为一致并与 Java 8 中的行为兼容:

  1. If your regex canmatch zero-length string, just add (?!\A)at the endof the regex and wrap the original regex in non-capturing group (?:...)(if necessary).
  2. If your regex can'tmatch zero-length string, you don't need to do anything.
  3. If you don't know whether the regex can match zero-length string or not, do both the actions in step 1.
  1. 如果您的正则表达式可以匹配零长度字符串,只需(?!\A)在正则表达式的末尾添加并将原始正则表达式包装在非捕获组中(?:...)(如有必要)。
  2. 如果您的正则表达式无法匹配零长度字符串,则无需执行任何操作。
  3. 如果您不知道正则表达式是否可以匹配零长度字符串,请执行步骤 1 中的两个操作。

(?!\A)checks that the string does not end at the beginning of the string, which implies that the match is an empty match at the beginning of the string.

(?!\A)检查字符串是否在字符串的开头结束,这意味着匹配是字符串开头的空匹配。

Following behavior in Java 7 and prior

遵循 Java 7 及之前版本中的行为

There is no general solution to make splitbackward-compatible with Java 7 and prior, short of replacing all instance of splitto point to your own custom implementation.

没有通用的解决方案可以使splitJava 7 和之前的版本向后兼容,除非将所有实例替换split为指向您自己的自定义实现。

回答by arshajii

There was a slight change in the docs for split()from Java 7 to Java 8. Specifically, the following statement was added:

split()从 Java 7 到 Java 8的文档略有变化。 具体来说,添加了以下语句:

When there is a positive-width match at the beginning of this string then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.

如果此字符串的开头存在正宽度匹配,则结果数组的开头将包含一个空的前导子字符串。然而,开头的零宽度匹配永远不会产生这样的空前导子串。

(emphasis mine)

(强调我的)

The empty string split generates a zero-width match at the beginning, so an empty string is not included at the start of the resulting array in accordance with what is specified above. By contrast, your second example which splits on "a"generates a positive-width match at the start of the string, so an empty string is in fact included at the start of the resulting array.

空字符串拆分在开头生成零宽度匹配,因此根据上面指定的内容,在结果数组的开头不包含空字符串。相比之下,您拆分的第二个示例在字符串的开头"a"生成宽度匹配,因此实际上在结果数组的开头包含了一个空字符串。

回答by Alexis C.

This has been specified in the documentation of split(String regex, limit).

这已在 的文档中指定split(String regex, limit)

When there is a positive-width match at the beginning of this string then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.

如果此字符串的开头存在正宽度匹配,则结果数组的开头将包含一个空的前导子字符串。然而,开头的零宽度匹配永远不会产生这样的空前导子串。

In "abc".split("")you got a zero-width match at the beginning so the leading empty substring is not included in the resulting array.

在开始时"abc".split("")您有一个零宽度匹配,因此结果数组中不包含前导空子字符串。

However in your second snippet when you split on "a"you got a positive width match (1 in this case), so the empty leading substring is included as expected.

但是,在您拆分的第二个代码段中,"a"您获得了正宽度匹配(在本例中为 1),因此按预期包含空的前导子字符串。

(Removed irrelevant source code)

(删除了不相关的源代码)