非字母字符上的 Java 字符串拆分

Question

提问by dmoench

I want to split a String into a String array along non-alphabetic characters. For example:

我想沿着非字母字符将一个字符串拆分为一个字符串数组。例如：

"Here is an ex@mple" => "Here", "is", "an" "ex", "mple"

I tried using the String.split(String regex)method with the regular expression "(?![\\p{Alpha}])". However this splits the string into

我尝试将String.split(String regex)方法与正则表达式一起使用"(?![\\p{Alpha}])"。但是，这会将字符串拆分为

"Here", "_is", "_an", "_ex", "@ample"

(those underscores are to emphasize there is a space). I guess this is because the ?!regex operator is "zero-width" and is actually splitting on and removing a zero-width character preceding the non-alphabetic characters in the input string.

（那些下划线是为了强调有一个空格）。我猜这是因为?!正则表达式运算符是“零宽度”，实际上是在输入字符串中的非字母字符之前拆分和删除零宽度字符。

How can I accomplish removal of the actual non-alpha characters while I split the string? Is there a NON-zero-width negation operator?

如何在拆分字符串时删除实际的非字母字符？是否有非零宽度否定运算符？

Answer 1

回答by arshajii

You could try \P{Alpha}+:

你可以试试\P{Alpha}+：

"Here is an ex@mple".split("\P{Alpha}+")

["Here", "is", "an", "ex", "mple"]

\P{Alpha}matches any non-alphabetic character (as opposed to \p{Alpha}, which matches any alphabetic character). +indicates that we should split on any continuous string of such characters. For example:

\P{Alpha}匹配任何非字母字符（而不是\p{Alpha}匹配任何字母字符）。+表示我们应该拆分任何此类字符的连续字符串。例如：

"a!@#$%^&*b".split("\P{Alpha}+")

["a", "b"]

Answer 2

回答by Sylvain Leroux

There are already several answers here, but none of them deal well with internationalization issues. And even if it might be assumed from the OP example that it was about "English" letters, it is maybe not the case for visitors coming here from a search...

这里已经有几个答案，但没有一个能很好地处理国际化问题。即使可以从 OP 示例中假设它是关于“英文”字母的，但对于通过搜索来到这里的访问者来说，情况可能并非如此......

... so, it worth mentioning that Java supports the Unicode Technical Standard #18 "Unicode Regular Expressions". Pretty impressing isn't it ? In clear, this is an extension to the classic (latin-centric or event English-centric) regular expressions designated to deal with international characters.

...所以，值得一提的是，Java 支持Unicode 技术标准 #18 “Unicode 正则表达式”。很令人印象深刻不是吗？显然，这是指定用于处理国际字符的经典（以拉丁语为中心或以事件为中心的）正则表达式的扩展。

For example, Java supports the full set of binary propertiesto check if a character belong to one of the Unicode code point character classes. Especially the \p{IsAlphabetic}character class would match any alphabetic character corresponding to a letter in any of the Unicode-supported langages.

例如，Java 支持完整的二进制属性集来检查字符是否属于 Unicode 代码点字符类之一。特别是\p{IsAlphabetic}字符类将匹配与任何 Unicode 支持的语言中的字母对应的任何字母字符。

Not clear ? Here is an example:

不清楚？下面是一个例子：

    Pattern p = Pattern.compile("\p{IsAlphabetic}+");
    //                           ^^^^^^^^^^^^^^^^^
    //                         any alphabetic character
    //                    (in any Unicode-supported language)

    Matcher m = p.matcher("L'élève あゆみ travaille _bien_");
    while(m.find()) {
        System.out.println(">" + m.group());
    }

Or mostly equivalent using split to break on non-alphabetic characters:

或者使用 split 来打破非字母字符的基本等效：

    for (String s : "L'élève あゆみ travaille bien".split("\P{IsAlphabetic}+"))
        System.out.println(">" + s);

In both cases, the output will properly tokenize words, taking into account French accentuated characters and Japanese hiragana characters -- just like it would do for words spelled using any Unicode-supported language (including the supplementary multi-lingual plane)

在这两种情况下，输出都会正确标记单词，同时考虑法语重音字符和日语平假名字符——就像使用任何支持 Unicode 的语言（包括补充多语言平面）拼写的单词一样

Answer 3

回答by Prudhvi kanth Chirumamilla

Using Square brackets[] in Split Function we can do this,,

在拆分函数中使用方括号[]，我们可以做到这一点，

SYNTAX : String strArray = text.split("[^a-zA-Z0-9]");

语法：String strArray = text.split("[^a-zA-Z0-9]");

For Example: --> for text : "Ready, steady, go!";

例如：--> 对于文本：“Ready, stable, go!”;

The String Array would be,, strArray = [Ready,steady,go];

字符串数组将是,, strArray = [Ready,steady,go];

For Example: --> for text : "You are the best!!!!!!!!!!!! CodeFighter ever!";

例如： --> 对于文本：“你是最棒的！！！！！！！！！！！！ CodeFighter 永远！”;

The String Array would be,, strArray = [You,are,the,best,CodeFighter,ever];

字符串数组将是,, strArray = [You,are,the,best,CodeFighter,ever];

Answer 4

回答by Prudhvi kanth Chirumamilla

Wouldn't

不会

"Here is an ex@mple".split("\S\w+")

work?

工作？

Answer 5

回答by Brendan Goggin

In addition to the other answers, you could iterate over the characters in the string, test if their ASCII values are in the range of lower and upper case letters, and perform your desired 'split' behavior if not.

除了其他答案之外，您还可以遍历字符串中的字符，测试它们的 ASCII 值是否在小写和大写字母范围内，如果不是，则执行您想要的“拆分”行为。

char[] chars = str.toCharArray();might be useful.

char[] chars = str.toCharArray();可能有用。

非字母字符上的 Java 字符串拆分

提问by dmoench

回答by arshajii

回答by Sylvain Leroux

回答by Prudhvi kanth Chirumamilla

回答by Prudhvi kanth Chirumamilla

回答by Brendan Goggin

相关推荐

最近更新

标签

非字母字符上的 Java 字符串拆分

提问by dmoench

回答by arshajii

回答by Sylvain Leroux

回答by Prudhvi kanth Chirumamilla

回答by Prudhvi kanth Chirumamilla

回答by Brendan Goggin

相关推荐

java 使用注释处理器替换代码

java 获取知道索引的 Collection 元素？

java SocketException：权限被拒绝：连接

使用 Java 创建新的本地数据库

相关推荐

最近更新

标签