如何遍历 Java 字符串的 unicode 代码点?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1527856/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I iterate through the unicode codepoints of a Java String?
提问by rampion
So I know about String#codePointAt(int)
, but it's indexed by the char
offset, not by the codepoint offset.
所以我知道String#codePointAt(int)
,但它是由char
偏移量索引的,而不是由代码点偏移量索引的。
I'm thinking about trying something like:
我正在考虑尝试类似的事情:
- using
String#charAt(int)
to get thechar
at an index - testing whether the
char
is in the high-surrogates range- if so, use
String#codePointAt(int)
to get the codepoint, and increment the index by 2 - if not, use the given
char
value as the codepoint, and increment the index by 1
- if so, use
- 使用
String#charAt(int)
得到char
的指数 - 测试是否
char
在高代理范围内- 如果是这样,使用
String#codePointAt(int)
获取代码点,并将索引增加 2 - 如果不是,则使用给定
char
值作为代码点,并将索引增加 1
- 如果是这样,使用
But my concerns are
但我的担忧是
- I'm not sure whether codepoints which are naturally in the high-surrogates range will be stored as two
char
values or one - this seems like an awful expensive way to iterate through characters
- someone must have come up with something better.
- 我不确定自然在高代理范围内的代码点是否会存储为两个
char
值或一个 - 这似乎是遍历字符的一种非常昂贵的方式
- 一定有人想出了更好的办法。
采纳答案by Jonathan Feinberg
Yes, Java uses a UTF-16-esque encoding for internal representations of Strings, and, yes, it encodes characters outside the Basic Multilingual Plane (BMP) using the surrogacy scheme.
是的,Java 对字符串的内部表示使用 UTF-16 式编码,并且,它使用代孕方案对基本多语言平面 ( BMP)之外的字符进行编码。
If you know you'll be dealing with characters outside the BMP, then here is the canonical way to iterate over the characters of a Java String:
如果您知道要处理 BMP 之外的字符,那么这里是迭代 Java 字符串字符的规范方法:
final int length = s.length();
for (int offset = 0; offset < length; ) {
final int codepoint = s.codePointAt(offset);
// do something with the codepoint
offset += Character.charCount(codepoint);
}
回答by Alexander Egger
Iterating over code points is filed as a feature request at Sun.
迭代代码点作为功能请求在 Sun 提交。
See Sun Bug Entry
请参阅Sun 错误条目
There is also an example on how to iterate over String CodePoints there.
还有一个关于如何在那里迭代 String CodePoints 的示例。
回答by rogerdpack
Thought I'd add a workaround method that works with foreach loops (ref), plus you can convert it to java 8's new String#codePointsmethod easily when you move to java 8:
以为我会添加一个适用于 foreach 循环(ref)的解决方法,另外,当您移动到 java 8 时,您可以轻松地将其转换为 java 8 的新String#codePoints方法:
You can use it with foreach like this:
您可以像这样将它与 foreach 一起使用:
for(int codePoint : codePoints(myString)) {
....
}
Here's the helper mthod:
这是帮助方法:
public static Iterable<Integer> codePoints(final String string) {
return new Iterable<Integer>() {
public Iterator<Integer> iterator() {
return new Iterator<Integer>() {
int nextIndex = 0;
public boolean hasNext() {
return nextIndex < string.length();
}
public Integer next() {
int result = string.codePointAt(nextIndex);
nextIndex += Character.charCount(result);
return result;
}
public void remove() {
throw new UnsupportedOperationException();
}
};
}
};
}
Or alternately if you just want to convert a string to an array of int (which might use more RAM than the above approach):
或者,如果您只想将字符串转换为 int 数组(这可能比上述方法使用更多的 RAM):
public static List<Integer> stringToCodePoints(String in) {
if( in == null)
throw new NullPointerException("got null");
List<Integer> out = new ArrayList<Integer>();
final int length = in.length();
for (int offset = 0; offset < length; ) {
final int codepoint = in.codePointAt(offset);
out.add(codepoint);
offset += Character.charCount(codepoint);
}
return out;
}
Thankfully uses "codePoints" safely handles the surrogate pair-ness of UTF-16 (java's internal string representation).
值得庆幸的是,使用“codePoints”可以安全地处理 UTF-16(java 的内部字符串表示)的代理对。
回答by Alex
Java 8 added CharSequence#codePoints
which returns an IntStream
containing the code points.
You can use the stream directly to iterate over them:
Java 8 添加了CharSequence#codePoints
它返回IntStream
包含代码点的一个。您可以直接使用流来迭代它们:
string.codePoints().forEach(c -> ...);
or with a for loop by collecting the stream into an array:
或者通过将流收集到数组中来使用 for 循环:
for(int c : string.codePoints().toArray()){
...
}
These ways are probably more expensive than Jonathan Feinbergs's solution, but they are faster to read/write and the performance difference will usually be insignificant.
这些方式可能比Jonathan Feinbergs 的解决方案更昂贵,但它们的读/写速度更快,并且性能差异通常是微不足道的。