如何遍历 Java 字符串的 unicode 代码点?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1527856/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 13:51:34  来源:igfitidea点击:

How can I iterate through the unicode codepoints of a Java String?

javastringunicode

提问by rampion

So I know about String#codePointAt(int), but it's indexed by the charoffset, not by the codepoint offset.

所以我知道String#codePointAt(int),但它是由char偏移量索引的,而不是由代码点偏移量索引的。

I'm thinking about trying something like:

我正在考虑尝试类似的事情:

But my concerns are

但我的担忧是

  • I'm not sure whether codepoints which are naturally in the high-surrogates range will be stored as two charvalues or one
  • this seems like an awful expensive way to iterate through characters
  • someone must have come up with something better.
  • 我不确定自然在高代理范围内的代码点是否会存储为两个char值或一个
  • 这似乎是遍历字符的一种非常昂贵的方式
  • 一定有人想出了更好的办法。

采纳答案by Jonathan Feinberg

Yes, Java uses a UTF-16-esque encoding for internal representations of Strings, and, yes, it encodes characters outside the Basic Multilingual Plane (BMP) using the surrogacy scheme.

是的,Java 对字符串的内部表示使用 UTF-16 式编码,并且,它使用代孕方案对基本多语言平面 ( BMP)之外的字符进行编码。

If you know you'll be dealing with characters outside the BMP, then here is the canonical way to iterate over the characters of a Java String:

如果您知道要处理 BMP 之外的字符,那么这里是迭代 Java 字符串字符的规范方法:

final int length = s.length();
for (int offset = 0; offset < length; ) {
   final int codepoint = s.codePointAt(offset);

   // do something with the codepoint

   offset += Character.charCount(codepoint);
}

回答by Alexander Egger

Iterating over code points is filed as a feature request at Sun.

迭代代码点作为功能请求在 Sun 提交。

See Sun Bug Entry

请参阅Sun 错误条目

There is also an example on how to iterate over String CodePoints there.

还有一个关于如何在那里迭代 String CodePoints 的示例。

回答by rogerdpack

Thought I'd add a workaround method that works with foreach loops (ref), plus you can convert it to java 8's new String#codePointsmethod easily when you move to java 8:

以为我会添加一个适用于 foreach 循环(ref)的解决方法,另外,当您移动到 ​​java 8 时,您可以轻松地将其转换为 java 8 的新String#codePoints方法:

You can use it with foreach like this:

您可以像这样将它与 foreach 一起使用:

 for(int codePoint : codePoints(myString)) {
   ....
 }

Here's the helper mthod:

这是帮助方法:

public static Iterable<Integer> codePoints(final String string) {
  return new Iterable<Integer>() {
    public Iterator<Integer> iterator() {
      return new Iterator<Integer>() {
        int nextIndex = 0;
        public boolean hasNext() {
          return nextIndex < string.length();
        }
        public Integer next() {
          int result = string.codePointAt(nextIndex);
          nextIndex += Character.charCount(result);
          return result;
        }
        public void remove() {
          throw new UnsupportedOperationException();
        }
      };
    }
  };
}

Or alternately if you just want to convert a string to an array of int (which might use more RAM than the above approach):

或者,如果您只想将字符串转换为 int 数组(这可能比上述方法使用更多的 RAM):

 public static List<Integer> stringToCodePoints(String in) {
    if( in == null)
      throw new NullPointerException("got null");
    List<Integer> out = new ArrayList<Integer>();
    final int length = in.length();
    for (int offset = 0; offset < length; ) {
      final int codepoint = in.codePointAt(offset);
      out.add(codepoint);
      offset += Character.charCount(codepoint);
    }
    return out;
  }

Thankfully uses "codePoints" safely handles the surrogate pair-ness of UTF-16 (java's internal string representation).

值得庆幸的是,使用“codePoints”可以安全地处理 UTF-16(java 的内部字符串表示)的代理对。

回答by Alex

Java 8 added CharSequence#codePointswhich returns an IntStreamcontaining the code points. You can use the stream directly to iterate over them:

Java 8 添加了CharSequence#codePoints它返回IntStream包含代码点的一个。您可以直接使用流来迭代它们:

string.codePoints().forEach(c -> ...);

or with a for loop by collecting the stream into an array:

或者通过将流收集到数组中来使用 for 循环:

for(int c : string.codePoints().toArray()){
    ...
}

These ways are probably more expensive than Jonathan Feinbergs's solution, but they are faster to read/write and the performance difference will usually be insignificant.

这些方式可能比Jonathan Feinbergs 的解决方案更昂贵,但它们的读/写速度更快,并且性能差异通常是微不足道的。