如何遍历 Java 字符串的 unicode 代码点？

Question

提问by rampion

So I know about String#codePointAt(int), but it's indexed by the charoffset, not by the codepoint offset.

所以我知道String#codePointAt(int)，但它是由char偏移量索引的，而不是由代码点偏移量索引的。

I'm thinking about trying something like:

我正在考虑尝试类似的事情：

using String#charAt(int)to get the charat an index
testing whether the charis in the high-surrogates range
- if so, use String#codePointAt(int)to get the codepoint, and increment the index by 2
- if not, use the given charvalue as the codepoint, and increment the index by 1

使用String#charAt(int)得到char的指数
测试是否char在高代理范围内
- 如果是这样，使用String#codePointAt(int)获取代码点，并将索引增加 2
- 如果不是，则使用给定char值作为代码点，并将索引增加 1

But my concerns are

但我的担忧是

I'm not sure whether codepoints which are naturally in the high-surrogates range will be stored as two charvalues or one
this seems like an awful expensive way to iterate through characters
someone must have come up with something better.

我不确定自然在高代理范围内的代码点是否会存储为两个char值或一个
这似乎是遍历字符的一种非常昂贵的方式
一定有人想出了更好的办法。

Answer 1

采纳答案by Jonathan Feinberg

Yes, Java uses a UTF-16-esque encoding for internal representations of Strings, and, yes, it encodes characters outside the Basic Multilingual Plane (BMP) using the surrogacy scheme.

是的，Java 对字符串的内部表示使用 UTF-16 式编码，并且，它使用代孕方案对基本多语言平面 ( BMP)之外的字符进行编码。

If you know you'll be dealing with characters outside the BMP, then here is the canonical way to iterate over the characters of a Java String:

如果您知道要处理 BMP 之外的字符，那么这里是迭代 Java 字符串字符的规范方法：

final int length = s.length();
for (int offset = 0; offset < length; ) {
   final int codepoint = s.codePointAt(offset);

   // do something with the codepoint

   offset += Character.charCount(codepoint);
}

Answer 2

回答by Alexander Egger

Iterating over code points is filed as a feature request at Sun.

迭代代码点作为功能请求在 Sun 提交。

See Sun Bug Entry

请参阅Sun 错误条目

There is also an example on how to iterate over String CodePoints there.

还有一个关于如何在那里迭代 String CodePoints 的示例。

Answer 3

回答by rogerdpack

Thought I'd add a workaround method that works with foreach loops (ref), plus you can convert it to java 8's new String#codePointsmethod easily when you move to java 8:

以为我会添加一个适用于 foreach 循环（ref）的解决方法，另外，当您移动到 java 8 时，您可以轻松地将其转换为 java 8 的新String#codePoints方法：

You can use it with foreach like this:

您可以像这样将它与 foreach 一起使用：

 for(int codePoint : codePoints(myString)) {
   ....
 }

Here's the helper mthod:

这是帮助方法：

public static Iterable<Integer> codePoints(final String string) {
  return new Iterable<Integer>() {
    public Iterator<Integer> iterator() {
      return new Iterator<Integer>() {
        int nextIndex = 0;
        public boolean hasNext() {
          return nextIndex < string.length();
        }
        public Integer next() {
          int result = string.codePointAt(nextIndex);
          nextIndex += Character.charCount(result);
          return result;
        }
        public void remove() {
          throw new UnsupportedOperationException();
        }
      };
    }
  };
}

Or alternately if you just want to convert a string to an array of int (which might use more RAM than the above approach):

或者，如果您只想将字符串转换为 int 数组（这可能比上述方法使用更多的 RAM）：

 public static List<Integer> stringToCodePoints(String in) {
    if( in == null)
      throw new NullPointerException("got null");
    List<Integer> out = new ArrayList<Integer>();
    final int length = in.length();
    for (int offset = 0; offset < length; ) {
      final int codepoint = in.codePointAt(offset);
      out.add(codepoint);
      offset += Character.charCount(codepoint);
    }
    return out;
  }

Thankfully uses "codePoints" safely handles the surrogate pair-ness of UTF-16 (java's internal string representation).

值得庆幸的是，使用“codePoints”可以安全地处理 UTF-16（java 的内部字符串表示）的代理对。

Answer 4

回答by Alex

Java 8 added CharSequence#codePointswhich returns an IntStreamcontaining the code points. You can use the stream directly to iterate over them:

Java 8 添加了CharSequence#codePoints它返回IntStream包含代码点的一个。您可以直接使用流来迭代它们：

string.codePoints().forEach(c -> ...);

or with a for loop by collecting the stream into an array:

或者通过将流收集到数组中来使用 for 循环：

for(int c : string.codePoints().toArray()){
    ...
}

These ways are probably more expensive than Jonathan Feinbergs's solution, but they are faster to read/write and the performance difference will usually be insignificant.

这些方式可能比Jonathan Feinbergs 的解决方案更昂贵，但它们的读/写速度更快，并且性能差异通常是微不足道的。

如何遍历 Java 字符串的 unicode 代码点？

提问by rampion

采纳答案by Jonathan Feinberg

回答by Alexander Egger

回答by rogerdpack

回答by Alex

相关推荐

最近更新

标签

如何遍历 Java 字符串的 unicode 代码点？

提问by rampion

采纳答案by Jonathan Feinberg

回答by Alexander Egger

回答by rogerdpack

回答by Alex

相关推荐

Java 中的方法与构造函数

Java 部署 Web 应用程序时，出现异常 NoClassDefFoundError:LocalizableImpl

Java 如何在不读取的情况下检查 InputStream 是否为空？

java.sq.SQLException：未找到列

相关推荐

最近更新

标签