Java 获取带编码的字符串大小(以字节为单位)而不转换为 byte[]

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19852460/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 20:42:30  来源:igfitidea点击:

Get size of String w/ encoding in bytes without converting to byte[]

javastringsizebyte

提问by elhefe

I have a situation where I need to know the size of a String/encoding pair, in bytes, but cannot use the getBytes()method because 1) the Stringis very large and duplicating the Stringin a byte[]array would use a large amount of memory, but more to the point 2) getBytes()allocates a byte[]array based on the length of the String* the maximum possible bytes per character. So if I have a Stringwith 1.5B characters and UTF-16 encoding, getBytes()will try to allocate a 3GB array and fail, since arrays are limited to 2^32 - X bytes (X is Java version specific).

我有一个情况,我需要知道的大小String/编码对,以字节为单位,但不能使用getBytes(),因为1)的方法String是非常大的,并复制String一个在byte[]阵列将使用大量的内存,但是更重要的点 2)根据每个字符的最大可能字节长度getBytes()分配一个byte[]数组String。因此,如果我有一个String1.5B 字符和 UTF-16 编码,getBytes()将尝试分配一个 3GB 数组并失败,因为数组限制为 2^32 - X 字节(X 是 Java 版本特定的)。

So - is there some way to calculate the byte size of a String/encoding pair directly from the Stringobject?

那么 - 有没有办法String直接从String对象计算/encoding 对的字节大小?

UPDATE:

更新:

Here's a working implementation of jtahlborn's answer:

这是 jtahlborn 答案的有效实现:

private class CountingOutputStream extends OutputStream {
    int total;

    @Override
    public void write(int i) {
        throw new RuntimeException("don't use");
    }
    @Override
    public void write(byte[] b) {
        total += b.length;
    }

    @Override public void write(byte[] b, int offset, int len) {
        total += len;
    }
}

采纳答案by jtahlborn

Simple, just write it to a dummy output stream:

简单,只需将其写入虚拟输出流:

class CountingOutputStream extends OutputStream {
  private int _total;

  @Override public void write(int b) {
    ++_total;
  }

  @Override public void write(byte[] b) {
    _total += b.length;
  }

  @Override public void write(byte[] b, int offset, int len) {
    _total += len;
  }

  public int getTotalSize(){
     _total;
  }
}

CountingOutputStream cos = new CountingOutputStream();
Writer writer = new OutputStreamWriter(cos, "my_encoding");
//writer.write(myString);

// UPDATE: OutputStreamWriter does a simple copy of the _entire_ input string, to avoid that use:
for(int i = 0; i < myString.length(); i+=8096) {
  int end = Math.min(myString.length(), i+8096);
  writer.write(myString, i, end - i);
}

writer.flush();

System.out.println("Total bytes: " + cos.getTotalSize());

it's not only simple, but probably just as fast as the other "complex" answers.

它不仅简单,而且可能与其他“复杂”答案一样快。

回答by brettw

Ok, this is extremely gross. I admit that, but this stuff is hidden by the JVM, so we have to dig a little. And sweat a little.

好吧,这太恶心了。我承认,但是这个东西被JVM隐藏了,所以我们必须挖掘一点。还有一点汗。

First, we want the actual char[] that backs a String without making a copy. To do this we have to use reflection to get at the 'value' field:

首先,我们需要实际的 char[] 支持 String 而不进行复制。为此,我们必须使用反射来获取“值”字段:

char[] chars = null;
for (Field field : String.class.getDeclaredFields()) {
    if ("value".equals(field.getName())) {
        field.setAccessible(true);
        chars = (char[]) field.get(string); // <--- got it!
        break;
    }
}

Next you need to implement a subclass of java.nio.ByteBuffer. Something like:

接下来,您需要实现java.nio.ByteBuffer. 就像是:

class MyByteBuffer extends ByteBuffer {
    int length;            
    // Your implementation here
};

Ignore all of the getters, implement all of the putmethods like put(byte)and putChar(char)etc. Inside something like put(byte), increment lengthby 1, inside of put(byte[])increment lengthby the array length. Get it? Everything that is put, you add the size of whatever it is to length. But you're not storing anything in your ByteBuffer, you're just counting and throwing away, so no space is taken. If you breakpoint the putmethods, you can probably figure out which ones you actually need to implement. putFloat(float)is probably not used, for example.

忽略所有的干将,实现所有的认沽方法,如put(byte)putChar(char)等里面的东西一样put(byte),增量长度由1,里面的put(byte[])增量长度由数组长度。得到它?放置的所有内容,您都将其大小添加到length。但是你没有在你的 中存储任何东西ByteBuffer,你只是在计算和扔掉,所以不占用空间。如果您对put方法进行断点,您可能会弄清楚您实际需要实现哪些方法。 putFloat(float)例如,可能没有使用。

Now for the grand finale, putting it all together:

现在是大结局,把它们放在一起:

MyByteBuffer bbuf = new MyByteBuffer();         // your "counting" buffer
CharBuffer cbuf = CharBuffer.wrap(chars);       // wrap your char array
Charset charset = Charset.forName("UTF-8");     // your charset goes here
CharsetEncoder encoder = charset.newEncoder();  // make a new encoder
encoder.encode(cbuf, bbuf, true);               // do it!
System.out.printf("Length: %d\n", bbuf.length); // pay me US,000,000

回答by elhefe

Here's an apparently working implementation:

这是一个明显有效的实现:

import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;

public class TestUnicode {

    private final static int ENCODE_CHUNK = 100;

    public static long bytesRequiredToEncode(final String s,
            final Charset encoding) {
        long count = 0;
        for (int i = 0; i < s.length(); ) {
            int end = i + ENCODE_CHUNK;
            if (end >= s.length()) {
                end = s.length();
            } else if (Character.isHighSurrogate(s.charAt(end))) {
                end++;
            }
            count += encoding.encode(s.substring(i, end)).remaining() + 1;
            i = end;
        }
        return count;
    }

    public static void main(String[] args) {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < 100; i++) {
            sb.appendCodePoint(11614);
            sb.appendCodePoint(1061122);
            sb.appendCodePoint(2065);
            sb.appendCodePoint(1064124);
        }
        Charset cs = StandardCharsets.UTF_8;

        System.out.println(bytesRequiredToEncode(new String(sb), cs));
        System.out.println(new String(sb).getBytes(cs).length);
    }
}

The output is:

输出是:

1400
1400

In practice I'd increase ENCODE_CHUNKto 10MChars or so.

在实践中,我会增加到ENCODE_CHUNK10MChars 左右。

Probably slightly less efficient than brettw's answer, but simpler to implement.

可能比 brettw 的答案效率稍低,但实施起来更简单。

回答by 30thh

The same using apache-commons libraries:

同样使用 apache-commons 库:

public static long stringLength(String string, Charset charset) {

    try (NullOutputStream nul = new NullOutputStream();
         CountingOutputStream count = new CountingOutputStream(nul)) {

        IOUtils.write(string, count, charset.name());
        count.flush();
        return count.getCount();
    } catch (IOException e) {
        throw new IllegalStateException("Unexpected I/O.", e);
    }
}

回答by Caio Cunha

Guava has an implementation according to this post:

根据这篇文章,番石榴有一个实现:

Utf8.encodedLength()

Utf8.encodedLength()