在 Java 中从 Windows 1252 转换为 UTF8：使用 CharsetDecoder/Encoder 的空字符

Question

提问by robob

I know it's a very general question but I'm becoming mad.

我知道这是一个非常笼统的问题，但我快疯了。

I used this code:

我使用了这个代码：

String ucs2Content = new String(bufferToConvert, inputEncoding);        
        byte[] outputBuf = ucs2Content.getBytes(outputEncoding);        
        return outputBuf;

But I read that is better to use CharsetDecoder and CharsetEncoder (I have contents with some character probably outside the destination encoding). I've just written this code but that has some problems:

但是我读到使用 CharsetDecoder 和 CharsetEncoder 更好（我的内容可能包含目标编码之外的某些字符）。我刚刚写了这段代码，但有一些问题：

// Create the encoder and decoder for Win1252
Charset charsetInput = Charset.forName(inputEncoding);
CharsetDecoder decoder = charsetInput.newDecoder();

Charset charsetOutput = Charset.forName(outputEncoding);
CharsetEncoder encoder = charsetOutput.newEncoder();

// Convert the byte array from starting inputEncoding into UCS2
CharBuffer cbuf = decoder.decode(ByteBuffer.wrap(bufferToConvert));

// Convert the internal UCS2 representation into outputEncoding
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(cbuf));
return bbuf.array();

Indeed this code appends to the buffer a sequence of null character!!!!!

事实上，这段代码将一个空字符序列附加到缓冲区中！！！！！

Could someone tell me where is the problem? I'm not so skilled with encoding conversion in Java.

有人能告诉我问题出在哪里吗？我不太熟悉 Java 中的编码转换。

Is there a better way to convert encoding in Java?

有没有更好的方法来转换 Java 中的编码？

Answer 1

采纳答案by jarnbjo

Your problem is that ByteBuffer.array()returns a direct reference to the array used as backing store for the ByteBuffer and not a copy of the backing array's valid range. You have to obey bbuf.limit()(as Peter did in his response) and just use the array content from index 0to bbuf.limit()-1.

您的问题是ByteBuffer.array()返回对用作 ByteBuffer 后备存储的数组的直接引用，而不是后备数组有效范围的副本。你必须服从bbuf.limit()（就像彼得在他的回应中所做的那样）并且只使用从 index0到的数组内容bbuf.limit()-1。

The reason for the extra 0 values in the backing array is a slight flaw in how the resulting ByteBuffer is created by the CharsetEncoder. Each CharsetEncoder has an "average bytes per character", which for the UCS2 encoder seem to be simple and correct (2 bytes/char). Obeying this fixed value, the CharsetEncoder initially allocates a ByteBuffer with "string length * average bytes per character" bytes, in this case e.g. 20 bytes for a 10 character long string. The UCS2 CharsetEncoder starts however with a BOM (byte order mark), which also occupies 2 bytes, so that only 9 of the 10 characters fit in the allocated ByteBuffer. The CharsetEncoder detects the overflow and allocates a new ByteBuffer with a length of 2*n+1 (n being the original length of the ByteBuffer), in this case 2*20+1 = 41 bytes. Since only 2 of the 21 new bytes are required to encode the remaining character, the array you get from bbuf.array()will have a length of 41 bytes, but bbuf.limit()will indicate that only the first 22 entries are actually used.

后备数组中额外 0 值的原因是 CharsetEncoder 创建结果 ByteBuffer 的方式存在轻微缺陷。每个 CharsetEncoder 都有一个“每个字符的平均字节数”，这对于 UCS2 编码器来说似乎简单而正确（2 个字节/字符）。按照这个固定值，CharsetEncoder 最初分配一个带有“字符串长度 * 每个字符的平均字节数”字节的 ByteBuffer，在这种情况下，例如 20 个字节用于 10 个字符长的字符串。然而，UCS2 CharsetEncoder 以 BOM（字节顺序标记）开始，它也占用 2 个字节，因此 10 个字符中只有 9 个适合分配的 ByteBuffer。CharsetEncoder 检测到溢出并分配一个长度为 2*n+1（n 是 ByteBuffer 的原始长度）的新 ByteBuffer，在这种情况下 2*20+1 = 41 个字节。bbuf.array()将有 41 个字节的长度，但bbuf.limit()将指示实际仅使用前 22 个条目。

Answer 2

回答by Peter Lawrey

I am not sure how you get a sequence of nullcharacters. Try this

我不确定你是如何获得一系列null字符的。试试这个

String outputEncoding = "UTF-8";
Charset charsetOutput = Charset.forName(outputEncoding);
CharsetEncoder encoder = charsetOutput.newEncoder();

// Convert the byte array from starting inputEncoding into UCS2
byte[] bufferToConvert = "Hello World! ￡".getBytes();
CharBuffer cbuf = decoder.decode(ByteBuffer.wrap(bufferToConvert));

// Convert the internal UCS2 representation into outputEncoding
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(cbuf));
System.out.println(new String(bbuf.array(), 0, bbuf.limit(), charsetOutput));

prints

印刷

Hello World! ￡

在 Java 中从 Windows 1252 转换为 UTF8：使用 CharsetDecoder/Encoder 的空字符

提问by robob

采纳答案by jarnbjo

回答by Peter Lawrey

相关推荐

最近更新

标签

在 Java 中从 Windows 1252 转换为 UTF8：使用 CharsetDecoder/Encoder 的空字符

提问by robob

采纳答案by jarnbjo

回答by Peter Lawrey

相关推荐

从日志文件中提取 java 堆栈跟踪的工具

java 有效 Cookie 值的明确指南

java JADClipse 不适用于 Eclipse 3.6

java 单例模式（Bill Pugh 的解决方案）

相关推荐

最近更新

标签