在 Java 中将 ANSI 字符转换为 UTF-8

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1466184/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 12:34:59  来源:igfitidea点击:

Convert ANSI characters to UTF-8 in Java

javautf-8character-encodingansi

提问by n002213f

Is there a way to convert an ANSI string to UTF using Java.

有没有办法使用 Java 将 ANSI 字符串转换为 UTF。

I have a custom serializer that uses readUTF & writeUTF methods of the DataInputStream class to deserialize and serialze string. If i receive a string encoded in ANSI and is too long, ~100000 chars long i get the error;

我有一个自定义序列化程序,它使用 DataInputStream 类的 readUTF 和 writeUTF 方法来反序列化和序列化字符串。如果我收到一个以 ANSI 编码的字符串并且太长,大约 100000 个字符长,我会收到错误消息;

Caused by: java.io.UTFDataFormatException: encoded string too long: 106958 bytes

引起:java.io.UTFDataFormatException:编码的字符串太长:106958 字节

However in my Junit tests i'm able create a string with 120000 'a's and it works perfectly

但是,在我的 Junit 测试中,我能够创建一个包含 120000 'a's 的字符串,并且它运行良好

I have checked the following posts but still having errors;

我已经检查了以下帖子,但仍有错误;

采纳答案by ZZ Coder

This error is not caused by character encoding. It means the length of the UTF data is wrong.

此错误不是由字符编码引起的。这意味着UTF数据的长度是错误的。

EDIT: Just realized this is a writing error, not reading error.

编辑:刚刚意识到这是一个写入错误,而不是读取错误。

The UTF length is only 2 bytes so it can only hold 64K UTF-8 bytes. You are trying to writing 100K, it's not going to work.

UTF 长度只有 2 个字节,因此它只能容纳 64K UTF-8 字节。您正在尝试写入 100K,这是行不通的。

This limit is hardcoded and no way to get around this,

这个限制是硬编码的,没有办法绕过这个,

if (utflen > 65535)
    throw new UTFDataFormatException(
            "encoded string too long: " + utflen + " bytes");

回答by iammichael

byte[] asciiBytes = ...;
String unicode = new String(asciiBytes, "US-ASCII");
byte[] utfBytes = unicode.getBytes("UTF-8");

回答by Aaron Digulla

Which ANSI codepage? There are lots of different character encodings which all refer to "ANSI". The DOS codepage is 437 (without the drawing symbols). If you use codepage 850, this will work:

哪个ANSI 代码页?有许多不同的字符编码都指的是“ANSI”。DOS 代码页是 437(没有绘图符号)。如果您使用代码页 850,这将起作用:

String unicode = new String(bytes, "IBM850");

(where bytesis an array with the ANSI characters). After that, you can convert this string into a byte array with any encoding using unicode.getBytes(encoding).

(其中bytes是带有 ANSI 字符的数组)。之后,您可以将此字符串转换为使用任何编码的字节数组unicode.getBytes(encoding)

Windows often uses the codepage 1252 (use "windows-1252" for that).

Windows 通常使用代码页 1252(为此使用“windows-1252”)。

回答by István

ZZ Coder already answered the question, but I have written a more detailed explanation and suggesting a workaround on this blog. Basically, the problem is in DataOutputStream, because it restricts the writeable String to 64KB. There are other possible workarounds to bystep the issue, some might work without breaking the actual binary data format one is using...

ZZ Coder 已经回答了这个问题,但我在这个博客上写了更详细的解释和建议的解决方法。基本上,问题出在 DataOutputStream 中,因为它将可写字符串限制为 64KB。还有其他可能的解决方法来解决这个问题,有些可能会在不破坏实际使用的二进制数据格式的情况下工作......