Java App:无法正确读取iso-8859-1编码的文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/498636/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 15:23:59  来源:igfitidea点击:

Java App : Unable to read iso-8859-1 encoded file correctly

javaencodingcharacter-encodingiso-8859-1

提问by Joel

I have a file which is encoded as iso-8859-1, and contains characters such as ? .

我有一个编码为 iso-8859-1 的文件,其中包含诸如 ? .

I am reading this file with java code, something like:

我正在使用 java 代码读取此文件,例如:

File in = new File("myfile.csv");
InputStream fr = new FileInputStream(in);
byte[] buffer = new byte[4096];
while (true) {
    int byteCount = fr.read(buffer, 0, buffer.length);
    if (byteCount <= 0) {
        break;
    }

    String s = new String(buffer, 0, byteCount,"ISO-8859-1");
    System.out.println(s);
}

However the ? character is always garbled, usually printing as a ? .

然而 ?字符总是乱码,通常打印为?.

I have read around the subject (and learnt a little on the way) e.g.

我已经阅读了这个主题(并在途中学到了一点)例如

but still can not get this working

但仍然无法正常工作

Interestingly this works on my local pc (xp) but not on my linux box.

有趣的是,这适用于我的本地电脑(xp),但不适用于我的 linux 机器。

I have checked that my jdk supports the required charsets (they are standard, so this is no suprise) using :

我已经使用以下方法检查了我的 jdk 是否支持所需的字符集(它们是标准的,所以这并不奇怪):

System.out.println(java.nio.charset.Charset.availableCharsets());

采纳答案by Jon Skeet

I suspect that either your file isn't actuallyencoded as ISO-8859-1, or System.out doesn't know how to print the character.

我怀疑您的文件实际上没有编码为 ISO-8859-1,或者 System.out 不知道如何打印字符。

I recommend that to check for the first, you examine the relevant byte in the file. To check for the second, examine the relevant character in the string, printing it out with

我建议首先检查文件中的相关字节。要检查第二个,检查字符串中的相关字符,用

 System.out.println((int) s.getCharAt(index));

In both cases the result shouldbe 244 decimal; 0xf4 hex.

在这两种情况下,结果都应该是十进制 244;0xf4 十六进制。

See my article on Unicode debuggingfor general advice (the code presented is in C#, but it's easy to convert to Java, and the principles are the same).

有关一般建议,请参阅我关于 Unicode 调试的文章(呈现的代码是用 C# 编写的,但转换为 Java 很容易,原理相同)。

In general, by the way, I'd wrap the stream with an InputStreamReaderwith the right encoding - it's easier than creating new strings "by hand". I realise this may just be demo code though.

一般来说,顺便说一下,我会InputStreamReader用正确的编码来包装流- 这比“手动”创建新字符串更容易。我意识到这可能只是演示代码。

EDIT: Here's a really easy way to prove whether or not the console will work:

编辑:这是一个非常简单的方法来证明控制台是否可以工作:

 System.out.println("Here's the character: \u00f4");

回答by Peter ?tibrany

If you can, try to run your program in debugger to see what's inside your 's' string after it is created. It is possible that it has correct content, but output is garbled after System.out.println(s) call. In that case, there is probably mismatch between what Java thinks is encoding of your output and character encoding of your terminal/console on Linux.

如果可以,请尝试在调试器中运行程序以查看创建后的 's' 字符串中的内容。有可能它的内容是正确的,但是在 System.out.println(s) 调用后输出是乱码。在这种情况下,Java 认为的输出编码与 Linux 上终端/控制台的字符编码之间可能不匹配。

回答by Zach Scrivena

Parsing the file as fixed-size blocks of bytes is not good --- what if some character has a byte representation that straddles across two blocks? Use an InputStreamReaderwith the appropriate character encoding instead:

将文件解析为固定大小的字节块并不好——如果某个字符的字节表示跨越两个块怎么办?使用InputStreamReader具有适当字符编码的an代替:

 BufferedReader br = new BufferedReader(
         new InputStreamReader(
         new FileInputStream("myfile.csv"), "ISO-8859-1");

 char[] buffer = new char[4096]; // character (not byte) buffer 

 while (true)
 {
      int charCount = br.read(buffer, 0, buffer.length);

      if (charCount == -1) break; // reached end-of-stream 

      String s = String.valueOf(buffer, 0, charCount);
      // alternatively, we can append to a StringBuilder

      System.out.println(s);
 }

Btw, remember to check that the unicode character can indeed be displayed correctly. You could also redirect the program output to a file and then compare it with the original file.

顺便说一句,请记住检查 unicode 字符是否确实可以正确显示。您还可以将程序输出重定向到一个文件,然后将其与原始文件进行比较。

As Jon Skeetsuggests, the problem may also be console-related. Try System.console().printf(s)to see if there is a difference.

正如Jon Skeet所说,这个问题也可能与控制台有关。试试看System.console().printf(s)有没有区别。

回答by Eek

Basically, if it works on your local XP PC but not on Linux, and you are parsing the exact same file (i.e. you transferred it in a binary fashion between the boxes), then it probably has something to do with the System.out.println call. I don't know how you verify the output, but if you do it by connecting with a remote shell from the XP box, then there is the character set of the shell (and the client) to consider.

基本上,如果它适用于您的本地 XP PC 而不适用于 Linux,并且您正在解析完全相同的文件(即您在两个盒子之间以二进制方式传输它),那么它可能与 System.out 有关。打印调用。我不知道您如何验证输出,但是如果您通过从 XP 盒连接远程 shell 来进行验证,则需要考虑 shell(和客户端)的字符集。

Additionally, what Zach Scrivena suggests is also true - you cannot assume that you can create strings from chunks of data in that way - either use an InputStreamReader or read the complete data into an array first (obviously not going to work for a large file). However, since it does seem to work on XP, then I would venture that this is probably not your problem in this specific case.

此外,Zach Scrivena 的建议也是正确的 - 您不能假设您可以以这种方式从数据块中创建字符串 - 要么使用 InputStreamReader 要么首先将完整数据读入数组(显然不适用于大文件) . 但是,由于它似乎确实适用于 XP,那么我敢说,在这种特定情况下,这可能不是您的问题。

回答by McDowell

@Joel - your own answerconfirms that the problem is a difference between the default encoding on your operating system (UTF-8, the one Java has picked up) and the encoding your terminal is using (ISO-8859-1).

@Joel -您自己的回答确认问题在于操作系统上的默认编码(UTF-8,Java 选择的一种)与终端使用的编码(ISO-8859-1)之间存在差异。

Consider this code:

考虑这个代码:

public static void main(String[] args) throws IOException {
    byte[] data = { (byte) 0xF4 };
    String decoded = new String(data, "ISO-8859-1");
    if (!"\u00f4".equals(decoded)) {
        throw new IllegalStateException();
    }

    // write default charset
    System.out.println(Charset.defaultCharset());

    // dump bytes to stdout
    System.out.write(data);

    // will encode to default charset when converting to bytes
    System.out.println(decoded);
}

By default, my Ubuntu (8.04) terminal uses the UTF-8 encoding. With this encoding, this is printed:

默认情况下,我的 Ubuntu (8.04) 终端使用 UTF-8 编码。使用此编码,打印:

UTF-8

UTF-8
??

If I switch the terminal's encoding to ISO 8859-1, this is printed:

如果我将终端的编码切换为 ISO 8859-1,则会打印:

UTF-8
ôô

UTF-8
ôô

In both cases, the same bytes are being emitted by the Java program:

在这两种情况下,Java 程序都发出相同的字节:

5554 462d 380a f4c3 b40a

The only difference is in how the terminal is interpreting the bytes it receives. In ISO 8859-1, ô is encoded as 0xF4. In UTF-8, ô is encoded as 0xC3B4. The other characters are common to both encodings.

唯一的区别在于终端如何解释它接收到的字节。在 ISO 8859-1 中,ô 编码为 0xF4。在 UTF-8 中,ô 被编码为 0xC3B4。其他字符对两种编码都是通用的。