编码为 UCS-2 Little Endian 的文件向 Java 报告 2x 太多行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/10070431/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 23:28:25  来源:igfitidea点击:

File encoded as UCS-2 Little Endian reports 2x too many lines to Java

javacharacter-encoding

提问by The111

I was processing several txt files with a simple Java program, and the first step of my process is counting the lines of each file:

我用一个简单的 Java 程序处理了几个 txt 文件,我的过程的第一步是计算每个文件的行数:

int count = 0;
br = new BufferedReader(new FileReader(myFile)); // myFile is the txt file in question
while (br.readLine() != null) {
    count++;
}

For one of my files, Java was counting exactly twice as many lines as there really were! This was confusing me greatly at first. I opened each file in Notepad++ and could see that the mis-counting file ended every line in exactly the same way as the other files, with a CR and LF. I did a little more poking around and noticed that all my "ok" files were ANSI encoded, and the one problem file was encoded as UCS-2 Little Endian (which I know nothing about). I got these files elsewhere, so I have no idea why the one was encoded that way, but of course switching it to ANSI fixed the issue.

对于我的一个文件,Java 计算的行数正好是实际行数的两倍!起初这让我很困惑。我在 Notepad++ 中打开每个文件,可以看到错误计数的文件以与其他文件完全相同的方式结束每一行,带有 CR 和 LF。我又查了一下,发现我所有的“ok”文件都是 ANSI 编码的,一个有问题的文件被编码为 UCS-2 Little Endian(我对此一无所知)。我在别处得到了这些文件,所以我不知道为什么以这种方式编码,但当然将其切换为 ANSI 解决了这个问题。

But now curiosity remains. Why was the encoding causing a double line count report?

但现在好奇心依然存在。为什么编码会导致双行计数报告?

Thanks!

谢谢!

回答by Lucero

Simple: if you apply the wrong encoding when reading UCS-2 (or UTF-16) text (e.g. ANSI, or any 8-bit encoding), then every second character is a 0x0. This then breaks the CR-LF to CR-0-LF, which is seen as two line changes (one for CR and one for LF).

简单:如果在读取 UCS-2(或 UTF-16)文本(例如 ANSI 或任何 8 位编码)时应用错误的编码,则每第二个字符都是 0x0。然后将 CR-LF 中断为 CR-0-LF,这被视为两行更改(一个用于 CR,一个用于 LF)。

回答by Jon Skeet

This is the problem:

这就是问题:

new FileReader(myFile)

That will use the platform default encoding. Don't do that. Use

这将使用平台默认编码。不要那样做。利用

new InputStreamReader(new FileInputStream(myFile), encoding)

where encodingis the appropriate encoding for the file. You've got to use the right encoding, or you won't read the file properly. Unfortunately of course that relies on you knowingthe encoding...

encoding文件的适当编码在哪里。您必须使用正确的编码,否则将无法正确读取文件。不幸的是,当然这取决于您了解编码......

EDIT: To answer the question of why the lines were double counted rather than just "how do I fix it", see Lucero's answer :)

编辑:要回答为什么行被重复计算而不仅仅是“我该如何解决”的问题,请参阅 Lucero 的回答:)