编码为 UCS-2 Little Endian 的文件向 Java 报告 2x 太多行

Question

提问by The111

I was processing several txt files with a simple Java program, and the first step of my process is counting the lines of each file:

我用一个简单的 Java 程序处理了几个 txt 文件，我的过程的第一步是计算每个文件的行数：

int count = 0;
br = new BufferedReader(new FileReader(myFile)); // myFile is the txt file in question
while (br.readLine() != null) {
    count++;
}

For one of my files, Java was counting exactly twice as many lines as there really were! This was confusing me greatly at first. I opened each file in Notepad++ and could see that the mis-counting file ended every line in exactly the same way as the other files, with a CR and LF. I did a little more poking around and noticed that all my "ok" files were ANSI encoded, and the one problem file was encoded as UCS-2 Little Endian (which I know nothing about). I got these files elsewhere, so I have no idea why the one was encoded that way, but of course switching it to ANSI fixed the issue.

对于我的一个文件，Java 计算的行数正好是实际行数的两倍！起初这让我很困惑。我在 Notepad++ 中打开每个文件，可以看到错误计数的文件以与其他文件完全相同的方式结束每一行，带有 CR 和 LF。我又查了一下，发现我所有的“ok”文件都是 ANSI 编码的，一个有问题的文件被编码为 UCS-2 Little Endian（我对此一无所知）。我在别处得到了这些文件，所以我不知道为什么以这种方式编码，但当然将其切换为 ANSI 解决了这个问题。

But now curiosity remains. Why was the encoding causing a double line count report?

但现在好奇心依然存在。为什么编码会导致双行计数报告？

Thanks!

谢谢！

Answer 1

回答by Lucero

Simple: if you apply the wrong encoding when reading UCS-2 (or UTF-16) text (e.g. ANSI, or any 8-bit encoding), then every second character is a 0x0. This then breaks the CR-LF to CR-0-LF, which is seen as two line changes (one for CR and one for LF).

简单：如果在读取 UCS-2（或 UTF-16）文本（例如 ANSI 或任何 8 位编码）时应用错误的编码，则每第二个字符都是 0x0。然后将 CR-LF 中断为 CR-0-LF，这被视为两行更改（一个用于 CR，一个用于 LF）。

Answer 2

回答by Jon Skeet

This is the problem:

这就是问题：

new FileReader(myFile)

That will use the platform default encoding. Don't do that. Use

这将使用平台默认编码。不要那样做。利用

new InputStreamReader(new FileInputStream(myFile), encoding)

where encodingis the appropriate encoding for the file. You've got to use the right encoding, or you won't read the file properly. Unfortunately of course that relies on you knowingthe encoding...

encoding文件的适当编码在哪里。您必须使用正确的编码，否则将无法正确读取文件。不幸的是，当然这取决于您了解编码......

EDIT: To answer the question of why the lines were double counted rather than just "how do I fix it", see Lucero's answer :)

编辑：要回答为什么行被重复计算而不仅仅是“我该如何解决”的问题，请参阅 Lucero 的回答:)

编码为 UCS-2 Little Endian 的文件向 Java 报告 2x 太多行

提问by The111

回答by Lucero

回答by Jon Skeet

相关推荐

最近更新

标签

编码为 UCS-2 Little Endian 的文件向 Java 报告 2x 太多行

提问by The111

回答by Lucero

回答by Jon Skeet

相关推荐

Java：无法加载属性文件。为什么？

在 Java 中验证电子邮件

Java X509 证书解析和验证

java 用Java重新打开数据库连接

相关推荐

最近更新

标签