java 读取任何具有奇怪编码的文本文件？

Question

提问by Brad

I have a text file with a strange encoding "UCS-2 Little Endian" that I want to read its contents using Java.

我有一个带有奇怪编码“UCS-2 Little Endian”的文本文件，我想使用 Java 读取其内容。

Opening the text file using NotePad++

使用 NotePad++ 打开文本文件

As you can see in the above screenshot the file contents appear fine in Notepad++, but when i read it using this code, just garbage is being printed in the console:

正如您在上面的屏幕截图中看到的，文件内容在 Notepad++ 中显示良好，但是当我使用此代码读取它时，控制台中只打印了垃圾：

String textFilePath = "c:\strange_file_encoding.txt"
BufferedReader reader = new BufferedReader( new InputStreamReader( new FileInputStream( filePath ), "UTF8" ) );
String line = "";

while ( ( line = reader.readLine() ) != null ) {
    System.out.println( line );  // Prints garbage characters 
}

The main point is that the user selects the file to read, so it can be of any encoding, and since I can't detect the file encoding I decode it using "UTF8" but as in the above example it fails to read it right.

重点是用户选择要读取的文件，因此它可以是任何编码，并且由于我无法检测到文件编码，因此我使用“UTF8”对其进行解码，但如上例所示，它无法正确读取.

Is there away to read such strange files in a right way ? Or at least can i detect if my code will fail to read it right ?

有没有办法以正确的方式阅读这些奇怪的文件？或者至少我可以检测我的代码是否无法正确读取？

Answer 1

回答by tempoc

You are using UTF-8 as your encoding in the InputStreamReader constructor, so it will try to interpret the bytes as UTF-8 instead of UCS-LE. Here is the documentation: Charset

您在 InputStreamReader 构造函数中使用 UTF-8 作为编码，因此它会尝试将字节解释为 UTF-8 而不是 UCS-LE。这是文档：Charset

I suppose you need to use UTF-16LE according to it.

我想你需要根据它使用UTF-16LE。

Here is more info on the supported character sets and their Java names: Supported Encodings

以下是有关支持的字符集及其 Java 名称的更多信息：支持的编码

Answer 2

回答by Vivin Paliath

You're providing the wrong encoding in InputStreamReader. Have you tried using UTF-16LE instead if UTF8?

您在InputStreamReader. 如果 UTF8，您是否尝试过使用 UTF-16LE 代替？

BufferedReader reader = new BufferedReader( new InputStreamReader( new FileInputStream( filePath ), "UTF-16LE" ) );

According to Charset:

根据Charset：

UTF-16LE Sixteen-bit UCS Transformation Format, little-endian byte order

UTF-16LE 十六位 UCS 转换格式，小端字节序

Answer 3

回答by Dror Bereznitsky

You cannot use UTF-8 encoding for all files, especially if you do not know which file encoding to expect. Use a library which can detect the file encoding before your read the file, for example: juniversalchardetor jChardet

您不能对所有文件使用 UTF-8 编码，尤其是当您不知道要使用哪种文件编码时。使用可以在读取文件之前检测文件编码的库，例如：juniversalchardet或jChardet

For more info see Java : How to determine the correct charset encoding of a stream

有关更多信息，请参阅Java：如何确定流的正确字符集编码

java 读取任何具有奇怪编码的文本文件？

提问by Brad

回答by tempoc

回答by Vivin Paliath

回答by Dror Bereznitsky

相关推荐

最近更新

标签

java 读取任何具有奇怪编码的文本文件？

提问by Brad

回答by tempoc

回答by Vivin Paliath

回答by Dror Bereznitsky

相关推荐

java java循环，如果否则

带有正则表达式分隔符的 Java Scanner

java 如何获取给定 App ID 和 Secret 的 FB 访问令牌？

java 此网页在 spring-security 应用程序中有一个重定向循环

相关推荐

最近更新

标签