java 读取任何具有奇怪编码的文本文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15511703/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-31 19:52:16  来源:igfitidea点击:

Reading any text file having strange encoding?

javatext-filesbufferedreaderfileinputstream

提问by Brad

I have a text file with a strange encoding "UCS-2 Little Endian" that I want to read its contents using Java.

我有一个带有奇怪编码“UCS-2 Little Endian”的文本文件,我想使用 Java 读取其内容。

Opening the text file using NotePad++

使用 NotePad++ 打开文本文件

As you can see in the above screenshot the file contents appear fine in Notepad++, but when i read it using this code, just garbage is being printed in the console:

正如您在上面的屏幕截图中看到的,文件内容在 Notepad++ 中显示良好,但是当我使用此代码读取它时,控制台中只打印了垃圾:

String textFilePath = "c:\strange_file_encoding.txt"
BufferedReader reader = new BufferedReader( new InputStreamReader( new FileInputStream( filePath ), "UTF8" ) );
String line = "";

while ( ( line = reader.readLine() ) != null ) {
    System.out.println( line );  // Prints garbage characters 
}

The main point is that the user selects the file to read, so it can be of any encoding, and since I can't detect the file encoding I decode it using "UTF8" but as in the above example it fails to read it right.

重点是用户选择要读取的文件,因此它可以是任何编码,并且由于我无法检测到文件编码,因此我使用“UTF8”对其进行解码,但如上例所示,它无法正确读取.

Is there away to read such strange files in a right way ? Or at least can i detect if my code will fail to read it right ?

有没有办法以正确的方式阅读这些奇怪的文件?或者至少我可以检测我的代码是否无法正确读取?

回答by tempoc

You are using UTF-8 as your encoding in the InputStreamReader constructor, so it will try to interpret the bytes as UTF-8 instead of UCS-LE. Here is the documentation: Charset

您在 InputStreamReader 构造函数中使用 UTF-8 作为编码,因此它会尝试将字节解释为 UTF-8 而不是 UCS-LE。这是文档:Charset

I suppose you need to use UTF-16LE according to it.

我想你需要根据它使用UTF-16LE。

Here is more info on the supported character sets and their Java names: Supported Encodings

以下是有关支持的字符集及其 Java 名称的更多信息: 支持的编码

回答by Vivin Paliath

You're providing the wrong encoding in InputStreamReader. Have you tried using UTF-16LE instead if UTF8?

您在InputStreamReader. 如果 UTF8,您是否尝试过使用 UTF-16LE 代替?

BufferedReader reader = new BufferedReader( new InputStreamReader( new FileInputStream( filePath ), "UTF-16LE" ) );

According to Charset:

根据Charset

UTF-16LE Sixteen-bit UCS Transformation Format, little-endian byte order

UTF-16LE 十六位 UCS 转换格式,小端字节序

回答by Dror Bereznitsky

You cannot use UTF-8 encoding for all files, especially if you do not know which file encoding to expect. Use a library which can detect the file encoding before your read the file, for example: juniversalchardetor jChardet

您不能对所有文件使用 UTF-8 编码,尤其是当您不知道要使用哪种文件编码时。使用可以在读取文件之前检测文件编码的库,例如:juniversalchardetjChardet

For more info see Java : How to determine the correct charset encoding of a stream

有关更多信息,请参阅Java:如何确定流的正确字符集编码