Java FileReader 编码问题

Question

提问by nybon

I tried to use java.io.FileReader to read some text files and convert them into a string, but I found the result is wrongly encoded and not readable at all.

我尝试使用 java.io.FileReader 读取一些文本文件并将它们转换为字符串，但我发现结果编码错误且根本不可读。

Here's my environment:

这是我的环境：

Windows 2003, OS encoding: CP1252
Java 5.0

Windows 2003，操作系统编码：CP1252
Java 5.0

My files are UTF-8 encoded or CP1252 encoded, and some of them (UTF-8 encoded files) may contain Chinese (non-Latin) characters.

我的文件是 UTF-8 编码或 CP1252 编码的，其中一些（UTF-8 编码的文件）可能包含中文（非拉丁）字符。

I use the following code to do my work:

我使用以下代码来完成我的工作：

   private static String readFileAsString(String filePath)
    throws java.io.IOException{
        StringBuffer fileData = new StringBuffer(1000);
        FileReader reader = new FileReader(filePath);
        //System.out.println(reader.getEncoding());
        BufferedReader reader = new BufferedReader(reader);
        char[] buf = new char[1024];
        int numRead=0;
        while((numRead=reader.read(buf)) != -1){
            String readData = String.valueOf(buf, 0, numRead);
            fileData.append(readData);
            buf = new char[1024];
        }
        reader.close();
        return fileData.toString();
    }

The above code doesn't work. I found the FileReader's encoding is CP1252 even if the text is UTF-8 encoded. But the JavaDoc of java.io.FileReader says that:

上面的代码不起作用。我发现 FileReader 的编码是 CP1252，即使文本是 UTF-8 编码的。但是 java.io.FileReader 的 JavaDoc 说：

The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate.

此类的构造函数假定默认字符编码和默认字节缓冲区大小是合适的。

Does this mean that I am not required to set character encoding by myself if I am using FileReader? But I did get wrongly encoded data currently, what's the correct way to deal with my situtaion? Thanks.

这是否意味着如果我使用 FileReader，我不需要自己设置字符编码？但是我目前确实得到了错误编码的数据，处理我的情况的正确方法是什么？谢谢。

Answer 1

采纳答案by Joachim Sauer

Yes, you need to specify the encodingof the file you want to read.

是的，您需要指定要读取的文件的编码。

Yes, this means that you have to knowthe encoding of the file you want to read.

是的，这意味着您必须知道要读取的文件的编码。

No, there is no general way to guessthe encoding of any given "plain text" file.

不，没有通用的方法来猜测任何给定的“纯文本”文件的编码。

The one-arguments constructors of FileReaderalways use the platform default encoding which is generally a bad idea.

单参数构造函数FileReader总是使用平台默认编码，这通常是一个坏主意。

Since Java 11 FileReaderhas also gained constructors that accept an encoding: new FileReader(file, charset)and new FileReader(fileName, charset).

由于 Java 11FileReader还获得了接受编码的构造函数：new FileReader(file, charset)和new FileReader(fileName, charset).

In earlier versions of java, you need to use new InputStreamReader(new FileInputStream(pathToFile), <encoding>).

在 Java 的早期版本中，您需要使用.new InputStreamReader(new FileInputStream(pathToFile), <encoding>)

Answer 2

回答by Michael Borgwardt

FileReaderuses Java's platform default encoding, which depends on the system settings of the computer it's running on and is generally the most popular encoding among users in that locale.

FileReader使用 Java 的平台默认编码，这取决于运行它的计算机的系统设置，并且通常是该地区用户中最流行的编码。

If this "best guess" is not correct then you have to specify the encoding explicitly. Unfortunately, FileReaderdoes not allow this (major oversight in the API). Instead, you have to use new InputStreamReader(new FileInputStream(filePath), encoding)and ideally get the encoding from metadata about the file.

如果这个“最佳猜测”不正确，那么您必须明确指定编码。不幸的是，FileReader不允许这样做（API 中的主要监督）。相反，您必须使用new InputStreamReader(new FileInputStream(filePath), encoding)并最好从有关文件的元数据中获取编码。

Answer 3

回答by Radoslav Ivanov

Since Java 11 you may use that:

从 Java 11 开始，您可以使用它：

public FileReader(String fileName, Charset charset) throws IOException;

Answer 4

回答by Andreas Gelever

For Java 7+ docyou can use this:

对于 Java 7+文档，您可以使用：

BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8);

Here are all Charsets doc

这是所有字符集文档

For example if your file is in CP1252, use this method

例如，如果您的文件在 CP1252 中，请使用此方法

Charset.forName("windows-1252");

Here is other canonical names for Java encodings both for IO and NIO doc

这是 IO 和 NIO文档的Java 编码的其他规范名称

If you do not know with exactly encoding you have got in a file, you may use some third-party libs like this tool from Google thiswhich works fairly neat.

如果你不知道你在文件中得到的确切编码，你可以使用一些第三方库，比如来自谷歌的这个工具，它工作得相当整洁。

Answer 5

回答by Iefimenko Ievgwn

For another as Latin languages for example Cyrillic you can use something like this:

对于另一种拉丁语言，例如西里尔文，您可以使用以下内容：

FileReader fr = new FileReader("src/text.txt", StandardCharsets.UTF_8);

and be sure that your .txtfile is saved with UTF-8(but not as default ANSI) format. Cheers!

并确保您的.txt文件以UTF-8（但不是默认ANSI）格式保存。干杯!

Answer 6

回答by Guangtong Shen

FileInputStream with InputStreamReader is better than directly using FileReader, because the latter doesn't allow you to specify encoding charset.

FileInputStream 和 InputStreamReader 比直接使用 FileReader 更好，因为后者不允许您指定编码字符集。

Here is an example using BufferedReader, FileInputStream and InputStreamReader together, so that you could read lines from a file.

这是一个同时使用 BufferedReader、FileInputStream 和 InputStreamReader 的示例，以便您可以从文件中读取行。

List<String> words = new ArrayList<>();
List<String> meanings = new ArrayList<>();
public void readAll( ) throws IOException{
    String fileName = "College_Grade4.txt";
    String charset = "UTF-8";
    BufferedReader reader = new BufferedReader(
        new InputStreamReader(
            new FileInputStream(fileName), charset)); 

    String line; 
    while ((line = reader.readLine()) != null) { 
        line = line.trim();
        if( line.length() == 0 ) continue;
        int idx = line.indexOf("\t");
        words.add( line.substring(0, idx ));
        meanings.add( line.substring(idx+1));
    } 
    reader.close();
}

Java FileReader 编码问题

提问by nybon

采纳答案by Joachim Sauer

回答by Michael Borgwardt

回答by Radoslav Ivanov

回答by Andreas Gelever

回答by Iefimenko Ievgwn

回答by Guangtong Shen

相关推荐

最近更新

标签

Java FileReader 编码问题

提问by nybon

采纳答案by Joachim Sauer

回答by Michael Borgwardt

回答by Radoslav Ivanov

回答by Andreas Gelever

回答by Iefimenko Ievgwn

回答by Guangtong Shen

相关推荐

如何查看子字符串是否存在于 Java 1.4 中的另一个字符串中？

Java Maven 资源过滤不起作用 - 由于 Spring Boot 依赖

Java 如何获取ArrayList的最后一个值

如何在Java中检测EOF？

相关推荐

最近更新

标签