Java FileReader 编码问题
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/696626/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Java FileReader encoding issue
提问by nybon
I tried to use java.io.FileReader to read some text files and convert them into a string, but I found the result is wrongly encoded and not readable at all.
我尝试使用 java.io.FileReader 读取一些文本文件并将它们转换为字符串,但我发现结果编码错误且根本不可读。
Here's my environment:
这是我的环境:
Windows 2003, OS encoding: CP1252
Java 5.0
Windows 2003,操作系统编码:CP1252
Java 5.0
My files are UTF-8 encoded or CP1252 encoded, and some of them (UTF-8 encoded files) may contain Chinese (non-Latin) characters.
我的文件是 UTF-8 编码或 CP1252 编码的,其中一些(UTF-8 编码的文件)可能包含中文(非拉丁)字符。
I use the following code to do my work:
我使用以下代码来完成我的工作:
private static String readFileAsString(String filePath)
throws java.io.IOException{
StringBuffer fileData = new StringBuffer(1000);
FileReader reader = new FileReader(filePath);
//System.out.println(reader.getEncoding());
BufferedReader reader = new BufferedReader(reader);
char[] buf = new char[1024];
int numRead=0;
while((numRead=reader.read(buf)) != -1){
String readData = String.valueOf(buf, 0, numRead);
fileData.append(readData);
buf = new char[1024];
}
reader.close();
return fileData.toString();
}
The above code doesn't work. I found the FileReader's encoding is CP1252 even if the text is UTF-8 encoded. But the JavaDoc of java.io.FileReader says that:
上面的代码不起作用。我发现 FileReader 的编码是 CP1252,即使文本是 UTF-8 编码的。但是 java.io.FileReader 的 JavaDoc 说:
The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate.
此类的构造函数假定默认字符编码和默认字节缓冲区大小是合适的。
Does this mean that I am not required to set character encoding by myself if I am using FileReader? But I did get wrongly encoded data currently, what's the correct way to deal with my situtaion? Thanks.
这是否意味着如果我使用 FileReader,我不需要自己设置字符编码?但是我目前确实得到了错误编码的数据,处理我的情况的正确方法是什么?谢谢。
采纳答案by Joachim Sauer
Yes, you need to specify the encodingof the file you want to read.
是的,您需要指定要读取的文件的编码。
Yes, this means that you have to knowthe encoding of the file you want to read.
是的,这意味着您必须知道要读取的文件的编码。
No, there is no general way to guessthe encoding of any given "plain text" file.
不,没有通用的方法来猜测任何给定的“纯文本”文件的编码。
The one-arguments constructors of FileReader
always use the platform default encoding which is generally a bad idea.
单参数构造函数FileReader
总是使用平台默认编码,这通常是一个坏主意。
Since Java 11 FileReader
has also gained constructors that accept an encoding: new FileReader(file, charset)
and new FileReader(fileName, charset)
.
由于 Java 11FileReader
还获得了接受编码的构造函数:new FileReader(file, charset)
和new FileReader(fileName, charset)
.
In earlier versions of java, you need to use new InputStreamReader(
new FileInputStream(pathToFile)
, <encoding>)
.
在 Java 的早期版本中,您需要使用.new InputStreamReader(
new FileInputStream(pathToFile)
, <encoding>)
回答by Michael Borgwardt
FileReader
uses Java's platform default encoding, which depends on the system settings of the computer it's running on and is generally the most popular encoding among users in that locale.
FileReader
使用 Java 的平台默认编码,这取决于运行它的计算机的系统设置,并且通常是该地区用户中最流行的编码。
If this "best guess" is not correct then you have to specify the encoding explicitly. Unfortunately, FileReader
does not allow this (major oversight in the API). Instead, you have to use new InputStreamReader(new FileInputStream(filePath), encoding)
and ideally get the encoding from metadata about the file.
如果这个“最佳猜测”不正确,那么您必须明确指定编码。不幸的是,FileReader
不允许这样做(API 中的主要监督)。相反,您必须使用new InputStreamReader(new FileInputStream(filePath), encoding)
并最好从有关文件的元数据中获取编码。
回答by Radoslav Ivanov
Since Java 11 you may use that:
从 Java 11 开始,您可以使用它:
public FileReader(String fileName, Charset charset) throws IOException;
回答by Andreas Gelever
For Java 7+ docyou can use this:
对于 Java 7+文档,您可以使用:
BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8);
Here are all Charsets doc
这是所有字符集文档
For example if your file is in CP1252, use this method
例如,如果您的文件在 CP1252 中,请使用此方法
Charset.forName("windows-1252");
Here is other canonical names for Java encodings both for IO and NIO doc
这是 IO 和 NIO文档的Java 编码的其他规范名称
If you do not know with exactly encoding you have got in a file, you may use some third-party libs like this tool from Google thiswhich works fairly neat.
如果你不知道你在文件中得到的确切编码,你可以使用一些第三方库,比如来自谷歌的这个工具,它工作得相当整洁。
回答by Iefimenko Ievgwn
For another as Latin languages for example Cyrillic you can use something like this:
对于另一种拉丁语言,例如西里尔文,您可以使用以下内容:
FileReader fr = new FileReader("src/text.txt", StandardCharsets.UTF_8);
and be sure that your .txt
file is saved with UTF-8
(but not as default ANSI
) format. Cheers!
并确保您的.txt
文件以UTF-8
(但不是默认ANSI
)格式保存。干杯!
回答by Guangtong Shen
FileInputStream with InputStreamReader is better than directly using FileReader, because the latter doesn't allow you to specify encoding charset.
FileInputStream 和 InputStreamReader 比直接使用 FileReader 更好,因为后者不允许您指定编码字符集。
Here is an example using BufferedReader, FileInputStream and InputStreamReader together, so that you could read lines from a file.
这是一个同时使用 BufferedReader、FileInputStream 和 InputStreamReader 的示例,以便您可以从文件中读取行。
List<String> words = new ArrayList<>();
List<String> meanings = new ArrayList<>();
public void readAll( ) throws IOException{
String fileName = "College_Grade4.txt";
String charset = "UTF-8";
BufferedReader reader = new BufferedReader(
new InputStreamReader(
new FileInputStream(fileName), charset));
String line;
while ((line = reader.readLine()) != null) {
line = line.trim();
if( line.length() == 0 ) continue;
int idx = line.indexOf("\t");
words.add( line.substring(0, idx ));
meanings.add( line.substring(idx+1));
}
reader.close();
}