用java读取unicode文本文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/979932/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Read unicode text files with java
提问by Ron Tuffin
Real simple question really. I need to read a Unicode text file in a Java program.
真的很简单的问题。我需要在 Java 程序中读取 Unicode 文本文件。
I am used to using plain ASCII text with a BufferedReader FileReader combo which is obviously not working :(
我习惯于将纯 ASCII 文本与 BufferedReader FileReader 组合一起使用,这显然不起作用:(
I know that I can read a String in the 'traditional' way using a Buffered Reader and then convert it using something like:
我知道我可以使用缓冲阅读器以“传统”方式读取字符串,然后使用以下内容进行转换:
temp = new String(temp.getBytes(), "UTF-16");
But is there a way to wrap the Reader in a 'Converter'?
但是有没有办法将阅读器包装在“转换器”中?
EDIT: the file starts with FF FE
编辑:文件以 FF FE 开头
采纳答案by objects
you wouldn't wrap the Reader, instead you would wrap the stream using an InputStreamReader. You could then wrap that with your BufferedReader that you currently use
您不会包装 Reader,而是使用 InputStreamReader 包装流。然后你可以用你当前使用的 BufferedReader 包装它
BufferedReader in = new BufferedReader(new InputStreamReader(stream, encoding));
回答by Macarse
Check http://java.sun.com/j2se/1.4.2/docs/api/java/io/InputStreamReader.html
检查http://java.sun.com/j2se/1.4.2/docs/api/java/io/InputStreamReader.html
I would read source file with something like:
我会用以下内容读取源文件:
Reader in = new InputStreamReader(new FileInputStream("file"), "UTF-8"));
回答by McDowell
Some notes:
一些注意事项:
- the "UTF-16" encoding can read either little- or big-endian encoded files marked with a BOM; see herefor a list of Java 6 encodings; it is not explicitly stated what endianness will be used when writing using "UTF-16" - it appears to be big-endian - so you might want to use "UnicodeLittle" when saving the data
- be careful when using String class encode/decode methods, especially with a marked variable-width encoding like UTF-16 - use them only on whole data
- as others have said, it is often best to read character data by wrapping your InputStreamwith an InputStreamReader; you can concatenate your inputinto a single String using a StringBuilderor similar buffer.
- “UTF-16”编码可以读取标有BOM 的小端或大端编码文件;有关 Java 6 编码列表,请参见此处;没有明确说明使用“UTF-16”编写时将使用什么字节序——它似乎是大端——所以你可能想在保存数据时使用“UnicodeLittle”
- 使用 String 类编码/解码方法时要小心,尤其是使用标记的可变宽度编码,如 UTF-16 -仅在整个数据上使用它们
- 正如其他人所说,通常最好通过用InputStreamReader包装InputStream来读取字符数据;您可以使用StringBuilder或类似的缓冲区将您的输入连接成单个字符串。
回答by daniel molla
Scanner scan = new Scanner(new File("C:\Users\daniel\Desktop\Corpus.txt"));
while(scan.hasNext()){
System.out.println(scan.nextLine());
}
回答by stenix
I would recommend to use UnicodeReader from Google Data API, see this answerfor a similar question. It will automatically detect encoding from the Byte order mark (BOM).
我建议使用 Google Data API 中的 UnicodeReader,请参阅此答案以了解类似问题。它将自动检测字节顺序标记 (BOM) 中的编码。
You may also consider BOMInputStreamin Apache Commons IO which does basically the same but does not cover all alternative versions of BOM.
您也可以考虑Apache Commons IO 中的BOMInputStream,它的作用基本相同,但并未涵盖所有替代版本的 BOM。
回答by aldo
String s = new String(Files.readAllBytes(Paths.get("file.txt")),"UTF-8");
回答by Jorge Ros
I just had to add "UTF-8" to the creation of the InputStreamReader and special characters could be seen inmediately.
我只需要在 InputStreamReader 的创建中添加“UTF-8”,就可以立即看到特殊字符。
InputStreamReader istreamReader = new InputStreamReader(inputStream,"UTF-8");
BufferedReader bufferedReader = new BufferedReader(istreamReader);

