如何用java将汉字保存到文件中?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/766361/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to save Chinese Characters to file with java?
提问by Frank
I use the following code to save Chinese characters into a .txt file, but when I opened it with Wordpad, I couldn't read it.
我用下面的代码将汉字保存到.txt文件中,但是用写字板打开时,却无法读取。
StringBuffer Shanghai_StrBuf = new StringBuffer("\u4E0A\u6D77");
boolean Append = true;
FileOutputStream fos;
fos = new FileOutputStream(FileName, Append);
for (int i = 0;i < Shanghai_StrBuf.length(); i++) {
fos.write(Shanghai_StrBuf.charAt(i));
}
fos.close();
What can I do ? I know if I cut and paste Chinese characters into Wordpad, I can save it into a .txt file. How do I do that in Java ?
我能做什么 ?我知道如果我将汉字剪切并粘贴到写字板中,我可以将其保存到 .txt 文件中。我如何在 Java 中做到这一点?
采纳答案by McDowell
There are several factors at work here:
这里有几个因素在起作用:
- Text files have no intrinsic metadata for describing their encoding (for all the talk of angle-bracket taxes, there are reasons XML is popular)
- The default encoding for Windows is still an 8bit (or doublebyte) "ANSI" character set with a limited range of values - text files written in this format are not portable
- To tell a Unicode file from an ANSI file, Windows apps rely on the presence of a byte order markat the start of the file (not strictly true - Raymond Chen explains). In theory, the BOM is there to tell you the endianess(byte order) of the data. For UTF-8, even though there is only one byte order, Windows apps rely on the marker bytes to automatically figure out that it is Unicode (though you'll note that Notepad has an encoding option on its open/save dialogs).
- It is wrong to say that Java is broken because it does not write a UTF-8 BOM automatically. On Unix systems, it would be an error to write a BOM to a script file, for example, and many Unix systems use UTF-8 as their default encoding. There are times when you don't want it on Windows, either, like when you're appending data to an existing file:
fos = new FileOutputStream(FileName,Append);
- 文本文件没有用于描述其编码的内在元数据(对于所有关于尖括号税的讨论,XML 很受欢迎是有原因的)
- Windows 的默认编码仍然是 8 位(或双字节)“ ANSI”字符集,值范围有限——以这种格式编写的文本文件不可移植
- 要将 Unicode 文件与 ANSI 文件区分开来,Windows 应用程序依赖于文件开头的字节顺序标记(并非严格如此 - Raymond Chen 解释说)。理论上,BOM 可以告诉您数据的字节序(字节顺序)。对于 UTF-8,即使只有一个字节顺序,Windows 应用程序也依赖标记字节来自动确定它是 Unicode(尽管您会注意到记事本在其打开/保存对话框中有一个编码选项)。
- 说Java坏了是错误的,因为它不会自动编写UTF-8 BOM。例如,在 Unix 系统上,将 BOM 写入脚本文件将是错误的,并且许多 Unix 系统使用 UTF-8 作为其默认编码。有时您也不希望在 Windows 上使用它,例如当您将数据附加到现有文件时:
fos = new FileOutputStream(FileName,Append);
Here is a method of reliably appending UTF-8 data to a file:
这是一种可靠地将 UTF-8 数据附加到文件的方法:
private static void writeUtf8ToFile(File file, boolean append, String data)
throws IOException {
boolean skipBOM = append && file.isFile() && (file.length() > 0);
Closer res = new Closer();
try {
OutputStream out = res.using(new FileOutputStream(file, append));
Writer writer = res.using(new OutputStreamWriter(out, Charset
.forName("UTF-8")));
if (!skipBOM) {
writer.write('\uFEFF');
}
writer.write(data);
} finally {
res.close();
}
}
Usage:
用法:
public static void main(String[] args) throws IOException {
String chinese = "\u4E0A\u6D77";
boolean append = true;
writeUtf8ToFile(new File("chinese.txt"), append, chinese);
}
Note: if the file already existed and you chose to append and existing data wasn'tUTF-8 encoded, the only thing that code will create is a mess.
注意:如果文件已经存在并且您选择附加并且现有数据不是UTF-8 编码,那么代码将创建的唯一内容是一团糟。
Here is the Closer
type used in this code:
这是Closer
此代码中使用的类型:
public class Closer implements Closeable {
private Closeable closeable;
public <T extends Closeable> T using(T t) {
closeable = t;
return t;
}
@Override public void close() throws IOException {
if (closeable != null) {
closeable.close();
}
}
}
This code makes a Windows-style best guess about how to read the file based on byte order marks:
此代码对如何根据字节顺序标记读取文件进行了 Windows 风格的最佳猜测:
private static final Charset[] UTF_ENCODINGS = { Charset.forName("UTF-8"),
Charset.forName("UTF-16LE"), Charset.forName("UTF-16BE") };
private static Charset getEncoding(InputStream in) throws IOException {
charsetLoop: for (Charset encodings : UTF_ENCODINGS) {
byte[] bom = "\uFEFF".getBytes(encodings);
in.mark(bom.length);
for (byte b : bom) {
if ((0xFF & b) != in.read()) {
in.reset();
continue charsetLoop;
}
}
return encodings;
}
return Charset.defaultCharset();
}
private static String readText(File file) throws IOException {
Closer res = new Closer();
try {
InputStream in = res.using(new FileInputStream(file));
InputStream bin = res.using(new BufferedInputStream(in));
Reader reader = res.using(new InputStreamReader(bin, getEncoding(bin)));
StringBuilder out = new StringBuilder();
for (int ch = reader.read(); ch != -1; ch = reader.read())
out.append((char) ch);
return out.toString();
} finally {
res.close();
}
}
Usage:
用法:
public static void main(String[] args) throws IOException {
System.out.println(readText(new File("chinese.txt")));
}
(System.out uses the default encoding, so whether it prints anything sensible depends on your platform and configuration.)
(System.out 使用默认编码,因此它是否打印任何合理的内容取决于您的平台和配置。)
回答by Kornel
回答by Esko Luontola
If you can rely that the default character encoding is UTF-8 (or some other Unicode encoding), you may use the following:
如果您可以相信默认字符编码是 UTF-8(或其他一些 Unicode 编码),您可以使用以下内容:
Writer w = new FileWriter("test.txt");
w.append("上海");
w.close();
The safest way is to always explicitly specify the encoding:
最安全的方法是始终明确指定编码:
Writer w = new OutputStreamWriter(new FileOutputStream("test.txt"), "UTF-8");
w.append("上海");
w.close();
P.S. You may use any Unicode characters in Java source code, even as method and variable names, if the -encoding parameter for javac is configured right. That makes the source code more readable than the escaped \uXXXX
form.
PS 如果 javac 的 -encoding 参数配置正确,您可以在 Java 源代码中使用任何 Unicode 字符,甚至作为方法和变量名称。这使得源代码比转义\uXXXX
形式更具可读性。
回答by Matthew Flaschen
Here's one way among many. Basically, we're just specifying that the conversion be done to UTF-8 before outputting bytes to the FileOutputStream:
这是众多方法中的一种。基本上,我们只是指定在将字节输出到 FileOutputStream 之前转换为 UTF-8:
String FileName = "output.txt";
StringBuffer Shanghai_StrBuf=new StringBuffer("\u4E0A\u6D77");
boolean Append=true;
Writer writer = new OutputStreamWriter(new FileOutputStream(FileName,Append), "UTF-8");
writer.write(Shanghai_StrBuf.toString(), 0, Shanghai_StrBuf.length());
writer.close();
I manually verified this against the images at http://www.fileformat.info/info/unicode/char/. In the future, please follow Java coding standards, including lower-case variable names. It improves readability.
我根据http://www.fileformat.info/info/unicode/char/ 上的图像手动验证了这一点。以后请遵循Java编码标准,包括小写变量名。它提高了可读性。
回答by Matthew Flaschen
Try this,
尝试这个,
StringBuffer Shanghai_StrBuf=new StringBuffer("\u4E0A\u6D77");
boolean Append=true;
Writer out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(FileName,Append), "UTF8"));
for (int i=0;i<Shanghai_StrBuf.length();i++) out.write(Shanghai_StrBuf.charAt(i));
out.close();
回答by Jon
Be very careful with the approaches proposed. Even specifying the encoding for the file as follows:
对建议的方法要非常小心。甚至指定文件的编码如下:
Writer w = new OutputStreamWriter(new FileOutputStream("test.txt"), "UTF-8");
Writer w = new OutputStreamWriter(new FileOutputStream("test.txt"), "UTF-8");
will not work if you're running under an operating system like Windows. Even setting the system property for file.encoding to UTF-8 does not fix the issue. This is because Java fails to write a byte order mark (BOM) for the file. Even if you specify the encoding when writing out to a file, opening the same file in an application like Wordpad will display the text as garbage because it doesn't detect the BOM. I tried running the examples here in Windows (with a platform/container encoding of CP1252).
如果您在 Windows 等操作系统下运行,将无法工作。即使将 file.encoding 的系统属性设置为 UTF-8 也不能解决问题。这是因为 Java 无法为文件写入字节顺序标记 (BOM)。即使您在写出文件时指定编码,在写字板等应用程序中打开同一个文件也会将文本显示为垃圾,因为它不会检测 BOM。我尝试在 Windows 中运行这里的示例(使用 CP1252 的平台/容器编码)。
The following bug exists to describe the issue in Java:
存在以下错误来描述 Java 中的问题:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058
The solution for the time being is to write the byte order mark yourself to ensure the file opens correctly in other applications. See this for more details on the BOM:
暂时的解决办法是自己写字节序标记,确保文件在其他应用程序中正确打开。有关 BOM 的更多详细信息,请参见此处:
http://mindprod.com/jgloss/bom.html
http://mindprod.com/jgloss/bom.html
and for a more correct solution see the following link:
有关更正确的解决方案,请参阅以下链接:
http://tripoverit.blogspot.com/2007/04/javas-utf-8-and-unicode-writing-is.html
http://tripoverit.blogspot.com/2007/04/javas-utf-8-and-unicode-writing-is.html