Java 的字符集/字符编码
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/13495924/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Java's charsets / character encoding
提问by coconut
I have a file in Spanish so it's full of characters like:
我有一个西班牙语文件,所以它充满了如下字符:
á é í ó ú ? ? á é í ó ú
I have to read the file, so I do this:
我必须阅读文件,所以我这样做:
fr = new FileReader(ficheroEntrada);
BufferedReader rEntrada = new BufferedReader(fr);
String linea = rEntrada.readLine();
if (linea == null) {
logger.error("ERROR: Empty file.");
return null;
}
String delimitador = "[;]";
String[] tokens = null;
List<String> token = new ArrayList<String>();
while ((linea = rEntrada.readLine()) != null) {
// Some parsing specific to my file.
tokens = linea.split(delimitador);
token.add(tokens[0]);
token.add(tokens[1]);
}
logger.info("List of tokens: " + token);
return token;
When I read the list of tokens, all the special characters are gone and have been replaced by this kind of characters:
当我阅读令牌列表时,所有特殊字符都消失了,并被这种字符替换了:
ó = ?“
? = ?‘
And so on...
等等...
What's happening? I had never had problems with the charsets (I'm assuming is a charset issue). Is it because of this computer? What can I do?
发生了什么?我从来没有遇到过字符集问题(我假设是字符集问题)。是因为这台电脑吗?我能做什么?
Any extra advice will be appreciated, I'm learning! Thank you!
任何额外的建议将不胜感激,我正在学习!谢谢!
回答by kosa
You need to specify related character encoding.
您需要指定相关的字符编码。
BufferedReader rEntrada = new BufferedReader(
new InputStreamReader(new FileInputStream(fr), "UTF-8"));
回答by Guido Simone
What's happening?
发生了什么?
The answers recommending reading and writing using UTF-8 encoding should fix your problem. My answer is more about what happened and how to diagnose similar problems in the future.
建议使用 UTF-8 编码进行读写的答案应该可以解决您的问题。我的回答更多是关于发生了什么以及将来如何诊断类似问题。
The first place to start is the UTF-8 character table at http://www.utf8-chartable.de. There is a drop down on the page which lets you browse different portions of Unicode. One of your problem characters is ó
. Checking the chart reveals that if your file was encoded in UTF-8, then the character is U+00D3 LATIN CAPITAL LETTER O WITH ACUTE
and the UTF-8 sequence is two bytes, hex c3 93
首先是http://www.utf8-chartable.de 上的 UTF-8 字符表。页面上有一个下拉菜单,可让您浏览 Unicode 的不同部分。您的问题字符之一是ó
. 检查图表显示,如果您的文件是用 UTF-8 编码的,那么字符是U+00D3 LATIN CAPITAL LETTER O WITH ACUTE
UTF-8 序列是两个字节,十六进制 c3 93
Now let's check the ISO-8859-1 character set at http://en.wikipedia.org/wiki/ISO/IEC_8859-1, since this is also a popular character set. However this is one of those single-byte character sets. Every valid character is represented by a single byte, unlike UTF-8 where a character may be represented by 1, 2 or 3 bytes.
现在让我们在http://en.wikipedia.org/wiki/ISO/IEC_8859-1上检查 ISO-8859-1 字符集,因为这也是一个流行的字符集。然而,这是那些单字节字符集之一。每个有效字符都由单个字节表示,这与 UTF-8 不同,UTF-8 字符可以由 1、2 或 3 个字节表示。
Note that the character at C3 looks like ? but there is no character at 93. So your default encoding is probably not ISO-8859-1.
请注意,C3 处的字符看起来像 ? 但是 93 处没有字符。所以您的默认编码可能不是 ISO-8859-1。
Next lets check Windows 1252 at http://en.wikipedia.org/wiki/Windows-1252. This is almost the same as ISO-8859-1 but fills in some of the blank spaces with useful characters. And there we have a match. The sequence C3 93 in Windows 1252 is exactly the character string ?“
接下来让我们在http://en.wikipedia.org/wiki/Windows-1252 上检查 Windows 1252 。这与 ISO-8859-1 几乎相同,但用有用的字符填充了一些空格。我们有一场比赛。Windows 1252中的序列C3 93正是字符串?“
What all this tells me is that your file is UTF-8 encoded however your Java environment is configured with Windows 1252 as it's default encoding. If you modify your code to explicitly specify the character set ("UTF-8") instead of using the default your code will be less likely to fail on different environments.
这一切告诉我的是,您的文件是 UTF-8 编码的,但是您的 Java 环境配置为 Windows 1252 作为默认编码。如果您修改代码以明确指定字符集(“UTF-8”)而不是使用默认值,则您的代码在不同环境下失败的可能性将较小。
Keep in mind though - this could have just as easily happened the other way. If you have a file of primarily Spanish text, it could just as easily been an ISO-8859-1 or Windows 1252 encoded file. In which case your code running on your machine would have worked just fine and switching it to read "UTF-8" encoding would have created a different set of garbled characters.
但请记住 - 这也可以很容易地以另一种方式发生。如果您有一个主要是西班牙语文本的文件,那么它很可能是 ISO-8859-1 或 Windows 1252 编码的文件。在这种情况下,您的机器上运行的代码会正常工作,而将其切换为读取“UTF-8”编码会创建一组不同的乱码。
This is part of the reason you are getting conflicting advice. Different people have encountered different mismatches based on their platform and so have discovered different fixes.
这是您收到相互矛盾的建议的部分原因。不同的人根据他们的平台遇到了不同的不匹配,因此发现了不同的修复方法。
When in doubt, I read the file in emacs and switch to hexl-mode so I can see the exact binary data in the file. I'm sure there are better and more modern ways to do this.
如有疑问,我会在 emacs 中读取文件并切换到十六进制模式,以便我可以查看文件中的确切二进制数据。我相信有更好、更现代的方法来做到这一点。
A final thought - it might be worth reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!
最后一个想法 - 可能值得阅读每个软件开发人员绝对、肯定必须了解 Unicode 和字符集的绝对最小值(没有借口!
回答by RobAu
You have the default encoding wrong. You probably need to read UTF8 or latin1. See this snippet for setting the encoding on streams. See also Java, default encoding
您的默认编码错误。您可能需要阅读 UTF8 或 latin1。请参阅此代码段以设置流的编码。另见Java,默认编码
public class Program {
public static void main(String... args) {
if (args.length != 2) {
return ;
}
try {
Reader reader = new InputStreamReader(
new FileInputStream(args[0]),"UTF-8");
BufferedReader fin = new BufferedReader(reader);
Writer writer = new OutputStreamWriter(
new FileOutputStream(args[1]), "UTF-8");
BufferedWriter fout = new BufferedWriter(writer);
String s;
while ((s=fin.readLine())!=null) {
fout.write(s);
fout.newLine();
}
//Remember to call close.
//calling close on a BufferedReader/BufferedWriter
// will automatically call close on its underlying stream
fin.close();
fout.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
回答by Thinhbk
In my experience, the text file should be read and written based on Western encoding: ISO-8859-1.
根据我的经验,文本文件的读写应该是基于西方编码:ISO-8859-1。
BufferedReader rEntrada = new BufferedReader( new InputStreamReader(new FileInputStream(fr), "ISO-8859-1"));
BufferedReader rEntrada = new BufferedReader( new InputStreamReader(new FileInputStream(fr), "ISO-8859-1"));
回答by ShyJ
The other answers provide you a right direction. Just wanted to add that Guavawith its Files.newReader(File,Charset)helper method makes creating such a BufferedReadera lot readable (pardon the pun):
其他答案为您提供了正确的方向。只是想用它的Files.newReader(File,Charset)辅助方法添加Guava使创建这样的BufferedReader更具可读性(请原谅双关语):
BufferedReader rEntrada = Files.newReader(new File(ficheroEntrada), Charsets.UTF_8);