如何在 Java 中打开包含重音符号的文件?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/3072376/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I open files containing accents in Java?
提问by Mark Juric
(editing for clarification and adding some code)
(编辑以澄清并添加一些代码)
Hello,
We have a requirement to parse data sent from users all over the world.  Our Linux systems have a default locale of en_US.UTF-8.  However, we often receive files with diacritical marks in their names such as "special_á_?_è_characters.doc".  Though the OS can deal with these files fine, and an strace shows the OS passing the correct file name to the Java program, Java munges the names and throws a "file not found" io exception trying to open them.
您好,我们需要解析来自世界各地的用户发送的数据。我们的 Linux 系统的默认语言环境是 en_US.UTF-8。但是,我们经常收到名称中带有变音符号的文件,例如“ special_á_?_è_characters.doc”。尽管操作系统可以很好地处理这些文件,并且 strace 显示操作系统将正确的文件名传递给 Java 程序,但 Java 会修改名称并抛出“找不到文件”io 异常试图打开它们。
This simple program can illustrate the issue:
这个简单的程序可以说明问题:
import java.io.*;
import java.text.*;
public class load_i18n
{
  public static void main( String [] args ) {
    File actual = new File(".");
    for( File f : actual.listFiles()){
      System.out.println( f.getName() );
    }
  }
}
Running this program in a directory containing the file special_á_?_è_characters.docand the default US English locale gives:
在包含文件special_á_?_è_characters.doc和默认美国英语语言环境的目录中运行此程序给出:
special_???_???_???_characters.doc
special_???_???_???_characters.doc
Setting the language via export LANG=es_ES@UTF-8 prints out the filename correctly (but is an unacceptable solution since the entire system is now running in Spanish.) Explicitly setting the Locale inside the program like the following has no effect either. Below I've modified the program to a) attempt to open the file and b) print out the name in both ASCII and as a byte array when it fails to open the file:
通过 export LANG=es_ES@UTF-8 设置语言会正确打印出文件名(但这是一个不可接受的解决方案,因为整个系统现在以西班牙语运行。)在程序中显式设置区域设置,如下所示也没有任何影响。下面我将程序修改为 a) 尝试打开文件和 b) 在无法打开文件时以 ASCII 和字节数组的形式打印出名称:
import java.io.*;
import java.util.Locale;
import java.text.*;
public class load_i18n
{
  public static void main( String [] args ) {
    // Stream to read file
    FileInputStream fin;
    Locale locale = new Locale("es", "ES");
    Locale.setDefault(locale);
    File actual = new File(".");
    System.out.println(Locale.getDefault());
    for( File f : actual.listFiles()){
      try {
        fin = new FileInputStream (f.getName());
      }
      catch (IOException e){
        System.err.println ("Can't open the file " + f.getName() + ".  Printing as byte array.");
        byte[] textArray = f.getName().getBytes();
        for(byte b: textArray){
          System.err.print(b + " ");
        }
        System.err.println();
        System.exit(-1);
      }
      System.out.println( f.getName() );
    }
  }
}
This produces the output
这产生输出
es_ES
load_i18n.class
Can't open the file special_???_???_???_characters.doc.  Printing as byte array.
115 112 101 99 105 97 108 95 -17 -65 -67 95 -17 -65 -67 95 -17 -65 -67 95 99 104 97 114 97 99 116 101 114 115 46 100 111 99
This shows that the issue is NOT just an issue with console display as the same characters and their representations are output in byte or ASCII format. In fact, console display does work even when using LANG=en_US.UTF-8 for some utilities like bash's echo:
这表明该问题不仅仅是控制台显示的问题,因为相同的字符及其表示以字节或 ASCII 格式输出。事实上,即使在对某些实用程序(如 bash 的 echo)使用 LANG=en_US.UTF-8 时,控制台显示也能正常工作:
[mjuric@arrhchadm30 tmp]$ echo $LANG
en_US.UTF-8
[mjuric@arrhchadm30 tmp]$ echo *
load_i18n.class special_á_?_è_characters.doc
[mjuric@arrhchadm30 tmp]$ ls
load_i18n.class  special_?_?_?_characters.doc
[mjuric@arrhchadm30 tmp]$
Is it possible to modify this code in such a way that when run under Linux with LANG=en_US.UTF-8, it reads the file name in such a way that it can be successfully opened?
是否可以修改此代码,使其在 Linux 下使用 LANG=en_US.UTF-8 运行时,以可以成功打开的方式读取文件名?
回答by BalusC
First, the character encoding used is not directly related to the locale. So changing the locale won't help much.
首先,使用的字符编码与语言环境没有直接关系。所以改变语言环境不会有太大帮助。
Second, the ???is typical for the Unicode replacement character U+FFFD?being printed in ISO-8859-1 instead of UTF-8. Here's an evidence:
其次,???对于以 ISO-8859-1 而不是 UTF-8 打印的Unicode 替换字符 U+FFFD是典型的?。这里有一个证据:
System.out.println(new String("?".getBytes("UTF-8"), "ISO-8859-1")); // ???
So there are two problems:
所以有两个问题:
- Your JVM is reading those special characters as 
?. - Your console is using ISO-8859-1 to display characters.
 
- 您的 JVM 将这些特殊字符读取为
?. - 您的控制台使用 ISO-8859-1 来显示字符。
 
For a Sun JVM, the VM argument -Dfile.encoding=UTF-8shouldfix the first problem. The second problem is to be fixed in the console settings. If you're using for example Eclipse, you can change it in Window > Preferences > General > Workspace > Text File Encoding. Set it to UTF-8 as well.
对于 Sun JVM,VM 参数-Dfile.encoding=UTF-8应该解决第一个问题。第二个问题是在控制台设置中修复。例如,如果您使用的是 Eclipse,则可以在Window > Preferences > General > Workspace > Text File Encoding 中更改它。也将其设置为 UTF-8。
Update: As per your update:
更新:根据您的更新:
byte[] textArray = f.getName().getBytes();
That should have been the following to exclude influence of platform default encoding:
这应该是以下内容以排除平台默认编码的影响:
byte[] textArray = f.getName().getBytes("UTF-8");
If that still displays the same, then the problem lies deeper. What JVM exactly are you using? Do a java -version. As said before, the -Dfile.encodingargument is Sun JVM specific. Some Linux machines ships with GNU JVM or OpenJDK's JVM and this argument may then not work.
如果仍然显示相同,那么问题就更深了。你到底在使用什么 JVM?做一个java -version。如前所述,该-Dfile.encoding参数是特定于 Sun JVM 的。某些 Linux 机器附带 GNU JVM 或 OpenJDK 的 JVM,因此此参数可能不起作用。
回答by Dennis C
It is a bug in JRE/JDK which exists for years.
这是 JRE/JDK 中存在多年的错误。
How to fix java when if refused to open a file with special charater in filename?
如果拒绝打开文件名中带有特殊字符的文件,如何修复 java?
File.exists() fails with unicode characters in name
File.exists() 失败,名称中包含 unicode 字符
I am now re-submitting a new bug report to them as LC_ALL=en_us will fix some cases, meanwhile it will fail some other cases.
我现在向他们重新提交一个新的错误报告,因为 LC_ALL=en_us 将修复某些情况,同时它会在其他一些情况下失败。
回答by pomo
It's a bug in the old-skool java File api, maybe just on a mac? Anyway, the new java.nio api works much better. I have several files containing unicode characters that failed to load using java.io... classes. After converting all my code to use java.nio.PathEVERYTHING started working. And I replaced apache FileUtils (which has the same problem) with java.nio.Files...
这是 old-skool java File api 中的一个错误,也许只是在 mac 上?无论如何,新的 java.nio api 工作得更好。我有几个文件包含无法使用 java.io... 类加载的 unicode 字符。将我所有的代码转换为使用java.nio.Path 之后,一切都开始工作了。我用java.nio.Files替换了 apache FileUtils(有同样的问题)...
回答by erickson
The Java system property file.encodingshould match the console's character encoding. The property must be set when starting javaon the command-line:
Java 系统属性file.encoding应与控制台的字符编码匹配。java在命令行上启动时必须设置该属性:
java -Dfile.encoding=UTF-8 …
Normally this happens automatically, because the console encoding is usually the platform default encoding, and Java will use the platform default encoding if you don't specify one explicitly.
通常这会自动发生,因为控制台编码通常是平台默认编码,如果您没有明确指定,Java 将使用平台默认编码。
回答by Eric Taix
Well I was strangled with this issue all the day! My previous (wrong) code was the same as you:
好吧,我整天都被这个问题扼杀了!我之前的(错误的)代码和你一样:
for(File f : dir.listFiles()) {
 String filename = f.getName(); // The filename here is wrong !
 FileInputStream fis = new FileInputStream (filename);
}
and it does not work (I'm using Java 1.7 Oracle on CentOS 6, LANG and LC_CTYPE=fr_FR.UTF-8 for all users except zimbra => LANG and LC_CTYPE=C - which btw is certainly the cause of this issue but I can't change this without the risk that Zimbra stops working...)
并且它不起作用(我在 CentOS 6、LANG 和 LC_CTYPE=fr_FR.UTF-8 上为所有用户使用 Java 1.7 Oracle,除了 zimbra => LANG 和 LC_CTYPE=C - 顺便说一句,这当然是这个问题的原因,但我如果没有 Zimbra 停止工作的风险,就无法改变这一点......)
So I decided to use the new classes of java.nio.file package (Files and Paths):
所以我决定使用 java.nio.file 包的新类(文件和路径):
DirectoryStream<Path> paths = Files.newDirectoryStream(Paths.get(outputName));
for (Iterator<Path> iterator = paths.iterator(); iterator.hasNext();) {
  Path path = iterator.next();
  String filename = path.getFileName().toString(); // The filename here is correct
  ...
}
So if you are using Java 1.7, you should give a try to new classes into java.nio.file package : it saved my day!
因此,如果您使用的是 Java 1.7,您应该尝试将新类放入 java.nio.file 包中:它拯救了我的一天!
Hope it helps
希望能帮助到你
回答by user3054250
In the DirectoryStream usage then don't forget to close the stream (try-with-resources can help here)
在 DirectoryStream 使用中,不要忘记关闭流(try-with-resources 可以在这里提供帮助)

