Java 无法打开文件名中包含代理 Unicode 值的文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1545625/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 17:01:52  来源:igfitidea点击:

Java Can't Open a File with Surrogate Unicode Values in the Filename?

javafileunicodefilenamessurrogate-pairs

提问by Bear

I'm dealing with code that does various IO operations with files, and I want to make it able to deal with international filenames. I'm working on a Mac with Java 1.5, and if a filename contains Unicode characters that require surrogates, the JVM can't seem to locate the file. For example, my test file is:

我正在处理对文件执行各种 IO 操作的代码,我想让它能够处理国际文件名。我正在使用 Java 1.5 的 Mac 上工作,如果文件名包含需要代理的 Unicode 字符,则 JVM 似乎无法找到该文件。例如,我的测试文件是:

"草鷗外.gif"which gets broken into the Java characters \u8349\uD85B\uDFF6\u9DD7\u5916.gif

"草鷗外.gif"它被分解成 Java 字符 \u8349\uD85B\uDFF6\u9DD7\u5916.gif

If I create a file from this filename, I can't open it because I get a FileNotFound exception. Even using this on the folder containing the file will fail:

如果我从此文件名创建文件,则无法打开它,因为出现 FileNotFound 异常。即使在包含文件的文件夹上使用它也会失败:

File[] files = folder.listFiles(); 
for (File file : files) {
    if (!file.exists()) {
        System.out.println("Failed to find File"); //Fails on the surrogate filename
    }
}

Most of the code I am actually dealing with are of the form:

我实际处理的大部分代码都是以下形式:

FileInputStream instream = new FileInputStream(new File("草鷗外.gif"));
// operations follow

Is there some way I can address this problem, either escaping the filenames or opening files differently?

有什么方法可以解决这个问题,要么转义文件名,要么以不同方式打开文件?

回答by bobince

I suspect one of Java or Mac is using CESU-8instead of proper UTF-8. Java uses “modified UTF-8” (which is a slight variation of CESU-8) for a variety of internal purposes, but I wasn't aware it could use it as a filesystem/defaultCharset. Unfortunately I have neither Mac nor Java here to test with.

我怀疑 Java 或 Mac 之一正在使用CESU-8而不是正确的 UTF-8。Java 使用“修改后的 UTF-8”(这是 CESU-8 的一个轻微变体)用于各种内部目的,但我不知道它可以将它用作文件系统/defaultCharset。不幸的是,我这里既没有 Mac 也没有 Java 可以测试。

“Modified” is a modified way of saying “badly bugged”. Instead of outputting a four-byte UTF-8 sequence for supplementary (non-BMP) characters like

© 2020 版权所有