java File.listFiles() 使用 JDK 6 破坏 Unicode 名称（Unicode 规范化问题）

Question

提问by James Murty

I'm struggling with a strange file name encoding issue when listing directory contents in Java 6 on both OS X and Linux: the File.listFiles()and related methods seem to return file names in a different encoding than the rest of the system.

在 OS X 和 Linux 上列出 Java 6 中的目录内容时，我正在努力解决一个奇怪的文件名编码问题：File.listFiles()和相关方法似乎以与系统其余部分不同的编码返回文件名。

Note that it is not merely the display of these file names that is causing me problems. I'm mainly interested in doing a comparison of file names with a remote file storage system, so I care more about the content of the name strings than the character encoding used to print output.

请注意，导致我出现问题的不仅仅是这些文件名的显示。我主要感兴趣的是将文件名与远程文件存储系统进行比较，因此我更关心名称字符串的内容，而不是用于打印输出的字符编码。

Here is a program to demonstrate. It creates a file with a Unicode name then prints out URL-encodedversions of the file names obtained from the directly-created File, and the same file when listed under a parent directory (you should run this code in an empty directory). The results show the different encoding returned by the File.listFiles()method.

这里有一个程序来演示。它创建一个具有 Unicode 名称的文件，然后打印出从直接创建的文件中获取的文件名的URL 编码版本，以及在父目录下列出的相同文件（您应该在空目录中运行此代码）。结果显示了该File.listFiles()方法返回的不同编码。

String fileName = "Tr?cky N?me";
File file = new File(fileName);
file.createNewFile();
System.out.println("File name: " + URLEncoder.encode(file.getName(), "UTF-8"));

// Get parent (current) dir and list file contents
File parentDir = file.getAbsoluteFile().getParentFile();
File[] children = parentDir.listFiles();
for (File child: children) {
    System.out.println("Listed name: " + URLEncoder.encode(child.getName(), "UTF-8"));
}

Here's what I get when I run this test code on my systems. Note the %CCversus %C3character representations.

这是我在系统上运行此测试代码时得到的结果。注意%CCvs%C3字符表示。

OS X Snow Leopard:

OS X 雪豹：

File name: Tri%CC%82cky+Na%CC%8Ame
Listed name: Tr%C3%AEcky+N%C3%A5me

$ java -version
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02-279-10M3065)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01-279, mixed mode)

KUbuntu Linux (running in a VM on same OS X system):

KUbuntu Linux（在同一 OS X 系统上的 VM 中运行）：

File name: Tri%CC%82cky+Na%CC%8Ame
Listed name: Tr%C3%AEcky+N%C3%A5me

$ java -version
java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8.1) (6b18-1.8.1-0ubuntu1)
OpenJDK Client VM (build 16.0-b13, mixed mode, sharing)

I have tried various hacks to get the strings to agree, including setting the file.encodingsystem property and various LC_CTYPEand LANGenvironment variables. Nothing helps, nor do I want to resort to such hacks.

我曾尝试过各种黑客获得字符串的同意，包括设置file.encoding系统属性和各种LC_CTYPE和LANG环境变量。没有任何帮助，我也不想诉诸此类黑客。

Unlike this (somewhat related?) question, I am able to read data from the listed files despite the odd names

与这个（有点相关？）问题不同，尽管名称很奇怪，但我能够从列出的文件中读取数据

Answer 1

采纳答案by Stephen P

Using Unicode, there is more than one valid way to represent the same letter. The characters you're using in your Tricky Name are a "latin small letter i with circumflex" and a "latin small letter a with ring above".

使用 Unicode，表示同一个字母的有效方法不止一种。您在 Tricky Name 中使用的字符是“带抑扬符的拉丁小写字母 i”和“上方带环的拉丁小写字母 a”。

You say "Note the %CCversus %C3character representations", but looking closer what you see are the sequences

你说“注意%CC与%C3字符表示”，但仔细看你看到的是序列

i 0xCC 0x82 vs. 0xC3 0xAE
a 0xCC 0x8A vs. 0xC3 0xA5

That is, the first is letter ifollowed by 0xCC82 which is the UTF-8 encoding of the Unicode\u0302"combining circumflex accent" character while the second is UTF-8 for \u00EE"latin small letter i with circumflex". Similarly for the other pair, the first is the letter afollowed by 0xCC8A the "combining ring above" character and the second is "latin small letter a with ring above". Both of these are valid UTF-8 encodings of valid Unicode character strings, but one is in "composed" and the other in "decomposed" format.

也就是说，第一个是字母，i后跟 0xCC82，这是Unicode\u0302“组合抑扬符重音”字符的 UTF-8 编码，而第二个是 UTF-8 表示\u00EE“拉丁小写字母 i 与抑扬符”。对另一对类似，第一个是字母a后跟 0xCC8A 的“组合环上方”字符，第二个是“拉丁小写字母 a 与上方环”。这两种都是有效 Unicode 字符串的有效 UTF-8 编码，但一种采用“组合”格式，另一种采用“分解”格式。

OS X HFS Plus volumes store strings (e.g. filenames) as "fully decomposed". A Unix file-system is really stored according to how the filesystem driver chooses to store it. You can't make any blanket statements across different types of filesystems.

OS X HFS Plus 卷将字符串（例如文件名）存储为“完全分解”。Unix 文件系统实际上是根据文件系统驱动程序选择的存储方式来存储的。您不能对不同类型的文件系统做出任何笼统的声明。

See the Wikipedia article on Unicode Equivalencefor general discussion of composed vs decomposed forms, which mentions OS X specifically.

有关组合形式与分解形式的一般性讨论，请参阅维基百科关于Unicode 等效的文章，其中特别提到了 OS X。

See Apple's Tech Q&A QA1235(in Objective-C unfortunately) for information on converting forms.

有关转换表单的信息，请参阅 Apple 的 Tech Q&A QA1235（不幸的是在 Objective-C 中）。

A recent email threadon Apple's java-dev mailing list could be of some help to you.

一个最近的电子邮件主题Apple的Java开发邮件列表上可能是对你有所帮助。

Basically, you need to normalize the decomposed form into a composed form before you can compare the strings.

基本上，您需要将分解的形式规范化为组合形式，然后才能比较字符串。

Answer 2

回答by Deduplicator

Solution extracted from question:

从问题中提取的解决方案：

Thanks to Stephen P for putting me on the right track.

感谢 Stephen P 让我走上正轨。

The fix first, for the impatient. If you are compiling with Java 6 you can use the java.text.Normalizerclass to normalize strings into a common form of your choice, e.g.

首先修复，对于不耐烦的人。如果您使用 Java 6 进行编译，则可以使用java.text.Normalizer类将字符串规范化为您选择的常用形式，例如

// Normalize to "Normalization Form Canonical Decomposition" (NFD)
protected String normalizeUnicode(String str) {
    Normalizer.Form form = Normalizer.Form.NFD;
    if (!Normalizer.isNormalized(str, form)) {
        return Normalizer.normalize(str, form);
    }
    return str;
}

Since java.text.Normalizeris only available in Java 6 and later, if you need to compile with Java 5 you might have to resort to the sun.text.Normalizerimplementation and something like this reflection-based hackSee also How does this normalize function work?

由于java.text.Normalizer仅在 Java 6 及更高版本中可用，如果您需要使用 Java 5 进行编译，您可能不得不求助于sun.text.Normalizer实现和类似基于反射的黑客之类的东西另请参阅此规范化函数如何工作？

This alone is enough for me to decide I won't support compilation of my project with Java 5 :|

仅此一项就足以让我决定不支持使用 Java 5 编译我的项目：|

Here are other interesting things I learned in this sordid adventure.

以下是我在这次肮脏的冒险中学到的其他有趣的事情。

The confusion is caused by the file names being in one of two normalization forms which cannot be directly compared: Normalization Form Canonical Decomposition (NFD) or Normalization Form Canonical Composition (NFC). The former tends to have ASCII letters followed by "modifiers" to add accents etc, while the latter has only the extended characters with no ACSCII leading character. Read the wiki page Stephen P references for a better explanation.

Unicode string literals like the one contained in the example code (and those received via HTTP in my real app) are in the NFD form, while file names returned by the File.listFiles()method are NFC. The following mini-example demonstrates the differences:

String name = "Tr?cky N?me";
System.out.println("Original name: " + URLEncoder.encode(name, "UTF-8"));
System.out.println("NFC Normalized name: " + URLEncoder.encode(
    Normalizer.normalize(name, Normalizer.Form.NFC), "UTF-8"));
System.out.println("NFD Normalized name: " + URLEncoder.encode(
    Normalizer.normalize(name, Normalizer.Form.NFD), "UTF-8"));

Output:

Original name: Tri%CC%82cky+Na%CC%8Ame
NFC Normalized name: Tr%C3%AEcky+N%C3%A5me
NFD Normalized name: Tri%CC%82cky+Na%CC%8Ame

If you construct a Fileobject with a string name, the File.getName()method will return the name in whatever form you gave it originally. However, if you call Filemethods that discover names on their own, they seem to return names in NFC form. This is a potentially a nasty gotcha. It certainly gotchme.
According to the quote below from Apple's documentationfile names are stored in decomposed (NFD) form on the HFS Plus file system:
When working within Mac OS you will find yourself using a mixture of precomposed and decomposed Unicode. For example, HFS Plus converts all file names to decomposed Unicode, while Macintosh keyboards generally produce precomposed Unicode.
So the File.listFiles()method helpfully (?) converts file names to the (pre)composed (NFC) form.

混淆是由文件名采用无法直接比较的两种规范化形式之一引起的：规范化形式规范分解 (NFD) 或规范化形式规范组合 (NFC)。前者往往有 ASCII 字母，后跟“修饰符”以添加重音等，而后者只有扩展字符，没有 ACSCII 前导字符。阅读 wiki 页面 Stephen P 参考以获得更好的解释。

示例代码中包含的 Unicode 字符串文字（以及在我的真实应用中通过 HTTP 接收的字符串）采用 NFD 形式，而该File.listFiles()方法返回的文件名是 NFC。以下小示例演示了差异：

String name = "Tr?cky N?me";
System.out.println("Original name: " + URLEncoder.encode(name, "UTF-8"));
System.out.println("NFC Normalized name: " + URLEncoder.encode(
    Normalizer.normalize(name, Normalizer.Form.NFC), "UTF-8"));
System.out.println("NFD Normalized name: " + URLEncoder.encode(
    Normalizer.normalize(name, Normalizer.Form.NFD), "UTF-8"));

输出：

Original name: Tri%CC%82cky+Na%CC%8Ame
NFC Normalized name: Tr%C3%AEcky+N%C3%A5me
NFD Normalized name: Tri%CC%82cky+Na%CC%8Ame

如果您构造一个File带有字符串名称的对象，该File.getName()方法将以您最初给它的任何形式返回名称。但是，如果您调用File自行发现名称的方法，它们似乎会以 NFC 形式返回名称。这可能是一个令人讨厌的问题。这当然是骗人的。
根据以下Apple 文档中的引用，文件名以分解 (NFD) 形式存储在 HFS Plus 文件系统上：
在 Mac OS 中工作时，您会发现自己混合使用了预先组合和分解的 Unicode。例如，HFS Plus 将所有文件名转换为分解的 Unicode，而 Macintosh 键盘通常会生成预先组合的 Unicode。
因此，该File.listFiles()方法有助于 (?) 将文件名转换为 (预) 组合 (NFC) 形式。

Answer 3

回答by helios

I've seen something similar before. People that uploadde files from their Mac to a webapp used filenames with é.

我以前见过类似的东西。将文件从 Mac 上传到网络应用程序的人使用带有 é 的文件名。

a) In OS that char is normal e + "sign for ′ applied to the previous char"

a) 在操作系统中，char 是正常的 e +“应用于前一个字符的 ' 符号”

b) In Windows it's a special char: é

b) 在 Windows 中，它是一个特殊字符：é

Both are Unicode. So... I understand you pass the (b) option to File create and at some point Mac OS converts it to the (a) option. Maybe if you find the double representation issue over the internet you can get a way to handle both situations successfully.

两者都是Unicode。所以...我知道您将 (b) 选项传递给 File create 并且在某些时候 Mac OS 将其转换为 (a) 选项。也许如果您在互联网上发现双重表示问题，您可以找到一种方法来成功处理这两种情况。

Hope it helps!

希望能帮助到你！

Answer 4

回答by gawi

On Unix file-system, a file name really is a null-terminated byte[]. So the java runtime has to perform conversion from java.lang.String to byte[] during the createNewFile() operation. The char-to-byte conversion is governed by the locale. I've been testing setting LC_ALLto en_US.UTF-8and en_US.ISO-8859-1and got coherent results. This is with Sun (...Oracle) java 1.6.0_20. However, For LC_ALL=en_US.POSIX, the result is:

在 Unix 文件系统上，文件名实际上是以空字符结尾的 byte[]。因此，java 运行时必须在 createNewFile() 操作期间执行从 java.lang.String 到 byte[] 的转换。字符到字节的转换由语言环境控制。我一直在测试环境LC_ALL，以en_US.UTF-8和en_US.ISO-8859-1，得到了一致的结果。这是 Sun (...Oracle) java 1.6.0_20。但是，对于LC_ALL=en_US.POSIX，结果是：

File name:   Tr%C3%AEcky+N%C3%A5me
Listed name: Tr%3Fcky+N%3Fme

3Fis a question mark. It tells me that the conversion was not successful for the non-ASCII character. Then again, everything is as expected.

3F是一个问号。它告诉我非 ASCII 字符的转换不成功。话又说回来，一切都在预料之中。

But the reason why your two strings are different is because of the equivalence between the \u00EE character (or C3 AEin UTF-8) and the sequence i+\u0302 (69 CC 82in UTF-8). \u0302 is a combining diacritical mark (combining circumflex accent). Some sort of normalization occurred during the file creation. I'm not sure if it's done in the Java run-time or the OS.

但是你的两个字符串不同的原因是因为 \u00EE 字符（或C3 AE在 UTF-8 中）和序列 i+\u0302 （69 CC 82在 UTF-8 中）之间的等效性。\u0302 是一个组合变音符号（组合抑扬符号）。在文件创建过程中发生了某种标准化。我不确定它是在 Java 运行时还是操作系统中完成的。

NOTE: I took me some time to figure it out since the code snippet that you've posted do not have a combining diacritical mark but the equivalent character ?(e.g. \u00ee). You should have embedded the Unicode escape sequence in the string literal (but it's easy to say that afterward...).

注意：我花了一些时间才弄清楚，因为您发布的代码片段没有组合变音标记，而是等效字符?（例如\u00ee）。您应该已经在字符串文字中嵌入了 Unicode 转义序列（但之后很容易说......）。

Answer 5

回答by BalusC

I suspect that you just have to instruct javacwhat encoding to use to compile the .javafile containing the special characters with since you've hardcoded it in the source file. Otherwise the platform default encoding will be used, which may not be UTF-8 at all.

我怀疑您只需要指示javac使用什么编码来编译.java包含特殊字符的文件，因为您已经在源文件中对其进行了硬编码。否则将使用平台默认编码，它可能根本不是 UTF-8。

You can use the VM argument -encodingfor this.

您可以-encoding为此使用 VM 参数。

javac -encoding UTF-8 com/example/Foo.java

This way the resulting .classfile will end up containing the correct characters and you will be able to create and list the correct filename as well.

这样，生成的.class文件最终将包含正确的字符，您也可以创建和列出正确的文件名。

Answer 6

回答by pomo

An alternative solution is to use the new java.nio.Path api in place of the java.io.File api which works perfectly.

另一种解决方案是使用新的 java.nio.Path api 代替运行完美的 java.io.File api。

java File.listFiles() 使用 JDK 6 破坏 Unicode 名称（Unicode 规范化问题）

提问by James Murty

采纳答案by Stephen P

回答by Deduplicator

Solution extracted from question:

从问题中提取的解决方案：

回答by helios

回答by gawi

回答by BalusC

回答by pomo

相关推荐

最近更新

标签

java File.listFiles() 使用 JDK 6 破坏 Unicode 名称（Unicode 规范化问题）

提问by James Murty

采纳答案by Stephen P

回答by Deduplicator

Solution extracted from question:

从问题中提取的解决方案：

回答by helios

回答by gawi

回答by BalusC

回答by pomo

相关推荐

用 Java 将 RTF 转为 PDF

java Java远程调试，技术上是怎么做的？

java 如何做与偏好属性android：dependency相反的事情？

java 从 List<Date> 创建一个 List<List<Date>> ，其中包含随后放入 Lists 的所有日期

相关推荐

最近更新

标签