是否有跨平台的 Java 方法来删除文件名特殊字符？

Question

提问by Ben S

I'm making a cross-platform application that renames files based on data retrieved online. I'd like to sanitize the Strings I took from a web API for the current platform.

我正在制作一个跨平台的应用程序，它根据在线检索的数据重命名文件。我想清理我从当前平台的 Web API 中获取的字符串。

I know that different platforms have different file-name requirements, so I was wondering if there's a cross-platform way to do this?

我知道不同的平台有不同的文件名要求，所以我想知道是否有跨平台的方式来做到这一点？

Edit:On Windows platforms you cannot have a question mark '?' in a file name, whereas in Linux, you can. The file names may contain such characters and I would like for the platforms that support those characters to keep them, but otherwise, strip them out.

编辑：在 Windows 平台上，您不能有问号“？” 在文件名中，而在 Linux 中，您可以。文件名可能包含这些字符，我希望支持这些字符的平台保留它们，否则，将它们去掉。

Also, I would prefer a standard Java solution that doesn't require third-party libraries.

此外，我更喜欢不需要第三方库的标准 Java 解决方案。

Answer 1

回答by Stephen C

It is not clear from your question, but since you are planning to accept pathnames from a web form (?) you probably ought block attempts renaming certain things; e.g. "C:\Program Files". This implies that you need to canonicalize the pathnames to eliminate "." and ".." before you make your access checks.

从您的问题中不清楚，但由于您计划接受来自网络表单 (?) 的路径名，您可能应该阻止重命名某些内容的尝试；例如“C:\Program Files”。这意味着您需要规范化路径名以消除“.”。和“..”，然后再进行访问检查。

Given that, I wouldn't attempt to remove illegal characters. Instead, I'd use "new File(str).getCanonicalFile()" to produce the canonical paths, next check that they satisfy your sandboxing restrictions, and finally use "File.exists()", "File.isFile()", etc to check that the source and destination are kosher, and are not the same file system object. I'd deal with illegal characters by attempting to do the operations and catching the exceptions.

鉴于此，我不会尝试删除非法字符。相反，我会使用“new File(str).getCanonicalFile()”来生成规范路径，接下来检查它们是否满足您的沙箱限制，最后使用“File.exists()”、“File.isFile()”等来检查源和目标是否是 kosher 并且不是同一个文件系统对象。我会通过尝试执行操作并捕获异常来处理非法字符。

Answer 2

回答by Sarel Botha

As suggested elsewhere, this is not usually what you want to do. It is usually best to create a temporary file using a secure method such as File.createTempFile().

正如其他地方所建议的，这通常不是您想要做的。通常最好使用诸如 File.createTempFile() 之类的安全方法来创建临时文件。

You should not do this with a whitelist and only keep 'good' characters. If the file is made up of only Chinese characters then you will strip everything out of it. We can't use a whitelist for this reason, we have to use a blacklist.

您不应该使用白名单来执行此操作，而应仅保留“好”字符。如果文件仅由中文字符组成，那么您将删除所有内容。出于这个原因，我们不能使用白名单，我们必须使用黑名单。

Linux pretty much allows anything which can be a real pain. I would just limit Linux to the same list that you limit Windows to so you save yourself headaches in the future.

Linux 几乎允许任何可能是真正痛苦的事情。我只是将 Linux 限制在与您限制 Windows 相同的列表中，这样您将来就可以省去麻烦。

Using this C# snippet on Windows I produced a list of characters that are not valid on Windows. There are quite a few more characters in this list than you may think (41) so I wouldn't recommend trying to create your own list.

在 Windows 上使用这个 C# 片段我生成了一个在 Windows 上无效的字符列表。此列表中的字符比您想象的要多得多 (41)，因此我不建议您尝试创建自己的列表。

        foreach (char c in new string(Path.GetInvalidFileNameChars()))
        {
            Console.Write((int)c);
            Console.Write(",");
        }

Here is a simple Java class which 'cleans' a file name.

这是一个“清理”文件名的简单 Java 类。

public class FileNameCleaner {
final static int[] illegalChars = {34, 60, 62, 124, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 58, 42, 63, 92, 47};
static {
    Arrays.sort(illegalChars);
}
public static String cleanFileName(String badFileName) {
    StringBuilder cleanName = new StringBuilder();
    for (int i = 0; i < badFileName.length(); i++) {
        int c = (int)badFileName.charAt(i);
        if (Arrays.binarySearch(illegalChars, c) < 0) {
            cleanName.append((char)c);
        }
    }
    return cleanName.toString();
}
}

EDIT: As Stephen suggested you probably also should verify that these file accesses only occur within the directory you allow.

编辑：正如斯蒂芬建议的那样，您可能还应该验证这些文件访问是否仅发生在您允许的目录中。

The following answer has sample code for establishing a custom security context in Java and then executing code in that 'sandbox'.

以下答案包含用于在 Java 中建立自定义安全上下文然后在该“沙箱”中执行代码的示例代码。

How do you create a secure JEXL (scripting) sandbox?

如何创建安全的 JEXL（脚本）沙箱？

Answer 3

回答by David Carboni

There's a pretty good built-in Java solution - Character.isXxx().

有一个非常好的内置 Java 解决方案 - Character.isXxx()。

Try Character.isJavaIdentifierPart(c):

尝试Character.isJavaIdentifierPart(c)：

String name = "name.é+!@#$%^&*(){}][/=?+-_\|;:`~!'\",<>";
StringBuilder filename = new StringBuilder();

for (char c : name.toCharArray()) {
  if (c=='.' || Character.isJavaIdentifierPart(c)) {
    filename.append(c);
  }
}

Result is "name.é$_".

结果是“name.é$_”。

Answer 4

回答by Dirk

or just do this:

或者只是这样做：

String filename = "A20/B22b#?A\BC#?$%ld_ma.la.xps";
String sane = filename.replaceAll("[^a-zA-Z0-9\._]+", "_");

Result: A20_B22b_A_BC_ld_ma.la.xps

结果： A20_B22b_A_BC_ld_ma.la.xps

Explanation:

解释：

[a-zA-Z0-9\\._]matches a letter from a-z lower or uppercase, numbers, dots and underscores

[a-zA-Z0-9\\._]匹配 az 小写或大写字母、数字、点和下划线

[^a-zA-Z0-9\\._]is the inverse. i.e. all characters which do not match the first expression

[^a-zA-Z0-9\\._]是逆。即所有与第一个表达式不匹配的字符

[^a-zA-Z0-9\\._]+is a sequence of characters which do not match the first expression

[^a-zA-Z0-9\\._]+是与第一个表达式不匹配的字符序列

So every sequence of characters which does not consist of characters from a-z, 0-9 or . _ will be replaced.

因此，每个不包含 az、0-9 或 . _ 将被替换。

Answer 5

回答by Aaron Digulla

Here is the code I use:

这是我使用的代码：

public static String sanitizeName( String name ) {
    if( null == name ) {
        return "";
    }

    if( SystemUtils.IS_OS_LINUX ) {
        return name.replaceAll( "[\u0000/]+", "" ).trim();
    }

    return name.replaceAll( "[\u0000-\u001f<>:\"/\\|?*\u007f]+", "" ).trim();
}

SystemUtilsis from Apache commons-lang3

SystemUtils来自Apache commons-lang3

Answer 6

回答by Stijn de Witt

This is based on the accepted answer by Sarel Bothawhich works fine as long as you don't encounter any characters outside of the Basic Multilingual Plane. If you need full Unicode support (and who doesn't?) use this code instead which is Unicode safe:

这是基于Sarel Botha接受的答案，只要您没有遇到Basic Multilingual Plane之外的任何字符，它就可以正常工作。如果您需要完整的 Unicode 支持（谁不需要？），请改用此代码，这是 Unicode 安全的：

public class FileNameCleaner {
  final static int[] illegalChars = {34, 60, 62, 124, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 58, 42, 63, 92, 47};

  static {
    Arrays.sort(illegalChars);
  }

  public static String cleanFileName(String badFileName) {
    StringBuilder cleanName = new StringBuilder();
    int len = badFileName.codePointCount(0, badFileName.length());
    for (int i=0; i<len; i++) {
      int c = badFileName.codePointAt(i);
      if (Arrays.binarySearch(illegalChars, c) < 0) {
        cleanName.appendCodePoint(c);
      }
    }
    return cleanName.toString();
  }
}

Key changes here:

这里的主要变化：

Use codePointCounti.c.w. lengthinstead of just length
use codePointAtinstead of charAt
use appendCodePointinstead of append
No need to cast chars to ints. In fact, you should never deal with chars as they are basically broken for anything outside the BMP.

使用codePointCounticwlength而不是仅仅length
使用codePointAt而不是charAt
使用appendCodePoint而不是append
无需将chars 强制转换为ints。事实上，你永远不应该处理chars 因为它们基本上被 BMP 之外的任何东西破坏了。

Answer 7

回答by wandlang

If you want to use more than like [A-Za-z0-9], then check MS Naming Conventions, and dont forget to filter out "...Characters whose integer representations are in the range from 1 through 31,...", like the example of Aaron Digulla does. The code e.g. from David Carboni would not be sufficient for these chars.

如果您想使用的不仅仅是 [A-Za-z0-9]，请检查MS Naming Conventions，并且不要忘记过滤掉“...整数表示在 1 到 31 范围内的字符，... ”，就像 Aaron Digulla 的例子一样。例如来自 David Carboni 的代码对于这些字符是不够的。

Excerpt containing the list of reserved characters:

包含保留字符列表的摘录：

Use any character in the current code page for a name, including Unicode characters and characters in the extended character set (128–255), except for the following:
The following reserved characters:
<(less than)
>(greater than)
:(colon)
"(double quote)
/(forward slash)
\(backslash)
|(vertical bar or pipe)
?(question mark)
*(asterisk)
Integer value zero, sometimes referred to as the ASCII NUL character.
Characters whose integer representations are in the range from 1 through 31, except for alternate data streams where these characters are allowed. For more information about file streams, see File Streams.
Any other character that the target file system does not allow.

使用当前代码页中的任何字符作为名称，包括 Unicode 字符和扩展字符集 (128–255) 中的字符，但以下字符除外：
以下保留字符：
<（少于）
>（比...更棒）
:（冒号）
"（双引号）
/（正斜杠）
\（反斜杠）
|（垂直条或管）
?（问号）
*（星号）
整数值零，有时称为 ASCII NUL 字符。
整数表示在 1 到 31 范围内的字符，但允许这些字符的备用数据流除外。有关文件流的更多信息，请参阅文件流。
目标文件系统不允许的任何其他字符。

Answer 8

回答by l.poellabauer

Paths.get(...)throws a detailed exception with the position of the illegal character.

Paths.get(...)抛出带有非法字符位置的详细异常。

public static String removeInvalidChars(final String fileName)
{
  try
  {
    Paths.get(fileName);
    return fileName;
  }
  catch (final InvalidPathException e)
  {
    if (e.getInput() != null && e.getInput().length() > 0 && e.getIndex() >= 0)
    {
      final StringBuilder stringBuilder = new StringBuilder(e.getInput());
      stringBuilder.deleteCharAt(e.getIndex());
      return removeInvalidChars(stringBuilder.toString());
    }
    throw e;
  }
}

是否有跨平台的 Java 方法来删除文件名特殊字符？

提问by Ben S

回答by Stephen C

回答by Sarel Botha

回答by David Carboni

回答by Dirk

回答by Aaron Digulla

回答by Stijn de Witt

回答by wandlang

回答by l.poellabauer

相关推荐

最近更新

标签

是否有跨平台的 Java 方法来删​​除文件名特殊字符？

提问by Ben S

回答by Stephen C

回答by Sarel Botha

回答by David Carboni

回答by Dirk

回答by Aaron Digulla

回答by Stijn de Witt

回答by wandlang

回答by l.poellabauer

相关推荐

Java 用于布尔字段的 Lombok 注释 @Getter

Java 中的同步方法和同步块有什么区别？

Java 使用 Spring REST 上传图片

Java 使用 BufferedReader 输入字符

相关推荐

最近更新

标签

是否有跨平台的 Java 方法来删除文件名特殊字符？