如何在 Java 中安全地编码字符串以用作文件名?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1184176/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 00:37:57  来源:igfitidea点击:

How can I safely encode a string in Java to use as a filename?

javastringfileencoding

提问by Steve McLeod

I'm receiving a string from an external process. I want to use that String to make a filename, and then write to that file. Here's my code snippet to do this:

我从外部进程收到一个字符串。我想使用该字符串来创建文件名,然后写入该文件。这是我执行此操作的代码片段:

    String s = ... // comes from external source
    File currentFile = new File(System.getProperty("user.home"), s);
    PrintWriter currentWriter = new PrintWriter(currentFile);

If s contains an invalid character, such as '/' in a Unix-based OS, then a java.io.FileNotFoundException is (rightly) thrown.

如果 s 包含无效字符,例如基于 Unix 的操作系统中的“/”,则会(正确地)抛出 java.io.FileNotFoundException。

How can I safely encode the String so that it can be used as a filename?

如何安全地对字符串进行编码,以便将其用作文件名?

Edit: What I'm hoping for is an API call that does this for me.

编辑:我希望的是为我执行此操作的 API 调用。

I can do this:

我可以做这个:

    String s = ... // comes from external source
    File currentFile = new File(System.getProperty("user.home"), URLEncoder.encode(s, "UTF-8"));
    PrintWriter currentWriter = new PrintWriter(currentFile);

But I'm not sure whether URLEncoder it is reliable for this purpose.

但我不确定 URLEncoder 为此目的是否可靠。

采纳答案by Stephen C

If you want the result to resemble the original file, SHA-1 or any other hashing scheme is not the answer. If collisions must be avoided, then simple replacement or removal of "bad" characters is not the answer either.

如果您希望结果类似于原始文件,SHA-1 或任何其他散列方案都不是答案。如果必须避免冲突,那么简单的替换或删除“坏”字符也不是答案。

Instead you want something like this. (Note: this should be treated as an illustrative example, not something to copy and paste.)

相反,你想要这样的东西。(注意:这应该被视为一个说明性的例子,而不是复制和粘贴的东西。)

char fileSep = '/'; // ... or do this portably.
char escape = '%'; // ... or some other legal char.
String s = ...
int len = s.length();
StringBuilder sb = new StringBuilder(len);
for (int i = 0; i < len; i++) {
    char ch = s.charAt(i);
    if (ch < ' ' || ch >= 0x7F || ch == fileSep || ... // add other illegal chars
        || (ch == '.' && i == 0) // we don't want to collide with "." or ".."!
        || ch == escape) {
        sb.append(escape);
        if (ch < 0x10) {
            sb.append('0');
        }
        sb.append(Integer.toHexString(ch));
    } else {
        sb.append(ch);
    }
}
File currentFile = new File(System.getProperty("user.home"), sb.toString());
PrintWriter currentWriter = new PrintWriter(currentFile);

This solution gives a reversible encoding (with no collisions) where the encoded strings resemble the original strings in most cases. I'm assuming that you are using 8-bit characters.

该解决方案提供了一种可逆编码(没有冲突),其中编码的字符串在大多数情况下类似于原始字符串。我假设您使用的是 8 位字符。

URLEncoderworks, but it has the disadvantage that it encodes a whole lot of legal file name characters.

URLEncoder有效,但它的缺点是它编码了大量合法的文件名字符。

If you want a not-guaranteed-to-be-reversible solution, then simply remove the 'bad' characters rather than replacing them with escape sequences.

如果您想要一个不保证可逆的解决方案,那么只需删除“坏”字符,而不是用转义序列替换它们。



The reverse of the above encoding should be equally straight-forward to implement.

上述编码的逆向实现应该同样简单。

回答by Burkhard

You could remove the invalid chars ( '/', '\', '?', '*') and then use it.

您可以删除无效字符('/'、'\'、'?'、'*')然后使用它。

回答by vog

It depends on whether the encoding should be reversible or not.

这取决于编码是否应该是可逆的。

Reversible

可逆的

Use URL encoding (java.net.URLEncoder) to replace special characters with %xx. Note that you take care of the special caseswhere the string equals ., equals ..or is empty!1 Many programs use URL encoding to create file names, so this is a standard technique which everybody understands.

使用 URL 编码 ( java.net.URLEncoder) 将特殊字符替换为%xx. 请注意,您要注意字符串等于、等于或为空的特殊情况!1 许多程序使用 URL 编码来创建文件名,因此这是每个人都理解的标准技术。...

Irreversible

不可逆

Use a hash (e.g. SHA-1) of the given string. Modern hash algorithms (notMD5) can be considered collision-free. In fact, you'll have a break-through in cryptography if you find a collision.

使用给定字符串的散列(例如 SHA-1)。现代哈希算法(不是MD5)可以被认为是无冲突的。事实上,如果您发现冲突,您将在密码学方面取得突破。



1 You can handle all 3 special cases elegantly by using a prefix such as "myApp-". If you put the file directly into $HOME, you'll have to do that anyway to avoid conflicts with existing files such as ".bashrc".1 您可以通过使用前缀(例如"myApp-". 如果您将文件直接放入$HOME.bashrc ,则无论如何都必须这样做以避免与现有文件(例如“.bashrc”)发生冲突。
public static String encodeFilename(String s)
{
    try
    {
        return "myApp-" + java.net.URLEncoder.encode(s, "UTF-8");
    }
    catch (java.io.UnsupportedEncodingException e)
    {
        throw new RuntimeException("UTF-8 is an unknown encoding!?");
    }
}

回答by cletus

My suggestion is to take a "white list" approach, meaning don't try and filter out bad characters. Instead define what is OK. You can either reject the filename or filter it. If you want to filter it:

我的建议是采用“白名单”方法,这意味着不要尝试过滤掉坏字符。而是定义什么是好的。您可以拒绝文件名或过滤它。如果你想过滤它:

String name = s.replaceAll("\W+", "");

What this does is replaces any character that isn'ta number, letter or underscore with nothing. Alternatively you could replace them with another character (like an underscore).

这样做是将任何不是数字、字母或下划线的字符替换空。或者,您可以将它们替换为另一个字符(如下划线)。

The problem is that if this is a shared directory then you don't want file name collision. Even if user storage areas are segregated by user you may end up with a colliding filename just by filtering out bad characters. The name a user put in is often useful if they ever want to download it too.

问题是,如果这是一个共享目录,那么您不希望文件名冲突。即使用户存储区域由用户分隔,您也可能会因为过滤掉坏字符而导致文件名冲突。如果用户也想下载它,那么用户输入的名称通常很有用。

For this reason I tend to allow the user to enter what they want, store the filename based on a scheme of my own choosing (eg userId_fileId) and then store the user's filename in a database table. That way you can display it back to the user, store things how you want and you don't compromise security or wipe out other files.

出于这个原因,我倾向于允许用户输入他们想要的内容,根据我自己选择的方案(例如 userId_fileId)存储文件名,然后将用户的文件名存储在数据库表中。这样您就可以将其显示给用户,以您想要的方式存储内容,并且不会损害安全性或清除其他文件。

You can also hash the file (eg MD5 hash) but then you can't list the files the user put in (not with a meaningful name anyway).

您也可以对文件进行散列(例如 MD5 散列),但是您无法列出用户放入的文件(无论如何都没有有意义的名称)。

EDIT:Fixed regex for java

编辑:固定的 Java 正则表达式

回答by hd1

Pick your poison from the options presented by commons-codec, example:

commons-codec 提供选项中选择你的毒药,例如:

String safeFileName = DigestUtils.sha1(filename);

回答by SharkAlley

For those looking for a general solution, these might be common critera:

对于那些寻求通用解决方案的人来说,这些可能是常见的标准:

  • The filename should resemble the string.
  • The encoding should be reversible where possible.
  • The probability of collisions should be minimized.
  • 文件名应该类似于字符串。
  • 在可能的情况下,编码应该是可逆的。
  • 碰撞的可能性应该被最小化。

To achieve this we can use regex to match illegal characters, percent-encodethem, then constrain the length of the encoded string.

为了实现这一点,我们可以使用正则表达式来匹配非法字符,它们进行百分比编码,然后限制编码字符串的长度。

private static final Pattern PATTERN = Pattern.compile("[^A-Za-z0-9_\-]");

private static final int MAX_LENGTH = 127;

public static String escapeStringAsFilename(String in){

    StringBuffer sb = new StringBuffer();

    // Apply the regex.
    Matcher m = PATTERN.matcher(in);

    while (m.find()) {

        // Convert matched character to percent-encoded.
        String replacement = "%"+Integer.toHexString(m.group().charAt(0)).toUpperCase();

        m.appendReplacement(sb,replacement);
    }
    m.appendTail(sb);

    String encoded = sb.toString();

    // Truncate the string.
    int end = Math.min(encoded.length(),MAX_LENGTH);
    return encoded.substring(0,end);
}

Patterns

图案

The pattern above is based on a conservative subset of allowed characters in the POSIX spec.

上面的模式基于POSIX 规范中允许字符保守子集

If you want to allow the dot character, use:

如果要允许点字符,请使用:

private static final Pattern PATTERN = Pattern.compile("[^A-Za-z0-9_\-\.]");

Just be wary of strings like "." and ".."

只是要小心像“。”这样的字符串。和 ”..”

If you want to avoid collisions on case insensitive filesystems, you'll need to escape capitals:

如果你想避免在不区分大小写的文件系统上发生冲突,你需要转义大写:

private static final Pattern PATTERN = Pattern.compile("[^a-z0-9_\-]");

Or escape lower case letters:

或转义小写字母:

private static final Pattern PATTERN = Pattern.compile("[^A-Z0-9_\-]");

Rather than using a whitelist, you may choose to blacklist reserved characters for your specific filesystem. E.G. This regex suits FAT32 filesystems:

您可以选择将特定文件系统的保留字符列入黑名单,而不是使用白名单。EG 这个正则表达式适合 FAT32 文件系统:

private static final Pattern PATTERN = Pattern.compile("[%\.\"\*/:<>\?\\\|\+,\.;=\[\]]");

Length

长度

On Android, 127 charactersis the safe limit. Many filesystems allow 255 characters.

在 Android 上,127 个字符是安全限制。许多文件系统允许 255 个字符。

If you prefer to retain the tail, rather than the head of your string, use:

如果您更喜欢保留尾部而不是字符串的头部,请使用:

// Truncate the string.
int start = Math.max(0,encoded.length()-MAX_LENGTH);
return encoded.substring(start,encoded.length());

Decoding

解码

To convert the filename back to the original string, use:

要将文件名转换回原始字符串,请使用:

URLDecoder.decode(filename, "UTF-8");

Limitations

限制

Because longer strings are truncated, there is the possibility of a name collision when encoding, or corruption when decoding.

由于较长的字符串被截断,因此编码时可能会发生名称冲突,或解码时可能会损坏。

回答by BullyWiiPlaza

Try using the following regex which replaces every invalid file name character with a space:

尝试使用以下正则表达式,用空格替换每个无效的文件名字符:

public static String toValidFileName(String input)
{
    return input.replaceAll("[:\\/*\"?|<>']", " ");
}

回答by voho

This is probably not the most effective way, but shows how to do it using Java 8 pipelines:

这可能不是最有效的方法,但展示了如何使用 Java 8 管道来做到这一点:

private static String sanitizeFileName(String name) {
    return name
            .chars()
            .mapToObj(i -> (char) i)
            .map(c -> Character.isWhitespace(c) ? '_' : c)
            .filter(c -> Character.isLetterOrDigit(c) || c == '-' || c == '_')
            .map(String::valueOf)
            .collect(Collectors.joining());
}

The solution could be improved by creating custom collector which uses StringBuilder, so you do not have to cast each light-weight character to a heavy-weight string.

该解决方案可以通过创建使用 StringBuilder 的自定义收集器来改进,因此您不必将每个轻量级字符转换为重量级字符串。

回答by JonasCz - Reinstate Monica

Here's what I use:

这是我使用的:

public String sanitizeFilename(String inputName) {
    return inputName.replaceAll("[^a-zA-Z0-9-_\.]", "_");
}

What this does is is replace every character which is not a letter, number, underscore or dot with an underscore, using regex.

这样做是使用正则表达式用下划线替换不是字母、数字、下划线或点的每个字符。

This means that something like "How to convert £ to $" will become "How_to_convert___to__". Admittedly, this result is not very user-friendly, but it is safe and the resulting directory /file names are guaranteed to work everywhere. In my case, the result is not shown to the user, and is thus not a problem, but you may want to alter the regex to be more permissive.

这意味着诸如“如何将 £ 转换为 $”之类的内容将变成“How_to_convert___to__”。诚然,这个结果不是很人性化,但它是安全的,并且生成的目录/文件名保证在任何地方都可以使用。在我的情况下,结果不会显示给用户,因此不是问题,但您可能希望将正则表达式更改为更宽松。

Worth noting that another problem I encountered was that I would sometimes get identical names (since it's based on user input), so you should be aware of that, since you can't have multiple directories / files with the same name in a single directory. I just prepended the current time and date, and a short random string to avoid that. (an actual random string, not a hash of the filename, since identical filenames will result in identical hashes)

值得注意的是,我遇到的另一个问题是我有时会得到相同的名称(因为它是基于用户输入的),所以您应该意识到这一点,因为在一个目录中不能有多个具有相同名称的目录/文件. 我只是在当前时间和日期之前加上一个简短的随机字符串来避免这种情况。(实际的随机字符串,而不是文件名的散列,因为相同的文件名将导致相同的散列)

Also, you may need to truncate or otherwise shorten the resulting string, since it may exceed the 255 character limit some systems have.

此外,您可能需要截断或以其他方式缩短结果字符串,因为它可能超过某些系统的 255 个字符限制。