php 文件名的字符串消毒剂
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2021624/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
string sanitizer for filename
提问by user151841
I'm looking for a php function that will sanitize a string and make it ready to use for a filename. Anyone know of a handy one?
我正在寻找一个 php 函数,它可以清理一个字符串并使其准备好用于文件名。有谁知道一个方便的?
( I could write one, but I'm worried that I'll overlook a character! )
(我可以写一个,但我担心我会忽略一个角色!)
Edit: for saving files on a Windows NTFS filesystem.
编辑:用于在 Windows NTFS 文件系统上保存文件。
采纳答案by Dominic Rodger
Instead of worrying about overlooking characters - how about using a whitelist of characters you are happy to be used? For example, you could allow just good ol' a-z, 0-9, _, and a single instance of a period (.). That's obviously more limiting than most filesystems, but should keep you safe.
与其担心忽略字符 - 使用您乐意使用的字符白名单怎么样?例如,你可以让刚刚好醇” a-z,0-9,_,和一段时间的单个实例(.)。这显然比大多数文件系统更具限制性,但应该可以保证您的安全。
回答by Sean Vieira
Making a small adjustment to Tor Valamo's solution to fix the problem noticed by Dominic Rodger, you coulduse:
对 Tor Valamo 的解决方案进行小幅调整以解决 Dominic Rodger 注意到的问题,您可以使用:
// Remove anything which isn't a word, whitespace, number
// or any of the following caracters -_~,;[]().
// If you don't need to handle multi-byte characters
// you can use preg_replace rather than mb_ereg_replace
// Thanks @?ukasz Rysiak!
$file = mb_ereg_replace("([^\w\s\d\-_~,;\[\]\(\).])", '', $file);
// Remove any runs of periods (thanks falstro!)
$file = mb_ereg_replace("([\.]{2,})", '', $file);
回答by SequenceDigitale.com
What about using rawurlencode() ? http://www.php.net/manual/en/function.rawurlencode.php
使用 rawurlencode() 怎么样? http://www.php.net/manual/en/function.rawurlencode.php
Here is a function that sanitize even Chinese Chars:
这是一个甚至可以清除中文字符的函数:
public static function normalizeString ($str = '')
{
$str = strip_tags($str);
$str = preg_replace('/[\r\n\t ]+/', ' ', $str);
$str = preg_replace('/[\"\*\/\:\<\>\?\'\|]+/', ' ', $str);
$str = strtolower($str);
$str = html_entity_decode( $str, ENT_QUOTES, "utf-8" );
$str = htmlentities($str, ENT_QUOTES, "utf-8");
$str = preg_replace("/(&)([a-z])([a-z]+;)/i", '', $str);
$str = str_replace(' ', '-', $str);
$str = rawurlencode($str);
$str = str_replace('%', '-', $str);
return $str;
}
Here is the explaination
这是解释
- Strip HTML Tags
- Remove Break/Tabs/Return Carriage
- Remove Illegal Chars for folder and filename
- Put the string in lower case
- Remove foreign accents such as éà? by convert it into html entities and then remove the code and keep the letter.
- Replace Spaces with dashes
- Encode special chars that could pass the previous steps and enter in conflict filename on server. ex. "中文百强网"
- Replace "%" with dashes to make sure the link of the file will not be rewritten by the browser when querying th file.
- 剥离 HTML 标签
- 删除断点/标签/回车
- 删除文件夹和文件名的非法字符
- 把字符串放在小写
- 删除诸如 éà 之类的外国口音?通过将其转换为 html 实体,然后删除代码并保留字母。
- 用破折号替换空格
- 编码可以通过前面的步骤并在服务器上输入冲突文件名的特殊字符。前任。“中文百强网”
- 用破折号替换“%”以确保在查询文件时浏览器不会重写文件的链接。
OK, some filename will not be releavant but in most case it will work.
好的,一些文件名不会相关,但在大多数情况下它会起作用。
ex. Original Name: "???????-??-????????????.jpg"
前任。原名:“?????????-??-????????????.jpg”
Output Name: "-E1-83-A1-E1-83-90-E1-83-91-E1-83-94-E1-83-AD-E1-83-93-E1-83-98--E1-83-93-E1-83-90--E1-83-A2-E1-83-98-E1-83-9E-E1-83-9D-E1-83-92-E1-83-A0-E1-83-90-E1-83-A4-E1-83-98-E1-83-A3-E1-83-9A-E1-83-98.jpg"
输出名称:“-E1-83-A1-E1-83-90-E1-83-91-E1-83-94-E1-83-AD-E1-83-93-E1-83-98--E1- 83-93-E1-83-90--E1-83-A2-E1-83-98-E1-83-9E-E1-83-9D-E1-83-92-E1-83-A0-E1-83 -90-E1-83-A4-E1-83-98-E1-83-A3-E1-83-9A-E1-83-98.jpg"
It's better like that than an 404 error.
这比 404 错误要好。
Hope that was helpful.
希望这是有帮助的。
Carl.
卡尔。
回答by mgutt
This is how you can sanitize for a file system as asked
这是您可以按照要求清理文件系统的方法
function filter_filename($name) {
// remove illegal file system characters https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words
$name = str_replace(array_merge(
array_map('chr', range(0, 31)),
array('<', '>', ':', '"', '/', '\', '|', '?', '*')
), '', $name);
// maximise filename length to 255 bytes http://serverfault.com/a/9548/44086
$ext = pathinfo($name, PATHINFO_EXTENSION);
$name= mb_strcut(pathinfo($name, PATHINFO_FILENAME), 0, 255 - ($ext ? strlen($ext) + 1 : 0), mb_detect_encoding($name)) . ($ext ? '.' . $ext : '');
return $name;
}
Everything else is allowed in a filesystem, so the question is perfectly answered...
文件系统中允许其他所有内容,所以这个问题得到了完美的回答......
... but it couldbe dangerous to allow for example single quotes 'in a filename if you use it later in an unsafe HTML context because this absolutely legal filename:
...但是如果您稍后在不安全的 HTML 上下文中使用它,那么在文件名中允许例如单引号可能很危险,'因为这个绝对合法的文件名:
' onerror= 'alert(document.cookie).jpg
becomes an XSS hole:
变成一个XSS 漏洞:
<img src='<? echo $image ?>' />
// output:
<img src=' ' onerror= 'alert(document.cookie)' />
Because of that, the popular CMS software Wordpressremoves them, but they covered all relevant chars only after some updates:
正因为如此,流行的 CMS 软件Wordpress删除了它们,但它们仅在一些更新后才覆盖所有相关字符:
$special_chars = array("?", "[", "]", "/", "\", "=", "<", ">", ":", ";", ",", "'", "\"", "&", "$", "#", "*", "(", ")", "|", "~", "`", "!", "{", "}", "%", "+", chr(0));
// ... a few rows later are whitespaces removed as well ...
preg_replace( '/[\r\n\t -]+/', '-', $filename )
Finally their list includes now most of the characters that are part of the URI rerserved-charactersand URL unsafe characterslist.
最后,他们的列表现在包括了大部分属于URI 保留字符和URL 不安全字符列表的字符。
Of course you could simply encode all these chars on HTML output, but most developers and me too, follow the idiom "Better safe than sorry"and delete them in advance.
当然,您可以简单地对 HTML 输出中的所有这些字符进行编码,但是大多数开发人员和我也是如此,遵循“安全胜于抱歉”的习惯用法并提前删除它们。
So finally I would suggest to use this:
所以最后我建议使用这个:
function filter_filename($filename, $beautify=true) {
// sanitize filename
$filename = preg_replace(
'~
[<>:"/\|?*]| # file system reserved https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words
[\x00-\x1F]| # control characters http://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.aspx
[\x7F\xA0\xAD]| # non-printing characters DEL, NO-BREAK SPACE, SOFT HYPHEN
[#\[\]@!$&\'()+,;=]| # URI reserved https://tools.ietf.org/html/rfc3986#section-2.2
[{}^\~`] # URL unsafe characters https://www.ietf.org/rfc/rfc1738.txt
~x',
'-', $filename);
// avoids ".", ".." or ".hiddenFiles"
$filename = ltrim($filename, '.-');
// optional beautification
if ($beautify) $filename = beautify_filename($filename);
// maximize filename length to 255 bytes http://serverfault.com/a/9548/44086
$ext = pathinfo($filename, PATHINFO_EXTENSION);
$filename = mb_strcut(pathinfo($filename, PATHINFO_FILENAME), 0, 255 - ($ext ? strlen($ext) + 1 : 0), mb_detect_encoding($filename)) . ($ext ? '.' . $ext : '');
return $filename;
}
Everything else that does not cause problems with the file system should be part of an additional function:
不会导致文件系统出现问题的其他所有内容都应该是附加功能的一部分:
function beautify_filename($filename) {
// reduce consecutive characters
$filename = preg_replace(array(
// "file name.zip" becomes "file-name.zip"
'/ +/',
// "file___name.zip" becomes "file-name.zip"
'/_+/',
// "file---name.zip" becomes "file-name.zip"
'/-+/'
), '-', $filename);
$filename = preg_replace(array(
// "file--.--.-.--name.zip" becomes "file.name.zip"
'/-*\.-*/',
// "file...name..zip" becomes "file.name.zip"
'/\.{2,}/'
), '.', $filename);
// lowercase for windows/unix interoperability http://support.microsoft.com/kb/100625
$filename = mb_strtolower($filename, mb_detect_encoding($filename));
// ".file-name.-" becomes "file-name"
$filename = trim($filename, '.-');
return $filename;
}
And at this point you need to generate a filename if the result is empty and you can decide if you want to encode UTF-8 characters. But you do not need that as UTF-8 is allowed in all file systems that are used in web hosting contexts.
如果结果为空,此时您需要生成一个文件名,您可以决定是否要对 UTF-8 字符进行编码。但是您不需要它,因为在 Web 托管上下文中使用的所有文件系统中都允许使用 UTF-8。
The only thing you have to do is to use urlencode()(as you hopefully do it with all your URLs) so the filename ???????_???????.jpgbecomes this URL as your <img src>or <a href>:
http://www.maxrev.de/html/img/%E1%83%A1%E1%83%90%E1%83%91%E1%83%94%E1%83%AD%E1%83%93%E1%83%98_%E1%83%9B%E1%83%90%E1%83%9C%E1%83%A5%E1%83%90%E1%83%9C%E1%83%90.jpg
您唯一需要做的就是使用urlencode()(因为您希望对所有 URL 都这样做),因此文件名???????_???????.jpg变成了这个 URL 作为您的<img src>或<a href>:http:
//www.maxrev.de/html/img/%E1%83% A1%E1%83%90%E1%83%91%E1%83%94%E1%83%AD%E1%83%93%E1%83%98_%E1%83%9B%E1%83%90% E1%83%9C%E1%83%A5%E1%83%90%E1%83%9C%E1%83%90.jpg
Stackoverflow does that, so I can post this link as a user would do it:
http://www.maxrev.de/html/img/???????_???????.jpg
Stackoverflow 这样做,所以我可以像用户那样发布这个链接:http: //www.maxrev.de/html/img/???????_???????.jpg
So this is a complete legal filename and not a problemas @SequenceDigitale.com mentioned in his answer.
所以这是一个完整的法律文件名和不是一个问题,因为@ SequenceDigitale.com在他的回答中提到。
回答by Philipp
SOLUTION 1- simple and effective
解决方案 1- 简单有效
$file_name = preg_replace( '/[^a-z0-9]+/', '-', strtolower( $url ) );
$file_name = preg_replace( '/[^a-z0-9]+/', '-', strtolower( $url ) );
- strtolower() guarantees the filename is lowercase (since case does not matter inside the URL, but in the NTFS filename)
[^a-z0-9]+will ensure, the filename only keeps letters and numbers- Substitute invalid characters with
'-'keeps the filename readable
- strtolower() 保证文件名是小写的(因为大小写在 URL 内无关紧要,但在 NTFS 文件名中)
[^a-z0-9]+将确保,文件名只保留字母和数字- 用
'-'保持文件名可读的替换无效字符
Example:
例子:
URL: http://stackoverflow.com/questions/2021624/string-sanitizer-for-filename
File: http-stackoverflow-com-questions-2021624-string-sanitizer-for-filename
SOLUTION 2- for very long URLs
解决方案 2- 对于很长的 URL
You want to cache the URL contents and just need to have unique filenames. I would use this function:
您想要缓存 URL 内容并且只需要具有唯一的文件名。我会使用这个功能:
$file_name = md5( strtolower( $url ) )
$file_name = md5( strtolower( $url ) )
this will create a filename with fixed length. The MD5 hash is in most cases unique enough for this kind of usage.
这将创建一个固定长度的文件名。在大多数情况下,MD5 散列对于这种用途来说是足够独特的。
Example:
例子:
URL: https://www.amazon.com/Interstellar-Matthew-McConaughey/dp/B00TU9UFTS/ref=s9_nwrsa_gw_g318_i10_r?_encoding=UTF8&fpl=fresh&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=desktop-1&pf_rd_r=BS5M1H560SMAR2JDKYX3&pf_rd_r=BS5M1H560SMAR2JDKYX3&pf_rd_t=36701&pf_rd_p=6822bacc-d4f0-466d-83a8-2c5e1d703f8e&pf_rd_p=6822bacc-d4f0-466d-83a8-2c5e1d703f8e&pf_rd_i=desktop
File: 51301f3edb513f6543779c3a5433b01c
回答by Mark Moline
Well, tempnam() will do it for you.
好吧, tempnam() 会为你做的。
http://us2.php.net/manual/en/function.tempnam.php
http://us2.php.net/manual/en/function.tempnam.php
but that creates an entirely new name.
但这创造了一个全新的名字。
To sanitize an existing string just restrict what your users can enter and make it letters, numbers, period, hyphen and underscore then sanitize with a simple regex. Check what characters need to be escaped or you could get false positives.
要清理现有字符串,只需限制用户可以输入的内容,并将其设为字母、数字、句点、连字符和下划线,然后使用简单的正则表达式进行清理。检查哪些字符需要转义,否则可能会出现误报。
$sanitized = preg_replace('/[^a-zA-Z0-9\-\._]/','', $filename);
回答by Tor Valamo
preg_replace("[^\w\s\d\.\-_~,;:\[\]\(\]]", '', $file)
Add/remove more valid characters depending on what is allowed for your system.
根据系统允许的内容添加/删除更多有效字符。
Alternatively you can try to create the file and then return an error if it's bad.
或者,您可以尝试创建该文件,然后如果它不好则返回一个错误。
回答by 120DEV
PHP provides a function to sanitize a text to different format
PHP 提供了将文本清理为不同格式的功能
How to :
如何 :
echo filter_var(
"Lorem Ipsum has been the industry's",FILTER_SANITIZE_URL
);
Blockquote
LoremIpsumhasbeentheindustry's
块引用
LoremIpsumhasbeentheindustry's
回答by CarlJohnson
Making a small adjustment to Sean Vieira's solution to allow for single dots, you could use:
对 Sean Vieira 的解决方案进行小幅调整以允许单点,您可以使用:
preg_replace("([^\w\s\d\.\-_~,;:\[\]\(\)]|[\.]{2,})", '', $file)
回答by Sampson
The following expression creates a nice, clean, and usable string:
以下表达式创建了一个漂亮、干净且可用的字符串:
/[^a-z0-9\._-]+/gi
Turning today's financial: billinginto today-s-financial-billing
把今天的财务:计费变成今天的财务计费

