php 清理字符串以使它们的 URL 和文件名安全?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2668854/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Sanitizing strings to make them URL and filename safe?
提问by Xeoncross
I am trying to come up with a function that does a good job of sanitizing certain strings so that they are safe to use in the URL (like a post slug) and also safe to use as file names. For example, when someone uploads a file I want to make sure that I remove all dangerous characters from the name.
我正在尝试提出一个函数,它可以很好地清理某些字符串,以便它们可以安全地在 URL 中使用(如 post slug)并且也可以安全地用作文件名。例如,当有人上传文件时,我想确保从名称中删除所有危险字符。
So far I have come up with the following function which I hope solves this problem and allows foreign UTF-8 data also.
到目前为止,我已经提出了以下功能,我希望它可以解决这个问题,并允许外国 UTF-8 数据。
/**
* Convert a string to the file/URL safe "slug" form
*
* @param string $string the string to clean
* @param bool $is_filename TRUE will allow additional filename characters
* @return string
*/
function sanitize($string = '', $is_filename = FALSE)
{
// Replace all weird characters with dashes
$string = preg_replace('/[^\w\-'. ($is_filename ? '~_\.' : ''). ']+/u', '-', $string);
// Only allow one dash separator at a time (and make string lowercase)
return mb_strtolower(preg_replace('/--+/u', '-', $string), 'UTF-8');
}
Does anyone have any tricky sample data I can run against this - or know of a better way to safeguard our apps from bad names?
有没有人有我可以针对此运行的任何棘手的示例数据 - 或者知道保护我们的应用程序免受不良名称影响的更好方法吗?
$is-filename allows some additional characters like temp vim files
$is-filename 允许一些额外的字符,如临时 vim 文件
update: removed the star character since I could not think of a valid use
更新:删除了星号,因为我想不出有效的用途
采纳答案by Alan Donnelly
Some observations on your solution:
关于您的解决方案的一些观察:
- 'u' at the end of your pattern means that the pattern, and not the text it's matching will be interpreted as UTF-8 (I presume you assumed the latter?).
- \w matches the underscore character. You specifically include it for files which leads to the assumption that you don't want them in URLs, but in the code you have URLs will be permitted to include an underscore.
- The inclusion of "foreign UTF-8" seems to be locale-dependent. It's not clear whether this is the locale of the server or client. From the PHP docs:
- 模式末尾的 'u' 意味着模式,而不是它匹配的文本将被解释为 UTF-8(我想你假设是后者?)。
- \w 匹配下划线字符。您专门为文件包含它,这会导致假设您不希望它们出现在 URL 中,但在代码中,您的 URL 将被允许包含下划线。
- 包含“外国 UTF-8”似乎与语言环境有关。不清楚这是服务器的语言环境还是客户端的语言环境。来自 PHP 文档:
A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.
“单词”字符是任何字母或数字或下划线字符,即可以是 Perl“单词”一部分的任何字符。字母和数字的定义由 PCRE 的字符表控制,如果发生特定于语言环境的匹配,则可能会有所不同。例如,在“fr”(法语)语言环境中,一些大于 128 的字符代码用于重音字母,这些字符代码与 \w 匹配。
Creating the slug
创建 slug
You probably shouldn't include accented etc. characters in your post slug since, technically, they should be percent encoded (per URL encoding rules) so you'll have ugly looking URLs.
您可能不应该在 post slug 中包含重音等字符,因为从技术上讲,它们应该进行百分比编码(根据 URL 编码规则),这样您的 URL 就会很难看。
So, if I were you, after lowercasing, I'd convert any 'special' characters to their equivalent (e.g. é -> e) and replace non [a-z] characters with '-', limiting to runs of a single '-' as you've done. There's an implementation of converting special characters here: https://web.archive.org/web/20130208144021/http://neo22s.com/slug
所以,如果我是你,在小写之后,我会将任何“特殊”字符转换为它们的等效字符(例如 é -> e)并将非 [az] 字符替换为“-”,限制为单个“-”的运行正如你所做的那样。这里有一个转换特殊字符的实现:https: //web.archive.org/web/20130208144021/http: //neo22s.com/slug
Sanitization in general
一般消毒
OWASP have a PHP implementation of their Enterprise Security API which among other things includes methods for safe encoding and decoding input and output in your application.
OWASP 有一个他们的企业安全 API 的 PHP 实现,其中包括在您的应用程序中安全编码和解码输入和输出的方法。
The Encoder interface provides:
编码器接口提供:
canonicalize (string $input, [bool $strict = true])
decodeFromBase64 (string $input)
decodeFromURL (string $input)
encodeForBase64 (string $input, [bool $wrap = false])
encodeForCSS (string $input)
encodeForHTML (string $input)
encodeForHTMLAttribute (string $input)
encodeForJavaScript (string $input)
encodeForOS (Codec $codec, string $input)
encodeForSQL (Codec $codec, string $input)
encodeForURL (string $input)
encodeForVBScript (string $input)
encodeForXML (string $input)
encodeForXMLAttribute (string $input)
encodeForXPath (string $input)
https://github.com/OWASP/PHP-ESAPIhttps://www.owasp.org/index.php/Category:OWASP_Enterprise_Security_API
https://github.com/OWASP/PHP-ESAPI https://www.owasp.org/index.php/Category:OWASP_Enterprise_Security_API
回答by Xeoncross
I found this larger function in the Chyrpcode:
我在Chyrp代码中发现了这个更大的函数:
/**
* Function: sanitize
* Returns a sanitized string, typically for URLs.
*
* Parameters:
* $string - The string to sanitize.
* $force_lowercase - Force the string to lowercase?
* $anal - If set to *true*, will remove all non-alphanumeric characters.
*/
function sanitize($string, $force_lowercase = true, $anal = false) {
$strip = array("~", "`", "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "_", "=", "+", "[", "{", "]",
"}", "\", "|", ";", ":", "\"", "'", "‘", "’", "“", "”", "–", "—",
"a”", "a“", ",", "<", ".", ">", "/", "?");
$clean = trim(str_replace($strip, "", strip_tags($string)));
$clean = preg_replace('/\s+/', "-", $clean);
$clean = ($anal) ? preg_replace("/[^a-zA-Z0-9]/", "", $clean) : $clean ;
return ($force_lowercase) ?
(function_exists('mb_strtolower')) ?
mb_strtolower($clean, 'UTF-8') :
strtolower($clean) :
$clean;
}
and this one in the wordpresscode
和wordpress代码中的这个
/**
* Sanitizes a filename replacing whitespace with dashes
*
* Removes special characters that are illegal in filenames on certain
* operating systems and special characters requiring special escaping
* to manipulate at the command line. Replaces spaces and consecutive
* dashes with a single dash. Trim period, dash and underscore from beginning
* and end of filename.
*
* @since 2.1.0
*
* @param string $filename The filename to be sanitized
* @return string The sanitized filename
*/
function sanitize_file_name( $filename ) {
$filename_raw = $filename;
$special_chars = array("?", "[", "]", "/", "\", "=", "<", ">", ":", ";", ",", "'", "\"", "&", "$", "#", "*", "(", ")", "|", "~", "`", "!", "{", "}");
$special_chars = apply_filters('sanitize_file_name_chars', $special_chars, $filename_raw);
$filename = str_replace($special_chars, '', $filename);
$filename = preg_replace('/[\s-]+/', '-', $filename);
$filename = trim($filename, '.-_');
return apply_filters('sanitize_file_name', $filename, $filename_raw);
}
Update Sept 2012
2012 年 9 月更新
Alix Axelhas done some incredible work in this area. His phunction framework includes several great text filters and transformations.
Alix Axel在这方面做了一些令人难以置信的工作。他的 phunction 框架包括几个很棒的文本过滤器和转换。
回答by SoLoGHoST
This should make your filenames safe...
这应该使您的文件名安全...
$string = preg_replace(array('/\s/', '/\.[\.]+/', '/[^\w_\.\-]/'), array('_', '.', ''), $string);
and a deeper solution to this is:
对此的更深层次的解决方案是:
// Remove special accented characters - ie. sí.
$clean_name = strtr($string, array('?' => 'S','?' => 'Z','?' => 's','?' => 'z','?' => 'Y','à' => 'A','á' => 'A','?' => 'A','?' => 'A','?' => 'A','?' => 'A','?' => 'C','è' => 'E','é' => 'E','ê' => 'E','?' => 'E','ì' => 'I','í' => 'I','?' => 'I','?' => 'I','?' => 'N','ò' => 'O','ó' => 'O','?' => 'O','?' => 'O','?' => 'O','?' => 'O','ù' => 'U','ú' => 'U','?' => 'U','ü' => 'U','Y' => 'Y','à' => 'a','á' => 'a','a' => 'a','?' => 'a','?' => 'a','?' => 'a','?' => 'c','è' => 'e','é' => 'e','ê' => 'e','?' => 'e','ì' => 'i','í' => 'i','?' => 'i','?' => 'i','?' => 'n','ò' => 'o','ó' => 'o','?' => 'o','?' => 'o','?' => 'o','?' => 'o','ù' => 'u','ú' => 'u','?' => 'u','ü' => 'u','y' => 'y','?' => 'y'));
$clean_name = strtr($clean_name, array('T' => 'TH', 't' => 'th', 'D' => 'DH', 'e' => 'dh', '?' => 'ss', '?' => 'OE', '?' => 'oe', '?' => 'AE', '?' => 'ae', 'μ' => 'u'));
$clean_name = preg_replace(array('/\s/', '/\.[\.]+/', '/[^\w_\.\-]/'), array('_', '.', ''), $clean_name);
This assumes that you want a dot in the filename. if you want it transferred to lowercase, just use
这假设您希望文件名中有一个点。如果您想将其转换为小写,只需使用
$clean_name = strtolower($clean_name);
for the last line.
对于最后一行。
回答by John Conde
Try this:
尝试这个:
function normal_chars($string)
{
$string = htmlentities($string, ENT_QUOTES, 'UTF-8');
$string = preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '', $string);
$string = html_entity_decode($string, ENT_QUOTES, 'UTF-8');
$string = preg_replace(array('~[^0-9a-z]~i', '~[ -]+~'), ' ', $string);
return trim($string, ' -');
}
Examples:
echo normal_chars('álix----_?xel!?!?'); // Alix Axel
echo normal_chars('áéíóúáéíóú'); // aeiouAEIOU
echo normal_chars('ü?????ü???'); // uyAEIOUYaA
Based on the selected answer in this thread: URL Friendly Username in PHP?
基于此线程中选择的答案:PHP 中的 URL 友好用户名?
回答by Alix Axel
This isn't exactly an answer as it doesn't provide any solutions (yet!), but it's too big to fit on a comment...
这不完全是一个答案,因为它没有提供任何解决方案(还没有!),但是它太大了,无法发表评论......
I did some testing (regarding file names) on Windows 7 and Ubuntu 12.04 and what I found out was that:
我在 Windows 7 和 Ubuntu 12.04 上做了一些测试(关于文件名),我发现:
1. PHP Can't Handle non-ASCII Filenames
1. PHP 无法处理非 ASCII 文件名
Although both Windows and Ubuntu can handle Unicode filenames (even RTL ones as it seems) PHP 5.3 requires hacks to deal even with the plain old ISO-8859-1, so it's better to keep it ASCII only for safety.
尽管 Windows 和 Ubuntu 都可以处理 Unicode 文件名(甚至看起来是 RTL 文件名),但 PHP 5.3 需要 hack 才能处理普通的旧 ISO-8859-1,因此最好仅将其保留为 ASCII 以确保安全。
2. The Lenght of the Filename Matters (Specially on Windows)
2. 文件名的长度很重要(特别是在 Windows 上)
On Ubuntu, the maximum length a filename can have (incluinding extension) is 255 (excluding path):
在 Ubuntu 上,文件名的最大长度(包括扩展名)是 255(不包括路径):
/var/www/uploads/123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345/
However, on Windows 7 (NTFS) the maximum lenght a filename can have depends on it's absolute path:
但是,在 Windows 7 (NTFS) 上,文件名的最大长度取决于它的绝对路径:
(0 + 0 + 244 + 11 chars) C:3456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123434567.txt
(0 + 3 + 240 + 11 chars) C:3345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789034567.txt
(3 + 3 + 236 + 11 chars) C:3634567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345634567.txt
Wikipediasays that:
维基百科说:
NTFS allows eachpath component (directory or filename) to be 255 characters long.
NTFS 允许每个路径组件(目录或文件名)的长度为 255 个字符。
To the best of my knowledge (and testing), this is wrong.
据我所知(和测试),这是错误的。
In total (counting slashes) all these examples have 259 chars, if you strip the C:\that gives 256 characters (not 255?!). The directories where created using the Explorer and you'll notice that it restrains itself from using all the available space for the directory name. The reason for this is to allow the creation of files using the 8.3 file naming convention. The same thing happens for other partitions.
总共(计算斜杠)所有这些例子有 259 个字符,如果你去掉C:\256 个字符(不是 255 个?!)。使用资源管理器创建的目录,您会注意到它限制自己使用目录名称的所有可用空间。这样做的原因是允许使用8.3 文件命名约定创建文件。其他分区也会发生同样的事情。
Files don't need to reserve the 8.3 lenght requirements of course:
当然文件不需要保留8.3长度的要求:
(255 chars) E:345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901.txt
You can't create any more sub-directories if the absolute path of the parent directory has more than 242 characters, because 256 = 242 + 1 + \ + 8 + . + 3. Using Windows Explorer, you can't create another directory if the parent directory has more than 233 characters (depending on the system locale), because 256 = 233 + 10 + \ + 8 + . + 3; the 10here is the length of the string New folder.
如果父目录的绝对路径超过 242 个字符,则无法创建更多子目录,因为256 = 242 + 1 + \ + 8 + . + 3. 使用 Windows 资源管理器,如果父目录超过 233 个字符(取决于系统区域设置),则无法创建另一个目录,因为256 = 233 + 10 + \ + 8 + . + 3; 在10这里是字符串的长度New folder。
Windows file system poses a nasty problem if you want to assure inter-operability between file systems.
如果您想确保文件系统之间的互操作性,Windows 文件系统会带来一个令人讨厌的问题。
3. Beware of Reserved Characters and Keywords
3. 注意保留字符和关键字
Aside from removing non-ASCII, non-printable and control characters, you also need to re(place/move):
除了删除非 ASCII、不可打印和控制字符之外,您还需要重新(放置/移动):
"*/:<>?\|
Just removing these characters might not be the best idea because the filename might lose some of it's meaning. I think that, at the very least, multiple occurences of these characters should be replaced by a single underscore (_), or perhaps something more representative (this is just an idea):
仅仅删除这些字符可能不是最好的主意,因为文件名可能会失去一些意义。我认为,至少,这些字符的多次出现应该用单个下划线 ( _)替换,或者可能更具有代表性(这只是一个想法):
"*?->_/\|->-:->[ ]-[ ]<->(>->)
"*?->_/\|->-:->[ ]-[ ]<->(>->)
There are also special keywords that should be avoided(like NUL), although I'm not sure how to overcome that. Perhaps a black list with a random name fallback would be a good approach to solve it.
还有一些应该避免的特殊关键字(如NUL),尽管我不确定如何克服它。也许带有随机名称回退的黑名单将是解决它的好方法。
4. Case Sensitiveness
4. 区分大小写
This should go without saying, but if you want so ensure file uniqueness across different operating systems you should transform file names to a normalized case, that way my_file.txtand My_File.txton Linux won't both become the same my_file.txtfile on Windows.
这不用说,但如果你想这样确保文件在不同的操作系统上的独特性,你应该转换文件名归一化的情况下,这种方式my_file.txt和My_File.txt在Linux上不会都成为相同my_file.txt的Windows文件。
5. Make Sure It's Unique
5. 确保它是独一无二的
If the file name already exists, a unique identifier should be appendedto it's base file name.
Common unique identifiers include the UNIX timestamp, a digest of the file contents or a random string.
常见的唯一标识符包括 UNIX 时间戳、文件内容摘要或随机字符串。
6. Hidden Files
6. 隐藏文件
Just because it can be named doesn't mean it should...
仅仅因为它可以命名并不意味着它应该......
Dots are usually white-listed in file names but in Linux a hidden file is represented by a leading dot.
点通常在文件名中列入白名单,但在 Linux 中,隐藏文件由前导点表示。
7. Other Considerations
7. 其他注意事项
If you have to strip some chars of the file name, the extension is usually more important than the base name of the file. Allowing a considerable maximum number of characters for the file extension(8-16) one should strip the characters from the base name. It's also important to note that in the unlikely event of having a more than one long extension - such as _.graphmlz.tag.gz- _.graphmlz.tagonly _should be considered as the file base name in this case.
如果您必须去除文件名的一些字符,扩展名通常比文件的基本名称更重要。允许文件扩展名(8-16) 的最大字符数应从基本名称中去除字符。同样重要的是要注意,在不太可能出现不止一个长扩展名的情况下 - 例如_.graphmlz.tag.gz-在这种情况下_.graphmlz.tag只_应被视为文件基本名称。
8. Resources
8. 资源
Calibrehandles file name mangling pretty decently:
Calibre非常体面地处理文件名重整:
Wikipedia page on file name manglingand linked chapter from Using Samba.
来自 Using Samba 的关于文件名修改和链接章节的维基百科页面。
If for instance, you try to create a file that violates any of the rules 1/2/3, you'll get a very useful error:
例如,如果您尝试创建一个违反任何规则 1/2/3 的文件,您将得到一个非常有用的错误:
Warning: touch(): Unable to create file ... because No error in ... on line ...
回答by alex
I've always thought Kohana did a pretty good job of it.
我一直认为Kohana 做得很好。
public static function title($title, $separator = '-', $ascii_only = FALSE)
{
if ($ascii_only === TRUE)
{
// Transliterate non-ASCII characters
$title = UTF8::transliterate_to_ascii($title);
// Remove all characters that are not the separator, a-z, 0-9, or whitespace
$title = preg_replace('![^'.preg_quote($separator).'a-z0-9\s]+!', '', strtolower($title));
}
else
{
// Remove all characters that are not the separator, letters, numbers, or whitespace
$title = preg_replace('![^'.preg_quote($separator).'\pL\pN\s]+!u', '', UTF8::strtolower($title));
}
// Replace all separator characters and whitespace by a single separator
$title = preg_replace('!['.preg_quote($separator).'\s]+!u', $separator, $title);
// Trim separators from the beginning and end
return trim($title, $separator);
}
The handy UTF8::transliterate_to_ascii()will turn stuff like ? => n.
方便的UTF8::transliterate_to_ascii()东西会变成什么样?=> 名词。
Of course, you could replace the other UTF8::*stuff with mb_* functions.
当然,你可以UTF8::*用 mb_* 函数替换其他东西。
回答by jah
In terms of file uploads, you would be safest to prevent the user from controlling the file name. As has already been hinted at, store the canonicalised filename in a database along with a randomly chosen and unique name which you'll use as the actual filename.
在文件上传方面,防止用户控制文件名是最安全的。正如已经暗示的那样,将规范化的文件名与随机选择的唯一名称一起存储在数据库中,您将使用该名称作为实际文件名。
Using OWASP ESAPI, these names could be generated thus:
使用 OWASP ESAPI,可以这样生成这些名称:
$userFilename = ESAPI::getEncoder()->canonicalize($input_string);
$safeFilename = ESAPI::getRandomizer()->getRandomFilename();
You could append a timestamp to the $safeFilename to help ensure that the randomly generated filename is unique without even checking for an existing file.
您可以将时间戳附加到 $safeFilename 以帮助确保随机生成的文件名是唯一的,甚至无需检查现有文件。
In terms of encoding for URL, and again using ESAPI:
在 URL 编码方面,再次使用 ESAPI:
$safeForURL = ESAPI::getEncoder()->encodeForURL($input_string);
This method performs canonicalisation before encoding the string and will handle all character encodings.
此方法在对字符串进行编码之前执行规范化,并将处理所有字符编码。
回答by John Magnolia
I have adapted from another source and added a couple extra, maybe a little overkill
我从另一个来源改编并添加了一些额外的东西,也许有点矫枉过正
/**
* Convert a string into a url safe address.
*
* @param string $unformatted
* @return string
*/
public function formatURL($unformatted) {
$url = strtolower(trim($unformatted));
//replace accent characters, forien languages
$search = array('à', 'á', '?', '?', '?', '?', '?', '?', 'è', 'é', 'ê', '?', 'ì', 'í', '?', '?', 'D', '?', 'ò', 'ó', '?', '?', '?', '?', 'ù', 'ú', '?', 'ü', 'Y', '?', 'à', 'á', 'a', '?', '?', '?', '?', '?', 'è', 'é', 'ê', '?', 'ì', 'í', '?', '?', '?', 'ò', 'ó', '?', '?', '?', '?', 'ù', 'ú', '?', 'ü', 'y', '?', 'ā', 'ā', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', 'ē', 'ē', '?', '?', '?', '?', '?', '?', 'ě', 'ě', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', 'ī', 'ī', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', 'ń', '?', '?', '?', 'ň', '?', 'ō', 'ō', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', 'ū', 'ū', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', 'ǎ', 'ǎ', 'ǐ', 'ǐ', 'ǒ', 'ǒ', 'ǔ', 'ǔ', 'ǖ', 'ǖ', 'ǘ', 'ǘ', 'ǚ', 'ǚ', 'ǜ', 'ǜ', '?', '?', '?', '?', '?', '?');
$replace = array('A', 'A', 'A', 'A', 'A', 'A', 'AE', 'C', 'E', 'E', 'E', 'E', 'I', 'I', 'I', 'I', 'D', 'N', 'O', 'O', 'O', 'O', 'O', 'O', 'U', 'U', 'U', 'U', 'Y', 's', 'a', 'a', 'a', 'a', 'a', 'a', 'ae', 'c', 'e', 'e', 'e', 'e', 'i', 'i', 'i', 'i', 'n', 'o', 'o', 'o', 'o', 'o', 'o', 'u', 'u', 'u', 'u', 'y', 'y', 'A', 'a', 'A', 'a', 'A', 'a', 'C', 'c', 'C', 'c', 'C', 'c', 'C', 'c', 'D', 'd', 'D', 'd', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'G', 'g', 'G', 'g', 'G', 'g', 'G', 'g', 'H', 'h', 'H', 'h', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'IJ', 'ij', 'J', 'j', 'K', 'k', 'L', 'l', 'L', 'l', 'L', 'l', 'L', 'l', 'l', 'l', 'N', 'n', 'N', 'n', 'N', 'n', 'n', 'O', 'o', 'O', 'o', 'O', 'o', 'OE', 'oe', 'R', 'r', 'R', 'r', 'R', 'r', 'S', 's', 'S', 's', 'S', 's', 'S', 's', 'T', 't', 'T', 't', 'T', 't', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'W', 'w', 'Y', 'y', 'Y', 'Z', 'z', 'Z', 'z', 'Z', 'z', 's', 'f', 'O', 'o', 'U', 'u', 'A', 'a', 'I', 'i', 'O', 'o', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'A', 'a', 'AE', 'ae', 'O', 'o');
$url = str_replace($search, $replace, $url);
//replace common characters
$search = array('&', '£', '$');
$replace = array('and', 'pounds', 'dollars');
$url= str_replace($search, $replace, $url);
// remove - for spaces and union characters
$find = array(' ', '&', '\r\n', '\n', '+', ',', '//');
$url = str_replace($find, '-', $url);
//delete and replace rest of special chars
$find = array('/[^a-z0-9\-<>]/', '/[\-]+/', '/<[^>]*>/');
$replace = array('', '-', '');
$uri = preg_replace($find, $replace, $url);
return $uri;
}
回答by cedric.walter
and this is Joomla 3.3.2 version from JFile::makeSafe($file)
这是 Joomla 3.3.2 版本 JFile::makeSafe($file)
public static function makeSafe($file)
{
// Remove any trailing dots, as those aren't ever valid file names.
$file = rtrim($file, '.');
$regex = array('#(\.){2,}#', '#[^A-Za-z0-9\.\_\- ]#', '#^\.#');
return trim(preg_replace($regex, '', $file));
}
回答by Motin
I recommend* URLify for PHP (480+ stars on Github)- "the PHP port of URLify.js from the Django project. Transliterates non-ascii characters for use in URLs".
我推荐* PHP 的 URLify(Github 上的 480 颗星)-“来自 Django 项目的 URLify.js 的 PHP 端口。音译非 ascii 字符以用于 URL”。
Basic usage:
基本用法:
To generate slugs for URLs:
为 URL 生成 slug:
<?php
echo URLify::filter (' J\'étudie le fran?ais ');
// "jetudie-le-francais"
echo URLify::filter ('Lo siento, no hablo espa?ol.');
// "lo-siento-no-hablo-espanol"
?>
To generate slugs for file names:
要为文件名生成 slug:
<?php
echo URLify::filter ('фото.jpg', 60, "", true);
// "foto.jpg"
?>
*None of the other suggestions matched my criteria:
*没有其他建议符合我的标准:
- Should be installable via composer
- Should not depend on iconv since it behaves differently on different systems
- Should be extendable to allow overrides and custom character replacements
- Popular (for instance many stars on Github)
- Has tests
- 应该可以通过composer安装
- 不应依赖 iconv,因为它在不同系统上的行为不同
- 应该可扩展以允许覆盖和自定义字符替换
- 流行(例如 Github 上的许多明星)
- 有测试
As a bonus, URLify also removes certain words and strips away all characters not transliterated.
作为奖励,URLify 还会删除某些单词并删除所有未音译的字符。
Here is a test case with tons of foreign characters being transliterated properly using URLify: https://gist.github.com/motin/a65e6c1cc303e46900d10894bf2da87f
这是一个使用 URLify 正确音译大量外来字符的测试用例:https://gist.github.com/motin/a65e6c1cc303e46900d10894bf2da87f

