php 从字符串中删除所有特殊字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14114411/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 06:40:05  来源:igfitidea点击:

Remove all special characters from a string

phpregexurlslug

提问by user115422

I am facing an issue with URLs, I want to be able to convert titles that could contain anything and have them stripped of all special characters so they only have letters and numbers and of course I would like to replace spaces with hyphens.

我遇到了 URL 问题,我希望能够转换可以包含任何内容的标题,并将它们去除所有特殊字符,因此它们只有字母和数字,当然我想用连字符替换空格。

How would this be done? I've heard a lot about regular expressions (regex) being used...

这将如何完成?我听说过很多关于正则表达式(regex)的使用......

回答by Terry Harvey

This should do what you're looking for:

这应该做你正在寻找的:

function clean($string) {
   $string = str_replace(' ', '-', $string); // Replaces all spaces with hyphens.

   return preg_replace('/[^A-Za-z0-9\-]/', '', $string); // Removes special chars.
}

Usage:

用法:

echo clean('a|"bc!@£de^&$f g');

Will output: abcdef-g

将输出: abcdef-g

Edit:

编辑:

Hey, just a quick question, how can I prevent multiple hyphens from being next to each other? and have them replaced with just 1?

嘿,只是一个简单的问题,我怎样才能防止多个连字符彼此相邻?并将它们替换为 1?

function clean($string) {
   $string = str_replace(' ', '-', $string); // Replaces all spaces with hyphens.
   $string = preg_replace('/[^A-Za-z0-9\-]/', '', $string); // Removes special chars.

   return preg_replace('/-+/', '-', $string); // Replaces multiple hyphens with single one.
}

回答by LSerni

Update

更新

The solution below has a "SEO friendlier" version:

下面的解决方案有一个“SEO友好”版本:

function hyphenize($string) {
    $dict = array(
        "I'm"      => "I am",
        "thier"    => "their",
        // Add your own replacements here
    );
    return strtolower(
        preg_replace(
          array( '#[\s-]+#', '#[^A-Za-z0-9. -]+#' ),
          array( '-', '' ),
          // the full cleanString() can be downloaded from http://www.unexpectedit.com/php/php-clean-string-of-utf8-chars-convert-to-similar-ascii-char
          cleanString(
              str_replace( // preg_replace can be used to support more complicated replacements
                  array_keys($dict),
                  array_values($dict),
                  urldecode($string)
              )
          )
        )
    );
}

function cleanString($text) {
    $utf8 = array(
        '/[áàa?a?]/u'   =>   'a',
        '/[áà???]/u'    =>   'A',
        '/[íì??]/u'     =>   'I',
        '/[íì??]/u'     =>   'i',
        '/[éèê?]/u'     =>   'e',
        '/[éèê?]/u'     =>   'E',
        '/[óò??o?]/u'   =>   'o',
        '/[óò???]/u'    =>   'O',
        '/[úù?ü]/u'     =>   'u',
        '/[úù?ü]/u'     =>   'U',
        '/?/'           =>   'c',
        '/?/'           =>   'C',
        '/?/'           =>   'n',
        '/?/'           =>   'N',
        '/–/'           =>   '-', // UTF-8 hyphen to "normal" hyphen
        '/['‘???]/u'    =>   ' ', // Literally a single quote
        '/[“”???]/u'    =>   ' ', // Double quote
        '/ /'           =>   ' ', // nonbreaking space (equiv. to 0x160)
    );
    return preg_replace(array_keys($utf8), array_values($utf8), $text);
}

The rationale for the above functions (which I find wayinefficient - the one below is better) is that a service that shall not be namedapparently ran spelling checks and keyword recognition on the URLs.

为实现上述功能(这是我找到的理由的方式效率低下-下一个是更好)是一个没有不被命名的服务显然是跑的网址拼写检查和关键字识别。

After losing a long time on a customer's paranoias, I found out they were notimagining things after all -- their SEO experts [I am definitely not one] reported that, say, converting "Viaggi Economy Perù" to viaggi-economy-peru"behaved better" than viaggi-economy-per(the previous "cleaning" removed UTF8 characters; Bogotàbecame bogot, Medellìnbecame medellnand so on).

在客户的偏执中失去了很长时间后,我发现他们毕竟不是在想象事情 - 他们的 SEO 专家 [我绝对不是一个] 报告说,例如,将“Viaggi Economy Perù”转换为viaggi-economy-peru“表现得更好”比viaggi-economy-per(之前的“清理”删除了 UTF8 字符;Bogotà变成了bogotMedellìn变成了medelln等等)。

There were also some common misspellings that seemed to influence the results, and the only explanation that made sense to me is that our URL were being unpacked, the words singled out, and used to drive God knows what ranking algorithms. And those algorithms apparently had been fed with UTF8-cleaned strings, so that "Perù" became "Peru" instead of "Per". "Per" did not match and sort of took it in the neck.

还有一些常见的拼写错误似乎影响了结果,对我来说唯一有意义的解释是我们的 URL 被解包,单词被挑出来,用于驱动天知道什么排名算法。而这些算法显然是用 UTF8 清理过的字符串输入的,所以“Perù”变成了“Peru”而不是“Per”。“Per”不匹配,有点把它放在脖子上。

In order to both keep UTF8 characters and replace some misspellings, the faster function below became the more accurate (?) function above. $dictneeds to be hand tailored, of course.

为了既保留UTF8字符又替换一些拼写错误,下面的faster函数变成了上面更准确的(?)函数。$dict当然,需要手工定制。

Previous answer

上一个答案

A simple approach:

一个简单的方法:

// Remove all characters except A-Z, a-z, 0-9, dots, hyphens and spaces
// Note that the hyphen must go last not to be confused with a range (A-Z)
// and the dot, NOT being special (I know. My life was a lie), is NOT escaped

$str = preg_replace('/[^A-Za-z0-9. -]/', '', $str);

// Replace sequences of spaces with hyphen
$str = preg_replace('/  */', '-', $str);

// The above means "a space, followed by a space repeated zero or more times"
// (should be equivalent to / +/)

// You may also want to try this alternative:
$str = preg_replace('/\s+/', '-', $str);

// where \s+ means "zero or more whitespaces" (a space is not necessarily the
// same as a whitespace) just to be sure and include everything

Note that you might have to first urldecode()the URL, since %20 and + both are actually spaces - I mean, if you have "Never%20gonna%20give%20you%20up" you want it to become Never-gonna-give-you-up, not Never20gonna20give20you20up. You might not need it, but I thought I'd mention the possibility.

请注意,您可能必须首先urldecode()输入 URL,因为 %20 和 + 实际上都是空格 - 我的意思是,如果您有“Never%20gonna%20give%20you%20up”,您希望它成为 Never-gonna-give-you-向上,而不是Never20gonna20give20you20up。你可能不需要它,但我想我会提到这种可能性。

So the finished function along with test cases:

所以完成的功能以及测试用例:

function hyphenize($string) {
    return 
    ## strtolower(
          preg_replace(
            array('#[\s-]+#', '#[^A-Za-z0-9. -]+#'),
            array('-', ''),
        ##     cleanString(
              urldecode($string)
        ##     )
        )
    ## )
    ;
}

print implode("\n", array_map(
    function($s) {
            return $s . ' becomes ' . hyphenize($s);
    },
    array(
    'Never%20gonna%20give%20you%20up',
    "I'm not the man I was",
    "'Légeresse', dit sa majesté",
    )));


Never%20gonna%20give%20you%20up    becomes  never-gonna-give-you-up
I'm not the man I was              becomes  im-not-the-man-I-was
'Légeresse', dit sa majesté        becomes  legeresse-dit-sa-majeste

To handle UTF-8 I used a cleanStringimplementation found online (link broken since, but a stripped down copy with all the not-too-esoteric UTF8 characters is at the beginning of the answer; it's also easy to add more characters to it if you need) that converts UTF8 characters to normal characters, thus preserving the word "look" as much as possible. It could be simplified and wrapped inside the function here for performance.

为了处理 UTF-8,我使用了cleanString在线找到的实现(此后链接已断开,但答案的开头是一个带有所有不太深奥的 UTF8 字符的精简副本;如果您添加更多字符,也很容易)需要)将 UTF8 字符转换为普通字符,从而尽可能保留单词“look”。可以将其简化并包装在此处的函数中以提高性能。

The function above also implements converting to lowercase - but that's a taste. The code to do so has been commented out.

上面的函数还实现了转换为小写字母 - 但这是一种品味。这样做的代码已被注释掉。

回答by Jeffrey

Here, check out this function:

在这里,看看这个功能:

function seo_friendly_url($string){
    $string = str_replace(array('[\', \']'), '', $string);
    $string = preg_replace('/\[.*\]/U', '', $string);
    $string = preg_replace('/&(amp;)?#?[a-z0-9]+;/i', '-', $string);
    $string = htmlentities($string, ENT_COMPAT, 'utf-8');
    $string = preg_replace('/&([a-z])(acute|uml|circ|grave|ring|cedil|slash|tilde|caron|lig|quot|rsquo);/i', '\1', $string );
    $string = preg_replace(array('/[^a-z0-9]/i', '/[-]+/') , '-', $string);
    return strtolower(trim($string, '-'));
}