php 从字符串中删除非 ascii 字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/8781911/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-26 05:29:52  来源:igfitidea点击:

Remove non-ascii characters from string

php

提问by LordZardeck

I'm getting strange characters when pulling data from a website:

从网站提取数据时出现奇怪的字符:

?

How can I remove anything that isn't a non-extended ASCII character?

如何删除不是非扩展 ASCII 字符的任何内容?

回答by Chris Bornhoft

A regex replace would be the best option. Using $stras an example string and matching it using :print:, which is a POSIX Character Class:

正则表达式替换将是最好的选择。使用$str作为示例字符串并使用匹配它:print:,这是一个POSIX 字符类

$str = 'aA?';
$str = preg_replace('/[[:^print:]]/', '', $str); // should be aA

What :print:does is look for all printable characters. The reverse, :^print:, looks for all non-printable characters. Any characters that are not part of the current character set will be removed.

什么:print:是寻找所有可打印的字符。相反,:^print:, 查找所有不可打印的字符。任何不属于当前字符集的字符都将被删除。

Note:Before using this method, you must ensure that your current character set is ASCII. POSIX Character Classes support both ASCII and Unicode and will match only according to the current character set. As of PHP 5.6, the default charset is UTF-8.

注意:在使用此方法之前,您必须确保您当前的字符集是 ASCII。POSIX 字符类同时支持 ASCII 和 Unicode,并且仅根据当前字符集进行匹配。自 PHP 5.6 起,默认字符集为 UTF-8。

回答by DamirR

You want only ASCII printable characters?

你只想要ASCII 可打印字符

use this:

用这个:

<?php
header('Content-Type: text/html; charset=UTF-8');
$str = "abqwre????sff";
$res = preg_replace('/[^\x20-\x7E]/','', $str);
echo "($str)($res)";

Or even better, convert your input to utf8 and use phputf8 libto translate 'not normal' characters into their ascii representation:

或者更好的是,将您的输入转换为 utf8 并使用phputf8 lib将“非正常”字符转换为它们的 ascii 表示:

require_once('libs/utf8/utf8.php');
require_once('libs/utf8/utils/bad.php');
require_once('libs/utf8/utils/validation.php');
require_once('libs/utf8_to_ascii/utf8_to_ascii.php');

if(!utf8_is_valid($str))
{
  $str=utf8_bad_strip($str);
}

$str = utf8_to_ascii($str, '' );

回答by Utopia

$clearstring=filter_var($rawstring, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_HIGH);

$clearstring=filter_var($rawstring, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_HIGH);

回答by Silas Palmer

Kind of related, we had a web application that had to send data to a legacy system that could only deal with the first 128 characters of the ASCII character set.

有点相关,我们有一个 Web 应用程序,它必须将数据发送到只能处理 ASCII 字符集的前 128 个字符的遗留系统。

Solution we had to use was something that would "translate" as many characters as possible into close-matching ASCII equivalents, but leave anything that could not be translated alone.

我们必须使用的解决方案是将尽可能多的字符“翻译”为接近匹配的 ASCII 等价物,但留下任何无法单独翻译的字符。

Normally I would do something like this:

通常我会做这样的事情:

<?php
// transliterate
if (function_exists('iconv')) {
    $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
    }
?>

... but that replaces everything that can't be translated into a question mark (?).

...但这取代了所有不能翻译成问号 (?) 的东西。

So we ended up doing the following. Check at the end of this function for (commented out) php regex that just strips out non-ASCII characters.

所以我们最终做了以下事情。在这个函数的末尾检查(注释掉)php regex,它只是去除了非 ASCII 字符。

<?php
public function cleanNonAsciiCharactersInString($orig_text) {

    $text = $orig_text;

    // Single letters
    $text = preg_replace("/[??α?áàa?a?]/u",      "a", $text);
    $text = preg_replace("/[?лДΛдАáà???]/u",     "A", $text);
    $text = preg_replace("/[?ЪЬБъь]/u",           "b", $text);
    $text = preg_replace("/[βвВ]/u",            "B", $text);
    $text = preg_replace("/[???с]/u",            "c", $text);
    $text = preg_replace("/[?С]/u",              "C", $text);        
    $text = preg_replace("/[δ?]/u",             "d", $text);
    $text = preg_replace("/[éèê???èε?е?ё?эЭ]/u", "e", $text);
    $text = preg_replace("/[éèê?ξ?Е∑]/u",     "E", $text);
    $text = preg_replace("/[?]/u",               "F", $text);
    $text = preg_replace("/[Нн??]/u",           "H", $text);
    $text = preg_replace("/[???]/u",            "h", $text);
    $text = preg_replace("/[íì??]/u",           "I", $text);
    $text = preg_replace("/[íì??ι???]/u",       "i", $text);
    $text = preg_replace("/[??]/u",             "j", $text);
    $text = preg_replace("/[Κ?К]/u",            'K', $text);
    $text = preg_replace("/[?к]/u",             'k', $text);
    $text = preg_replace("/[?∟]/u",             'l', $text);
    $text = preg_replace("/[Мм]/u",             "M", $text);
    $text = preg_replace("/[?η?ηπ?]/u",            "n", $text);
    $text = preg_replace("/[?∏пПИЙийΝЛ]/u",       "N", $text);
    $text = preg_replace("/[óò??o?ο??Фσ?о]/u", "o", $text);
    $text = preg_replace("/[óò???θΩθО?]/u",     "O", $text);
    $text = preg_replace("/[ρφрРф]/u",          "p", $text);
    $text = preg_replace("/[?яЯ]/u",              "R", $text); 
    $text = preg_replace("/[Г?г?]/u",              "r", $text); 
    $text = preg_replace("/[?]/u",              "S", $text);
    $text = preg_replace("/[?]/u",              "s", $text);
    $text = preg_replace("/[Тт]/u",              "T", $text);
    $text = preg_replace("/[τ??]/u",              "t", $text);
    $text = preg_replace("/[úù?ü?μ?μυ??]/u",     "u", $text);
    $text = preg_replace("/[√]/u",               "v", $text);
    $text = preg_replace("/[úù?ü?Цц]/u",         "U", $text);
    $text = preg_replace("/[Ψψω????щш?]/u",      "w", $text);
    $text = preg_replace("/[???ШЩ]/u",          "W", $text);
    $text = preg_replace("/[ΧχЖХж]/u",          "x", $text);
    $text = preg_replace("/[??¥]/u",           "Y", $text);
    $text = preg_replace("/[?γ??Ууч]/u",       "y", $text);
    $text = preg_replace("/[ζ]/u",              "Z", $text);

    // Punctuation
    $text = preg_replace("/[????]/u", ",", $text);        
    $text = preg_replace("/[`?′'‘]/u", "'", $text);
    $text = preg_replace("/[″“”???]/u", '"', $text);
    $text = preg_replace("/[—–―?– ̄?─?→←]/u", '-', $text);
    $text = preg_replace("/[  ]/u", ' ', $text);

    $text = str_replace("…", "...", $text);
    $text = str_replace("≠", "!=", $text);
    $text = str_replace("≤", "<=", $text);
    $text = str_replace("≥", ">=", $text);
    $text = preg_replace("/[?≈≡]/u", "=", $text);


    // Exciting combinations    
    $text = str_replace("ыЫ", "bl", $text);
    $text = str_replace("℅", "c/o", $text);
    $text = str_replace("?", "Pts", $text);
    $text = str_replace("?", "tm", $text);
    $text = str_replace("№", "No", $text);        
    $text = str_replace("Ч", "4", $text);                
    $text = str_replace("‰", "%", $text);
    $text = preg_replace("/[??]/u", "*", $text);
    $text = str_replace("?", "<", $text);
    $text = str_replace("?", ">", $text);
    $text = str_replace("?", "!!", $text);
    $text = str_replace("?", "/", $text);
    $text = str_replace("∕", "/", $text);
    $text = str_replace("?", "7/8", $text);
    $text = str_replace("?", "5/8", $text);
    $text = str_replace("?", "3/8", $text);
    $text = str_replace("?", "1/8", $text);        
    $text = preg_replace("/[‰]/u", "%", $text);
    $text = preg_replace("/[??]/u", "Ab", $text);
    $text = preg_replace("/[Юю]/u", "IO", $text);
    $text = preg_replace("/[????]/u", "fi", $text);
    $text = preg_replace("/[зЗ]/u", "3", $text); 
    $text = str_replace("£", "(pounds)", $text);
    $text = str_replace("?", "(lira)", $text);
    $text = preg_replace("/[‰]/u", "%", $text);
    $text = preg_replace("/[??↓↑│]/u", "|", $text);
    $text = preg_replace("/[∞∩∫???]/u", "", $text);


    //2) Translation CP1252.
    $trans = get_html_translation_table(HTML_ENTITIES);
    $trans['f'] = '&fnof;';    // Latin Small Letter F With Hook
    $trans['-'] = array(
        '&hellip;',     // Horizontal Ellipsis
        '&tilde;',      // Small Tilde
        '&ndash;'       // Dash
        );
    $trans["+"] = '&dagger;';    // Dagger
    $trans['#'] = '&Dagger;';    // Double Dagger         
    $trans['M'] = '&permil;';    // Per Mille Sign
    $trans['S'] = '&Scaron;';    // Latin Capital Letter S With Caron        
    $trans['OE'] = '&OElig;';    // Latin Capital Ligature OE
    $trans["'"] = array(
        '&lsquo;',  // Left Single Quotation Mark
        '&rsquo;',  // Right Single Quotation Mark
        '&rsaquo;', // Single Right-Pointing Angle Quotation Mark
        '&sbquo;',  // Single Low-9 Quotation Mark
        '&circ;',   // Modifier Letter Circumflex Accent
        '&lsaquo;'  // Single Left-Pointing Angle Quotation Mark
        );

    $trans['"'] = array(
        '&ldquo;',  // Left Double Quotation Mark
        '&rdquo;',  // Right Double Quotation Mark
        '&bdquo;',  // Double Low-9 Quotation Mark
        );

    $trans['*'] = '&bull;';    // Bullet
    $trans['n'] = '&ndash;';    // En Dash
    $trans['m'] = '&mdash;';    // Em Dash        
    $trans['tm'] = '&trade;';    // Trade Mark Sign
    $trans['s'] = '&scaron;';    // Latin Small Letter S With Caron
    $trans['oe'] = '&oelig;';    // Latin Small Ligature OE
    $trans['Y'] = '&Yuml;';    // Latin Capital Letter Y With Diaeresis
    $trans['euro'] = '&euro;';    // euro currency symbol
    ksort($trans);

    foreach ($trans as $k => $v) {
        $text = str_replace($v, $k, $text);
    }

    // 3) remove <p>, <br/> ...
    $text = strip_tags($text);

    // 4) &amp; => & &quot; => '
    $text = html_entity_decode($text);


    // transliterate
    // if (function_exists('iconv')) {
    // $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
    // }

    // remove non ascii characters
    // $text =  preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $text);      

    return $text;
}

?>

回答by simhumileco

I also think that the best solution might be to use a regular expression.

我还认为最好的解决方案可能是使用正则表达式。

Here's my suggestion:

这是我的建议:

function convert_to_normal_text($text) {

    $normal_characters = "a-zA-Z0-9\s`~!@#$%^&*()_+-={}|:;<>?,.\/\"\'\\[\]";
    $normal_text = preg_replace("/[^$normal_characters]/", '', $text);

    return $normal_text;
}

Then you can use it like this:

然后你可以像这样使用它:

$before = 'Some "normal characters": Abc123!+, some ASCII characters: ABC+?? and some non-ASCII characters: ?????.';
$after = convert_to_normal_text($before);
echo $after;

Displays:

显示:

Some "normal characters": Abc123!+, some ASCII characters: ABC+ and some non-ASCII characters: .

回答by ALHaines

I just had to add the header

我只需要添加标题

header('Content-Type: text/html; charset=UTF-8');

回答by Goran Jakovljevic

This should be pretty straight forwards and no need for iconv function:

这应该非常简单,不需要 iconv 函数:

// Remove all characters that are not the separator, a-z, 0-9, or whitespace
$string = preg_replace('![^'.preg_quote('-').'a-z0-_9\s]+!', '', strtolower($string));
// Replace all separator characters and whitespace by a single separator
$string = preg_replace('!['.preg_quote('-').'\s]+!u', '-', $string);

回答by websolutions.gr

I think the best way to do something like this is by using ord() command. This way you will be able to keep characters written in any language. Just remember to first test your text's ord results. This will not work on unicode.

我认为执行此类操作的最佳方法是使用 ord() 命令。通过这种方式,您将能够保留以任何语言编写的字符。请记住首先测试文本的 ord 结果。这不适用于 unicode。

$name="βγδεζηΘKgfgebhjrf!@#$%^&";    
//this function will clear all non greek and english characters on greek-iso charset        
function replace_characters($string)    
{    
   $str_length=strlen($string);    
   for ($x=0;$x<$str_length;$x++)    
      {    
          $character=$string[$x];    
          if ((ord($character)>64 && ord($character)<91) || (ord($character)>96 && ord($character)<123) || (ord($character)>192 && ord($character)<210) || (ord($character)>210 && ord($character)<218) || (ord($character)>219 && ord($character)<250) || ord($character)==252 || ord($character)==254)    
             {    
                 $new_string=$new_string.$character;     
             }    
      }    
      return $new_string;    
}    
//end function    

$name=replace_characters($name);    

echo $name;