如何在 PHP 中替换 Microsoft 编码的引号

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1262038/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 01:47:49  来源:igfitidea点击:

How to replace Microsoft-encoded quotes in PHP

phpstringencodingcharacter-encoding

提问by Misha M

I need to replace Microsoft Word's version of single and double quotations marks (“ ” ‘ ') with regular quotes (' and ") due to an encoding issue in my application. I do not need them to be HTML entities and I cannot change my database schema.

“ ” ‘ '由于我的应用程序中的编码问题,我需要将 Microsoft Word 版本的单引号和双引号 ( ) 替换为正则引号 (' 和 ")。我不需要它们是 HTML 实体,我无法更改我的数据库架构。

I have two options: to use either a regular expression or an associated array.

我有两个选择:使用正则表达式或关联数组。

Is there a better way to do this?

有一个更好的方法吗?

回答by Justin Dominic

I have found an answer to this question. You need just one line of code using iconv()function in php:

我找到了这个问题的答案。你只需要一行代码iconv()在 php 中使用函数:

// replace Microsoft Word version of single  and double quotations marks (“ ” ‘ ') with  regular quotes (' and ")
$output = iconv('UTF-8', 'ASCII//TRANSLIT', $input);     

回答by Pascal MARTIN

Considering you only want to replace a few specific and well identified characters, I would go for str_replacewith an array: you obviously don't need the heavy artillery regex will bring you ;-)

考虑到您只想替换一些特定且识别良好的字符,我会选择str_replace一个数组:您显然不需要重炮正则表达式会给您带来;-)

And if you encounter some other special characters (damn copy-paste from Microsoft Word...), you can just add them to that array whenever is necessary / whenever they are identified.

如果您遇到一些其他特殊字符(该死的从 Microsoft Word 复制粘贴...),您可以在需要时/每当识别出它们时将它们添加到该数组中。


The best answer I can give to your comment is probably this link: Convert Smart Quotes with PHP


我可以对您的评论给出的最佳答案可能是此链接:Convert Smart Quotes with PHP

And the associated code (quoting that page):

以及相关代码(引用该页面)

function convert_smart_quotes($string) 
{ 
    $search = array(chr(145), 
                    chr(146), 
                    chr(147), 
                    chr(148), 
                    chr(151)); 

    $replace = array("'", 
                     "'", 
                     '"', 
                     '"', 
                     '-'); 

    return str_replace($search, $replace, $string); 
} 

(I don't have Microsoft Word on this computer, so I can't test by myself)

(我这台电脑上没有Microsoft Word,所以无法自己测试)

I don't remember exactly what we used at work (I was not the one having to deal with that kind of input), but it was the same kind of stuff...

我不记得我们在工作中使用了什么(我不是必须处理那种输入的人),但它是同一种东西......

回答by Gumbo

Your Microsoft-encoded quotesare the probably the typographic quotation marks. You can simply replace them with str_replaceif you know the encoding of the string in that you want to replace them.

您的Microsoft 编码引号可能是印刷引号str_replace如果您知道要替换它们的字符串的编码,则可以简单地将它们替换为。

Here's an example for UTF-8 but using a single mapping array with strtr:

这是 UTF-8 的示例,但使用单个映射数组strtr

$quotes = array(
    "\xC2\xAB"     => '"', // ? (U+00AB) in UTF-8
    "\xC2\xBB"     => '"', // ? (U+00BB) in UTF-8
    "\xE2\x80\x98" => "'", // ‘ (U+2018) in UTF-8
    "\xE2\x80\x99" => "'", // ' (U+2019) in UTF-8
    "\xE2\x80\x9A" => "'", // ? (U+201A) in UTF-8
    "\xE2\x80\x9B" => "'", // ? (U+201B) in UTF-8
    "\xE2\x80\x9C" => '"', // “ (U+201C) in UTF-8
    "\xE2\x80\x9D" => '"', // ” (U+201D) in UTF-8
    "\xE2\x80\x9E" => '"', // ? (U+201E) in UTF-8
    "\xE2\x80\x9F" => '"', // ? (U+201F) in UTF-8
    "\xE2\x80\xB9" => "'", // ? (U+2039) in UTF-8
    "\xE2\x80\xBA" => "'", // ? (U+203A) in UTF-8
);
$str = strtr($str, $quotes);

If you're need another encoding, you can use mb_convert_encodingto convert the keys.

如果您需要其他编码,您可以使用mb_convert_encoding来转换密钥。

回答by thelastshadow

If like me you arrive here with an enormous range of broken ASCII / Microsoft Word characters that are doing weird things to your CMS or RTE and iconv isn't working, then this mad function might just be for you.

如果像我一样,您带着大量损坏的 ASCII / Microsoft Word 字符来到这里,这些字符对您的 CMS 或 RTE 做了奇怪的事情,并且 iconv 不起作用,那么这个疯狂的功能可能只适合您。

Make sure your encoding is UTF-8 when you save this function to a file.

将此函数保存到文件时,请确保您的编码为 UTF-8。

<?php
    /**
     * fixMSWord
     *
     * Replace ASCII chars with UTF-8. Note there are ASCII characters that don't
     * correctly map and will be replaced by spaces.
     *
     * @author      Robin Cafolla
     * @date        2013-03-22
     */
    function fixMSWord($string) {
        $map = Array(
            '33' => '!', '34' => '"', '35' => '#', '36' => '$', '37' => '%', '38' => '&', '39' => "'", '40' => '(', '41' => ')', '42' => '*',
            '43' => '+', '44' => ',', '45' => '-', '46' => '.', '47' => '/', '48' => '0', '49' => '1', '50' => '2', '51' => '3', '52' => '4',
            '53' => '5', '54' => '6', '55' => '7', '56' => '8', '57' => '9', '58' => ':', '59' => ';', '60' => '<', '61' => '=', '62' => '>',
            '63' => '?', '64' => '@', '65' => 'A', '66' => 'B', '67' => 'C', '68' => 'D', '69' => 'E', '70' => 'F', '71' => 'G', '72' => 'H',
            '73' => 'I', '74' => 'J', '75' => 'K', '76' => 'L', '77' => 'M', '78' => 'N', '79' => 'O', '80' => 'P', '81' => 'Q', '82' => 'R',
            '83' => 'S', '84' => 'T', '85' => 'U', '86' => 'V', '87' => 'W', '88' => 'X', '89' => 'Y', '90' => 'Z', '91' => '[', '92' => '\',
            '93' => ']', '94' => '^', '95' => '_', '96' => '`', '97' => 'a', '98' => 'b', '99' => 'c', '100'=> 'd', '101'=> 'e', '102'=> 'f',
            '103'=> 'g', '104'=> 'h', '105'=> 'i', '106'=> 'j', '107'=> 'k', '108'=> 'l', '109'=> 'm', '110'=> 'n', '111'=> 'o', '112'=> 'p',
            '113'=> 'q', '114'=> 'r', '115'=> 's', '116'=> 't', '117'=> 'u', '118'=> 'v', '119'=> 'w', '120'=> 'x', '121'=> 'y', '122'=> 'z',
            '123'=> '{', '124'=> '|', '125'=> '}', '126'=> '~', '127'=> ' ', '128'=> '&#8364;', '129'=> ' ', '130'=> ',', '131'=> ' ', '132'=> '"',
            '133'=> '.', '134'=> ' ', '135'=> ' ', '136'=> '^', '137'=> ' ', '138'=> ' ', '139'=> '<', '140'=> ' ', '141'=> ' ', '142'=> ' ',
            '143'=> ' ', '144'=> ' ', '145'=> "'", '146'=> "'", '147'=> '"', '148'=> '"', '149'=> '.', '150'=> '-', '151'=> '-', '152'=> '~',
            '153'=> ' ', '154'=> ' ', '155'=> '>', '156'=> ' ', '157'=> ' ', '158'=> ' ', '159'=> ' ', '160'=> ' ', '161'=> '?', '162'=> '¢',
            '163'=> '£', '164'=> '¤', '165'=> '¥', '166'=> '|', '167'=> '§', '168'=> '¨', '169'=> '?', '170'=> 'a', '171'=> '?', '172'=> '?',
            '173'=> '-', '174'=> '?', '175'=> 'ˉ', '176'=> '°', '177'=> '±', '178'=> '2', '179'=> '3', '180'=> '′', '181'=> 'μ', '182'=> '?',
            '183'=> '·', '184'=> '?', '185'=> '1', '186'=> 'o', '187'=> '?', '188'=> '?', '189'=> '?', '190'=> '?', '191'=> '?', '192'=> 'à',
            '193'=> 'á', '194'=> '?', '195'=> '?', '196'=> '?', '197'=> '?', '198'=> '?', '199'=> '?', '200'=> 'è', '201'=> 'é', '202'=> 'ê',
            '203'=> '?', '204'=> 'ì', '205'=> 'í', '206'=> '?', '207'=> '?', '208'=> 'D', '209'=> '?', '210'=> 'ò', '211'=> 'ó', '212'=> '?',
            '213'=> '?', '214'=> '?', '215'=> '×', '216'=> '?', '217'=> 'ù', '218'=> 'ú', '219'=> '?', '220'=> 'ü', '221'=> 'Y', '222'=> 'T',
            '223'=> '?', '224'=> 'à', '225'=> 'á', '226'=> 'a', '227'=> '?', '228'=> '?', '229'=> '?', '230'=> '?', '231'=> '?', '232'=> 'è',
            '233'=> 'é', '234'=> 'ê', '235'=> '?', '236'=> 'ì', '237'=> 'í', '238'=> '?', '239'=> '?', '240'=> 'e', '241'=> '?', '242'=> 'ò',
            '243'=> 'ó', '244'=> '?', '245'=> '?', '246'=> '?', '247'=> '÷', '248'=> '?', '249'=> 'ù', '250'=> 'ú', '251'=> '?', '252'=> 'ü',
            '253'=> 'y', '254'=> 't', '255'=> '?'
        );

        $search = Array();
        $replace = Array();

        foreach ($map as $s => $r) {
            $search[] = chr((int)$s);
            $replace[] = $r;
        }

        return str_replace($search, $replace, $string);
    }

回答by ceejayoz

We used the following. It deals with a few more special characters.

我们使用了以下内容。它处理一些更特殊的字符。

$text = str_replace(chr(130), ',', $text);    // Baseline single quote
$text = str_replace(chr(132), '"', $text);    // Baseline double quote
$text = str_replace(chr(133), '...', $text);  // Ellipsis
$text = str_replace(chr(145), "'", $text);    // Left single quote
$text = str_replace(chr(146), "'", $text);    // Right single quote
$text = str_replace(chr(147), '"', $text);    // Left double quote
$text = str_replace(chr(148), '"', $text);    // Right double quote

$text = mb_convert_encoding($text, 'HTML-ENTITIES', 'UTF-8');

回答by NobleUplift

Every single one of the previous answers except for Gumbo'swill mangle Unicode strings:

除了Gumbo之外,之前的每一个答案都会破坏 Unicode 字符串:

echo convert_smart_quotes("This is Yi: ?. Point ⒒ this breaks Yi. Yi broke–why? I need a longer––point. This makes Han 嗗 mad.");

Results in:

结果是:

This is Yi: ?''. Point ?'' this breaks Yi. Yi broke?"why? I need a longer?"?"point. This makes Han ?-- mad.

The iconv:

图标:

$output = iconv('UTF-8', 'ASCII//TRANSLIT', $input);

Results in:

结果是:

PHP Notice: iconv(): Detected an illegal character in input string in php shell code on line 1

PHP 注意: iconv(): 在第 1 行的 php shell 代码的输入字符串中检测到非法字符

You can change it to //IGNORE, which will remove the characters, but not translate them.

您可以将其更改为//IGNORE,这将删除字符,但不会翻译它们。

This is the best way to replace Microsoft quotes encoded in CP1252. If they are in Unicode and you need to replace them, use Gumbo's answer:

这是替换 CP1252 中编码的 Microsoft 引号的最佳方法。如果它们是 Unicode 并且您需要替换它们,请使用 Gumbo 的答案:

function convert_cp1252_to_ascii($input, $default = '') {
    if ($input === null || $input == '') {
        return $default;
    }

    // https://en.wikipedia.org/wiki/UTF-8
    // https://en.wikipedia.org/wiki/ISO/IEC_8859-1
    // https://en.wikipedia.org/wiki/Windows-1252
    // http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
    $encoding = mb_detect_encoding($input, array('Windows-1252', 'ISO-8859-1'), true);
    if ($encoding == 'ISO-8859-1' || $encoding == 'Windows-1252') {
        /*
         * Use the search/replace arrays if a character needs to be replaced with
         * something other than its Unicode equivalent.
         */

        $replace = array(
            128 => "E",    // http://www.fileformat.info/info/unicode/char/20AC/index.htm EURO SIGN
            129 => "",     // UNDEFINED
            130 => ",",    // http://www.fileformat.info/info/unicode/char/201A/index.htm SINGLE LOW-9 QUOTATION MARK
            131 => "f",    // http://www.fileformat.info/info/unicode/char/0192/index.htm LATIN SMALL LETTER F WITH HOOK
            132 => ",,",   // http://www.fileformat.info/info/unicode/char/201e/index.htm DOUBLE LOW-9 QUOTATION MARK
            133 => "...",  // http://www.fileformat.info/info/unicode/char/2026/index.htm HORIZONTAL ELLIPSIS
            134 => "t",    // http://www.fileformat.info/info/unicode/char/2020/index.htm DAGGER
            135 => "T",    // http://www.fileformat.info/info/unicode/char/2021/index.htm DOUBLE DAGGER
            136 => "^",    // http://www.fileformat.info/info/unicode/char/02c6/index.htm MODIFIER LETTER CIRCUMFLEX ACCENT
            137 => "%",    // http://www.fileformat.info/info/unicode/char/2030/index.htm PER MILLE SIGN
            138 => "S",    // http://www.fileformat.info/info/unicode/char/0160/index.htm LATIN CAPITAL LETTER S WITH CARON
            139 => "<",    // http://www.fileformat.info/info/unicode/char/2039/index.htm SINGLE LEFT-POINTING ANGLE QUOTATION MARK
            140 => "OE",   // http://www.fileformat.info/info/unicode/char/0152/index.htm LATIN CAPITAL LIGATURE OE
            141 => "",     // UNDEFINED
            142 => "Z",    // http://www.fileformat.info/info/unicode/char/017d/index.htm LATIN CAPITAL LETTER Z WITH CARON
            143 => "",     // UNDEFINED
            144 => "",     // UNDEFINED
            145 => "'",    // http://www.fileformat.info/info/unicode/char/2018/index.htm LEFT SINGLE QUOTATION MARK
            146 => "'",    // http://www.fileformat.info/info/unicode/char/2019/index.htm RIGHT SINGLE QUOTATION MARK
            147 => "\"",   // http://www.fileformat.info/info/unicode/char/201c/index.htm LEFT DOUBLE QUOTATION MARK
            148 => "\"",   // http://www.fileformat.info/info/unicode/char/201d/index.htm RIGHT DOUBLE QUOTATION MARK
            149 => "*",    // http://www.fileformat.info/info/unicode/char/2022/index.htm BULLET
            150 => "-",    // http://www.fileformat.info/info/unicode/char/2013/index.htm EN DASH
            151 => "--",   // http://www.fileformat.info/info/unicode/char/2014/index.htm EM DASH
            152 => "~",    // http://www.fileformat.info/info/unicode/char/02DC/index.htm SMALL TILDE
            153 => "TM",   // http://www.fileformat.info/info/unicode/char/2122/index.htm TRADE MARK SIGN
            154 => "s",    // http://www.fileformat.info/info/unicode/char/0161/index.htm LATIN SMALL LETTER S WITH CARON
            155 => ">",    // http://www.fileformat.info/info/unicode/char/203A/index.htm SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
            156 => "oe",   // http://www.fileformat.info/info/unicode/char/0153/index.htm LATIN SMALL LIGATURE OE
            157 => "",     // UNDEFINED
            158 => "z",    // http://www.fileformat.info/info/unicode/char/017E/index.htm LATIN SMALL LETTER Z WITH CARON
            159 => "Y",    // http://www.fileformat.info/info/unicode/char/0178/index.htm LATIN CAPITAL LETTER Y WITH DIAERESIS
        );

        $find = array();
        foreach (array_keys($replace) as $key) {
            $find[] = chr($key);
        }

        $input = str_replace($find, array_values($replace), $input);
        /*
         * Because ISO-8859-1 and CP1252 are identical except for 0x80 through 0x9F
         * and control characters, always convert from Windows-1252 to UTF-8.
         */
        $input = iconv('Windows-1252', 'UTF-8//IGNORE', $input);
    }
    return $input;
}

Taken from this answer, with some modifications. If you want to control over what you find/replace, use that function.

取自this answer,并进行了一些修改。如果您想控制查找/替换的内容,请使用该功能。