使用 PHP 转换所有类型的智能引号
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20025030/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Convert all types of smart quotes with PHP
提问by Xeoncross
I am trying to convert all types of smart quotes to regular quotes when working with text. However, the following function I've compiled still seems to be lacking support and proper design.
我试图在处理文本时将所有类型的智能引号转换为常规引号。但是,我编译的以下函数似乎仍然缺乏支持和适当的设计。
Does anyone know how to properly get all quote charactersconverted?
有谁知道如何正确转换所有引号字符?
function convert_smart_quotes($string)
{
$quotes = array(
"\xC2\xAB" => '"', // ? (U+00AB) in UTF-8
"\xC2\xBB" => '"', // ? (U+00BB) in UTF-8
"\xE2\x80\x98" => "'", // ‘ (U+2018) in UTF-8
"\xE2\x80\x99" => "'", // ' (U+2019) in UTF-8
"\xE2\x80\x9A" => "'", // ? (U+201A) in UTF-8
"\xE2\x80\x9B" => "'", // ? (U+201B) in UTF-8
"\xE2\x80\x9C" => '"', // “ (U+201C) in UTF-8
"\xE2\x80\x9D" => '"', // ” (U+201D) in UTF-8
"\xE2\x80\x9E" => '"', // ? (U+201E) in UTF-8
"\xE2\x80\x9F" => '"', // ? (U+201F) in UTF-8
"\xE2\x80\xB9" => "'", // ? (U+2039) in UTF-8
"\xE2\x80\xBA" => "'", // ? (U+203A) in UTF-8
);
$string = strtr($string, $quotes);
// Version 2
$search = array(
chr(145),
chr(146),
chr(147),
chr(148),
chr(151)
);
$replace = array("'","'",'"','"',' - ');
$string = str_replace($search, $replace, $string);
// Version 3
$string = str_replace(
array('‘','’','“','”'),
array("'", "'", '"', '"'),
$string
);
// Version 4
$search = array(
'‘',
'’',
'“',
'”',
'—',
'–',
);
$replace = array("'","'",'"','"',' - ', '-');
$string = str_replace($search, $replace, $string);
return $string;
}
Note: This question is a complete query about the full of gamut of quotes including the "Microsoft" quotes asked hereThis is a "duplicate" in the same way that asking about all tire sizes is a "duplicate" of asking for a car tire size.
注意:这个问题是关于所有引号的完整查询,包括此处询问的“Microsoft”引号这是一个“重复”,就像询问所有轮胎尺寸是询问汽车轮胎的“重复”一样尺寸。
回答by Walter Tross
You need something like this (assuming UTF-8 input, and ignoring CJK (Chinese, Japanese, Korean)):
你需要这样的东西(假设 UTF-8 输入,并忽略 CJK(中文、日文、韩文)):
$chr_map = array(
// Windows codepage 1252
"\xC2\x82" => "'", // U+0082?U+201A single low-9 quotation mark
"\xC2\x84" => '"', // U+0084?U+201E double low-9 quotation mark
"\xC2\x8B" => "'", // U+008B?U+2039 single left-pointing angle quotation mark
"\xC2\x91" => "'", // U+0091?U+2018 left single quotation mark
"\xC2\x92" => "'", // U+0092?U+2019 right single quotation mark
"\xC2\x93" => '"', // U+0093?U+201C left double quotation mark
"\xC2\x94" => '"', // U+0094?U+201D right double quotation mark
"\xC2\x9B" => "'", // U+009B?U+203A single right-pointing angle quotation mark
// Regular Unicode // U+0022 quotation mark (")
// U+0027 apostrophe (')
"\xC2\xAB" => '"', // U+00AB left-pointing double angle quotation mark
"\xC2\xBB" => '"', // U+00BB right-pointing double angle quotation mark
"\xE2\x80\x98" => "'", // U+2018 left single quotation mark
"\xE2\x80\x99" => "'", // U+2019 right single quotation mark
"\xE2\x80\x9A" => "'", // U+201A single low-9 quotation mark
"\xE2\x80\x9B" => "'", // U+201B single high-reversed-9 quotation mark
"\xE2\x80\x9C" => '"', // U+201C left double quotation mark
"\xE2\x80\x9D" => '"', // U+201D right double quotation mark
"\xE2\x80\x9E" => '"', // U+201E double low-9 quotation mark
"\xE2\x80\x9F" => '"', // U+201F double high-reversed-9 quotation mark
"\xE2\x80\xB9" => "'", // U+2039 single left-pointing angle quotation mark
"\xE2\x80\xBA" => "'", // U+203A single right-pointing angle quotation mark
);
$chr = array_keys ($chr_map); // but: for efficiency you should
$rpl = array_values($chr_map); // pre-calculate these two arrays
$str = str_replace($chr, $rpl, html_entity_decode($str, ENT_QUOTES, "UTF-8"));
Here comes the background:
背景来了:
Every Unicode character belongs to exactly one "General Category", of which the ones that can contain quote characters are the following:
每个 Unicode 字符都属于一个"General Category",其中可以包含引号字符的字符如下:
Ps"Punctuation, Open"Pe"Punctuation, Close"Pi"Punctuation, Initial quote (may behave like Ps or Pe depending on usage)"Pf"Punctuation, Final quote (may behave like Ps or Pe depending on usage)"Po"Punctuation, Other"
(these pages are handy for checking that you didn't miss anything - there is also an index of categories)
(这些页面可以方便地检查您是否没有遗漏任何内容 - 还有一个类别索引)
It is sometimes useful to match these categoriesin a Unicode-enabled regex.
在启用 Unicode 的正则表达式中匹配这些类别有时很有用。
Furthermore, Unicode characters have "properties", of which the one you are interested in is Quotation_Mark. Unfortunately, these are not accessible in a regex.
此外,Unicode 字符具有“属性”,其中您感兴趣的是Quotation_Mark. 不幸的是,这些不能在正则表达式中访问。
In Wikipedia you can find the group of characters with the Quotation_Markproperty. The final reference is PropList.txton unicode.org, but this is an ASCII textfile.
在维基百科中,您可以找到具有Quotation_Mark属性的字符组。最后一个参考是unicode.org上的 PropList.txt,但这是一个 ASCII 文本文件。
In case you need to translate CJK characters too, you only have to get their code points, decide their translation, and find their UTF-8 encoding, e.g., by looking it up in fileformat.info (e.g., for U+301E: http://www.fileformat.info/info/unicode/char/301e/index.htm).
如果您也需要翻译 CJK 字符,您只需获取它们的代码点,决定它们的翻译,并找到它们的 UTF-8 编码,例如,通过在 fileformat.info 中查找(例如,对于 U+301E:http ://www.fileformat.info/info/unicode/char/301e/index.htm)。
Regarding Windows codepage 1252: Unicodedefines the first 256 code points to represent exactly the same characters as ISO-8859-1, but ISO-8859-1 is often confused with Windows codepage 1252, so that all browsers render the range 0x80-0x9F, which is "empty" in ISO-8859-1 (more exactly: it contains control characters), as if it were Windows codepage 1252. The table in the Wikipedia pagelists the Unicode equivalents.
关于 Windows 代码页 1252:Unicode定义了前 256 个代码点来表示与ISO-8859-1完全相同的字符,但 ISO-8859-1 经常与Windows 代码页 1252混淆,因此所有浏览器都呈现范围 0x80-0x9F,它在 ISO-8859-1 中是“空的”(更准确地说:它包含控制字符),就好像它是 Windows 代码页 1252。维基百科页面中的表格列出了 Unicode 等效项。
Note: strtr()is often slower than str_replace(). Time it with your input and your PHP version. If it's fast enough, you can directly use a map like my $chr_map.
注意:strtr()通常比str_replace(). 使用您的输入和您的 PHP 版本计时。如果速度够快,可以直接使用像我这样的地图$chr_map。
If you are not sure that your input is UTF-8 encoded, AND are willing to assume that if it's not, then it's ISO-8859-1 or Windows codepage 1252, then you can do this before anything else:
如果您不确定您的输入是否是 UTF-8 编码,并且愿意假设如果不是,那么它是 ISO-8859-1 或 Windows 代码页 1252,那么您可以先执行此操作:
if ( !preg_match('/^\X*$/u', $str)) {
$str = utf8_encode($str);
}
Warning: this regex can in very rare cases fail to detect a non-UTF-8 encoding, though. E.g.: "Gru?…"/*CP-1252*/=="Gru\xDF\x85"looks like UTF-8 to this regex (U+07C5 is the N'ko digit 5). This regex can be slightly enhanced, but unfortunately it can be shown that there exists NO completely foolproof solution to the problem of encoding detection.
警告:不过,此正则表达式在极少数情况下可能无法检测到非 UTF-8 编码。例如:"Gru?…"/*CP-1252*/=="Gru\xDF\x85"这个正则表达式看起来像 UTF-8(U+07C5 是 N'ko 数字 5)。这个正则表达式可以稍微增强,但不幸的是,它可以表明对于编码检测问题不存在完全万无一失的解决方案。
If you want to normalize the range 0x80-0x9F that stems from Windows codepage 1252 to regular Unicode codepoints, you can do this (and remove the first part of the $chr_mapabove):
如果要将源自 Windows 代码页 1252 的范围 0x80-0x9F 标准化为常规 Unicode 代码点,则可以执行此操作(并删除上述第一部分$chr_map):
$normalization_map = array(
"\xC2\x80" => "\xE2\x82\xAC", // U+20AC Euro sign
"\xC2\x82" => "\xE2\x80\x9A", // U+201A single low-9 quotation mark
"\xC2\x83" => "\xC6\x92", // U+0192 latin small letter f with hook
"\xC2\x84" => "\xE2\x80\x9E", // U+201E double low-9 quotation mark
"\xC2\x85" => "\xE2\x80\xA6", // U+2026 horizontal ellipsis
"\xC2\x86" => "\xE2\x80\xA0", // U+2020 dagger
"\xC2\x87" => "\xE2\x80\xA1", // U+2021 double dagger
"\xC2\x88" => "\xCB\x86", // U+02C6 modifier letter circumflex accent
"\xC2\x89" => "\xE2\x80\xB0", // U+2030 per mille sign
"\xC2\x8A" => "\xC5\xA0", // U+0160 latin capital letter s with caron
"\xC2\x8B" => "\xE2\x80\xB9", // U+2039 single left-pointing angle quotation mark
"\xC2\x8C" => "\xC5\x92", // U+0152 latin capital ligature oe
"\xC2\x8E" => "\xC5\xBD", // U+017D latin capital letter z with caron
"\xC2\x91" => "\xE2\x80\x98", // U+2018 left single quotation mark
"\xC2\x92" => "\xE2\x80\x99", // U+2019 right single quotation mark
"\xC2\x93" => "\xE2\x80\x9C", // U+201C left double quotation mark
"\xC2\x94" => "\xE2\x80\x9D", // U+201D right double quotation mark
"\xC2\x95" => "\xE2\x80\xA2", // U+2022 bullet
"\xC2\x96" => "\xE2\x80\x93", // U+2013 en dash
"\xC2\x97" => "\xE2\x80\x94", // U+2014 em dash
"\xC2\x98" => "\xCB\x9C", // U+02DC small tilde
"\xC2\x99" => "\xE2\x84\xA2", // U+2122 trade mark sign
"\xC2\x9A" => "\xC5\xA1", // U+0161 latin small letter s with caron
"\xC2\x9B" => "\xE2\x80\xBA", // U+203A single right-pointing angle quotation mark
"\xC2\x9C" => "\xC5\x93", // U+0153 latin small ligature oe
"\xC2\x9E" => "\xC5\xBE", // U+017E latin small letter z with caron
"\xC2\x9F" => "\xC5\xB8", // U+0178 latin capital letter y with diaeresis
);
$chr = array_keys ($normalization_map); // but: for efficiency you should
$rpl = array_values($normalization_map); // pre-calculate these two arrays
$str = str_replace($chr, $rpl, $str);

