PHP 中的多字节修剪?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/10066647/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Multibyte trim in PHP?
提问by federico-t
Apparently there's no mb_trimin the mb_*family, so I'm trying to implement one for my own.
显然没有mb_trim在mb_*家庭,所以我想实现一个我自己。
I recently found this regex in a comment in php.net:
我最近在php.net的评论中发现了这个正则表达式:
/(^\s+)|(\s+$)/u
So, I'd implement it in the following way:
所以,我会通过以下方式实现它:
function multibyte_trim($str)
{
if (!function_exists("mb_trim") || !extension_loaded("mbstring")) {
return preg_replace("/(^\s+)|(\s+$)/u", "", $str);
} else {
return mb_trim($str);
}
}
The regex seems correct to me, but I'm extremely noob with regular expressions. Will this effectively remove anyUnicode space in the beginning/end of a string?
正则表达式对我来说似乎是正确的,但我对正则表达式非常菜鸟。这会有效地删除字符串开头/结尾的任何Unicode 空格吗?
回答by deceze
The standard trimfunction trims a handful of space and space-like characters. These are defined as ASCII characters, which means certain specific bytesfrom 0to 0100 0000.
标准trim函数会修剪一些空格和类似空格的字符。这些被定义为 ASCII 字符,这意味着从到 的某些特定字节。00100 0000
ProperUTF-8 input will never contain multi-byte characters that is made up of bytes 0xxx xxxx. All the bytes in properUTF-8 multibyte characters start with 1xxx xxxx.
正确的UTF-8 输入永远不会包含由 bytes 组成的多字节字符0xxx xxxx。正确的UTF-8 多字节字符中的所有字节都以1xxx xxxx.
This means that in a properUTF-8 sequence, the bytes 0xxx xxxxcan only refer to single-byte characters. PHP's trimfunction will therefore never trim away "half a character" assumingyou have a properUTF-8 sequence. (Be very very careful about improperUTF-8 sequences.)
这意味着在正确的UTF-8 序列中,字节0xxx xxxx只能指单字节字符。trim因此,假设您拥有正确的UTF-8 序列,PHP 的函数将永远不会删除“半个字符” 。(要非常小心不正确的UTF-8 序列。)
The \son ASCII regular expressions will mostly match the same characters as trim.
在\s对ASCII正则表达式将大致相同的字符相匹配的trim。
The pregfunctions with the /umodifier only works on UTF-8 encoded regular expressions, and /\s/umatch also the UTF8's nbsp. This behaviour with non-breaking spaces is the only advantage to using it.
preg带有/u修饰符的函数仅适用于UTF-8 编码的正则表达式,并且/\s/u也匹配 UTF8 的nbsp。这种具有不间断空格的行为是使用它的唯一优势。
If you want to replace space characters in other, non ASCII-compatible encodings, neither method will work.
如果要替换其他非 ASCII 兼容编码中的空格字符,这两种方法都不起作用。
In other words, if you're trying to trim usual spaces an ASCII-compatible string, just use trim. When using /\s/ube careful with the meaning of nbsp for your text.
换句话说,如果您尝试修剪 ASCII 兼容字符串的常用空格,只需使用trim. 使用时请/\s/u注意 nbsp 对文本的含义。
Take care:
小心:
$s1 = html_entity_decode(" Hello   "); // the NBSP
$s2 = " exotic test ホ ";
echo "\nCORRECT trim: [". trim($s1) ."], [". trim($s2) ."]";
echo "\nSAME: [". trim($s1) ."] == [". preg_replace('/^\s+|\s+$/','',$s1) ."]";
echo "\nBUT: [". trim($s1) ."] != [". preg_replace('/^\s+|\s+$/u','',$s1) ."]";
echo "\n!INCORRECT trim: [". trim($s2,' ') ."]"; // DANGER! not UTF8 safe!
echo "\nSAFE ONLY WITH preg: [".
preg_replace('/^[\s]+|[\s]+$/u', '', $s2) ."]";
回答by kba
I don't know what you're trying to do with that endless recursive function you're defining, but if you just want a multibyte-safe trim, this will work.
我不知道你想用你定义的那个无穷无尽的递归函数做什么,但如果你只是想要一个多字节安全的修剪,这将起作用。
function mb_trim($str) {
return preg_replace("/(^\s+)|(\s+$)/us", "", $str);
}
回答by Edson Medina
This version supports the second optional parameter $charlist:
此版本支持第二个可选参数 $charlist:
function mb_trim ($string, $charlist = null)
{
if (is_null($charlist)) {
return trim ($string);
}
$charlist = str_replace ('/', '\/', preg_quote ($charlist));
return preg_replace ("/(^[$charlist]+)|([$charlist]+$)/us", '', $string);
}
Does not support ".." for ranges though.
虽然不支持“..”范围。
回答by Michael Taggart
Ok, so I took @edson-medina's solution and fixed a bug and added some unit tests. Here's the 3 functions we use to give mb counterparts to trim, rtrim, and ltrim.
好的,所以我采用了@edson-medina 的解决方案并修复了一个错误并添加了一些单元测试。这是我们用来给 mb 对应物trim、rtrim和ltrim的3个函数。
////////////////////////////////////////////////////////////////////////////////////
//Add some multibyte core functions not in PHP
////////////////////////////////////////////////////////////////////////////////////
function mb_trim($string, $charlist = null) {
if (is_null($charlist)) {
return trim($string);
} else {
$charlist = preg_quote($charlist, '/');
return preg_replace("/(^[$charlist]+)|([$charlist]+$)/us", '', $string);
}
}
function mb_rtrim($string, $charlist = null) {
if (is_null($charlist)) {
return rtrim($string);
} else {
$charlist = preg_quote($charlist, '/');
return preg_replace("/([$charlist]+$)/us", '', $string);
}
}
function mb_ltrim($string, $charlist = null) {
if (is_null($charlist)) {
return ltrim($string);
} else {
$charlist = preg_quote($charlist, '/');
return preg_replace("/(^[$charlist]+)/us", '', $string);
}
}
////////////////////////////////////////////////////////////////////////////////////
Here's the unit tests I wrote for anyone interested:
这是我为任何感兴趣的人编写的单元测试:
public function test_trim() {
$this->assertEquals(trim(' foo '), mb_trim(' foo '));
$this->assertEquals(trim(' foo ', ' o'), mb_trim(' foo ', ' o'));
$this->assertEquals('foo', mb_trim(' ?fooホ ', ' ?ホ'));
}
public function test_rtrim() {
$this->assertEquals(rtrim(' foo '), mb_rtrim(' foo '));
$this->assertEquals(rtrim(' foo ', ' o'), mb_rtrim(' foo ', ' o'));
$this->assertEquals('foo', mb_rtrim('fooホ ', ' ホ'));
}
public function test_ltrim() {
$this->assertEquals(ltrim(' foo '), mb_ltrim(' foo '));
$this->assertEquals(ltrim(' foo ', ' o'), mb_ltrim(' foo ', ' o'));
$this->assertEquals('foo', mb_ltrim(' ?foo', ' ?'));
}
回答by Opty
You can also trim non-ascii compatible spaces (non-breaking space for example) on UTF-8 strings with preg_replace('/^\p{Z}+|\p{Z}+$/u','',$str);\swill only match "ascii compatible" space character even with the umodifier.
but \p{Z}will match all known unicode space characters
您还可以修剪 UTF-8 字符串上的非 ascii 兼容空格(例如不间断空格),即使使用修饰符preg_replace('/^\p{Z}+|\p{Z}+$/u','',$str);\s也只会匹配“ascii 兼容”空格字符。
但会匹配所有已知的 unicode 空格字符u\p{Z}
回答by trapper_hag
mb_ereg_replace seems to get around that:
mb_ereg_replace 似乎解决了这个问题:
function mb_trim($str,$regex = "(^\s+)|(\s+$)/us") {
return mb_ereg_replace($regex, "", $str);
}
..but I don't know enough about regular expressions to know how you'd then add on the "charlist" parameter people would expect to be able to feed to trim() - i.e. a list of characters to trim - so have just made the regex a parameter.
..但我对正则表达式的了解不够,不知道如何添加人们希望能够提供给 trim() 的“charlist”参数 - 即要修剪的字符列表 - 所以刚刚使正则表达式成为参数。
It might be that you could have an array of special characters, then step through it for each character in the charlist and escape them accordingly when building the regex string.
可能是您可以拥有一组特殊字符,然后针对字符列表中的每个字符逐步遍历它,并在构建正则表达式字符串时相应地对它们进行转义。
回答by Anthony Rutledge
My two cents
我的两分钱
The actual solution to your question is that you should first do encoding checks before working to alter foreign input strings. Many are quick to learn about "sanitizing and validating" input data, but slow to learn the step of identifying the underlying nature (character encoding) of the strings they are working with early on.
您问题的实际解决方案是,您应该在更改外部输入字符串之前首先进行编码检查。许多人很快就学会了“清理和验证”输入数据,但在早期学习识别他们正在使用的字符串的潜在性质(字符编码)的步骤时却很慢。
How many bytes will be used to represent each character? With properly formatted UTF-8, it can be 1 (the characters trimdeals with), 2, 3, or 4 bytes. The problem comes in when legacy, or malformed, representations of UTF-8 come into play--the byte character boundaries might not line up as expected (layman speak).
将使用多少字节来表示每个字符?使用格式正确的 UTF-8,它可以是 1(字符trim处理)、2、3 或 4 个字节。当遗留的或格式错误的 UTF-8 表示开始发挥作用时,问题就出现了——字节字符边界可能不会按预期排列(外行说)。
In PHP, some advocate that all strings should be forced to conform to proper UTF-8 encoding (1, 2, 3, or 4 bytes per character), where functions like trim()will still work because the byte/character boundary for the characters it deals with will be congruent for the Extended ASCII / 1-byte values that trim()seeks to eliminate from the start and end of a string (trim manual page).
在 PHP 中,有人主张所有字符串都应该强制符合正确的 UTF-8 编码(每个字符 1、2、3 或 4 个字节),其中像这样的函数trim()仍然可以工作,因为它处理的字符的字节/字符边界with 将与trim()寻求从字符串的开头和结尾消除的扩展 ASCII / 1 字节值一致(trim 手册页)。
However, because computer programming is a diverse field, one cannot possible have a blanket approach that works in all scenarios. With that said, write your application the way it needs to be to function properly. Just doing a basic database driven website with form inputs? Yes, for my money force everything to be UTF-8.
然而,由于计算机编程是一个多样化的领域,不可能有一种适用于所有场景的全面方法。话虽如此,请按照正常运行所需的方式编写您的应用程序。只是做一个带有表单输入的基本数据库驱动的网站?是的,为了我的钱,一切都必须是 UTF-8。
Note: You will still have internationalization issues, even if your UTF-8 issue is stable. Why? Many non-English character sets exist in the 2, 3, or 4 byte space (code points, etc.). Obviously, if you use a computer that must deal with Chinese, Japanese, Russian, Arabic, or Hebrew scripts, you want everything to work with 2, 3, and 4 bytes as well! Remember, the PHP trimfunction can trim default characters, or user specified ones. This matters, especially if you need your trimto account for some Chinese characters.
注意:即使您的 UTF-8 问题稳定,您仍然会遇到国际化问题。为什么?许多非英语字符集存在于 2、3 或 4 字节空间(代码点等)中。显然,如果您使用的计算机必须处理中文、日语、俄语、阿拉伯语或希伯来语脚本,那么您希望所有内容都能够处理 2、3 和 4 个字节!请记住,PHPtrim函数可以修剪默认字符或用户指定的字符。这很重要,特别是如果您需要trim考虑一些中文字符。
I would much rather deal with the problem of someone not being able to access my site, then the problem of access and responses that should not be occurring. When you think about it, this falls in line with the principles of least privilege(security) and universal design(accessibility).
我宁愿处理某人无法访问我的网站的问题,然后处理不应该发生的访问和响应问题。仔细想想,这符合最小权限(安全)和通用设计(可访问性)的原则。
Summary
概括
If input data will not conform to proper UTF-8 encoding, you may want to throw an exception. You can attempt to use the PHP multi-byte functionsto determine your encoding, or some other multi-byte library. If, and when, PHP is written to fully support unicode (Perl, Java ...), PHP will be all the better for it. The PHP unicode effort died a few years ago, hence you are forced to use extra libraries to deal with UTF-8 multi-byte strings sanely. Just adding the /uflag to preg_replace()is not looking at the big picture.
如果输入数据不符合正确的 UTF-8 编码,您可能需要抛出异常。您可以尝试使用PHP 多字节函数来确定您的编码,或其他一些多字节库。如果 PHP 被编写为完全支持 unicode(Perl、Java ...),那么 PHP 将变得更好。PHP unicode 的努力在几年前就结束了,因此您不得不使用额外的库来理智地处理 UTF-8 多字节字符串。只是将/u标志添加到preg_replace()不看大局。
Update:
更新:
That being said, I believe the following multibyte trim would be useful for those trying to extract REST resources from the path component of a url (less the query string, naturally. Note: this would be useful after sanitizing and validating the path string.
话虽如此,我相信以下多字节修剪对于那些试图从 url 的路径组件中提取 REST 资源的人很有用(自然是查询字符串。注意:这在清理和验证路径字符串后会很有用。
function mb_path_trim($path)
{
return preg_replace("/^(?:\/)|(?:\/)$/u", "", $path);
}

