PHP:如何删除字符串中的所有不可打印字符?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1176904/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 01:22:07  来源:igfitidea点击:

PHP: How to remove all non printable characters in a string?

phputf-8ascii

提问by Stewart Robinson

I imagine I need to remove chars 0-31 and 127,

我想我需要删除字符 0-31 和 127,

Is there a function or piece of code to do this efficiently.

是否有一个函数或一段代码可以有效地做到这一点。

回答by Paul Dixon

7 bit ASCII?

7 位 ASCII 码?

If your Tardis just landed in 1963, and you just want the 7 bit printable ASCII chars, you can rip out everything from 0-31 and 127-255 with this:

如果您的 Tardis 刚刚于 1963 年登陆,并且您只想要 7 位可打印的 ASCII 字符,您可以使用以下命令删除 0-31 和 127-255 之间的所有内容:

$string = preg_replace('/[\x00-\x1F\x7F-\xFF]/', '', $string);

It matches anything in range 0-31, 127-255 and removes it.

它匹配 0-31、127-255 范围内的任何内容并将其删除。

8 bit extended ASCII?

8位扩展ASCII?

You fell into a Hot Tub Time Machine, and you're back in the eighties. If you've got some form of 8 bit ASCII, then you might want to keep the chars in range 128-255. An easy adjustment - just look for 0-31 and 127

你掉进了热水浴缸时光机,你又回到了八十年代。如果您有某种形式的 8 位 ASCII,那么您可能希望将字符保持在 128-255 的范围内。一个简单的调整 - 只需寻找 0-31 和 127

$string = preg_replace('/[\x00-\x1F\x7F]/', '', $string);

UTF-8?

UTF-8?

Ah, welcome back to the 21st century. If you have a UTF-8 encoded string, then the /umodifiercan be used on the regex

啊,欢迎回到21世纪。如果您有 UTF-8 编码的字符串,则可以在正则表达式上使用/u修饰符

$string = preg_replace('/[\x00-\x1F\x7F]/u', '', $string);

This just removes 0-31 and 127. This works in ASCII and UTF-8 because both share the same control set range(as noted by mgutt below). Strictly speaking, this would work without the /umodifier. But it makes life easier if you want to remove other chars...

这只是删除了 0-31 和 127。这适用于 ASCII 和 UTF-8,因为两者共享相同的控制集范围(如下面的 mgutt 所述)。严格来说,这可以在没有/u修饰符的情况下工作。但是如果你想删除其他字符,它会让生活更轻松......

If you're dealing with Unicode, there are potentially many non-printing elements, but let's consider a simple one: NO-BREAK SPACE (U+00A0)

如果您正在处理 Unicode,则可能有许多非打印元素,但让我们考虑一个简单的元素NO-BREAK SPACE (U+00A0)

In a UTF-8 string, this would be encoded as 0xC2A0. You could look for and remove that specific sequence, but with the /umodifier in place, you can simply add \xA0to the character class:

在 UTF-8 字符串中,这将被编码为0xC2A0. 您可以查找并删除该特定序列,但是在/u修改器就位后,您可以简单地添加\xA0到字符类中:

$string = preg_replace('/[\x00-\x1F\x7F\xA0]/u', '', $string);

Addendum: What about str_replace?

附录:str_replace 怎么样?

preg_replace is pretty efficient, but if you're doing this operation a lot, you could build an array of chars you want to remove, and use str_replace as noted by mgutt below, e.g.

preg_replace 非常有效,但是如果您经常执行此操作,则可以构建要删除的字符数组,并使用 str_replace 如下面的 mgutt 所述,例如

//build an array we can re-use across several operations
$badchar=array(
    // control characters
    chr(0), chr(1), chr(2), chr(3), chr(4), chr(5), chr(6), chr(7), chr(8), chr(9), chr(10),
    chr(11), chr(12), chr(13), chr(14), chr(15), chr(16), chr(17), chr(18), chr(19), chr(20),
    chr(21), chr(22), chr(23), chr(24), chr(25), chr(26), chr(27), chr(28), chr(29), chr(30),
    chr(31),
    // non-printing characters
    chr(127)
);

//replace the unwanted chars
$str2 = str_replace($badchar, '', $str);

Intuitively, this seems like it would be fast, but it's not always the case, you should definitely benchmark to see if it saves you anything. I did some benchmarks across a variety string lengths with random data, and this pattern emerged using php 7.0.12

直觉上,这似乎会很快,但情况并非总是如此,您绝对应该进行基准测试,看看它是否能为您节省任何东西。我用随机数据对各种字符串长度做了一些基准测试,这个模式是使用 php 7.0.12 出现的

     2 chars str_replace     5.3439ms preg_replace     2.9919ms preg_replace is 44.01% faster
     4 chars str_replace     6.0701ms preg_replace     1.4119ms preg_replace is 76.74% faster
     8 chars str_replace     5.8119ms preg_replace     2.0721ms preg_replace is 64.35% faster
    16 chars str_replace     6.0401ms preg_replace     2.1980ms preg_replace is 63.61% faster
    32 chars str_replace     6.0320ms preg_replace     2.6770ms preg_replace is 55.62% faster
    64 chars str_replace     7.4198ms preg_replace     4.4160ms preg_replace is 40.48% faster
   128 chars str_replace    12.7239ms preg_replace     7.5412ms preg_replace is 40.73% faster
   256 chars str_replace    19.8820ms preg_replace    17.1330ms preg_replace is 13.83% faster
   512 chars str_replace    34.3399ms preg_replace    34.0221ms preg_replace is  0.93% faster
  1024 chars str_replace    57.1141ms preg_replace    67.0300ms str_replace  is 14.79% faster
  2048 chars str_replace    94.7111ms preg_replace   123.3189ms str_replace  is 23.20% faster
  4096 chars str_replace   227.7029ms preg_replace   258.3771ms str_replace  is 11.87% faster
  8192 chars str_replace   506.3410ms preg_replace   555.6269ms str_replace  is  8.87% faster
 16384 chars str_replace  1116.8811ms preg_replace  1098.0589ms preg_replace is  1.69% faster
 32768 chars str_replace  2299.3128ms preg_replace  2222.8632ms preg_replace is  3.32% faster

The timings themselves are for 10000 iterations, but what's more interesting is the relative differences. Up to 512 chars, I was seeing preg_replace alway win. In the 1-8kb range, str_replace had a marginal edge.

计时本身是 10000 次迭代,但更有趣的是相对差异。最多 512 个字符,我看到 preg_replace 总是赢。在 1-8kb 范围内,str_replace 具有边缘优势。

I thought it was interesting result, so including it here. The important thing is not to take this result and use it to decide which method to use, but to benchmark against your own data and then decide.

我认为这是一个有趣的结果,所以把它包括在这里。重要的不是取这个结果并用它来决定使用哪种方法,而是根据你自己的数据进行基准测试,然后再决定。

回答by Dalin

Many of the other answers here do not take into account unicode characters (e.g. ??ü?й???ηы????? ). In this case you can use the following:

这里的许多其他答案都没有考虑 unicode 字符(例如 ??ü?й???ηы????? )。在这种情况下,您可以使用以下内容:

$string = preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F]/u', '', $string);

There's a strange class of characters in the range \x80-\x9F(Just above the 7-bit ASCII range of characters) that are technically control characters, but over time have been misused for printable characters. If you don't have any problems with these, then you can use:

该范围内有一类奇怪的字符\x80-\x9F(刚好高于 7 位 ASCII 字符范围),它们在技术上是控制字符,但随着时间的推移,它们被误用于可打印字符。如果您对这些没有任何问题,那么您可以使用:

$string = preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/u', '', $string);

If you wish to also strip line feeds, carriage returns, tabs, non-breaking spaces, and soft-hyphens, you can use:

如果您还希望去除换行符、回车符、制表符、不间断空格和软连字符,您可以使用:

$string = preg_replace('/[\x00-\x1F\x7F-\xA0\xAD]/u', '', $string);

Note that you mustuse single quotes for the above examples.

请注意,对于上述示例,您必须使用单引号。

If you wish to strip everything except basic printable ASCII characters (all the example characters above will be stripped) you can use:

如果您希望去除除基本可打印 ASCII 字符以外的所有内容(上面的所有示例字符都将被去除),您可以使用:

$string = preg_replace( '/[^[:print:]]/', '',$string);

For reference see http://www.fileformat.info/info/charset/UTF-8/list.htm

参考见http://www.fileformat.info/info/charset/UTF-8/list.htm

回答by Kevin Nelson

Starting with PHP 5.2, we also have access to filter_var, which I have not seen any mention of so thought I'd throw it out there. To use filter_var to strip non-printable characters < 32 and > 127, you can do:

从 PHP 5.2 开始,我们还可以访问 filter_var,我没有看到任何提及,所以我想我会把它扔掉。要使用 filter_var 去除不可打印字符 < 32 和 > 127,您可以执行以下操作:

Filter ASCII characters below 32

过滤 32 位以下的 ASCII 字符

$string = filter_var($input, FILTER_UNSAFE_RAW, FILTER_FLAG_STRIP_LOW);

Filter ASCII characters above 127

过滤 127 以上的 ASCII 字符

$string = filter_var($input, FILTER_UNSAFE_RAW, FILTER_FLAG_STRIP_HIGH);

Strip both:

剥离两者:

$string = filter_var($input, FILTER_UNSAFE_RAW, FILTER_FLAG_STRIP_LOW|FILTER_FLAG_STRIP_HIGH);

You can also html-encode low characters (newline, tab, etc.) while stripping high:

您还可以在剥离高位字符的同时对低位字符(换行符、制表符等)进行 html 编码:

$string = filter_var($input, FILTER_UNSAFE_RAW, FILTER_FLAG_ENCODE_LOW|FILTER_FLAG_STRIP_HIGH);

There are also options for stripping HTML, sanitizing e-mails and URLs, etc. So, lots of options for sanitization (strip out data) and even validation (return false if not valid rather than silently stripping).

还有用于剥离 HTML、清理电子邮件和 URL 等的选项。因此,有许多用于清理(剥离数据)甚至验证(如果无效则返回 false 而不是静默剥离)的选项。

Sanitization:http://php.net/manual/en/filter.filters.sanitize.php

消毒:http : //php.net/manual/en/filter.filters.sanitize.php

Validation:http://php.net/manual/en/filter.filters.validate.php

验证:http : //php.net/manual/en/filter.filters.validate.php

However, there is still the problem, that the FILTER_FLAG_STRIP_LOW will strip out newline and carriage returns, which for a textarea are completely valid characters...so some of the Regex answers, I guess, are still necessary at times, e.g. after reviewing this thread, I plan to do this for textareas:

但是,仍然存在问题,即 FILTER_FLAG_STRIP_LOW 会去掉换行符和回车符,对于 textarea 来说,它们是完全有效的字符......所以我想,某些正则表达式的答案有时仍然是必要的,例如在查看此内容后线程,我计划为 textareas 执行此操作:

$string = preg_replace( '/[^[:print:]\r\n]/', '',$input);

This seems more readable than a number of the regexes that stripped out by numeric range.

这似乎比按数字范围剥离的许多正则表达式更具可读性。

回答by ghostdog74

you can use character classes

你可以使用字符类

/[[:cntrl:]]+/

回答by Hymantrade

this is simpler:

这更简单:

$string = preg_replace( '/[^[:cntrl:]]/', '',$string);

$string = preg_replace('/[^[:cntrl:]]/', '',$string);

回答by Wayne Weibel

All of the solutions work partially, and even below probably does not cover all of the cases. My issue was in trying to insert a string into a utf8 mysql table. The string (and its bytes) all conformed to utf8, but had several bad sequences. I assume that most of them were control or formatting.

所有的解决方案都部分起作用,甚至下面的解决方案也可能没有涵盖所有情况。我的问题是尝试将字符串插入 utf8 mysql 表中。字符串(及其字节)都符合 utf8,但有几个错误的序列。我假设它们中的大多数是控制或格式。

function clean_string($string) {
  $s = trim($string);
  $s = iconv("UTF-8", "UTF-8//IGNORE", $s); // drop all non utf-8 characters

  // this is some bad utf-8 byte sequence that makes mysql complain - control and formatting i think
  $s = preg_replace('/(?>[\x00-\x1F]|\xC2[\x80-\x9F]|\xE2[\x80-\x8F]{2}|\xE2\x80[\xA4-\xA8]|\xE2\x81[\x9F-\xAF])/', ' ', $s);

  $s = preg_replace('/\s+/', ' ', $s); // reduce all multiple whitespace to a single space

  return $s;
}

To further exacerbate the problem is the table vs. server vs. connection vs. rendering of the content, as talked about a little here

进一步加剧问题的是表格与服务器、连接与内容的呈现,正如这里讨论的

回答by cedivad

My UTF-8 compliant version:

我的 UTF-8 兼容版本:

preg_replace('/[^\p{L}\s]/u','',$value);

preg_replace('/[^\p{L}\s]/u','',$value);

回答by Richy B.

You could use a regular express to remove everything apart from those characters you wish to keep:

您可以使用正则表达式删除除您希望保留的那些字符之外的所有内容:

$string=preg_replace('/[^A-Za-z0-9 _\-\+\&]/','',$string);

Replaces everything that is not (^) the letters A-Z or a-z, the numbers 0-9, space, underscore, hypen, plus and ampersand - with nothing (i.e. remove it).

替换所有不是 (^) 字母 AZ 或 az、数字 0-9、空格、下划线、连字符、加号和与号的所有内容 - 不进行任何操作(即删除它)。

回答by Gajus

preg_replace('/(?!\n)[\p{Cc}]/', '', $response);

This will remove all the control characters (http://uk.php.net/manual/en/regexp.reference.unicode.php) leaving the \nnewline characters. From my experience, the control characters are the ones that most often cause the printing issues.

这将删除所有控制字符(http://uk.php.net/manual/en/regexp.reference.unicode.php),留下\n换行符。根据我的经验,控制字符是最常导致打印问题的字符。

回答by Junaid Masood

To strip all non-ASCII characters from the input string

从输入字符串中去除所有非 ASCII 字符

$result = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $string);

$result = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $string);

That code removes any characters in the hex ranges 0-31 and 128-255, leaving only the hex characters 32-127 in the resulting string, which I call $result in this example.

该代码删除了十六进制范围 0-31 和 128-255 中的所有字符,只留下了结果字符串中的十六进制字符 32-127,在本例中我将其称为 $result。