bash 尝试从 UNIX 文件中删除不可打印的字符(垃圾值)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34412754/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 14:02:57  来源:igfitidea点击:

Trying to remove non-printable charaters(junk values) from a UNIX file

bashunixawksednon-printing-characters

提问by Pranav

I am trying to remove non-printable character (for e.g. ^@) from records in my file. Since the volume to records is too big in the file using cat is not an option as the loop is taking too much time. I tried using

我正在尝试^@从我的文件中的记录中删除不可打印的字符(例如)。由于文件中的记录量太大,使用 cat 不是一种选择,因为循环花费了太多时间。我尝试使用

sed -i 's/[^@a-zA-Z 0-9`~!@#$%^&*()_+\[\]\{}|;'\'':",.\/<>?]//g' FILENAME

but still the ^@characters are not removed. Also I tried using

但仍然^@没有删除字符。我也尝试使用

awk '{ sub("[^a-zA-Z0-9\"!@#$%^&*|_\[](){}", ""); print } FILENAME > NEW FILE 

but it also did not help.

但这也无济于事。

Can anybody suggest some alternative way to remove non-printable characters?

有人可以建议一些替代方法来删除不可打印的字符吗?

Used tr -cdbut it is removing accented characters. But they are required in the file.

使用过,tr -cd但它正在删除重音字符。但它们在文件中是必需的。

回答by Tom Fenech

Perhaps you could go with the complement of [:print:], which contains all printable characters:

也许您可以使用[:print:]包含所有可打印字符的的补码:

tr -cd '[:print:]' < file > newfile

If your version of trdoesn't support multi-byte characters (it seems that many don't), this works for me with GNU sed (with UTF-8 locale settings):

如果您的版本tr不支持多字节字符(似乎很多不支持),这对我来说适用于 GNU sed(使用 UTF-8 语言环境设置):

sed 's/[^[:print:]]//g' file

回答by Pranav

Remove all control characters first:

首先删除所有控制字符:

tr -dc '
sed -i 's/[^@a-zA-Z 0-9`~!@#$%^&*()_+\[\]\{}|;'\'':",.\/<>?]//g' newfile
7-12-50-6' < file > newfile

Then try your string:

然后试试你的字符串:

strings -1 file... > outputfile

I believe that what you see ^@is in fact a zero value \0.
The trfilter from above will remove those as well.

我相信你看到^@的实际上是一个零值\0。上面
tr过滤器也将删除这些。

回答by derek

##代码##

seems to work

似乎工作