如何使用 bash 工具搜索非 ASCII 字符?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/13596531/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to search for non-ASCII characters with bash tools?
提问by Jonas Stein
I have a large text file that contains a few unicode characters that make LaTeX crash. How can I find non-ASCII characters in a file with sed, and the like in a Linux bash?
我有一个大文本文件,其中包含一些使 LaTeX 崩溃的 unicode 字符。如何在 Linux bash 中使用 sed 等查找文件中的非 ASCII 字符?
回答by pixelbeat
Try:
尝试:
nonascii() { LANG=C grep --color=always '[^ -~]\+'; }
Which can be used like:
可以像这样使用:
printf '?TF8\n' | nonascii
Within []
^
means "not". So [^ -~]
means characters not between space and ~. So excluding control chars, this matches non ASCII characters, and is a more portable though slightly less accurate version of [^\x00-\x7f]
below. The \+
means 1 or more
and will get multibye characters to have a color shown around the complete character(s), rather than interspersed in each byte, thus corrupting the multibyte sequence
内的[]
^
意思是“不是”。所以[^ -~]
意味着字符不在空格和 ~ 之间。所以不包括控制字符,这匹配非 ASCII 字符,并且是一个更便携但稍微不太准确的[^\x00-\x7f]
下面的版本。的\+
手段1 or more
和将得到multibye字符具有围绕完整的字符(一个或多个)中所示的颜色,而不是散布在每个字节,从而破坏多字节序列
回答by kev
Try this command:
试试这个命令:
grep -P '[^\x00-\x7f]' file