如何使用 bash 工具搜索非 ASCII 字符?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13596531/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 23:00:54  来源:igfitidea点击:

How to search for non-ASCII characters with bash tools?

bashunicodegrep

提问by Jonas Stein

I have a large text file that contains a few unicode characters that make LaTeX crash. How can I find non-ASCII characters in a file with sed, and the like in a Linux bash?

我有一个大文本文件,其中包含一些使 LaTeX 崩溃的 unicode 字符。如何在 Linux bash 中使用 sed 等查找文件中的非 ASCII 字符?

回答by pixelbeat

Try:

尝试:

nonascii() { LANG=C grep --color=always '[^ -~]\+'; }

Which can be used like:

可以像这样使用:

printf '?TF8\n' | nonascii

Within []^means "not". So [^ -~]means characters not between space and ~. So excluding control chars, this matches non ASCII characters, and is a more portable though slightly less accurate version of [^\x00-\x7f]below. The \+means 1 or moreand will get multibye characters to have a color shown around the complete character(s), rather than interspersed in each byte, thus corrupting the multibyte sequence

内的[]^意思是“不是”。所以[^ -~]意味着字符不在空格和 ~ 之间。所以不包括控制字符,这匹配非 ASCII 字符,并且是一个更便携但稍微不太准确的[^\x00-\x7f]下面的版本。的\+手段1 or more和将得到multibye字符具有围绕完整的字符(一个或多个)中所示的颜色,而不是散布在每个字节,从而破坏多字节序列

回答by kev

Try this command:

试试这个命令:

grep -P '[^\x00-\x7f]' file