bash sed 在 Linux 中替换 ASCII 字符
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33670231/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
sed replacing ASCII characters in Linux
提问by gaurus
I want to replace the ASCII/English characters in a file and keep the unicode characters in Linux environment
我想替换文件中的 ASCII/英文字符并在 Linux 环境中保留 unicode 字符
INSERT INTO text (old_id,old_text,old_flags) VALUES (2815829,'[[?????:Youth-soccer-indiana.jpg|thumb|300px|right|???? ?? ???.???????? ??????, ??? ?????? ??? ??, ?? ??? ??? ?? ?? ??????? ??????? ?? ?? ?????? ???? ???]]\n\n\'\'\'???\'\'\', ?? [[??????]] ??? [[???????]] ?????? ??????? ???? ???? ?? [[?????????]] ??????? ??? \'\'???\'\'
I have tried
我试过了
~$ sed 's/[^\u0900-\u097F]/ /g' hi.text but the range
but i get
但我明白了
sed: -e expression #1, char 23: Invalid range end
sed:-e 表达式 #1,字符 23:范围结束无效
I also tried this and it seems to work but not fully
我也试过这个,它似乎工作但不完全
sed 's/[a-zA-Z 0-9`~!@#$%^&*()_+\[\]\{}|;'\'':",.\/<>?]//g' enwiki-latest-pages-articles-multistream_3.sql >result.txt
Can anyone tell me how to get the sed working with the unicode range regex
谁能告诉我如何让 sed 与 unicode range regex 一起工作
回答by Thomas Dickey
ASCII codes are in the range 0 to 127 inclusive. From that range, 0-31 and 127 are control characters. Unicode encoded as UTF-8 uses data bytes from the range 128 to 255 inclusive.
ASCII 码的范围是 0 到 127(包括 0 到 127)。在该范围内,0-31 和 127 是控制字符。编码为 UTF-8 的 Unicode 使用 128 到 255 范围内的数据字节。
Because sed is line-oriented, newline (code 9 is control/J) is treated specially. Your file mayinclude tab (code 8) and carriage return (code 13). But in practice you likely only care about tabs and printable ASCII.
因为 sed 是面向行的,所以换行符(代码 9 是 control/J)被特殊对待。您的文件可能包含制表符(代码 8)和回车符(代码 13)。但实际上,您可能只关心制表符和可打印的 ASCII。
Tilde (~
) is code 126 (something handy to know).
波浪号 ( ~
) 是代码 126(很容易知道)。
So:
所以:
sed -e 's/[ -~\t]/ /g'
where \t
is ASCII tab (and depending on implementation you may need a literal tab) will remove all of the printable ASCII, leaving untouched newline and UTF-8.
where \t
is ASCII tab(根据实现,你可能需要一个文字选项卡)将删除所有可打印的ASCII,留下未改动的换行符和UTF-8。
回答by Giuseppe Ricupero
PERL
PERL
If you don't mind using perl try a mnemonic:
如果您不介意使用 perl,请尝试使用助记符:
# this version replace each group also newlines
perl -pe 's/[[:ascii:]]/ /g;' filename
UPDATE: Using @user1516947 example i've slightly modified the perl solution to collapse multiple ascii chars into one space (and remove unwanted leading and trailing spaces):
更新:使用@user1516947 示例,我稍微修改了 perl 解决方案,将多个 ascii 字符合并为一个空格(并删除不需要的前导和尾随空格):
perl -pe 's/[[:ascii:]]+/ /g; s/^\s+|\s+$//g' filename
Command line usage example based on sample input:
基于样本输入的命令行使用示例:
echo "INSERT INTO text (old_id,old_text,old_flags) VALUES (2815829,'[[?????:Youth-soccer-indiana.jpg|thumb|300px|right|???? ?? ???.???????? ??????, ??? ?????? ??? ??, ?? ??? ??? ?? ?? ??????? ??????? ?? ?? ?????? ???? ???]]\n\n\'\'\'???\'\'\', ???????? ==\n\"???\" (\"???????\") ???? ?? [[?????? ??????]] ???? \'\'????????? (desport)\'\' ?? ???????? ??? ??, ????? ???? \"?????\" ???\n\n== ?????? ==\n\n[[?????:Greek statue discus thrower 2 century aC.jpg|thumb|150px|right|2" | perl -pe 's/[[:ascii:]]+/ /g; s/^\s+|\s+$//g'
Output:
输出:
????? ???? ?? ??? ???????? ?????? ??? ?????? ??? ?? ?? ??? ??? ?? ?? ??????? ??????? ?? ?? ?????? ???? ??? ??? ???????? ??? ??????? ???? ?? ?????? ?????? ???? ????????? ?? ???????? ??? ?? ????? ???? ????? ??? ?????? ?????
(GNU) SED
(GNU) SED
Or in sed (in linux environment you have to modify LANG env to make the sed range valid):
或者在 sed 中(在 linux 环境中,您必须修改 LANG env 以使 sed 范围有效):
# this version does not replace newlines
LANG=C sed 's/[\d0-\d127]/ /g' filename
A less readable sed version that replace all newlines (but one) too:
一个不太可读的 sed 版本,它也替换了所有换行符(但一个):
LANG=C sed ':a;N;$!ba;s/[\d0-\d127]/ /g' filename
回答by Giuseppe Ricupero
To get rid of the ascii characters you can run it over the range, sed
eats newlines though so if you want those gone too you need to hit it with tr
afterward.
要摆脱 ascii 字符,您可以在整个范围内运行它,sed
但会吃换行符,因此如果您想要这些字符也消失,则需要在tr
之后使用它。
echo -e "hi ? \nthere ?" | LANG=C sed "s/[\x01-\x7F]//g" | tr -d '\n'
??
echo -e "hi ? \nthere ?" | LANG=C sed "s/[\x01-\x7F]//g" | tr -d '\n'
??
Conversely if you wanted to rid the unicode characters you can specify instead the unicode range:
echo -e "hi ? \nthere ?" | LANG=C sed "s/[\x80-\xFF]//g"
hi
there
相反,如果你想摆脱Unicode字符,你可以改为指定的unicode范围:
echo -e "hi ? \nthere ?" | LANG=C sed "s/[\x80-\xFF]//g"
喜
有