bash sed 在 Linux 中替换 ASCII 字符

Question

提问by gaurus

I want to replace the ASCII/English characters in a file and keep the unicode characters in Linux environment

我想替换文件中的 ASCII/英文字符并在 Linux 环境中保留 unicode 字符

INSERT INTO text (old_id,old_text,old_flags) VALUES (2815829,'[[?????:Youth-soccer-indiana.jpg|thumb|300px|right|???? ?? ???.???????? ??????, ??? ?????? ??? ??, ?? ??? ??? ?? ?? ??????? ??????? ?? ?? ?????? ???? ???]]\n\n\'\'\'???\'\'\', ?? [[??????]] ??? [[???????]] ?????? ??????? ???? ???? ?? [[?????????]] ??????? ??? \'\'???\'\'

I have tried

我试过了

~$ sed 's/[^\u0900-\u097F]/ /g' hi.text but the range

but i get

但我明白了

sed: -e expression #1, char 23: Invalid range end

sed：-e 表达式 #1，字符 23：范围结束无效

I also tried this and it seems to work but not fully

我也试过这个，它似乎工作但不完全

sed 's/[a-zA-Z 0-9`~!@#$%^&*()_+\[\]\{}|;'\'':",.\/<>?]//g' enwiki-latest-pages-articles-multistream_3.sql  >result.txt

Can anyone tell me how to get the sed working with the unicode range regex

谁能告诉我如何让 sed 与 unicode range regex 一起工作

Answer 1

回答by Thomas Dickey

ASCII codes are in the range 0 to 127 inclusive. From that range, 0-31 and 127 are control characters. Unicode encoded as UTF-8 uses data bytes from the range 128 to 255 inclusive.

ASCII 码的范围是 0 到 127（包括 0 到 127）。在该范围内，0-31 和 127 是控制字符。编码为 UTF-8 的 Unicode 使用 128 到 255 范围内的数据字节。

Because sed is line-oriented, newline (code 9 is control/J) is treated specially. Your file mayinclude tab (code 8) and carriage return (code 13). But in practice you likely only care about tabs and printable ASCII.

因为 sed 是面向行的，所以换行符（代码 9 是 control/J）被特殊对待。您的文件可能包含制表符（代码 8）和回车符（代码 13）。但实际上，您可能只关心制表符和可打印的 ASCII。

Tilde (~) is code 126 (something handy to know).

波浪号 ( ~) 是代码 126（很容易知道）。

So:

所以：

sed -e 's/[ -~\t]/ /g'

where \tis ASCII tab (and depending on implementation you may need a literal tab) will remove all of the printable ASCII, leaving untouched newline and UTF-8.

where \tis ASCII tab（根据实现，你可能需要一个文字选项卡）将删除所有可打印的ASCII，留下未改动的换行符和UTF-8。

Answer 2

回答by Giuseppe Ricupero

PERL

If you don't mind using perl try a mnemonic:

如果您不介意使用 perl，请尝试使用助记符：

# this version replace each group also newlines
perl -pe 's/[[:ascii:]]/ /g;' filename

UPDATE: Using @user1516947 example i've slightly modified the perl solution to collapse multiple ascii chars into one space (and remove unwanted leading and trailing spaces):

更新：使用@user1516947 示例，我稍微修改了 perl 解决方案，将多个 ascii 字符合并为一个空格（并删除不需要的前导和尾随空格）：

perl -pe 's/[[:ascii:]]+/ /g; s/^\s+|\s+$//g' filename

Command line usage example based on sample input:

基于样本输入的命令行使用示例：

echo "INSERT INTO text (old_id,old_text,old_flags) VALUES (2815829,'[[?????:Youth-soccer-indiana.jpg|thumb|300px|right|???? ?? ???.???????? ??????, ??? ?????? ??? ??, ?? ??? ??? ?? ?? ??????? ??????? ?? ?? ?????? ???? ???]]\n\n\'\'\'???\'\'\', ???????? ==\n\"???\" (\"???????\") ???? ?? [[?????? ??????]] ???? \'\'????????? (desport)\'\' ?? ???????? ??? ??, ????? ???? \"?????\" ???\n\n== ?????? ==\n\n[[?????:Greek statue discus thrower 2 century aC.jpg|thumb|150px|right|2" | perl -pe 's/[[:ascii:]]+/ /g; s/^\s+|\s+$//g'

Output:

输出：

 ????? ???? ?? ??? ???????? ?????? ??? ?????? ??? ?? ?? ??? ??? ?? ?? ??????? ??????? ?? ?? ?????? ???? ??? ??? ???????? ??? ??????? ???? ?? ?????? ?????? ???? ????????? ?? ???????? ??? ?? ????? ???? ????? ??? ?????? ?????

(GNU) SED

Or in sed (in linux environment you have to modify LANG env to make the sed range valid):

或者在 sed 中（在 linux 环境中，您必须修改 LANG env 以使 sed 范围有效）：

# this version does not replace newlines
LANG=C sed 's/[\d0-\d127]/ /g' filename

A less readable sed version that replace all newlines (but one) too:

一个不太可读的 sed 版本，它也替换了所有换行符（但一个）：

LANG=C sed ':a;N;$!ba;s/[\d0-\d127]/ /g' filename

Answer 3

回答by Giuseppe Ricupero

To get rid of the ascii characters you can run it over the range, sedeats newlines though so if you want those gone too you need to hit it with trafterward.

要摆脱 ascii 字符，您可以在整个范围内运行它，sed但会吃换行符，因此如果您想要这些字符也消失，则需要在tr之后使用它。

echo -e "hi ? \nthere ?" | LANG=C sed "s/[\x01-\x7F]//g" | tr -d '\n'
??

Conversely if you wanted to rid the unicode characters you can specify instead the unicode range: echo -e "hi ? \nthere ?" | LANG=C sed "s/[\x80-\xFF]//g"
hi
there

相反，如果你想摆脱Unicode字符，你可以改为指定的unicode范围： echo -e "hi ? \nthere ?" | LANG=C sed "s/[\x80-\xFF]//g"
喜
有

bash sed 在 Linux 中替换 ASCII 字符

提问by gaurus

回答by Thomas Dickey

回答by Giuseppe Ricupero

回答by Giuseppe Ricupero

相关推荐

最近更新

标签

bash sed 在 Linux 中替换 ASCII 字符

提问by gaurus

回答by Thomas Dickey

回答by Giuseppe Ricupero

回答by Giuseppe Ricupero

相关推荐

如何使用 curl 命令从 bash 脚本获取 HTTP 响应？

bash 正在从代码部署代理目录运行 AWS CodeDeploy AfterInstall 脚本

bash 我应该在环境路径名中使用引号吗？

bash 使用“读取变量”的“错误变量名称”

相关推荐

最近更新

标签