bash 如何从文件中删除非数字垃圾

Question

提问by Mike

Here's an output from less:

这是来自的输出less：

487451
487450<A3><BA>1<A3><BA>1
487449<A3><BA>1<A3><BA>1
487448<A3><BA>1<A3><BA>1
487447<A3><BA>1<A3><BA>1
487446<A3><BA>1<A3><BA>1
487445<A3><BA>1<A3><BA>1
484300<A3><BA>1<A3><BA>1
484299<A3><BA>1<A3><BA>1
484297<A3><BA>1<A3><BA>1
484296<A3><BA>1<A3><BA>1
484295<A3><BA>1<A3><BA>1
484294<A3><BA>1<A3><BA>1
484293<A3><BA>1<A3><BA>1
483496
483495
483494
483493
483492
483491

I see a bunch of nonprintable characters here. How do I remove them using sed/tr?

我在这里看到一堆不可打印的字符。如何使用sed/删除它们tr？

My try was 's/$[0-9][0-9]*$/\1/g', but it doesn't work.

我的尝试是's/$[0-9][0-9]*$/\1/g'，但它不起作用。

EDIT: Okay, let's go further down the source. The numbers are extracted from this file:

编辑：好的，让我们更深入地了解源代码。数字是从这个文件中提取的：

487451"><img src="Manage/pic/20100901/Adidas running-429.JPG" alt="Adidas running-429" height="120" border="0" class="BK01" onload='javascript:if(this.width>160){this.width=160}' /></a></td>
487450"><img src="Manage/pic/20100901/Adidas fs 1<A3><BA>1-060.JPG" alt="Adidas fs 1<A3><BA>1-060" height="120" border="0" class="BK01" onload='javascript:if(this.width>160){this.width=160}' /></a></td>

The first line is perfectly normal and what most of the lines are. The second is "corrupted". I'd just like to extract the number at the beginning (using 's/$[0-9][0-9]*$.*/\1/g', but somehow the nonprintables get into the regex, which should stop at ".

第一行是完全正常的，大多数行是什么。二是“腐化”。我只想在开头提取数字（使用's/$[0-9][0-9]*$.*/\1/g'，但不知何故，不可打印的内容进入正则表达式，应该在".

EDIT II: Here's a clarification: There are no brackets in the text file.These are character codes of nonprintable characters. The brackets are there because I copied the file from less. Mac's Terminal, on the other hand, uses ??to represent such characters. I bet xtermon my Ubuntu would print that white oval with a question mark.

编辑二：这里有一个澄清：文本文件中没有括号。这些是不可打印字符的字符代码。括号在那里是因为我从less. 另一方面，Mac 的终端??用于表示此类字符。我打赌xterm我的 Ubuntu 会打印出带问号的白色椭圆。

Answer 1

回答by Jonathan Leffler

Classic job for either sed's or Unix's trcommand.

sed's 或 Unixtr命令的经典作业。

sed 's/[^0-9]//g' $file

(Anything that is not a digit - or newline - is deleted.)

（任何不是数字或换行符的内容都将被删除。）

tr -cd '0-92' < $file > $file.1

Delete (-d) the complement (-c) of the digits and newline...

删除 ( -d)-c数字和换行符的补码 ( )...

Answer 2

回答by deong

You missed the bit where you match the rest of the line.

您错过了与该行其余部分相匹配的部分。

sed 's/\([0-9][0-9]*\)[^0-9]*//g' 
                      ^^^^^^^

Answer 3

回答by josh.trow

If you know the crap will always be inside brackets, why not delete that crap?

如果你知道废话总是在括号内，为什么不删除那些废话？

sed 's/<[^>]*>//g'

EDIT: Thanks, Mike that makes sense. In that case, how about:

编辑：谢谢，迈克，这是有道理的。在这种情况下，如何：

sed 's/([0-9]+).*//g'

Answer 4

回答by anubhava

Try this sed command:

试试这个 sed 命令：

sed 's/^\([0-9][0-9]*\).*$//' file.txt

OUTPUT (running above command on the input file you provided)

输出（在您提供的输入文件上运行以上命令）

Answer 5

回答by user2461982

If the data always is like the sample, deleting from the less-than to the end of the line would work fine. sed -i "s/<.*$//" file

如果数据总是像样本一样，从小于到行尾删除就可以了。sed -i "s/<.*$//" 文件

bash 如何从文件中删除非数字垃圾

提问by Mike

回答by Jonathan Leffler

回答by deong

回答by josh.trow

回答by anubhava

OUTPUT (running above command on the input file you provided)

输出（在您提供的输入文件上运行以上命令）

回答by user2461982

相关推荐

最近更新

标签

bash 如何从文件中删除非数字垃圾

提问by Mike

回答by Jonathan Leffler

回答by deong

回答by josh.trow

回答by anubhava

OUTPUT (running above command on the input file you provided)

输出（在您提供的输入文件上运行以上命令）

回答by user2461982

相关推荐

来自管道的 Bash 输入

BASH：如何将变量放入正则表达式？

bash 中的代码块用法 { }

bash 如何传递包含引号/空格的脚本参数？

相关推荐

最近更新

标签