bash 如何从文件中删除非数字垃圾

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5859628/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-17 23:54:38  来源:igfitidea点击:

How to remove nonnumeric junk from a file

regexbashtextsed

提问by Mike

Here's an output from less:

这是来自的输出less

487451
487450<A3><BA>1<A3><BA>1
487449<A3><BA>1<A3><BA>1
487448<A3><BA>1<A3><BA>1
487447<A3><BA>1<A3><BA>1
487446<A3><BA>1<A3><BA>1
487445<A3><BA>1<A3><BA>1
484300<A3><BA>1<A3><BA>1
484299<A3><BA>1<A3><BA>1
484297<A3><BA>1<A3><BA>1
484296<A3><BA>1<A3><BA>1
484295<A3><BA>1<A3><BA>1
484294<A3><BA>1<A3><BA>1
484293<A3><BA>1<A3><BA>1
483496
483495
483494
483493
483492
483491

I see a bunch of nonprintable characters here. How do I remove them using sed/tr?

我在这里看到一堆不可打印的字符。如何使用sed/删除它们tr

My try was 's/\([0-9][0-9]*\)/\1/g', but it doesn't work.

我的尝试是's/\([0-9][0-9]*\)/\1/g',但它不起作用。

EDIT: Okay, let's go further down the source. The numbers are extracted from this file:

编辑:好的,让我们更深入地了解源代码。数字是从这个文件中提取的:

487451"><img src="Manage/pic/20100901/Adidas running-429.JPG" alt="Adidas running-429" height="120" border="0" class="BK01" onload='javascript:if(this.width>160){this.width=160}' /></a></td>
487450"><img src="Manage/pic/20100901/Adidas fs 1<A3><BA>1-060.JPG" alt="Adidas fs 1<A3><BA>1-060" height="120" border="0" class="BK01" onload='javascript:if(this.width>160){this.width=160}' /></a></td>

The first line is perfectly normal and what most of the lines are. The second is "corrupted". I'd just like to extract the number at the beginning (using 's/\([0-9][0-9]*\).*/\1/g', but somehow the nonprintables get into the regex, which should stop at ".

第一行是完全正常的,大多数行是什么。二是“腐化”。我只想在开头提取数字(使用's/\([0-9][0-9]*\).*/\1/g',但不知何故,不可打印的内容进入正则表达式,应该在".

EDIT II: Here's a clarification: There are no brackets in the text file.These are character codes of nonprintable characters. The brackets are there because I copied the file from less. Mac's Terminal, on the other hand, uses ??to represent such characters. I bet xtermon my Ubuntu would print that white oval with a question mark.

编辑二:这里有一个澄清:文本文件中没有括号。这些是不可打印字符的字符代码。括号在那里是因为我从less. 另一方面,Mac 的终端??用于表示此类字符。我打赌xterm我的 Ubuntu 会打印出带问号的白色椭圆。

回答by Jonathan Leffler

Classic job for either sed's or Unix's trcommand.

sed's 或 Unixtr命令的经典作业。

sed 's/[^0-9]//g' $file

(Anything that is not a digit - or newline - is deleted.)

(任何不是数字或换行符的内容都将被删除。)

tr -cd '0-92' < $file > $file.1

Delete (-d) the complement (-c) of the digits and newline...

删除 ( -d)-c数字和换行符的补码 ( )...

回答by deong

You missed the bit where you match the rest of the line.

您错过了与该行其余部分相匹配的部分。

sed 's/\([0-9][0-9]*\)[^0-9]*//g' 
                      ^^^^^^^

回答by josh.trow

If you know the crap will always be inside brackets, why not delete that crap?

如果你知道废话总是在括号内,为什么不删除那些废话?

sed 's/<[^>]*>//g'

EDIT: Thanks, Mike that makes sense. In that case, how about:

编辑:谢谢,迈克,这是有道理的。在这种情况下,如何:

sed 's/([0-9]+).*//g'

回答by anubhava

Try this sed command:

试试这个 sed 命令:

sed 's/^\([0-9][0-9]*\).*$//' file.txt

OUTPUT (running above command on the input file you provided)

输出(在您提供的输入文件上运行以上命令)

487451
487450
487449
487448
487447
487446
487445
484300
484299
484297
484296
484295
484294
484293
483496
483495
483494
483493
483492
483491

回答by user2461982

If the data always is like the sample, deleting from the less-than to the end of the line would work fine. sed -i "s/<.*$//" file

如果数据总是像样本一样,从小于到行尾删除就可以了。sed -i "s/<.*$//" 文件