bash 使用 Sed 删除部分字符串

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3106809/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 19:23:37  来源:igfitidea点击:

Removing Parts of String With Sed

linuxbashunixsed

提问by neversaint

I have lines of data that looks like this:

我有如下所示的数据行:

sp_A0A342_ATPB_COFAR_6_+_contigs_full.fasta
sp_A0A342_ATPB_COFAR_9_-_contigs_full.fasta
sp_A0A373_RK16_COFAR_10_-_contigs_full.fasta
sp_A0A373_RK16_COFAR_8_+_contigs_full.fasta
sp_A0A4W3_SPEA_GEOSL_15_-_contigs_full.fasta

How can I use sedto delete parts of string after 4th column (_ separated) for each line. Finally yielding:

如何sed在每行的第 4 列(_ 分隔)之后删除部分字符串。最后产生:

sp_A0A342_ATPB_COFAR
sp_A0A342_ATPB_COFAR
sp_A0A373_RK16_COFAR
sp_A0A373_RK16_COFAR
sp_A0A4W3_SPEA_GEOSL

回答by Matthew Flaschen

cutis a better fit.

cut更合适。

cut -d_ -f 1-4 old_file

This simply means use _ as delimiter, and keep fields 1-4.

这只是意味着使用 _ 作为分隔符,并保留字段 1-4。

If you insist on sed:

如果你坚持sed

sed 's/\(_[^_]*\)\{4\}$//'

This left hand side matches exactly four repetitions of a group, consisting of an underscore followed by 0 or more non-underscores. After that, we must be at the end of the line. This is all replaced by nothing.

这个左侧正好匹配一组的四次重复,由一个下划线和 0 个或多个非下划线组成。在那之后,我们必须在行尾。这一切都被什么都取代了。

回答by Scott Thomson

sed -e 's/\([^_]*\)_\([^_]*\)_\([^_]*\)_\([^_]*\)_.*/___' infile > outfile

Match "any number of not '_'", saving what was matched between \( and \), followed by '_'. Do this 4 times, then match anything for the rest of the line (to be ignored). Substitute with each of the matches separated by '_'.

匹配“任意数量的非'_'”,保存\( 和\) 之间匹配的内容,后跟'_'。这样做 4 次,然后匹配该行其余部分的任何内容(被忽略)。用“_”分隔的每个匹配项替换。

回答by Owen S.

Here's another possibility:

这是另一种可能性:

sed -E -e 's|^([^_]+(_[^_]+){3}).*$||'

where -E, like -r in GNU sed, turns on extended regular expressions for readability.

其中 -E 与 GNU sed 中的 -r 一样,打开扩展正则表达式以提高可读性。

Just because you cando it in sed, though, doesn't mean you should. I like cut much much better for this.

但是,仅仅因为您可以在 sed 中做到这一点,并不意味着您应该这样做。我喜欢为此剪得更好。

回答by Paused until further notice.

AWK likes to play in the fields:

AWK喜欢玩的领域:

awk 'BEGIN{FS=OFS="_"}{print ,,,}' inputfile

or, more generally:

或者,更一般地说:

awk -v count=4 'BEGIN{FS="_"}{for(i=1;i<=count;i++){printf "%s%s",sep,$i;sep=FS};printf "\n"}'

回答by Slartibartfast

sed -e 's/_[0-9][0-9]*_[+-]_contigs_full.fasta$//g'

Still the cut answer is probably faster and just generally better.

尽管如此,简单的答案可能更快,而且通常更好。

回答by Peter Ajtai

Yes, cut is way better, and yes matching the back of each is easier.

是的,剪裁更好,是的,每个人的背面都更容易匹配。

I finally got a match using the beginning of each line:

我终于使用每一行的开头匹配了:

 sed -r 's/(([^_]*_){3}([^_]*)).*//' oldFile > newFile