bash 如何匹配特定列中的多个模式?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31024928/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 13:15:33  来源:igfitidea点击:

How to match for multiple patterns in the specific column?

bashunixawkgreppattern-matching

提问by Learner

I was wondering if there would be a more efficient way to use awk/grep/sed to solve the following problem?

我想知道是否有更有效的方法来使用 awk/grep/sed 来解决以下问题?

I would like parse through a certain column of my input file (in this example column 1) and use awk/grep/any other function to subset and select patterns that match my query. For example given the file below ;

我想解析我的输入文件的某个列(在本例中的第 1 列),并使用 awk/grep/任何其他函数来对与我的查询匹配的模式进行子集化和选择。例如给出下面的文件;

chr1    3009844 3009908 DXX 42  -
chr2    3000386 3000450 DXX 15  -
chr3    3000386 3000450 DXX 15  -
chr4    3000386 3000450 DXX 15  -
chr5    3000386 3000450 DXX 15  -
chr6    3000386 3000450 DXX 15  -
chr7    3000386 3000450 DXX 15  -
chr8    3000386 3000450 DXX 15  -
chr9    3000386 3000450 DXX 15  -
chr10   3000386 3000450 DXX 15  -
chr11   3000386 3000450 DXX 15  -
chr12   3000386 3000450 DXX 15  -
chr13   3000386 3000450 DXX 15  -
chr14   3000386 3000450 DXX 15  -
chr15   3000386 3000450 DXX 15  -
chr16   3000386 3000450 DXX 15  -
chr17   3000386 3000450 DXX 15  -
chr18   3000386 3000450 DXX 15  -
chr19   3000386 3000450 DXX 15  -
chrX    3000386 3000450 DXX 15  -
chrY    3000386 3000450 DXX 15  -
chr1_GL456210_random    3000386 3000450 DXX 15  -
chr1_GL456211_random    3000386 3000450 DXX 15  -
chr1_GL456212_random    3000386 3000450 DXX 15  -
chr1_GL456221_random    3000386 3000450 DXX 15  -
chr4_GL456216_random    3000386 3000450 DXX 15  -
chr4_JH584292_random    3000386 3000450 DXX 15  -
chr4_JH584295_random    3000386 3000450 DXX 15  -
chr5_GL456354_random    3000386 3000450 DXX 15  -
chr5_JH584296_random    3000386 3000450 DXX 15  -
chr5_JH584297_random    3000386 3000450 DXX 15  -
chr5_JH584299_random    3000386 3000450 DXX 15  -
chrX_GL456233_random    3000386 3000450 DXX 15  -

I would just like to have an output which only has chr1-chr22, chrX and chrY present in the first column, for instance ;

我只想有一个输出,例如,第一列中只有 chr1-chr22、chrX 和 chrY;

chr1    3009844 3009908 DXX 42  -
chr2    3000386 3000450 DXX 15  -
chr3    3000386 3000450 DXX 15  -
chr4    3000386 3000450 DXX 15  -
chr5    3000386 3000450 DXX 15  -
chr6    3000386 3000450 DXX 15  -
chr7    3000386 3000450 DXX 15  -
chr8    3000386 3000450 DXX 15  -
chr9    3000386 3000450 DXX 15  -
chr10   3000386 3000450 DXX 15  -
chr11   3000386 3000450 DXX 15  -
chr12   3000386 3000450 DXX 15  -
chr13   3000386 3000450 DXX 15  -
chr14   3000386 3000450 DXX 15  -
chr15   3000386 3000450 DXX 15  -
chr16   3000386 3000450 DXX 15  -
chr17   3000386 3000450 DXX 15  -
chr18   3000386 3000450 DXX 15  -
chr19   3000386 3000450 DXX 15  -
chrX    3000386 3000450 DXX 15  -
chrY    3000386 3000450 DXX 15  -

I managed to find the solution using the command below:

我设法使用以下命令找到了解决方案:

awk ' == "chr1" ||  == "chr2" ||  == "chr3" ||  == "chr4" ||  == "chr5" ||  == "chr6" ||  == "chr7" ||  == "chr8" ||  == "chr9" ||  == "chr10" ||  == "chr11" ||  == "chr12" ||  == "chr13" ||  == "chr14" ||  == "chr15" ||  == "chr16" ||  == "chr17" ||  == "chr18" ||  == "chr19" ||  == "chr20" ||  == "chrX" ||  == "chrY"'  in_file > out_file

It works fine but was wondering if dear members would have a more elegant way to solve the problem? Or if you could point to resource to explore awk/grep in linux it would be much appreciated!

它工作正常,但想知道亲爱的成员是否有更优雅的方法来解决问题?或者,如果您可以指出资源以在 linux 中探索 awk/grep,将不胜感激!

回答by fedorqui 'SO stop harming'

Use a regular expression:

使用正则表达式:

awk ' ~ /^chr(1?[0-9]|2[0-2]|X|Y)$/' file

This uses $1 ~ /^pattern$/to chose the good lines consisting in exactly pattern(note ^for beginning and $for end).

这用于$1 ~ /^pattern$/选择正确的行pattern(注意^开头和$结尾)。

The pattern is on the form chr(..|..|..), meaning: match chrfollowed by either of the |-separated conditions within ().

模式在形式上chr(..|..|..),意思是:匹配chr后跟 中的任何一个 -|分隔的条件()

These conditions can be either of:

这些条件可以是:

  • a number (possible 1 followed by a digit) (1?[0-9])
  • a number being 2 + any of 0, 1, 2 (2[0-2])
  • X
  • Y
  • 一个数字(可能是 1 后跟一个数字)( 1?[0-9])
  • 一个数是 2 + 0, 1, 2 ( 2[0-2]) 中的任何一个
  • X

Demo automatically explained: https://regex101.com/r/gH1kS4/2

Demo自动解释:https: //regex101.com/r/gH1kS4/2

回答by henfiber

If you want something easier to maintain (e.g. editing or adding new lines/patterns to match) and also something easier to understand, especially if you just started engaging with regular expressions, use the grep -f match.list input.txtformat:

如果你想要一些更容易维护的东西(例如编辑或添加新的行/模式来匹配)并且更容易理解,特别是如果你刚开始使用正则表达式,请使用以下grep -f match.list input.txt格式:

Create a file with the patterns you want to match (match.list):

使用要匹配的模式创建一个文件 ( match.list):

^chr[1-9][[:space:]]\|      # this matches chr1-chr9
^chr1[0-9][[:space:]]\|     # this matches chr10-chr19
^chr2[12][[:space:]]\|      # this matches chr21-22
^chr[XY][[:space:]]\|       # this matches chrX and chrY
new_string_or_pattern\|     # ... your new pattern ...

Then just call greplike this:

然后就这样调用grep

grep -f match.list input.txt

As you can see above, you can even add comments to the list of patterns, using the \|trick (ending each pattern with \|), so you can remember what you did yesterday or where did you find the regex. And you may add new fixed strings or patterns by just adding new lines. Also, if you find it difficult to create a complex regex, you may just create a pattern file with the fixed strings you want to match:

正如您在上面看到的,您甚至可以使用\|技巧(以 结束每个模式\|)向模式列表添加注释,这样您就可以记住昨天做了什么或在哪里找到了正则表达式。您可以通过添加新行来添加新的固定字符串或模式。此外,如果您发现创建复杂的正则表达式很困难,您可以使用您想要匹配的固定字符串创建一个模式文件:

^chrX
^chrY
...

Another benefit of this approach is that you may maintain several pattern files, representing different sub-queries you may need to run daily. E.g.

这种方法的另一个好处是您可以维护多个模式文件,代表您可能需要每天运行的不同子查询。例如

grep -f chromosomes_n input.txt
grep -f chromosomes_xy input.txt
grep -f chromosomes_random input.txt

The only drawback of the approach is that grepwill get slower if you add more than a dozen patterns in each file. But that will be a problem only if your input file has hundreds of thousands of lines.

这种方法的唯一缺点是,grep如果在每个文件中添加十多个模式,速度会变慢。但只有当您的输入文件有数十万行时,这才会成为问题。

回答by arco444

You can use this simplified regex with grep:

您可以使用这个简化的正则表达式grep

grep "^chr\(1\?[0-9]\|2[012]\|[XY]\)[[:space:]]" filename

The logic is contained within the parentheses \(..\)

逻辑包含在括号内 \(..\)

  • 1\?[0-9]- match 0-9 optionally preceded by 1
  • 2[012]- match 2 followed by 0, 1 or 2
  • [XY]- match X or Y
  • 1\?[0-9]- 匹配 0-9 可选地以 1 开头
  • 2[012]- 匹配 2 后跟 0、1 或 2
  • [XY]- 匹配 X 或 Y

回答by Ed Morton

Given your posted example all you need to get the output you want is either of these (or other simple REs):

鉴于您发布的示例,您需要获得所需的输出是这些(或其他简单的 RE)之一:

awk ' !~ /_/' file
awk ' ~ /^[[:alnum:]]+$/' file

so you MAY not have to list specific "patterns" at all depending on your real world requirements.

因此,您可能根本不必根据现实世界的要求列出特定的“模式”。

回答by Themis Giannoulis

Below will do the work.

下面将做这项工作。

grep -v -w 'random'