bash 如何匹配特定列中的多个模式?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31024928/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to match for multiple patterns in the specific column?
提问by Learner
I was wondering if there would be a more efficient way to use awk/grep/sed to solve the following problem?
我想知道是否有更有效的方法来使用 awk/grep/sed 来解决以下问题?
I would like parse through a certain column of my input file (in this example column 1) and use awk/grep/any other function to subset and select patterns that match my query. For example given the file below ;
我想解析我的输入文件的某个列(在本例中的第 1 列),并使用 awk/grep/任何其他函数来对与我的查询匹配的模式进行子集化和选择。例如给出下面的文件;
chr1 3009844 3009908 DXX 42 -
chr2 3000386 3000450 DXX 15 -
chr3 3000386 3000450 DXX 15 -
chr4 3000386 3000450 DXX 15 -
chr5 3000386 3000450 DXX 15 -
chr6 3000386 3000450 DXX 15 -
chr7 3000386 3000450 DXX 15 -
chr8 3000386 3000450 DXX 15 -
chr9 3000386 3000450 DXX 15 -
chr10 3000386 3000450 DXX 15 -
chr11 3000386 3000450 DXX 15 -
chr12 3000386 3000450 DXX 15 -
chr13 3000386 3000450 DXX 15 -
chr14 3000386 3000450 DXX 15 -
chr15 3000386 3000450 DXX 15 -
chr16 3000386 3000450 DXX 15 -
chr17 3000386 3000450 DXX 15 -
chr18 3000386 3000450 DXX 15 -
chr19 3000386 3000450 DXX 15 -
chrX 3000386 3000450 DXX 15 -
chrY 3000386 3000450 DXX 15 -
chr1_GL456210_random 3000386 3000450 DXX 15 -
chr1_GL456211_random 3000386 3000450 DXX 15 -
chr1_GL456212_random 3000386 3000450 DXX 15 -
chr1_GL456221_random 3000386 3000450 DXX 15 -
chr4_GL456216_random 3000386 3000450 DXX 15 -
chr4_JH584292_random 3000386 3000450 DXX 15 -
chr4_JH584295_random 3000386 3000450 DXX 15 -
chr5_GL456354_random 3000386 3000450 DXX 15 -
chr5_JH584296_random 3000386 3000450 DXX 15 -
chr5_JH584297_random 3000386 3000450 DXX 15 -
chr5_JH584299_random 3000386 3000450 DXX 15 -
chrX_GL456233_random 3000386 3000450 DXX 15 -
I would just like to have an output which only has chr1-chr22, chrX and chrY present in the first column, for instance ;
我只想有一个输出,例如,第一列中只有 chr1-chr22、chrX 和 chrY;
chr1 3009844 3009908 DXX 42 -
chr2 3000386 3000450 DXX 15 -
chr3 3000386 3000450 DXX 15 -
chr4 3000386 3000450 DXX 15 -
chr5 3000386 3000450 DXX 15 -
chr6 3000386 3000450 DXX 15 -
chr7 3000386 3000450 DXX 15 -
chr8 3000386 3000450 DXX 15 -
chr9 3000386 3000450 DXX 15 -
chr10 3000386 3000450 DXX 15 -
chr11 3000386 3000450 DXX 15 -
chr12 3000386 3000450 DXX 15 -
chr13 3000386 3000450 DXX 15 -
chr14 3000386 3000450 DXX 15 -
chr15 3000386 3000450 DXX 15 -
chr16 3000386 3000450 DXX 15 -
chr17 3000386 3000450 DXX 15 -
chr18 3000386 3000450 DXX 15 -
chr19 3000386 3000450 DXX 15 -
chrX 3000386 3000450 DXX 15 -
chrY 3000386 3000450 DXX 15 -
I managed to find the solution using the command below:
我设法使用以下命令找到了解决方案:
awk ' == "chr1" || == "chr2" || == "chr3" || == "chr4" || == "chr5" || == "chr6" || == "chr7" || == "chr8" || == "chr9" || == "chr10" || == "chr11" || == "chr12" || == "chr13" || == "chr14" || == "chr15" || == "chr16" || == "chr17" || == "chr18" || == "chr19" || == "chr20" || == "chrX" || == "chrY"' in_file > out_file
It works fine but was wondering if dear members would have a more elegant way to solve the problem? Or if you could point to resource to explore awk/grep in linux it would be much appreciated!
它工作正常,但想知道亲爱的成员是否有更优雅的方法来解决问题?或者,如果您可以指出资源以在 linux 中探索 awk/grep,将不胜感激!
回答by fedorqui 'SO stop harming'
Use a regular expression:
使用正则表达式:
awk ' ~ /^chr(1?[0-9]|2[0-2]|X|Y)$/' file
This uses $1 ~ /^pattern$/
to chose the good lines consisting in exactly pattern
(note ^
for beginning and $
for end).
这用于$1 ~ /^pattern$/
选择正确的行pattern
(注意^
开头和$
结尾)。
The pattern is on the form chr(..|..|..)
, meaning: match chr
followed by either of the |
-separated conditions within ()
.
模式在形式上chr(..|..|..)
,意思是:匹配chr
后跟 中的任何一个 -|
分隔的条件()
。
These conditions can be either of:
这些条件可以是:
- a number (possible 1 followed by a digit) (
1?[0-9]
) - a number being 2 + any of 0, 1, 2 (
2[0-2]
) - X
- Y
- 一个数字(可能是 1 后跟一个数字)(
1?[0-9]
) - 一个数是 2 + 0, 1, 2 (
2[0-2]
) 中的任何一个 - X
- 是
Demo automatically explained: https://regex101.com/r/gH1kS4/2
Demo自动解释:https: //regex101.com/r/gH1kS4/2
回答by henfiber
If you want something easier to maintain (e.g. editing or adding new lines/patterns to match) and also something easier to understand, especially if you just started engaging with regular expressions, use the grep -f match.list input.txt
format:
如果你想要一些更容易维护的东西(例如编辑或添加新的行/模式来匹配)并且更容易理解,特别是如果你刚开始使用正则表达式,请使用以下grep -f match.list input.txt
格式:
Create a file with the patterns you want to match (match.list
):
使用要匹配的模式创建一个文件 ( match.list
):
^chr[1-9][[:space:]]\| # this matches chr1-chr9
^chr1[0-9][[:space:]]\| # this matches chr10-chr19
^chr2[12][[:space:]]\| # this matches chr21-22
^chr[XY][[:space:]]\| # this matches chrX and chrY
new_string_or_pattern\| # ... your new pattern ...
Then just call grep
like this:
然后就这样调用grep
:
grep -f match.list input.txt
As you can see above, you can even add comments to the list of patterns, using the \|
trick (ending each pattern with \|
), so you can remember what you did yesterday or where did you find the regex. And you may add new fixed strings or patterns by just adding new lines. Also, if you find it difficult to create a complex regex, you may just create a pattern file with the fixed strings you want to match:
正如您在上面看到的,您甚至可以使用\|
技巧(以 结束每个模式\|
)向模式列表添加注释,这样您就可以记住昨天做了什么或在哪里找到了正则表达式。您可以通过添加新行来添加新的固定字符串或模式。此外,如果您发现创建复杂的正则表达式很困难,您可以使用您想要匹配的固定字符串创建一个模式文件:
^chrX
^chrY
...
Another benefit of this approach is that you may maintain several pattern files, representing different sub-queries you may need to run daily. E.g.
这种方法的另一个好处是您可以维护多个模式文件,代表您可能需要每天运行的不同子查询。例如
grep -f chromosomes_n input.txt
grep -f chromosomes_xy input.txt
grep -f chromosomes_random input.txt
The only drawback of the approach is that grep
will get slower if you add more than a dozen patterns in each file. But that will be a problem only if your input file has hundreds of thousands of lines.
这种方法的唯一缺点是,grep
如果在每个文件中添加十多个模式,速度会变慢。但只有当您的输入文件有数十万行时,这才会成为问题。
回答by arco444
You can use this simplified regex with grep
:
您可以使用这个简化的正则表达式grep
:
grep "^chr\(1\?[0-9]\|2[012]\|[XY]\)[[:space:]]" filename
The logic is contained within the parentheses \(..\)
逻辑包含在括号内 \(..\)
1\?[0-9]
- match 0-9 optionally preceded by 12[012]
- match 2 followed by 0, 1 or 2[XY]
- match X or Y
1\?[0-9]
- 匹配 0-9 可选地以 1 开头2[012]
- 匹配 2 后跟 0、1 或 2[XY]
- 匹配 X 或 Y
回答by Ed Morton
Given your posted example all you need to get the output you want is either of these (or other simple REs):
鉴于您发布的示例,您需要获得所需的输出是这些(或其他简单的 RE)之一:
awk ' !~ /_/' file
awk ' ~ /^[[:alnum:]]+$/' file
so you MAY not have to list specific "patterns" at all depending on your real world requirements.
因此,您可能根本不必根据现实世界的要求列出特定的“模式”。
回答by Themis Giannoulis
Below will do the work.
下面将做这项工作。
grep -v -w 'random'