在 Linux 中从具有不同分隔符的文本文件中提取列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19959746/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-07 01:20:28  来源:igfitidea点击:

Extracting columns from text file with different delimiters in Linux

linux

提问by user1815498

I have very large genotype files that are basically impossible to open in R, so I am trying to extract the rows and columns of interest using linux command line. Rows are straightforward enough using head/tail, but I'm having difficulty figuring out how to handle the columns.

我有非常大的基因型文件,基本上不可能在 R 中打开,所以我试图使用 linux 命令行提取感兴趣的行和列。使用 head/tail 行非常简单,但我很难弄清楚如何处理列。

If I attempt to extract (say) the 100-105th tab or space delimited column using

如果我尝试使用

 cut -c100-105 myfile >outfile

this obviously won't work if there are strings of multiple characters in each column. Is there some way to modify cut with appropriate arguments so that it extracts the entire string within a column, where columns are defined as space or tab (or any other character) delimited?

如果每列中有多个字符的字符串,这显然不起作用。有没有办法用适当的参数修改 cut 以便它提取列中的整个字符串,其中列被定义为空格或制表符(或任何其他字符)分隔?

采纳答案by hek2mgl

If the command should work with both tabs and spaces as the delimiter I would use awk:

如果命令应该同时使用制表符和空格作为分隔符,我将使用awk

awk '{print 0,1,2,3,4,5}' myfile > outfile

As long as you just need to specify 5 fields it is imo ok to just type them, for longer ranges you can use a forloop:

只要您只需要指定 5 个字段,只需键入它们就可以了,对于更长的范围,您可以使用for循环:

awk '{for(i=100;i<=105;i++)print $i}' myfile > outfile


If you want to use cut, you need to use the -foption:

如果要使用cut,则需要使用该-f选项:

cut -f100-105 myfile > outfile

If the field delimiter is different from TAByou need to specify it using -d:

如果字段分隔符不同于TAB您需要使用-d以下命令指定它:

cut -d' ' -f100-105 myfile > outfile

Check the man pagefor more info on the cut command.

查看手册页以获取有关 cut 命令的更多信息。

回答by asalic

You can use cut with a delimiter like this:

您可以使用带有分隔符的 cut ,如下所示:

with space delim:

带空格分隔:

cut -d " " -f1-100,1000-1005 infile.csv > outfile.csv

with tab delim:

带标签分隔符:

cut -d$'\t' -f1-100,1000-1005 infile.csv > outfile.csv

I gave you the version of cut in which you can extract a list of intervals...

我给了你 cut 的版本,你可以在其中提取间隔列表......

Hope it helps!

希望能帮助到你!