Linux 如何对文件进行子集化 - 选择行数或列数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6491532/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-05 04:46:15  来源:igfitidea点击:

how to subset a file - select a numbers of rows or columns

linuxunixsedawkcut

提问by jianfeng.mao

I would like to have your advice/help on how to subset a big file (millions of rows or lines).

我想就如何对大文件(数百万行或行)进行子集化获得您的建议/帮助。

For example,

例如,

(1) I have big file (millions of rows, tab-delimited). I want to a subset of this file with only rows from 10000 to 100000.

(1) 我有大文件(数百万行,制表符分隔)。我想要这个文件的一个子集,只有从 10000 到 100000 的行。

(2) I have big file (millions of columns, tab-delimited). I want to a subset of this file with only columns from 10000 to 100000.

(2) 我有大文件(数百万列,制表符分隔)。我想要这个文件的一个子集,只有从 10000 到 100000 的列。

I know there are tools like head, tail, cut, split, and awk or sed. I can use them to do simple subsetting. But, I do not know how to do this job.

我知道有像 head、tail、cut、split 和 awk 或 sed 这样的工具。我可以用它们来做简单的子集。但是,我不知道如何做这项工作。

Could you please give any advice? Thanks in advance.

你能给点建议吗?提前致谢。

采纳答案by Drakosha

Filtering rows is easy, for example with AWK:

过滤行很容易,例如使用 AWK:

cat largefile | awk 'NR >= 10000  && NR <= 100000 { print }'

Filtering columns is easier with CUT:

使用 CUT 可以更轻松地过滤列:

cat largefile | cut -d '\t' -f 10000-100000

As Rahul Dravid mentioned, catis not a must here, and as Zsolt Botykai added you can improve performance using:

正如 Rahul Dravid 所提到的,cat这里不是必须的,正如 Zsolt Botykai 补充的那样,您可以使用以下方法提高性能:

awk 'NR > 100000 { exit } NR >= 10000 && NR <= 100000' largefile
cut -d '\t' -f 10000-100000 largefile 

回答by Vijay

Some different solutions:

一些不同的解决方案:

For row ranges: In sed:

对于行范围: In sed

sed -n 10000,100000p somefile.txt

For column ranges in awk:

对于 中的列范围awk

awk -v f=10000 -v t=100000 '{ for (i=f; i<=t;i++) printf("%s%s", $i,(i==t) ? "\n" : OFS) }' details.txt

回答by Fredrik Pihl

Was beaten to it for the sed solution, so I'll post a perldito instead. To print selected lines.

被 sed 解决方案打败了,所以我会发布一个perldito。打印选定的行。

$ seq 100 | perl -ne 'print if $. >= 10 && $. <= 20' 
10
11
12
13
14
15
16
17
18
19
20

To print selective columns, use

要打印选择性列,请使用

perl -lane 'print $F[1] .. $F[3] '

-Fis used in conjunction with -a, to choose the delimiter on which to split lines.

-F-a,结合使用以选择分割线的分隔符。

To test, use seqand pasteto get generate some columns

测试、使用seqpaste生成一些列

$ seq 50 | paste - - - - -
1   2   3   4   5
6   7   8   9   10
11  12  13  14  15
16  17  18  19  20
21  22  23  24  25
26  27  28  29  30
31  32  33  34  35
36  37  38  39  40
41  42  43  44  45
46  47  48  49  50

Lets's print everything except the first and the last column

让我们打印除第一列和最后一列之外的所有内容

$ seq 50 | paste - - - - - | perl -lane 'print join "   ", $F[1] .. $F[3]'
2   3   4
7   8   9
12  13  14
17  18  19
22  23  24
27  28  29
32  33  34
37  38  39
42  43  44
47  48  49

In the joinstatement above, there is a tab, you get it by doing a ctrl-v tab.

join上面的语句中,有一个选项卡,您可以通过执行 ctrl-v 选项卡来获取它。

回答by Warren

For the first problem, selecting a set of rows from a large file, piping tail to head is very simple. You want 90000 rows from largefile starting at row 10000. tail grabs the back end of largefile starting at row 10000 and then head chops off all but the first 90000 rows.

对于第一个问题,从大文件中选择一组行,从尾部到头部管道非常简单。您需要从第 10000 行开始的 largefile 中的 90000 行。 ta​​il 从第 10000 行开始抓取 largefile 的后端,然后 head 砍掉除前 90000 行之外的所有行。

tail -n +10000 largefile | head -n 90000 -