Linux 如何对文件进行子集化 - 选择行数或列数

Question

提问by jianfeng.mao

I would like to have your advice/help on how to subset a big file (millions of rows or lines).

我想就如何对大文件（数百万行或行）进行子集化获得您的建议/帮助。

For example,

例如，

(1) I have big file (millions of rows, tab-delimited). I want to a subset of this file with only rows from 10000 to 100000.

(1) 我有大文件（数百万行，制表符分隔）。我想要这个文件的一个子集，只有从 10000 到 100000 的行。

(2) I have big file (millions of columns, tab-delimited). I want to a subset of this file with only columns from 10000 to 100000.

(2) 我有大文件（数百万列，制表符分隔）。我想要这个文件的一个子集，只有从 10000 到 100000 的列。

I know there are tools like head, tail, cut, split, and awk or sed. I can use them to do simple subsetting. But, I do not know how to do this job.

我知道有像 head、tail、cut、split 和 awk 或 sed 这样的工具。我可以用它们来做简单的子集。但是，我不知道如何做这项工作。

Could you please give any advice? Thanks in advance.

你能给点建议吗？提前致谢。

Answer 1

采纳答案by Drakosha

Filtering rows is easy, for example with AWK:

过滤行很容易，例如使用 AWK：

cat largefile | awk 'NR >= 10000  && NR <= 100000 { print }'

Filtering columns is easier with CUT:

使用 CUT 可以更轻松地过滤列：

cat largefile | cut -d '\t' -f 10000-100000

As Rahul Dravid mentioned, catis not a must here, and as Zsolt Botykai added you can improve performance using:

正如 Rahul Dravid 所提到的，cat这里不是必须的，正如 Zsolt Botykai 补充的那样，您可以使用以下方法提高性能：

awk 'NR > 100000 { exit } NR >= 10000 && NR <= 100000' largefile
cut -d '\t' -f 10000-100000 largefile

Answer 2

回答by Vijay

Some different solutions:

一些不同的解决方案：

For row ranges: In sed:

对于行范围： In sed：

sed -n 10000,100000p somefile.txt

For column ranges in awk:

对于中的列范围awk：

awk -v f=10000 -v t=100000 '{ for (i=f; i<=t;i++) printf("%s%s", $i,(i==t) ? "\n" : OFS) }' details.txt

Answer 3

回答by Fredrik Pihl

Was beaten to it for the sed solution, so I'll post a perldito instead. To print selected lines.

被 sed 解决方案打败了，所以我会发布一个perldito。打印选定的行。

$ seq 100 | perl -ne 'print if $. >= 10 && $. <= 20' 
10
11
12
13
14
15
16
17
18
19
20

To print selective columns, use

要打印选择性列，请使用

perl -lane 'print $F[1] .. $F[3] '

-Fis used in conjunction with -a, to choose the delimiter on which to split lines.

-F与-a,结合使用以选择分割线的分隔符。

To test, use seqand pasteto get generate some columns

测试、使用seq和paste生成一些列

$ seq 50 | paste - - - - -
1   2   3   4   5
6   7   8   9   10
11  12  13  14  15
16  17  18  19  20
21  22  23  24  25
26  27  28  29  30
31  32  33  34  35
36  37  38  39  40
41  42  43  44  45
46  47  48  49  50

Lets's print everything except the first and the last column

让我们打印除第一列和最后一列之外的所有内容

$ seq 50 | paste - - - - - | perl -lane 'print join "   ", $F[1] .. $F[3]'
2   3   4
7   8   9
12  13  14
17  18  19
22  23  24
27  28  29
32  33  34
37  38  39
42  43  44
47  48  49

In the joinstatement above, there is a tab, you get it by doing a ctrl-v tab.

在join上面的语句中，有一个选项卡，您可以通过执行 ctrl-v 选项卡来获取它。

Answer 4

回答by Warren

For the first problem, selecting a set of rows from a large file, piping tail to head is very simple. You want 90000 rows from largefile starting at row 10000. tail grabs the back end of largefile starting at row 10000 and then head chops off all but the first 90000 rows.

对于第一个问题，从大文件中选择一组行，从尾部到头部管道非常简单。您需要从第 10000 行开始的 largefile 中的 90000 行。 tail 从第 10000 行开始抓取 largefile 的后端，然后 head 砍掉除前 90000 行之外的所有行。

tail -n +10000 largefile | head -n 90000 -

Linux 如何对文件进行子集化 - 选择行数或列数

提问by jianfeng.mao

采纳答案by Drakosha

回答by Vijay

回答by Fredrik Pihl

回答by Warren

相关推荐

最近更新

标签

Linux 如何对文件进行子集化 - 选择行数或列数

提问by jianfeng.mao

采纳答案by Drakosha

回答by Vijay

回答by Fredrik Pihl

回答by Warren

相关推荐

Linux：阻塞直到文件中的字符串匹配（“tail + grep with blocks”）

C# 如何：突出显示 UltraTree 中的选定节点

Linux USB 编程

Linux 如何为maven设置路径

相关推荐

最近更新

标签