bash Bash工具从文件中获取第n行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6022384/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 20:31:12  来源:igfitidea点击:

Bash tool to get nth line from a file

bashshellunixawksed

提问by Vlad Vivdovitch

Is there a "canonical" way of doing that? I've been using head -n | tail -1which does the trick, but I've been wondering if there's a Bash tool that specifically extracts a line (or a range of lines) from a file.

有没有一种“规范”的方式来做到这一点?我一直在使用head -n | tail -1which 可以解决问题,但我一直想知道是否有专门从文件中提取一行(或一系列行)的 Bash 工具。

By "canonical" I mean a program whose main function is doing that.

“规范”是指一个主要功能就是这样做的程序。

回答by anubhava

headand pipe with tailwill be slow for a huge file. I would suggest sedlike this:

headtail对于一个巨大的文件,管道会很慢。我会这样建议sed

sed 'NUMq;d' file

Where NUMis the number of the line you want to print; so, for example, sed '10q;d' fileto print the 10th line of file.

NUM你要打印的行号在哪里;因此,例如,sed '10q;d' file打印file.

Explanation:

解释:

NUMqwill quit immediately when the line number is NUM.

NUMq当行号为 时将立即退出NUM

dwill delete the line instead of printing it; this is inhibited on the last line because the qcauses the rest of the script to be skipped when quitting.

d将删除该行而不是打印它;这在最后一行被禁止,因为这q会导致退出时跳过脚本的其余部分。

If you have NUMin a variable, you will want to use double quotes instead of single:

如果你有NUM一个变量,你会想要使用双引号而不是单引号:

sed "${NUM}q;d" file

回答by jm666

sed -n '2p' < file.txt

will print 2nd line

将打印第二行

sed -n '2011p' < file.txt

2011th line

2011年线

sed -n '10,33p' < file.txt

line 10 up to line 33

第 10 行到第 33 行

sed -n '1p;3p' < file.txt

1st and 3th line

第一行和第三行

and so on...

等等...

For adding lines with sed, you can check this:

要使用 sed 添加行,您可以检查以下内容:

sed: insert a line in a certain position

sed:在某个位置插入一行

回答by CaffeineConnoisseur

I have a unique situation where I can benchmark the solutions proposed on this page, and so I'm writing this answer as a consolidation of the proposed solutions with included run times for each.

我有一个独特的情况,我可以对这个页面上提出的解决方案进行基准测试,所以我写这个答案是作为建议的解决方案的合并,其中包含每个解决方案的运行时间。

Set Up

设置

I have a 3.261 gigabyte ASCII text data file with one key-value pair per row. The file contains 3,339,550,320 rows in total and defies opening in any editor I have tried, including my go-to Vim. I need to subset this file in order to investigate some of the values that I've discovered only start around row ~500,000,000.

我有一个 3.261 GB 的 ASCII 文本数据文件,每行一个键值对。该文件总共包含 3,339,550,320 行,并且无法在我尝试过的任何编辑器中打开,包括我的首选 Vim。我需要对这个文件进行子集化,以便调查我发现的一些值,这些值仅从约 500,000,000 行开始。

Because the file has so many rows:

因为文件有这么多行:

  • I need to extract only a subset of the rows to do anything useful with the data.
  • Reading through every row leading up to the values I care about is going to take a long time.
  • If the solution reads past the rows I care about and continues reading the rest of the file it will waste time reading almost 3 billion irrelevant rows and take 6x longer than necessary.
  • 我只需要提取行的一个子集来对数据做任何有用的事情。
  • 通读导致我关心的值的每一行将需要很长时间。
  • 如果解决方案读取我关心的行并继续读取文件的其余部分,它将浪费时间读取近 30 亿个不相关的行,并且花费的时间比所需时间长 6 倍。

My best-case-scenario is a solution that extracts only a single line from the file without reading any of the other rows in the file, but I can't think of how I would accomplish this in Bash.

我最好的情况是一个解决方案,它只从文件中提取一行而不读取文件中的任何其他行,但我想不出我将如何在 Bash 中实现这一点。

For the purposes of my sanity I'm not going to be trying to read the full 500,000,000 lines I'd need for my own problem. Instead I'll be trying to extract row 50,000,000 out of 3,339,550,320 (which means reading the full file will take 60x longer than necessary).

为了我的理智,我不会试图阅读我自己的问题所需的全部 500,000,000 行。相反,我将尝试从 3,339,550,320 行中提取第 50,000,000 行(这意味着读取完整文件所需的时间将比所需时间长 60 倍)。

I will be using the timebuilt-in to benchmark each command.

我将使用time内置程序对每个命令进行基准测试。

Baseline

基线

First let's see how the headtailsolution:

先来看看headtail解决方法:

$ time head -50000000 myfile.ascii | tail -1
pgm_icnt = 0

real    1m15.321s

The baseline for row 50 million is 00:01:15.321, if I'd gone straight for row 500 million it'd probably be ~12.5 minutes.

第 5000 万行的基线是 00:01:15.321,如果我直接进入第 5 亿行,它可能需要大约 12.5 分钟。

cut

I'm dubious of this one, but it's worth a shot:

我对此表示怀疑,但值得一试:

$ time cut -f50000000 -d$'\n' myfile.ascii
pgm_icnt = 0

real    5m12.156s

This one took 00:05:12.156 to run, which is much slower than the baseline! I'm not sure whether it read through the entire file or just up to line 50 million before stopping, but regardless this doesn't seem like a viable solution to the problem.

这个用了 00:05:12.156 运行,比基线慢很多!我不确定它在停止之前是通读了整个文件还是最多读取了 5000 万行,但无论如何这似乎不是解决问题的可行方法。

AWK

AWK

I only ran the solution with the exitbecause I wasn't going to wait for the full file to run:

我只用 运行了解决方案,exit因为我不会等待完整文件运行:

$ time awk 'NR == 50000000 {print; exit}' myfile.ascii
pgm_icnt = 0

real    1m16.583s

This code ran in 00:01:16.583, which is only ~1 second slower, but still not an improvement on the baseline. At this rate if the exit command had been excluded it would have probably taken around ~76 minutes to read the entire file!

此代码在 00:01:16.583 中运行,仅慢了约 1 秒,但仍然没有比基线有所改进。按照这个速度,如果退出命令被排除在外,读取整个文件可能需要大约 76 分钟!

Perl

珀尔

I ran the existing Perl solution as well:

我也运行了现有的 Perl 解决方案:

$ time perl -wnl -e '$.== 50000000 && print && exit;' myfile.ascii
pgm_icnt = 0

real    1m13.146s

This code ran in 00:01:13.146, which is ~2 seconds faster than the baseline. If I'd run it on the full 500,000,000 it would probably take ~12 minutes.

此代码在 00:01:13.146 运行,比基线快约 2 秒。如果我在整个 500,000,000 上运行它,可能需要大约 12 分钟。

sed

sed

The top answer on the board, here's my result:

板上的最佳答案,这是我的结果:

$ time sed "50000000q;d" myfile.ascii
pgm_icnt = 0

real    1m12.705s

This code ran in 00:01:12.705, which is 3 seconds faster than the baseline, and ~0.4 seconds faster than Perl. If I'd run it on the full 500,000,000 rows it would have probably taken ~12 minutes.

这段代码在 00:01:12.705 运行,比基线快 3 秒,比 Perl 快约 0.4 秒。如果我在完整的 500,000,000 行上运行它,可能需要大约 12 分钟。

mapfile

地图文件

I have bash 3.1 and therefore cannot test the mapfile solution.

我有 bash 3.1,因此无法测试 mapfile 解决方案。

Conclusion

结论

It looks like, for the most part, it's difficult to improve upon the headtailsolution. At best the sedsolution provides a ~3% increase in efficiency.

在大多数情况下,似乎很难改进headtail解决方案。该sed解决方案最多可将效率提高约 3%。

(percentages calculated with the formula % = (runtime/baseline - 1) * 100)

(用公式计算的百分比% = (runtime/baseline - 1) * 100

Row 50,000,000

第 50,000,000 行

  1. 00:01:12.705 (-00:00:02.616 = -3.47%) sed
  2. 00:01:13.146 (-00:00:02.175 = -2.89%) perl
  3. 00:01:15.321 (+00:00:00.000 = +0.00%) head|tail
  4. 00:01:16.583 (+00:00:01.262 = +1.68%) awk
  5. 00:05:12.156 (+00:03:56.835 = +314.43%) cut
  1. 00:01:12.705 (-00:00:02.616 = -3.47%) sed
  2. 00:01:13.146 (-00:00:02.175 = -2.89%) perl
  3. 00:01:15.321(+00:00:00.000 = +0.00%) head|tail
  4. 00:01:16.583 (+00:00:01.262 = +1.68%) awk
  5. 00:05:12.156 (+00:03:56.835 = +314.43%) cut

Row 500,000,000

第 500,000,000 行

  1. 00:12:07.050 (-00:00:26.160) sed
  2. 00:12:11.460 (-00:00:21.750) perl
  3. 00:12:33.210 (+00:00:00.000) head|tail
  4. 00:12:45.830 (+00:00:12.620) awk
  5. 00:52:01.560 (+00:40:31.650) cut
  1. 00:12:07.050 (-00:00:26.160) sed
  2. 00:12:11.460 (-00:00:21.750) perl
  3. 00:12:33.210 (+00:00:00.000) head|tail
  4. 00:12:45.830 (+00:00:12.620) awk
  5. 00:52:01.560 (+00:40:31.650) cut

Row 3,338,559,320

第 3,338,559,320 行

  1. 01:20:54.599 (-00:03:05.327) sed
  2. 01:21:24.045 (-00:02:25.227) perl
  3. 01:23:49.273 (+00:00:00.000) head|tail
  4. 01:25:13.548 (+00:02:35.735) awk
  5. 05:47:23.026 (+04:24:26.246) cut
  1. 01:20:54.599 (-00:03:05.327) sed
  2. 01:21:24.045 (-00:02:25.227) perl
  3. 01:23:49.273 (+00:00:00.000) head|tail
  4. 01:25:13.548 (+00:02:35.735) awk
  5. 05:47:23.026 (+04:24:26.246) cut

回答by fedorqui 'SO stop harming'

With awkit is pretty fast:

有了awk它,速度非常快:

awk 'NR == num_line' file

When this is true, the default behaviour of awkis performed: {print $0}.

如果这是真的,awk则执行的默认行为是:{print $0}



Alternative versions

替代版本

If your file happens to be huge, you'd better exitafter reading the required line. This way you save CPU time See time comparison at the end of the answer.

如果您的文件碰巧很大,您最好exit在阅读所需的行之后。这样可以节省 CPU 时间请参阅答案末尾的时间比较

awk 'NR == num_line {print; exit}' file

If you want to give the line number from a bash variable you can use:

如果你想从 bash 变量中给出行号,你可以使用:

awk 'NR == n' n=$num file
awk -v n=$num 'NR == n' file   # equivalent


See how much time is saved by using exit, specially if the line happens to be in the first part of the file:

查看使用 节省了多少时间exit,特别是如果该行恰好位于文件的第一部分:

# Let's create a 10M lines file
for ((i=0; i<100000; i++)); do echo "bla bla"; done > 100Klines
for ((i=0; i<100; i++)); do cat 100Klines; done > 10Mlines

$ time awk 'NR == 1234567 {print}' 10Mlines
bla bla

real    0m1.303s
user    0m1.246s
sys 0m0.042s
$ time awk 'NR == 1234567 {print; exit}' 10Mlines
bla bla

real    0m0.198s
user    0m0.178s
sys 0m0.013s

So the difference is 0.198s vs 1.303s, around 6x times faster.

所以差异是 0.198s 和 1.303s,大约快 6 倍。

回答by Philipp Cla?en

According to my tests, in terms of performance and readability my recommendation is:

根据我的测试,在性能和可读性方面,我的建议是:

tail -n+N | head -1

tail -n+N | head -1

Nis the line number that you want. For example, tail -n+7 input.txt | head -1will print the 7th line of the file.

N是您想要的行号。例如,tail -n+7 input.txt | head -1将打印文件的第 7 行。

tail -n+Nwill print everything starting from line N, and head -1will make it stop after one line.

tail -n+N将从 line 开始打印所有内容N,并head -1使其在一行后停止。



The alternative head -N | tail -1is perhaps slightly more readable. For example, this will print the 7th line:

替代方案head -N | tail -1可能稍微更具可读性。例如,这将打印第 7 行:

head -7 input.txt | tail -1

head -7 input.txt | tail -1

When it comes to performance, there is not much difference for smaller sizes, but it will be outperformed by the tail | head(from above) when the files become huge.

在性能方面,较小的尺寸没有太大差异,但是tail | head当文件变大时,它会被(从上面)超越。

The top-voted sed 'NUMq;d'is interesting to know, but I would argue that it will be understood by fewer people out of the box than the head/tail solution and it is also slower than tail/head.

sed 'NUMq;d'知道得票最多的人很有趣,但我认为,与 head/tail 解决方案相比,开箱即用的人会更少地理解它,而且它也比 tail/head 慢。

In my tests, both tails/heads versions outperformed sed 'NUMq;d'consistently. That is in line with the other benchmarks that were posted. It is hard to find a case where tails/heads was really bad. It is also not surprising, as these are operations that you would expect to be heavily optimized in a modern Unix system.

在我的测试中,两个尾部/正面版本的表现sed 'NUMq;d'始终如一。这与发布的其他基准是一致的。很难找到尾巴/正面真的很糟糕的情况。这也不足为奇,因为您希望在现代 Unix 系统中对这些操作进行大量优化。

To get an idea about the performance differences, these are the number that I get for a huge file (9.3G):

为了了解性能差异,这些是我为一个大文件 (9.3G) 得到的数字:

  • tail -n+N | head -1: 3.7 sec
  • head -N | tail -1: 4.6 sec
  • sed Nq;d: 18.8 sec
  • tail -n+N | head -1: 3.7 秒
  • head -N | tail -1: 4.6 秒
  • sed Nq;d: 18.8 秒

Results may differ, but the performance head | tailand tail | headis, in general, comparable for smaller inputs, and sedis always slower by a significant factor (around 5x or so).

结果可能会有所不同,但性能head | tailtail | head一般来说,对于较小的输入sed是可比的,并且总是慢了一个显着的因素(大约 5 倍左右)。

To reproduce my benchmark, you can try the following, but be warned that it will create a 9.3G file in the current working directory:

要重现我的基准测试,您可以尝试以下操作,但请注意,它将在当前工作目录中创建一个 9.3G 的文件:

#!/bin/bash
readonly file=tmp-input.txt
readonly size=1000000000
readonly pos=500000000
readonly retries=3

seq 1 $size > $file
echo "*** head -N | tail -1 ***"
for i in $(seq 1 $retries) ; do
    time head "-$pos" $file | tail -1
done
echo "-------------------------"
echo
echo "*** tail -n+N | head -1 ***"
echo

seq 1 $size > $file
ls -alhg $file
for i in $(seq 1 $retries) ; do
    time tail -n+$pos $file | head -1
done
echo "-------------------------"
echo
echo "*** sed Nq;d ***"
echo

seq 1 $size > $file
ls -alhg $file
for i in $(seq 1 $retries) ; do
    time sed $pos'q;d' $file
done
/bin/rm $file

Here is the output of a run on my machine (ThinkPad X1 Carbon with an SSD and 16G of memory). I assume in the final run everything will come from the cache, not from disk:

这是在我的机器上运行的输出(ThinkPad X1 Carbon,带有 SSD 和 16G 内存)。我假设在最终运行中一切都将来自缓存,而不是来自磁盘:

*** head -N | tail -1 ***
500000000

real    0m9,800s
user    0m7,328s
sys     0m4,081s
500000000

real    0m4,231s
user    0m5,415s
sys     0m2,789s
500000000

real    0m4,636s
user    0m5,935s
sys     0m2,684s
-------------------------

*** tail -n+N | head -1 ***

-rw-r--r-- 1 phil 9,3G Jan 19 19:49 tmp-input.txt
500000000

real    0m6,452s
user    0m3,367s
sys     0m1,498s
500000000

real    0m3,890s
user    0m2,921s
sys     0m0,952s
500000000

real    0m3,763s
user    0m3,004s
sys     0m0,760s
-------------------------

*** sed Nq;d ***

-rw-r--r-- 1 phil 9,3G Jan 19 19:50 tmp-input.txt
500000000

real    0m23,675s
user    0m21,557s
sys     0m1,523s
500000000

real    0m20,328s
user    0m18,971s
sys     0m1,308s
500000000

real    0m19,835s
user    0m18,830s
sys     0m1,004s

回答by David W.

Wow, all the possibilities!

哇,所有的可能性!

Try this:

尝试这个:

sed -n "${lineNum}p" $file

or one of these depending upon your version of Awk:

或其中之一,具体取决于您的 Awk 版本:

awk  -vlineNum=$lineNum 'NR == lineNum {print 
# print line number 52
sed '52!d' file
}' $file awk -v lineNum=4 '{if (NR == lineNum) {print
mapfile -s 41 -n 1 ary < file
}}' $file awk '{if (NR == lineNum) {print
printf '%s' "${ary[0]}"
}}' lineNum=$lineNum $file

(You may have to try the nawkor gawkcommand).

您可能必须尝试使用nawkorgawk命令)。

Is there a tool that only does the print that particular line? Not one of the standard tools. However, sedis probably the closest and simplest to use.

有没有只打印特定行的工具?不是标准工具之一。然而,sed可能是最接近和最容易使用的。

回答by Steven Penny

mapfile -s $((42-1)) -n $((666-42+1)) ary < file
printf '%s' "${ary[@]}"

Useful one-line scripts for sed

sed 有用的单行脚本

回答by gniourf_gniourf

This question being tagged Bash, here's the Bash (≥4) way of doing: use mapfilewith the -s(skip) and -n(count) option.

这个问题被标记为 Bash,这是 Bash (≥4) 的做法:mapfile-s(skip) 和-n(count) 选项一起使用。

If you need to get the 42nd line of a file file:

如果您需要获取文件的第 42 行file

mapfile -t -s $((42-1)) -n $((666-42+1)) ary < file
# do stuff
printf '%s\n' "${ary[@]}"

At this point, you'll have an array arythe fields of which containing the lines of file(including the trailing newline), where we have skipped the first 41 lines (-s 41), and stopped after reading one line (-n 1). So that's really the 42nd line. To print it out:

此时,您将拥有一个数组,ary其字段包含 的行file(包括尾随换行符),其中我们跳过了前 41 行 ( -s 41),并在读取一行 ( -n 1)后停止。所以这真的是第 42 行。打印出来:

print_file_range() {
    # - is the range of file  to be printed to stdout
    local ary
    mapfile -s $((-1)) -n $((-+1)) ary < ""
    printf '%s' "${ary[@]}"
}


If you need a range of lines, say the range 42–666 (inclusive), and say you don't want to do the math yourself, and print them on stdout:

如果您需要一系列行,请说出范围 42–666(含),并说您不想自己进行数学运算,然后将它们打印在标准输出上:

sed -n '10{p;q;}' file   # print line 10

If you need to process these lines too, it's not really convenient to store the trailing newline. In this case use the -toption (trim):

如果您也需要处理这些行,则存储尾随换行符并不是很方便。在这种情况下,请使用-t选项 (trim):

perl -wnl -e '$.== NUM && print && exit;' some.file

You can have a function do that for you:

你可以让一个函数为你做这件事:

##代码##

No external commands, only Bash builtins!

没有外部命令,只有 Bash 内置命令!

回答by bernd

You may also used sed print and quit:

您也可以使用 sed 打印并退出:

##代码##

回答by Timofey Stolbov

You can also use Perl for this:

你也可以使用 Perl:

##代码##