bash Bash工具从文件中获取第n行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/6022384/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Bash tool to get nth line from a file
提问by Vlad Vivdovitch
Is there a "canonical" way of doing that? I've been using head -n | tail -1
which does the trick, but I've been wondering if there's a Bash tool that specifically extracts a line (or a range of lines) from a file.
有没有一种“规范”的方式来做到这一点?我一直在使用head -n | tail -1
which 可以解决问题,但我一直想知道是否有专门从文件中提取一行(或一系列行)的 Bash 工具。
By "canonical" I mean a program whose main function is doing that.
“规范”是指一个主要功能就是这样做的程序。
回答by anubhava
head
and pipe with tail
will be slow for a huge file. I would suggest sed
like this:
head
tail
对于一个巨大的文件,管道会很慢。我会这样建议sed
:
sed 'NUMq;d' file
Where NUM
is the number of the line you want to print; so, for example, sed '10q;d' file
to print the 10th line of file
.
NUM
你要打印的行号在哪里;因此,例如,sed '10q;d' file
打印file
.
Explanation:
解释:
NUMq
will quit immediately when the line number is NUM
.
NUMq
当行号为 时将立即退出NUM
。
d
will delete the line instead of printing it; this is inhibited on the last line because the q
causes the rest of the script to be skipped when quitting.
d
将删除该行而不是打印它;这在最后一行被禁止,因为这q
会导致退出时跳过脚本的其余部分。
If you have NUM
in a variable, you will want to use double quotes instead of single:
如果你有NUM
一个变量,你会想要使用双引号而不是单引号:
sed "${NUM}q;d" file
回答by jm666
sed -n '2p' < file.txt
will print 2nd line
将打印第二行
sed -n '2011p' < file.txt
2011th line
2011年线
sed -n '10,33p' < file.txt
line 10 up to line 33
第 10 行到第 33 行
sed -n '1p;3p' < file.txt
1st and 3th line
第一行和第三行
and so on...
等等...
For adding lines with sed, you can check this:
要使用 sed 添加行,您可以检查以下内容:
回答by CaffeineConnoisseur
I have a unique situation where I can benchmark the solutions proposed on this page, and so I'm writing this answer as a consolidation of the proposed solutions with included run times for each.
我有一个独特的情况,我可以对这个页面上提出的解决方案进行基准测试,所以我写这个答案是作为建议的解决方案的合并,其中包含每个解决方案的运行时间。
Set Up
设置
I have a 3.261 gigabyte ASCII text data file with one key-value pair per row. The file contains 3,339,550,320 rows in total and defies opening in any editor I have tried, including my go-to Vim. I need to subset this file in order to investigate some of the values that I've discovered only start around row ~500,000,000.
我有一个 3.261 GB 的 ASCII 文本数据文件,每行一个键值对。该文件总共包含 3,339,550,320 行,并且无法在我尝试过的任何编辑器中打开,包括我的首选 Vim。我需要对这个文件进行子集化,以便调查我发现的一些值,这些值仅从约 500,000,000 行开始。
Because the file has so many rows:
因为文件有这么多行:
- I need to extract only a subset of the rows to do anything useful with the data.
- Reading through every row leading up to the values I care about is going to take a long time.
- If the solution reads past the rows I care about and continues reading the rest of the file it will waste time reading almost 3 billion irrelevant rows and take 6x longer than necessary.
- 我只需要提取行的一个子集来对数据做任何有用的事情。
- 通读导致我关心的值的每一行将需要很长时间。
- 如果解决方案读取我关心的行并继续读取文件的其余部分,它将浪费时间读取近 30 亿个不相关的行,并且花费的时间比所需时间长 6 倍。
My best-case-scenario is a solution that extracts only a single line from the file without reading any of the other rows in the file, but I can't think of how I would accomplish this in Bash.
我最好的情况是一个解决方案,它只从文件中提取一行而不读取文件中的任何其他行,但我想不出我将如何在 Bash 中实现这一点。
For the purposes of my sanity I'm not going to be trying to read the full 500,000,000 lines I'd need for my own problem. Instead I'll be trying to extract row 50,000,000 out of 3,339,550,320 (which means reading the full file will take 60x longer than necessary).
为了我的理智,我不会试图阅读我自己的问题所需的全部 500,000,000 行。相反,我将尝试从 3,339,550,320 行中提取第 50,000,000 行(这意味着读取完整文件所需的时间将比所需时间长 60 倍)。
I will be using the time
built-in to benchmark each command.
我将使用time
内置程序对每个命令进行基准测试。
Baseline
基线
First let's see how the head
tail
solution:
先来看看head
tail
解决方法:
$ time head -50000000 myfile.ascii | tail -1
pgm_icnt = 0
real 1m15.321s
The baseline for row 50 million is 00:01:15.321, if I'd gone straight for row 500 million it'd probably be ~12.5 minutes.
第 5000 万行的基线是 00:01:15.321,如果我直接进入第 5 亿行,它可能需要大约 12.5 分钟。
cut
切
I'm dubious of this one, but it's worth a shot:
我对此表示怀疑,但值得一试:
$ time cut -f50000000 -d$'\n' myfile.ascii
pgm_icnt = 0
real 5m12.156s
This one took 00:05:12.156 to run, which is much slower than the baseline! I'm not sure whether it read through the entire file or just up to line 50 million before stopping, but regardless this doesn't seem like a viable solution to the problem.
这个用了 00:05:12.156 运行,比基线慢很多!我不确定它在停止之前是通读了整个文件还是最多读取了 5000 万行,但无论如何这似乎不是解决问题的可行方法。
AWK
AWK
I only ran the solution with the exit
because I wasn't going to wait for the full file to run:
我只用 运行了解决方案,exit
因为我不会等待完整文件运行:
$ time awk 'NR == 50000000 {print; exit}' myfile.ascii
pgm_icnt = 0
real 1m16.583s
This code ran in 00:01:16.583, which is only ~1 second slower, but still not an improvement on the baseline. At this rate if the exit command had been excluded it would have probably taken around ~76 minutes to read the entire file!
此代码在 00:01:16.583 中运行,仅慢了约 1 秒,但仍然没有比基线有所改进。按照这个速度,如果退出命令被排除在外,读取整个文件可能需要大约 76 分钟!
Perl
珀尔
I ran the existing Perl solution as well:
我也运行了现有的 Perl 解决方案:
$ time perl -wnl -e '$.== 50000000 && print && exit;' myfile.ascii
pgm_icnt = 0
real 1m13.146s
This code ran in 00:01:13.146, which is ~2 seconds faster than the baseline. If I'd run it on the full 500,000,000 it would probably take ~12 minutes.
此代码在 00:01:13.146 运行,比基线快约 2 秒。如果我在整个 500,000,000 上运行它,可能需要大约 12 分钟。
sed
sed
The top answer on the board, here's my result:
板上的最佳答案,这是我的结果:
$ time sed "50000000q;d" myfile.ascii
pgm_icnt = 0
real 1m12.705s
This code ran in 00:01:12.705, which is 3 seconds faster than the baseline, and ~0.4 seconds faster than Perl. If I'd run it on the full 500,000,000 rows it would have probably taken ~12 minutes.
这段代码在 00:01:12.705 运行,比基线快 3 秒,比 Perl 快约 0.4 秒。如果我在完整的 500,000,000 行上运行它,可能需要大约 12 分钟。
mapfile
地图文件
I have bash 3.1 and therefore cannot test the mapfile solution.
我有 bash 3.1,因此无法测试 mapfile 解决方案。
Conclusion
结论
It looks like, for the most part, it's difficult to improve upon the head
tail
solution. At best the sed
solution provides a ~3% increase in efficiency.
在大多数情况下,似乎很难改进head
tail
解决方案。该sed
解决方案最多可将效率提高约 3%。
(percentages calculated with the formula % = (runtime/baseline - 1) * 100
)
(用公式计算的百分比% = (runtime/baseline - 1) * 100
)
Row 50,000,000
第 50,000,000 行
- 00:01:12.705 (-00:00:02.616 = -3.47%)
sed
- 00:01:13.146 (-00:00:02.175 = -2.89%)
perl
- 00:01:15.321 (+00:00:00.000 = +0.00%)
head|tail
- 00:01:16.583 (+00:00:01.262 = +1.68%)
awk
- 00:05:12.156 (+00:03:56.835 = +314.43%)
cut
- 00:01:12.705 (-00:00:02.616 = -3.47%)
sed
- 00:01:13.146 (-00:00:02.175 = -2.89%)
perl
- 00:01:15.321(+00:00:00.000 = +0.00%)
head|tail
- 00:01:16.583 (+00:00:01.262 = +1.68%)
awk
- 00:05:12.156 (+00:03:56.835 = +314.43%)
cut
Row 500,000,000
第 500,000,000 行
- 00:12:07.050 (-00:00:26.160)
sed
- 00:12:11.460 (-00:00:21.750)
perl
- 00:12:33.210 (+00:00:00.000)
head|tail
- 00:12:45.830 (+00:00:12.620)
awk
- 00:52:01.560 (+00:40:31.650)
cut
- 00:12:07.050 (-00:00:26.160)
sed
- 00:12:11.460 (-00:00:21.750)
perl
- 00:12:33.210 (+00:00:00.000)
head|tail
- 00:12:45.830 (+00:00:12.620)
awk
- 00:52:01.560 (+00:40:31.650)
cut
Row 3,338,559,320
第 3,338,559,320 行
- 01:20:54.599 (-00:03:05.327)
sed
- 01:21:24.045 (-00:02:25.227)
perl
- 01:23:49.273 (+00:00:00.000)
head|tail
- 01:25:13.548 (+00:02:35.735)
awk
- 05:47:23.026 (+04:24:26.246)
cut
- 01:20:54.599 (-00:03:05.327)
sed
- 01:21:24.045 (-00:02:25.227)
perl
- 01:23:49.273 (+00:00:00.000)
head|tail
- 01:25:13.548 (+00:02:35.735)
awk
- 05:47:23.026 (+04:24:26.246)
cut
回答by fedorqui 'SO stop harming'
With awk
it is pretty fast:
有了awk
它,速度非常快:
awk 'NR == num_line' file
When this is true, the default behaviour of awk
is performed: {print $0}
.
如果这是真的,awk
则执行的默认行为是:{print $0}
。
Alternative versions
替代版本
If your file happens to be huge, you'd better exit
after reading the required line. This way you save CPU time See time comparison at the end of the answer.
如果您的文件碰巧很大,您最好exit
在阅读所需的行之后。这样可以节省 CPU 时间请参阅答案末尾的时间比较。
awk 'NR == num_line {print; exit}' file
If you want to give the line number from a bash variable you can use:
如果你想从 bash 变量中给出行号,你可以使用:
awk 'NR == n' n=$num file
awk -v n=$num 'NR == n' file # equivalent
See how much time is saved by using exit
, specially if the line happens to be in the first part of the file:
查看使用 节省了多少时间exit
,特别是如果该行恰好位于文件的第一部分:
# Let's create a 10M lines file
for ((i=0; i<100000; i++)); do echo "bla bla"; done > 100Klines
for ((i=0; i<100; i++)); do cat 100Klines; done > 10Mlines
$ time awk 'NR == 1234567 {print}' 10Mlines
bla bla
real 0m1.303s
user 0m1.246s
sys 0m0.042s
$ time awk 'NR == 1234567 {print; exit}' 10Mlines
bla bla
real 0m0.198s
user 0m0.178s
sys 0m0.013s
So the difference is 0.198s vs 1.303s, around 6x times faster.
所以差异是 0.198s 和 1.303s,大约快 6 倍。
回答by Philipp Cla?en
According to my tests, in terms of performance and readability my recommendation is:
根据我的测试,在性能和可读性方面,我的建议是:
tail -n+N | head -1
tail -n+N | head -1
N
is the line number that you want. For example, tail -n+7 input.txt | head -1
will print the 7th line of the file.
N
是您想要的行号。例如,tail -n+7 input.txt | head -1
将打印文件的第 7 行。
tail -n+N
will print everything starting from line N
, and head -1
will make it stop after one line.
tail -n+N
将从 line 开始打印所有内容N
,并head -1
使其在一行后停止。
The alternative head -N | tail -1
is perhaps slightly more readable. For example, this will print the 7th line:
替代方案head -N | tail -1
可能稍微更具可读性。例如,这将打印第 7 行:
head -7 input.txt | tail -1
head -7 input.txt | tail -1
When it comes to performance, there is not much difference for smaller sizes, but it will be outperformed by the tail | head
(from above) when the files become huge.
在性能方面,较小的尺寸没有太大差异,但是tail | head
当文件变大时,它会被(从上面)超越。
The top-voted sed 'NUMq;d'
is interesting to know, but I would argue that it will be understood by fewer people out of the box than the head/tail solution and it is also slower than tail/head.
sed 'NUMq;d'
知道得票最多的人很有趣,但我认为,与 head/tail 解决方案相比,开箱即用的人会更少地理解它,而且它也比 tail/head 慢。
In my tests, both tails/heads versions outperformed sed 'NUMq;d'
consistently. That is in line with the other benchmarks that were posted. It is hard to find a case where tails/heads was really bad. It is also not surprising, as these are operations that you would expect to be heavily optimized in a modern Unix system.
在我的测试中,两个尾部/正面版本的表现sed 'NUMq;d'
始终如一。这与发布的其他基准是一致的。很难找到尾巴/正面真的很糟糕的情况。这也不足为奇,因为您希望在现代 Unix 系统中对这些操作进行大量优化。
To get an idea about the performance differences, these are the number that I get for a huge file (9.3G):
为了了解性能差异,这些是我为一个大文件 (9.3G) 得到的数字:
tail -n+N | head -1
: 3.7 sechead -N | tail -1
: 4.6 secsed Nq;d
: 18.8 sec
tail -n+N | head -1
: 3.7 秒head -N | tail -1
: 4.6 秒sed Nq;d
: 18.8 秒
Results may differ, but the performance head | tail
and tail | head
is, in general, comparable for smaller inputs, and sed
is always slower by a significant factor (around 5x or so).
结果可能会有所不同,但性能head | tail
和tail | head
一般来说,对于较小的输入sed
是可比的,并且总是慢了一个显着的因素(大约 5 倍左右)。
To reproduce my benchmark, you can try the following, but be warned that it will create a 9.3G file in the current working directory:
要重现我的基准测试,您可以尝试以下操作,但请注意,它将在当前工作目录中创建一个 9.3G 的文件:
#!/bin/bash
readonly file=tmp-input.txt
readonly size=1000000000
readonly pos=500000000
readonly retries=3
seq 1 $size > $file
echo "*** head -N | tail -1 ***"
for i in $(seq 1 $retries) ; do
time head "-$pos" $file | tail -1
done
echo "-------------------------"
echo
echo "*** tail -n+N | head -1 ***"
echo
seq 1 $size > $file
ls -alhg $file
for i in $(seq 1 $retries) ; do
time tail -n+$pos $file | head -1
done
echo "-------------------------"
echo
echo "*** sed Nq;d ***"
echo
seq 1 $size > $file
ls -alhg $file
for i in $(seq 1 $retries) ; do
time sed $pos'q;d' $file
done
/bin/rm $file
Here is the output of a run on my machine (ThinkPad X1 Carbon with an SSD and 16G of memory). I assume in the final run everything will come from the cache, not from disk:
这是在我的机器上运行的输出(ThinkPad X1 Carbon,带有 SSD 和 16G 内存)。我假设在最终运行中一切都将来自缓存,而不是来自磁盘:
*** head -N | tail -1 ***
500000000
real 0m9,800s
user 0m7,328s
sys 0m4,081s
500000000
real 0m4,231s
user 0m5,415s
sys 0m2,789s
500000000
real 0m4,636s
user 0m5,935s
sys 0m2,684s
-------------------------
*** tail -n+N | head -1 ***
-rw-r--r-- 1 phil 9,3G Jan 19 19:49 tmp-input.txt
500000000
real 0m6,452s
user 0m3,367s
sys 0m1,498s
500000000
real 0m3,890s
user 0m2,921s
sys 0m0,952s
500000000
real 0m3,763s
user 0m3,004s
sys 0m0,760s
-------------------------
*** sed Nq;d ***
-rw-r--r-- 1 phil 9,3G Jan 19 19:50 tmp-input.txt
500000000
real 0m23,675s
user 0m21,557s
sys 0m1,523s
500000000
real 0m20,328s
user 0m18,971s
sys 0m1,308s
500000000
real 0m19,835s
user 0m18,830s
sys 0m1,004s
回答by David W.
Wow, all the possibilities!
哇,所有的可能性!
Try this:
尝试这个:
sed -n "${lineNum}p" $file
or one of these depending upon your version of Awk:
或其中之一,具体取决于您的 Awk 版本:
awk -vlineNum=$lineNum 'NR == lineNum {print # print line number 52
sed '52!d' file
}' $file
awk -v lineNum=4 '{if (NR == lineNum) {print mapfile -s 41 -n 1 ary < file
}}' $file
awk '{if (NR == lineNum) {print printf '%s' "${ary[0]}"
}}' lineNum=$lineNum $file
(You may have to try the nawk
or gawk
command).
(您可能必须尝试使用nawk
orgawk
命令)。
Is there a tool that only does the print that particular line? Not one of the standard tools. However, sed
is probably the closest and simplest to use.
有没有只打印特定行的工具?不是标准工具之一。然而,sed
可能是最接近和最容易使用的。
回答by Steven Penny
mapfile -s $((42-1)) -n $((666-42+1)) ary < file
printf '%s' "${ary[@]}"
回答by gniourf_gniourf
This question being tagged Bash, here's the Bash (≥4) way of doing: use mapfile
with the -s
(skip) and -n
(count) option.
这个问题被标记为 Bash,这是 Bash (≥4) 的做法:mapfile
与-s
(skip) 和-n
(count) 选项一起使用。
If you need to get the 42nd line of a file file
:
如果您需要获取文件的第 42 行file
:
mapfile -t -s $((42-1)) -n $((666-42+1)) ary < file
# do stuff
printf '%s\n' "${ary[@]}"
At this point, you'll have an array ary
the fields of which containing the lines of file
(including the trailing newline), where we have skipped the first 41 lines (-s 41
), and stopped after reading one line (-n 1
). So that's really the 42nd line. To print it out:
此时,您将拥有一个数组,ary
其字段包含 的行file
(包括尾随换行符),其中我们跳过了前 41 行 ( -s 41
),并在读取一行 ( -n 1
)后停止。所以这真的是第 42 行。打印出来:
print_file_range() {
# - is the range of file to be printed to stdout
local ary
mapfile -s $((-1)) -n $((-+1)) ary < ""
printf '%s' "${ary[@]}"
}
If you need a range of lines, say the range 42–666 (inclusive), and say you don't want to do the math yourself, and print them on stdout:
如果您需要一系列行,请说出范围 42–666(含),并说您不想自己进行数学运算,然后将它们打印在标准输出上:
sed -n '10{p;q;}' file # print line 10
If you need to process these lines too, it's not really convenient to store the trailing newline. In this case use the -t
option (trim):
如果您也需要处理这些行,则存储尾随换行符并不是很方便。在这种情况下,请使用-t
选项 (trim):
perl -wnl -e '$.== NUM && print && exit;' some.file
You can have a function do that for you:
你可以让一个函数为你做这件事:
##代码##No external commands, only Bash builtins!
没有外部命令,只有 Bash 内置命令!
回答by bernd
You may also used sed print and quit:
您也可以使用 sed 打印并退出:
##代码##回答by Timofey Stolbov
You can also use Perl for this:
你也可以使用 Perl:
##代码##