bash 在文件中打印一行的最快方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15632691/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Fastest way to print a single line in a file
提问by JBoy
I have to fetch one specific line out of a big file (1500000 lines), multiple times in a loop over multiple files, I was asking my self what would be the best option (in terms of performance). There are many ways to do this, i manly use these 2
我必须从一个大文件(1500000 行)中取出一个特定的行,在多个文件的循环中多次,我问自己什么是最好的选择(就性能而言)。有很多方法可以做到这一点,我男子气概地使用这两个
cat ${file} | head -1
or
或者
cat ${file} | sed -n '1p'
I could not find an answer to this do they both only fetch the first line or one of the two (or both)first open the whole file and then fetch the row 1?
我找不到答案是他们都只获取第一行还是两者之一(或两者)首先打开整个文件然后获取第 1 行?
回答by Chris Seymour
Drop the useless use of cat
and do:
放弃无用的使用cat
并执行以下操作:
$ sed -n '1{p;q}' file
This will quit the sed
script after the line has been printed.
这将sed
在打印行后退出脚本。
Benchmarking script:
基准测试脚本:
#!/bin/bash
TIMEFORMAT='%3R'
n=25
heading=('head -1 file' 'sed -n 1p file' "sed -n '1{p;q} file" 'read line < file && echo $line')
# files upto a hundred million lines (if your on slow machine decrease!!)
for (( j=1; j<=100,000,000;j=j*10 ))
do
echo "Lines in file: $j"
# create file containing j lines
seq 1 $j > file
# initial read of file
cat file > /dev/null
for comm in {0..3}
do
avg=0
echo
echo ${heading[$comm]}
for (( i=1; i<=$n; i++ ))
do
case $comm in
0)
t=$( { time head -1 file > /dev/null; } 2>&1);;
1)
t=$( { time sed -n 1p file > /dev/null; } 2>&1);;
2)
t=$( { time sed '1{p;q}' file > /dev/null; } 2>&1);;
3)
t=$( { time read line < file && echo $line > /dev/null; } 2>&1);;
esac
avg=$avg+$t
done
echo "scale=3;($avg)/$n" | bc
done
done
Just save as benchmark.sh
and run bash benchmark.sh
.
只需另存为benchmark.sh
并运行bash benchmark.sh
。
Results:
结果:
head -1 file
.001
sed -n 1p file
.048
sed -n '1{p;q} file
.002
read line < file && echo $line
0
**Results from file with 1,000,000 lines.*
**来自 1,000,000 行文件的结果。*
So the times for sed -n 1p
will grow linearly with the length of the file but the timing for the other variations will be constant (and negligible)as they all quit after reading the first line:
所以 for 的时间sed -n 1p
将随着文件的长度线性增长,但其他变化的时间将是恒定的(并且可以忽略不计),因为它们在阅读第一行后都退出了:
Note: timings are different from original post due to being on a faster Linux box.
注意:由于在更快的 Linux 机器上,时间与原始帖子不同。
回答by jim mcnamara
If you are really just getting the very first line and reading hundreds of files, then consider shell builtins instead of external external commands, use read
which is a shell builtin for bash and ksh. This eliminates the overhead of process creation with awk
, sed
, head
, etc.
如果您真的只是获取第一行并读取数百个文件,那么请考虑使用 shell 内置命令而不是外部外部命令,使用read
它是 bash 和 ksh 的 shell 内置命令。这消除了进程创建与开销awk
,sed
,head
,等。
The other issue is doing timed performance analysis on I/O. The first time you open and then read a file, file data is probably not cached in memory. However, if you try a second command on the same file again, the data as well as the inode have been cached, so the timed results are may be faster, pretty much regardless of the command you use. Plus, inodes can stay cached practically forever. They do on Solaris for example. Or anyway, several days.
另一个问题是对 I/O 进行定时性能分析。第一次打开然后读取文件时,文件数据可能没有缓存在内存中。但是,如果您再次在同一个文件上尝试第二个命令,则数据和 inode 已被缓存,因此计时结果可能会更快,几乎与您使用的命令无关。此外,inode 几乎可以永远保持缓存状态。例如,他们在 Solaris 上这样做。或者无论如何,几天。
For example, linux caches everything and the kitchen sink, which is a good performance attribute. But it makes benchmarking problematic if you are not aware of the issue.
比如linux缓存所有东西和厨房水槽,这是一个很好的性能属性。但是,如果您不知道这个问题,它会使基准测试成为问题。
All of this caching effect "interference" is both OS and hardware dependent.
所有这些缓存效果“干扰”都取决于操作系统和硬件。
So - pick one file, read it with a command. Now it is cached. Run the same test command several dozen times, this is sampling the effect of the command and child process creation, not your I/O hardware.
所以 - 选择一个文件,用命令读取它。现在它被缓存了。运行相同的测试命令几十次,这是对命令和子进程创建的效果进行采样,而不是您的 I/O 硬件。
this is sed vs read for 10 iterations of getting the first line of the same file, after read the file once:
这是 sed vs read 获取同一文件的第一行的 10 次迭代,在读取文件一次后:
sed: sed '1{p;q}' uopgenl20121216.lis
sed: sed '1{p;q}' uopgenl20121216.lis
real 0m0.917s
user 0m0.258s
sys 0m0.492s
read: read foo < uopgenl20121216.lis ; export foo; echo "$foo"
读: read foo < uopgenl20121216.lis ; export foo; echo "$foo"
real 0m0.017s
user 0m0.000s
sys 0m0.015s
This is clearly contrived, but does show the difference between builtin performance vs using a command.
这显然是人为的,但确实显示了内置性能与使用命令之间的差异。
回答by Elisiano Petrini
How about avoiding pipes?
Both sed
and head
support the filename as an argument. In this way you avoid passing by cat. I didn't measure it, but head should be faster on larger files as it stops the computation after N lines (whereas sed goes through all of them, even if it doesn't print them - unless you specify the q
uit option as suggested above).
如何避免管道?无论sed
和head
支持文件名作为参数。这样你就可以避免经过猫。我没有测量它,但是 head 在较大的文件上应该更快,因为它会在 N 行后停止计算(而 sed 会遍历所有这些,即使它不打印它们 - 除非您q
按照上面的建议指定uit 选项)。
Examples:
例子:
sed -n '1{p;q}' /path/to/file
head -n 1 /path/to/file
Again, I didn't test the efficiency.
同样,我没有测试效率。
回答by dvvrt
If you want to print only 1 line (say the 20th one) from a large file you could also do:
如果您只想从大文件中打印 1 行(比如第 20 行),您还可以执行以下操作:
head -20 filename | tail -1
I did a "basic" test with bash and it seems to perform better than the sed -n '1{p;q}
solution above.
我用 bash 做了一个“基本”测试,它似乎比sed -n '1{p;q}
上面的解决方案表现得更好。
Test takes a large file and prints a line from somewhere in the middle (at line 10000000
), repeats 100 times, each time selecting the next line. So it selects line 10000000,10000001,10000002, ...
and so on till 10000099
测试需要一个大文件并从中间的某处(在 line 处10000000
)打印一行,重复 100 次,每次选择下一行。所以它选择线10000000,10000001,10000002, ...
等等直到10000099
$wc -l english
36374448 english
$time for i in {0..99}; do j=$((i+10000000)); sed -n $j'{p;q}' english >/dev/null; done;
real 1m27.207s
user 1m20.712s
sys 0m6.284s
vs.
对比
$time for i in {0..99}; do j=$((i+10000000)); head -$j english | tail -1 >/dev/null; done;
real 1m3.796s
user 0m59.356s
sys 0m32.376s
For printing a line out of multiple files
用于打印多个文件中的一行
$wc -l english*
36374448 english
17797377 english.1024MB
3461885 english.200MB
57633710 total
$time for i in english*; do sed -n '10000000{p;q}' $i >/dev/null; done;
real 0m2.059s
user 0m1.904s
sys 0m0.144s
$time for i in english*; do head -10000000 $i | tail -1 >/dev/null; done;
real 0m1.535s
user 0m1.420s
sys 0m0.788s