bash 在文件中打印一行的最快方法

Question

提问by JBoy

I have to fetch one specific line out of a big file (1500000 lines), multiple times in a loop over multiple files, I was asking my self what would be the best option (in terms of performance). There are many ways to do this, i manly use these 2

我必须从一个大文件（1500000 行）中取出一个特定的行，在多个文件的循环中多次，我问自己什么是最好的选择（就性能而言）。有很多方法可以做到这一点，我男子气概地使用这两个

cat ${file} | head -1

or

或者

cat ${file} | sed -n '1p'

I could not find an answer to this do they both only fetch the first line or one of the two (or both)first open the whole file and then fetch the row 1?

我找不到答案是他们都只获取第一行还是两者之一（或两者）首先打开整个文件然后获取第 1 行？

Answer 1

回答by Chris Seymour

Drop the useless use of catand do:

放弃无用的使用cat并执行以下操作：

$ sed -n '1{p;q}' file

This will quit the sedscript after the line has been printed.

这将sed在打印行后退出脚本。

Benchmarking script:

基准测试脚本：

#!/bin/bash

TIMEFORMAT='%3R'
n=25
heading=('head -1 file' 'sed -n 1p file' "sed -n '1{p;q} file" 'read line < file && echo $line')

# files upto a hundred million lines (if your on slow machine decrease!!)
for (( j=1; j<=100,000,000;j=j*10 ))
do
    echo "Lines in file: $j"
    # create file containing j lines
    seq 1 $j > file
    # initial read of file
    cat file > /dev/null

    for comm in {0..3}
    do
        avg=0
        echo
        echo ${heading[$comm]}    
        for (( i=1; i<=$n; i++ ))
        do
            case $comm in
                0)
                    t=$( { time head -1 file > /dev/null; } 2>&1);;
                1)
                    t=$( { time sed -n 1p file > /dev/null; } 2>&1);;
                2)
                    t=$( { time sed '1{p;q}' file > /dev/null; } 2>&1);;
                3)
                    t=$( { time read line < file && echo $line > /dev/null; } 2>&1);;
            esac
            avg=$avg+$t
        done
        echo "scale=3;($avg)/$n" | bc
    done
done

Just save as benchmark.shand run bash benchmark.sh.

只需另存为benchmark.sh并运行bash benchmark.sh。

Results:

结果：

head -1 file
.001

sed -n 1p file
.048

sed -n '1{p;q} file
.002

read line < file && echo $line
0

**Results from file with 1,000,000 lines.*

**来自 1,000,000 行文件的结果。*

So the times for sed -n 1pwill grow linearly with the length of the file but the timing for the other variations will be constant (and negligible)as they all quit after reading the first line:

所以 for 的时间sed -n 1p将随着文件的长度线性增长，但其他变化的时间将是恒定的（并且可以忽略不计），因为它们在阅读第一行后都退出了：

enter image description here

在此处输入图片说明

Note: timings are different from original post due to being on a faster Linux box.

注意：由于在更快的 Linux 机器上，时间与原始帖子不同。

Answer 2

回答by jim mcnamara

If you are really just getting the very first line and reading hundreds of files, then consider shell builtins instead of external external commands, use readwhich is a shell builtin for bash and ksh. This eliminates the overhead of process creation with awk, sed, head, etc.

如果您真的只是获取第一行并读取数百个文件，那么请考虑使用 shell 内置命令而不是外部外部命令，使用read它是 bash 和 ksh 的 shell 内置命令。这消除了进程创建与开销awk，sed，head，等。

The other issue is doing timed performance analysis on I/O. The first time you open and then read a file, file data is probably not cached in memory. However, if you try a second command on the same file again, the data as well as the inode have been cached, so the timed results are may be faster, pretty much regardless of the command you use. Plus, inodes can stay cached practically forever. They do on Solaris for example. Or anyway, several days.

另一个问题是对 I/O 进行定时性能分析。第一次打开然后读取文件时，文件数据可能没有缓存在内存中。但是，如果您再次在同一个文件上尝试第二个命令，则数据和 inode 已被缓存，因此计时结果可能会更快，几乎与您使用的命令无关。此外，inode 几乎可以永远保持缓存状态。例如，他们在 Solaris 上这样做。或者无论如何，几天。

For example, linux caches everything and the kitchen sink, which is a good performance attribute. But it makes benchmarking problematic if you are not aware of the issue.

比如linux缓存所有东西和厨房水槽，这是一个很好的性能属性。但是，如果您不知道这个问题，它会使基准测试成为问题。

All of this caching effect "interference" is both OS and hardware dependent.

所有这些缓存效果“干扰”都取决于操作系统和硬件。

So - pick one file, read it with a command. Now it is cached. Run the same test command several dozen times, this is sampling the effect of the command and child process creation, not your I/O hardware.

所以 - 选择一个文件，用命令读取它。现在它被缓存了。运行相同的测试命令几十次，这是对命令和子进程创建的效果进行采样，而不是您的 I/O 硬件。

this is sed vs read for 10 iterations of getting the first line of the same file, after read the file once:

这是 sed vs read 获取同一文件的第一行的 10 次迭代，在读取文件一次后：

sed: sed '1{p;q}' uopgenl20121216.lis

sed： sed '1{p;q}' uopgenl20121216.lis

real    0m0.917s
user    0m0.258s
sys     0m0.492s

read: read foo < uopgenl20121216.lis ; export foo; echo "$foo"

读： read foo < uopgenl20121216.lis ; export foo; echo "$foo"

real    0m0.017s
user    0m0.000s
sys     0m0.015s

This is clearly contrived, but does show the difference between builtin performance vs using a command.

这显然是人为的，但确实显示了内置性能与使用命令之间的差异。

Answer 3

回答by Elisiano Petrini

How about avoiding pipes? Both sedand headsupport the filename as an argument. In this way you avoid passing by cat. I didn't measure it, but head should be faster on larger files as it stops the computation after N lines (whereas sed goes through all of them, even if it doesn't print them - unless you specify the quit option as suggested above).

如何避免管道？无论sed和head支持文件名作为参数。这样你就可以避免经过猫。我没有测量它，但是 head 在较大的文件上应该更快，因为它会在 N 行后停止计算（而 sed 会遍历所有这些，即使它不打印它们 - 除非您q按照上面的建议指定uit 选项）。

Examples:

例子：

sed -n '1{p;q}' /path/to/file
head -n 1 /path/to/file

Again, I didn't test the efficiency.

同样，我没有测试效率。

Answer 4

回答by dvvrt

If you want to print only 1 line (say the 20th one) from a large file you could also do:

如果您只想从大文件中打印 1 行（比如第 20 行），您还可以执行以下操作：

head -20 filename | tail -1

I did a "basic" test with bash and it seems to perform better than the sed -n '1{p;q}solution above.

我用 bash 做了一个“基本”测试，它似乎比sed -n '1{p;q}上面的解决方案表现得更好。

Test takes a large file and prints a line from somewhere in the middle (at line 10000000), repeats 100 times, each time selecting the next line. So it selects line 10000000,10000001,10000002, ...and so on till 10000099

测试需要一个大文件并从中间的某处（在 line 处10000000）打印一行，重复 100 次，每次选择下一行。所以它选择线10000000,10000001,10000002, ...等等直到10000099

$wc -l english
36374448 english

$time for i in {0..99}; do j=$((i+10000000));  sed -n $j'{p;q}' english >/dev/null; done;

real    1m27.207s
user    1m20.712s
sys     0m6.284s

vs.

对比

$time for i in {0..99}; do j=$((i+10000000));  head -$j english | tail -1 >/dev/null; done;

real    1m3.796s
user    0m59.356s
sys     0m32.376s

For printing a line out of multiple files

用于打印多个文件中的一行

$wc -l english*
  36374448 english
  17797377 english.1024MB
   3461885 english.200MB
  57633710 total

$time for i in english*; do sed -n '10000000{p;q}' $i >/dev/null; done; 

real    0m2.059s
user    0m1.904s
sys     0m0.144s



$time for i in english*; do head -10000000 $i | tail -1 >/dev/null; done;

real    0m1.535s
user    0m1.420s
sys     0m0.788s

bash 在文件中打印一行的最快方法

提问by JBoy

回答by Chris Seymour

回答by jim mcnamara

回答by Elisiano Petrini

回答by dvvrt

相关推荐

最近更新

标签

bash 在文件中打印一行的最快方法

提问by JBoy

回答by Chris Seymour

回答by jim mcnamara

回答by Elisiano Petrini

回答by dvvrt

相关推荐

bash 如何在bash脚本中使用expect

bash 仅显示 grep 的第 n 个匹配项

bash 如何将子shell的输出文件描述符重定向到父shell中的输入文件描述符？

bash egrep AND 运算符

相关推荐

最近更新

标签