bash 'while read line' 大文件效率

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/10364570/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 02:09:13  来源:igfitidea点击:

bash 'while read line' efficiency with big file

bashperformance

提问by leemzoon

I was using a while loop to process a task,

我正在使用 while 循环来处理任务,

which read records from a big file about 10 million lines.

它从一个大约 1000 万行的大文件中读取记录。

I found that the processing become more and more slower as time goes by.

我发现随着时间的推移,处理变得越来越慢。

and I make a simulated script with 1 million lines as blow, which reveal the problem.

我制作了一个100万行的模拟脚本作为打击,揭示了问题。

but I still don't know why, how does the readcommand work?

但我还是不知道为什么,read命令是如何工作的?

seq 1000000 > seq.dat
while read s;
do
    if [ `expr $s % 50000` -eq 0 ];then
        echo -n $( expr `date +%s` - $A) ' ';
        A=`date +%s`;
    fi
done < seq.dat

The terminal outputs the time interval:

终端输出时间间隔:

98 98 98 98 98 97 98 97 98 101 106 112 121 121 127 132 135 134

98 98 98 98 98 97 98 97 98 101 106 112 121 121 127 132 135 134

at about 50,000 lines,the processing become slower obviously.

50000行左右,处理速度明显变慢。

回答by shellter

Using your code, I saw the same pattern of increasing times (right from the beginning!). If you want faster processing, you should rewrite using shell internal features. Here's my bash version:

使用您的代码,我看到了相同的时间增加模式(从一开始!)。如果你想要更快的处理,你应该使用 shell 内部特性重写。这是我的 bash 版本:

tabChar="   "  # put a real tab char here, of course
seq 1000000 > seq.dat
while read s;
do
    if (( ! ( s % 50000 ) )) ;then
        echo $s "${tabChar}" $( expr `date +%s` - $A) 
        A=$(date +%s);
    fi
done < seq.dat

editfixed bug, output indicated each line was being processed, now only every 50000'th line gets the timing treatment. Doah!

编辑修复的错误,输出表明每行正在处理,现在只有每 50000 行得到计时处理。啊!

was

曾是

  if ((  s % 50000 )) ;then

fixed to

固定到

  if (( ! ( s % 50000 ) )) ;then

output now echo ${.sh.version} =Version JM 93t+ 2010-05-24

现在输出echo ${.sh.version} =JM 93t+ 版本 2010-05-24

50000
100000   1
150000   0
200000   1
250000   0
300000   1
350000   0
400000   1
450000   0
500000   1
550000   0
600000   1
650000   0
700000   1
750000   0

output bash

输出重击

50000    480
100000   3
150000   2
200000   3
250000   3
300000   2
350000   3
400000   3
450000   2
500000   2
550000   3
600000   2
650000   2
700000   3
750000   3
800000   2
850000   2
900000   3
950000   2
800000   1
850000   0
900000   1
950000   0
1e+06    1

As to why your original test case is taking so long ... not sure. I was surprised to see both the time for each test cyle AND the increase in time. If you really need to understand this, you may need to spend time instrumenting more test stuff. Maybe you'd see something running trussor strace(depending on your base OS).

至于为什么你的原始测试用例需要这么长时间......不确定。我很惊讶地看到每个测试周期的时间和时间的增加。如果你真的需要理解这一点,你可能需要花时间检测更多的测试内容。也许您会看到一些正在运行的东西trussstrace(取决于您的基本操作系统)。

I hope this helps.

我希望这有帮助。

回答by Tim Pote

Read is a comparatively slow process, as the author of "Learning the Korn Shell" points out*. (Just above Section 7.2.2.1.) There are other programs, such as awkor sedthat have been highly optimized to do what is essentially the same thing: read from a file one line at a time and perform some operations using that input.

正如《Learning the Korn Shell》的作者指出的那样,阅读是一个相对缓慢的过程。(就在第 7.2.2.1 节之上。)还有其他程序,例如awksed已经过高度优化以完成本质上相同的事情:一次从文件中读取一行并使用该输入执行一些操作。

Not to mention, that you're calling an external process every time you're doing subtraction or taking the modulus, which can get expensive. awkhas both of those functionalities built in.

更不用说,每次进行减法或取模时都会调用外部过程,这可能会变得很昂贵。 awk内置了这两个功能。

As the following test points out, awkis quite a bit faster:

正如以下测试所指出的,awk速度要快一些:

#!/usr/bin/env bash

seq 1000000 | 
awk '
  BEGIN {
    command = "date +%s"
    prevTime = 0
  }
   % 50000 == 0 {
    command | getline currentTime
    close(command)

    print currentTime - prevTime
    prevTime = currentTime
  }
'

Output:

输出:

1335629268
0   
0   
0   
0   
0   
0   
0   
0   
0   
0   
0   
0   
0   
0   
1   
0   
0   
0   
0

Note that the first number is equivalent to date +%s. Just like in your test case, I let the first match be.

请注意,第一个数字等效于date +%s。就像在你的测试用例中一样,我让第一个匹配。

Note

笔记

*Yes the author is talking about the Korn Shell, not bash as the OP tagged, but bash and ksh are rather similar in a lot of ways. ksh is actually a superset of bash. So I would assume that the read command is not drastically different from one shell to another.

*是的,作者在谈论 Korn Shell,而不是 bash 作为 OP 标记,但 bash 和 ksh 在很多方面都相当相似。ksh 实际上是 bash 的超集。所以我会假设 read 命令从一个 shell 到另一个 shell 并没有太大的不同。