Perl 比 bash 快吗？

Question

提问by Brent

I have a bash script that cuts out a section of a logfile between 2 timestamps, but because of the size of the files, it takes quite a while to run.

我有一个 bash 脚本，它在 2 个时间戳之间切出日志文件的一部分，但是由于文件的大小，运行需要很长时间。

If I were to rewrite the script in Perl, could I achieve a significant speed increase - or would I have to move to something like C to accomplish this?

如果我要在 Perl 中重写脚本，我能否实现显着的速度提升 - 或者我是否必须转向 C 之类的东西来实现这一目标？

#!/bin/bash

if [ $# -ne 3 ]; then
  echo "USAGE LINE1=`grep -n TIMESTAMP1 filename | head -1 | cut -d ':' -f 1`
LINE2=`grep -n TIMESTAMP2 filename | head -1 | cut -d ':' -f 1`
tail +$LINE1 filename | head -$(($LINE2-$LINE1))
 <logfile(s)> <from date (epoch)> <to date (epoch)>"
  exit 1
fi

LOGFILES=
FROM=
TO=
rm -f /tmp/getlogs??????
TEMP=`mktemp /tmp/getlogsXXXXXX`

## LOGS NEED TO BE LISTED CHRONOLOGICALLY
ls -lnt $LOGFILES|awk '{print }' > $TEMP
LOGFILES=`tac $TEMP`
cp /dev/null $TEMP

findEntry() {
  RETURN=0
  dt=
  fil=
  ln1=
  ln2=
  t1=`tail -n+$ln1 $fil|head -n1|cut -c1-15`
  dt1=`date -d "$t1" +%s`
  t2=`tail -n+$ln2 $fil|head -n1|cut -c1-15`
  dt2=`date -d "$t2" +%s`
  if [ $dt -ge $dt2 ]; then
    mid=$dt2
  else
    mid=$(( (($ln2-$ln1)*($dt-$dt1)/($dt2-$dt1))+$ln1 ))
  fi
  t3=`tail -n+$mid $fil|head -n1|cut -c1-15`
  dt3=`date -d "$t3" +%s`
  # finished
  if [ $dt -eq $dt3 ]; then
    # FOUND IT (scroll back to the first match)
    while [ $dt -eq $dt3 ]; do
      mid=$(( $mid-1 ))
      t3=`tail -n+$mid $fil|head -n1|cut -c1-15`
      dt3=`date -d "$t3" +%s`
    done
    RETURN=$(( $mid+1 ))
    return
  fi
  if [ $(( $mid-1 )) -eq $ln1 ] || [ $(( $ln2-1)) -eq $mid ]; then
    # FOUND NEAR IT
    RETURN=$mid
    return
  fi
  # not finished yet
  if [ $dt -lt $dt3 ]; then
    # too high
    findEntry $dt $fil $ln1 $mid
  else
    if [ $dt -ge $dt3 ]; then
      # too low
      findEntry $dt $fil $mid $ln2
    fi
  fi
}

# Check timestamps on logfiles
LOGS=""
for LOG in $LOGFILES; do
  filetime=`ls -ln $LOG|awk '{print ,}'`
  timestamp=`date -d "$filetime" +%s`
  if [ $timestamp -ge $FROM ]; then
    LOGS="$LOGS $LOG"
  fi
done

# Check first and last dates in LOGS to refine further
for LOG in $LOGS; do
    if [ ${LOG%.gz} != $LOG ]; then
      gunzip -c $LOG > $TEMP
    else
      cp $LOG $TEMP
    fi
    t=`head -n1 $TEMP|cut -c1-15`
    FIRST=`date -d "$t" +%s`
    t=`tail -n1 $TEMP|cut -c1-15`
    LAST=`date -d "$t" +%s`
    if [ $TO -lt $FIRST ] || [ $FROM -gt $LAST ]; then
      # This file is entirely out of range
      cp /dev/null $TEMP
    else
      if [ $FROM -le $FIRST ]; then
        if [ $TO -ge $LAST ]; then
          # Entire file is within range
          cat $TEMP
        else
          # Last part of file is out of range
          STARTLINENUMBER=1
          ENDLINENUMBER=`wc -l<$TEMP`
          findEntry $TO $TEMP $STARTLINENUMBER $ENDLINENUMBER
          head -n$RETURN $TEMP
        fi
      else
        if [ $TO -ge $LAST ]; then
          # First part of file is out of range
          STARTLINENUMBER=1
          ENDLINENUMBER=`wc -l<$TEMP`
          findEntry $FROM $TEMP $STARTLINENUMBER $ENDLINENUMBER
          tail -n+$RETURN $TEMP
        else
          # range is entirely within this logfile
          STARTLINENUMBER=1
          ENDLINENUMBER=`wc -l<$TEMP`
          findEntry $FROM $TEMP $STARTLINENUMBER $ENDLINENUMBER
          n1=$RETURN
          findEntry $TO $TEMP $STARTLINENUMBER $ENDLINENUMBER
          n2=$RETURN
          tail -n+$n1 $TEMP|head -n$(( $n2-$n1 ))
        fi
      fi
    fi
done
rm -f /tmp/getlogs??????

Answer 1

回答by Daniel C. Sobral

Perl is absurdly faster than Bash. And, for text manipulation, you can actually achieve better performances with Perl than with C, unless you take time to write complex algorithms. Of course, for simple stuff C can be unbeatable.

Perl 比 Bash 快得多。而且，对于文本操作，使用 Perl 实际上可以获得比使用 C 更好的性能，除非您花时间编写复杂的算法。当然，对于简单的东西，C 可能是无与伦比的。

That said, if your "bash" script is not looping, just calling other programs, then there isn't any gain to be had. For example, if your script looks like "cat X | grep Y | tr -f 3-5 | sort | uniq", then most of the time is spent on cat, grep, tr, sort and uniq, NOT on Bash.

也就是说，如果您的“bash”脚本没有循环，只是调用其他程序，那么就没有任何收益。例如，如果您的脚本看起来像“ cat X | grep Y | tr -f 3-5 | sort | uniq”，那么大部分时间都花在 cat、grep、tr、sort 和 uniq 上，而不是花在 Bash 上。

You'll gain performance if there is any loop in the script, or if you save multiple reads of the same file.

如果脚本中有任何循环，或者如果您保存对同一文件的多次读取，您将获得性能。

You say you cut stuff between two timestamps on a file. Let's say your Bash script looks like this:

你说你在一个文件的两个时间戳之间剪切东西。假设您的 Bash 脚本如下所示：

my $state = 0;
while(<>) {
  exit if /TIMESTAMP2/;
  print $_ if $state == 1;
  $state = 1 if /TIMESTAMP1/;
}

Then you'll gain performance, because you are reading the whole file three times: once for each command where "filename" appears. In Perl, you would do something like this:

然后您将获得性能，因为您正在读取整个文件三次：对于出现“filename”的每个命令一次。在 Perl 中，你会做这样的事情：

#!/usr/bin/perl

use strict;
use warnings;

my %months = (
    jan => 1, feb => 2,  mar => 3,  apr => 4,
    may => 5, jun => 6,  jul => 7,  aug => 8,
    sep => 9, oct => 10, nov => 11, dec => 12,
);

while ( my $line = <> ) {
    my $ts = substr $line, 0, 15;
    next if parse_date($ts) lt '0201100543';
    last if parse_date($ts) gt '0715123456';
    print $line;
}

sub parse_date {
    my ($month, $day, $time) = split ' ', $_[0];
    my ($hour, $min, $sec) = split /:/, $time;
    return sprintf(
        '%2.2d%2.2d%2.2d%2.2d%2.2d',
        $months{lc $month}, $day,
        $hour, $min, $sec,
    );
}


__END__

This will read the file only once and will also stop once you read TIMESTAMP2. Since you are processing multiple files, you'd use "last" or "break" instead of "exit", so that the script can continue to process the files.

这只会读取文件一次，并且在您读取 TIMESTAMP2 后也会停止。由于您正在处理多个文件，您将使用“last”或“break”而不是“exit”，以便脚本可以继续处理文件。

Anyway, seeing your script I'm positive you'll gain a lot by rewriting it in Perl. Notwithstanding the loops dealing with file names (whose speed WILL be improved, but is probably insignificant), for each file which is not fully inside or outside scope you do:

无论如何，看到你的脚本我肯定你会通过用 Perl 重写它而获得很多。尽管有处理文件名的循环（其速度将得到提高，但可能无关紧要），对于每个不完全在范围内或范围外的文件，您都可以：

Read the WHOLE file to count lines!
Do multiple tails on the file
Finish by "head" or "tail" the file once again

阅读整个文件以计算行数！
在文件上做多个尾巴
再次通过“头”或“尾”文件完成

Furthermore, head your tails. Each time you do that, some piece of code is reading that data. Some of those lines are being read up to 10 times or more!

此外，你的尾巴。每次这样做时，都会有一段代码读取该数据。其中一些行被阅读了 10 次或更多！

Answer 2

回答by chaos

You will almost certainly realize a massive speed benefit from writing your script in Perl just by cutting off the file read when you pass your second timestamp.

几乎可以肯定，通过在传递第二个时间戳时切断读取的文件，您几乎肯定会意识到用 Perl 编写脚本会带来巨大的速度优势。

More generally, yes; a bashscript of any complexity, unless it's a truly amazing piece of wizardry, can handily be outperformed by a Perl script for equivalent inputs and outputs.

更一般地说，是的；bash任何复杂的脚本，除非它是一个真正令人惊叹的魔法，否则对于等效的输入和输出，Perl 脚本可以轻松胜过。

Answer 3

回答by Sinan ünür

Updated script based on Brent's comment:This one is untested.

根据 Brent 的评论更新了脚本：此脚本未经测试。

#!/usr/bin/perl

use strict;
use warnings;

while ( <> ) {
    my ($ts) = split;
    next if $ts < 1247672719;
    last if $ts > 1252172093;
    print $ts, "\n";
}

__END__

Previous answer for reference:What is the format of the file? Here is a short script which assumes the first column is a timestamp and prints only lines that have timestamps in a certain range. It also assumes that the timestamps are sorted. On my system, it took about a second to filter 900,000 lines out of a million:

以前的答案供参考：文件的格式是什么？这是一个简短的脚本，它假设第一列是时间戳，并且只打印时间戳在特定范围内的行。它还假设时间戳已排序。在我的系统上，从一百万行中筛选出 900,000 行大约需要一秒钟：

perl -ne "print if /$FROM/../$TO/" $LOGFILES

Answer 4

回答by Tanktalus

Based on the shell code you have, with multiple calls to tail/head, I'd say absolutelyPerl could be faster. C could be even faster, but the development time probably won't be worth it, so I'd stick to Perl. (I say "could" because you can write shell scripts in Perl, and I've seen enough of those to cringe. That obviously wouldn't have the speed benefit that you want.)

根据您拥有的 shell 代码，多次调用 tail/head，我认为Perl绝对可以更快。C 可能更快，但开发时间可能不值得，所以我会坚持使用 Perl。（我说“可以”是因为你可以用 Perl 编写 shell 脚本，而且我已经看到足够多的脚本了。这显然不会有你想要的速度优势。）

Perl has a higher startup cost, or so it's claimed. Honestly, I've never noticed. If your alternative is to do it in Java, Perl has no startup cost. Compared to Bash, I simply haven't noticed. What I have noticed is that as I get away from calling all the specialised Unix tools, which are great when you don't have alternatives, and get toward doing it all in a single process, speed goes up. The overhead of creating new processes on Unix isn't as severe as it may have been on Windows, but it's still not entirely negligible as you have to reinitialise the C runtime library (libC) each time, parse arguments, open files (perhaps), etc. In Perl, you end up using vast swaths of memory as you pass everything around in a list or something, but it is all in memory, so it's faster. And many of the tools you're used to are either built in (map/grep, regexes) or are available in modules on CPAN. A good combination of these would get the job done easily.

Perl 的启动成本更高，或者它声称的如此。老实说，我从来没有注意到。如果您的替代方案是用 Java 完成，Perl 没有启动成本。与 Bash 相比，我根本没有注意到。我注意到的是，当我不再调用所有专门的 Unix 工具时，这些工具在您没有替代品时非常有用，而是在一个过程中完成所有工作，速度会提高。在 Unix 上创建新进程的开销并不像在 Windows 上那样严重，但仍然不能完全忽略，因为您每次都必须重新初始化 C 运行时库 (libC)、解析参数、打开文件（可能）等。在 Perl 中，当您以列表或其他方式传递所有内容时，最终会使用大量内存，但它们都在内存中，因此速度更快。和许多工具你map/grep、正则表达式）或在 CPAN 上的模块中可用。这些的良好组合可以轻松完成工作。

The big thing is to avoid re-reading files. It's costly. And you're doing it many times. Heck, you could use the :gzipmodifier on open to read your gzip files directly, saving yet another pass - and this would be faster in that you'd be reading less from disk.

最重要的是避免重新读取文件。这是昂贵的。而且你做了很多次。哎呀，您可以:gzip在 open 上使用修饰符直接读取您的 gzip 文件，从而节省另一遍 - 这会更快，因为您从磁盘读取的内容更少。

Answer 5

回答by Nick Presta

I would profile all three solutions and pick which is best in terms of initial startup speed, processing speed, and memory usage.

我将分析所有三个解决方案，并选择在初始启动速度、处理速度和内存使用方面最好的解决方案。

Something like Perl/Python/Ruby may not be the absolute fastest, but you can rapidly develop in those languages - much faster than in C and even Bash.

像 Perl/Python/Ruby 这样的东西可能不是绝对最快的，但你可以用这些语言快速开发——比 C 甚至 Bash 快得多。

Answer 6

回答by ghostdog74

it depends on how your bash script is written. if you are not using awk to parse the log file, instead using bash's while read loop, then changing it to awk will improve the speed.

这取决于您的 bash 脚本是如何编写的。如果您没有使用 awk 来解析日志文件，而是使用 bash 的 while read 循环，那么将其更改为 awk 将提高速度。

Answer 7

回答by David

bashactually reads the file a line at a time as it interprets it on the fly (which you'll be made painfully aware of if you ever modify a bashscript while it's still running), rather than preloading and parsing it all at once. So yeah, Perl will generally be a lot faster if you're doing anything that you wouldn't normally do in bashanyways.

bash实际上一次读取文件一行，因为它动态地解释它（如果你在bash脚本仍在运行时修改它，你会痛苦地意识到这一点），而不是一次预加载和解析它。所以是的，如果你正在做任何你通常不会做的事情，Perl 通常会快得多bash。

Answer 8

回答by mas

I agree that moving from a bash-only script to Perl (or even awk if a perl environment is not readily available) could yield a speed benefit, assuming both are equally well written.

我同意从纯 bash 脚本迁移到 Perl（如果 perl 环境不可用，甚至 awk）可以产生速度优势，假设两者都编写得同样好。

However, if the extract was amenable to being formed by a bash script that creates parameters for and then calls grep with a regex then that could be faster than a 'pure' script.

但是，如果提取适合由 bash 脚本形成，该脚本创建参数，然后使用正则表达式调用 grep，那么这可能比“纯”脚本更快。

Answer 9

回答by Mathieu Longtin

In your bash script, put this:

在你的 bash 脚本中，输入：

##代码##

$FROM and $TO are really regex to your start and end time.

$FROM 和 $TO 实际上是您开始和结束时间的正则表达式。

They are inclusive, so you might want to put 2009-06-14 23:59:59for your end time, since 2009-06-15 00:00:00will include transactions at midnight.

它们是包容性的，因此您可能希望将其2009-06-14 23:59:59作为结束时间，因为2009-06-15 00:00:00将包括午夜的交易。

Answer 10

回答by joe

Well, bash is intepreted line by line as it runs and depends on calling a lot of external progs (depending on what you want to do).You often have to use temp files as intermediate storage for result sets. It (shell) was originally designed to talk to the system and automate cmd sequences (shell files).

好吧，bash 在运行时逐行解释，并且依赖于调用大量外部程序（取决于您想要做什么）。您通常必须使用临时文件作为结果集的中间存储。它（shell）最初设计用于与系统对话并自动执行 cmd 序列（shell 文件）。

Perl is more like C, it's largely self contained with a huge library of free code and it's compiled , so it runs much faster, eg about 80-90% speed of C, but easier to program (eg variable sizes are dynamic).

Perl 更像 C，它主要是自包含一个巨大的免费代码库并且它是编译的，所以它运行得更快，例如大约是 C 的 80-90% 的速度，但更容易编程（例如可变大小是动态的）。

Perl 比 bash 快吗？

提问by Brent

回答by Daniel C. Sobral

回答by chaos

回答by Sinan ünür

回答by Tanktalus

回答by Nick Presta

回答by ghostdog74

回答by David

回答by mas

回答by Mathieu Longtin

回答by joe

相关推荐

最近更新

标签

Perl 比 bash 快吗？

提问by Brent

回答by Daniel C. Sobral

回答by chaos

回答by Sinan ünür

回答by Tanktalus

回答by Nick Presta

回答by ghostdog74

回答by David

回答by mas

回答by Mathieu Longtin

回答by joe

相关推荐

bash shell 脚本中的安全 rm -rf 函数

bash 将控制台输出重定向到 Python 字符串

在 bash 中添加（收集）退出代码

Bash：如何遍历目录结构并执行命令？

相关推荐

最近更新

标签