文本处理 - Python 与 Perl 性能

Question

提问by ihightower

Here is my Perl and Python script to do some simple text processing from about 21 log files, each about 300 KB to 1 MB (maximum) x 5 times repeated (total of 125 files, due to the logrepeated 5 times).

这是我的 Perl 和 Python 脚本，用于对大约 21 个日志文件进行一些简单的文本处理，每个文件大约 300 KB 到 1 MB（最大）x 5 次重复（总共 125 个文件，由于日志重复了 5 次）。

Python Code(code modified to use compiled reand using re.I)

Python 代码（修改为使用已编译re和使用的代码re.I）

#!/usr/bin/python

import re
import fileinput

exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I)
location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)

for line in fileinput.input():
    fn = fileinput.filename()
    currline = line.rstrip()

    mprev = exists_re.search(currline)

    if(mprev):
        xlogtime = mprev.group(1)

    mcurr = location_re.search(currline)

    if(mcurr):
        print fn, xlogtime, mcurr.group(1)

Perl Code

Perl 代码

#!/usr/bin/perl

while (<>) {
    chomp;

    if (m/^(.*?) INFO.*Such a record already exists/i) {
        $xlogtime = ;
    }

    if (m/^AwbLocation (.*?) insert into/i) {
        print "$ARGV $xlogtime \n";
    }
}

And, on my PC both code generates exactly the same result file of 10,790 lines. And, here is the timing done on Cygwin's Perl and Python implementations.

而且，在我的 PC 上，这两个代码生成了 10,790 行的完全相同的结果文件。而且，这是在 Cygwin 的 Perl 和 Python 实现上完成的时间。

User@UserHP /cygdrive/d/tmp/Clipboard
# time /tmp/scripts/python/afs/process_file.py *log* *log* *log* *log* *log* >
summarypy.log

real    0m8.185s
user    0m8.018s
sys     0m0.092s

User@UserHP /cygdrive/d/tmp/Clipboard
# time /tmp/scripts/python/afs/process_file.pl *log* *log* *log* *log* *log* >
summarypl.log

real    0m1.481s
user    0m1.294s
sys     0m0.124s

Originally, it took 10.2 seconds using Python and only 1.9 secs using Perl for this simple text processing.

最初，对于这个简单的文本处理，使用 Python 需要 10.2 秒，使用 Perl 只需 1.9 秒。

(UPDATE) but, after the compiled reversion of Python, it now takes 8.2 seconds in Python and 1.5 seconds in Perl. Still Perl is much faster.

（更新）但是，在rePython的编译版本之后，现在在 Python 中需要 8.2 秒，在 Perl 中需要 1.5 秒。Perl 仍然要快得多。

Is there a way to improve the speed of Python at all OR it is obvious that Perl will be the speedy one for simple text processing.

有没有办法提高 Python 的速度，或者很明显 Perl 将成为简单文本处理的快速方法。

By the way this was not the only test I did for simple text processing... And, each different way I make the source code, always always Perl wins by a large margin. And, not once did Python performed better for simple m/regex/match and print stuff.

顺便说一下，这不是我为简单文本处理所做的唯一测试......而且，我制作源代码的每种不同方式，总是 Perl 以较大优势获胜。而且，对于简单的m/regex/匹配和打印内容，Python 的表现从未如此出色。

Please do not suggest to use C, C++, Assembly, other flavours of Python, etc.
I am looking for a solution using Standard Python with its built-in modules compared against Standard Perl (not even using the modules). Boy, I wish to use Python for all my tasks due to its readability, but to give up speed, I don't think so.
So, please suggest how can the code be improved to have comparable results with Perl.

请不要建议使用 C、C++、Assembly、其他风格的 Python 等。
我正在寻找使用标准 Python 及其内置模块与标准 Perl 相比的解决方案（甚至不使用模块）。男孩，由于其可读性，我希望将 Python 用于我的所有任务，但为了放弃速度，我不这么认为。
因此，请建议如何改进代码以获得与 Perl 相当的结果。

UPDATE: 2012-10-18

更新：2012-10-18

As other users suggested, Perl has its place and Python has its.

正如其他用户所建议的那样，Perl 有其一席之地，而 Python 也有其一席之地。

So, for this question, one can safely conclude that for simple regex match on each line for hundreds or thousands of text files and writing the results to a file (or printing to screen), Perl will always, always WIN in performance for this job. It as simple as that.

因此，对于这个问题，可以安全地得出结论，对于成百上千个文本文件的每一行的简单正则表达式匹配并将结果写入文件（或打印到屏幕），Perl 将始终，始终赢得这项工作的性能. 就这么简单。

Please note that when I say Perl wins in performance... only standard Perl and Python is compared... not resorting to some obscure modules (obscure for a normal user like me) and also not calling C, C++, assembly libraries from Python or Perl. We don't have time to learn all these extra steps and installation for a simple text matching job.

请注意，当我说 Perl 在性能上胜出时……只比较标准的 Perl 和 Python……不诉诸一些晦涩的模块（对于像我这样的普通用户来说是晦涩的），也没有从 Python 调用 C、C++、汇编库或 Perl。我们没有时间为简单的文本匹配工作学习所有这些额外的步骤和安装。

So, Perl rocks for text processing and regex.

因此，Perl 非常适合文本处理和正则表达式。

Python has its place to rock in other places.

Python 在其他地方也有它的一席之地。

Update 2013-05-29:An excellent article that does similar comparison is here. Perl again wins for simple text matching... And for more details, read the article.

更新2013年5月29日：一个优秀的文章，做类似的比较是在这里。Perl 在简单的文本匹配方面再次获胜……有关更多详细信息，请阅读文章。

Answer 1

回答by jrd1

In general, all artificial benchmarks are evil.However, everything else being equal (algorithmic approach), you can make improvements on a relative basis. However, it should be noted that I don't use Perl, so I can't argue in its favor. That being said, with Python you can try using Pyrexor Cythonto improve performance. Or, if you are adventurous, you can try converting the Python code into C++ via ShedSkin(which works for most of the core language, and some - but not all, of the core modules).

一般来说，所有人工基准都是邪恶的。但是，在其他条件相同的情况下（算法方法），您可以在相对基础上进行改进。然而，应该注意的是，我不使用 Perl，所以我不能争论它的好处。话虽如此，使用 Python，您可以尝试使用Pyrex或Cython来提高性能。或者，如果您喜欢冒险，您可以尝试通过ShedSkin将 Python 代码转换为 C++ （这适用于大多数核心语言，以及一些（但不是全部）核心模块）。

Nevertheless, you can follow some of the tips posted here:

不过，您可以遵循此处发布的一些提示：

http://wiki.python.org/moin/PythonSpeed/PerformanceTips

Answer 2

回答by Josh Wright

This is exactly the sort of stuff that Perl was designed to do, so it doesn't surprise me that it's faster.

这正是 Perl 被设计用来做的事情，所以它更快我并不感到惊讶。

One easy optimization in your Python code would be to precompile those regexes, so they aren't getting recompiled each time.

Python 代码中的一种简单优化是预编译这些正则表达式，这样它们就不会每次都被重新编译。

exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists')
location_re = re.compile(r'^AwbLocation (.*?) insert into')

And then in your loop:

然后在你的循环中：

mprev = exists_re.search(currline)

and

和

mcurr = location_re.search(currline)

That by itself won't magically bring your Python script in line with your Perl script, but repeatedly calling re in a loop without compiling first is bad practice in Python.

这本身不会神奇地使您的 Python 脚本与您的 Perl 脚本保持一致，但是在循环中重复调用 re 而不先编译在 Python 中是不好的做法。

Answer 3

回答by Don O'Donnell

Function calls are a bit expensive in terms of time in Python. And yet you have a loop invariant function call to get the file name inside the loop:

就 Python 中的时间而言，函数调用有点昂贵。然而你有一个循环不变函数调用来获取循环内的文件名：

fn = fileinput.filename()

Move this line above the forloop and you should see some improvement to your Python timing. Probably not enough to beat out Perl though.

将此行for移到循环上方，您应该会看到 Python 计时有所改善。不过可能还不足以击败 Perl。

Answer 4

回答by ikegami

Hypothesis: Perl spends less time backtracking in lines that don't match due to optimisations it has that Python doesn't.

假设：Perl 在不匹配的行中花费更少的时间回溯，因为它有 Python 没有的优化。

What do you get by replacing

更换后你会得到什么

^(.*?) INFO.*Such a record already exists

with

和

^((?:(?! INFO).)*?) INFO.*Such a record already

or

或者

^(?>(.*?) INFO).*Such a record already exists

Answer 5

回答by pepr

I expect Perl be faster. Just being curious, can you try the following?

我希望 Perl 更快。只是好奇，您可以尝试以下方法吗？

#!/usr/bin/python

import re
import glob
import sys
import os

exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I)
location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)

for mask in sys.argv[1:]:
    for fname in glob.glob(mask):
        if os.path.isfile(fname):
            f = open(fname)
            for line in f:
                mex = exists_re.search(line)
                if mex:
                    xlogtime = mex.group(1)

                mloc = location_re.search(line)
                if mloc:
                    print fname, xlogtime, mloc.group(1)
            f.close()

Updateas reaction to "it is too complex".

更新为对“它太复杂”的反应。

Of course it looks more complex than the Perl version. The Perl was built around the regular expressions. This way, you can hardly find interpreted language that is faster in regular expressions. The Perl syntax...

当然，它看起来比 Perl 版本更复杂。Perl 是围绕正则表达式构建的。这样，您很难找到在正则表达式中速度更快的解释性语言。Perl 语法...

while (<>) {
    ...
}

... also hides a lot of things that have to be done somehow in a more general language. On the other hand, it is quite easy to make the Python code more readable if you move the unreadable part out:

...还隐藏了许多必须以更通用的语言以某种方式完成的事情。另一方面，如果将不可读的部分移出，则很容易使 Python 代码更具可读性：

#!/usr/bin/python

import re
import glob
import sys
import os

def input_files():
    '''The generator loops through the files defined by masks from cmd.'''
    for mask in sys.argv[1:]:
        for fname in glob.glob(mask):
            if os.path.isfile(fname):
                yield fname


exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I)
location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)

for fname in input_files():
    with open(fname) as f:        # Now the f.close() is done automatically
        for line in f:
            mex = exists_re.search(line)
            if mex:
                xlogtime = mex.group(1)

            mloc = location_re.search(line)
            if mloc:
                print fname, xlogtime, mloc.group(1)

Here the def input_files()could be placed elsewhere (say in another module), or it can be reused. It is possible to mimic even the Perl's while (<>) {...}easily, even though not the same way syntactically:

这里def input_files()可以放置在其他地方（比如在另一个模块中），或者可以重复使用。甚至可以while (<>) {...}轻松模仿 Perl ，即使在语法上的方式不同：

#!/usr/bin/python

import re
import glob
import sys
import os

def input_lines():
    '''The generator loops through the lines of the files defined by masks from cmd.'''
    for mask in sys.argv[1:]:
        for fname in glob.glob(mask):
            if os.path.isfile(fname):
                with open(fname) as f: # now the f.close() is done automatically
                    for line in f:
                        yield fname, line

exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I)
location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)

for fname, line in input_lines():
    mex = exists_re.search(line)
    if mex:
        xlogtime = mex.group(1)

    mloc = location_re.search(line)
    if mloc:
        print fname, xlogtime, mloc.group(1)

Then the last formay look as easy (in principle) as the Perl's while (<>) {...}. Such readability enhancementsare more difficult in Perl.

那么最后一个for看起来和 Perl 的while (<>) {...}. 这种可读性增强在 Perl 中更加困难。

Anyway, it will not make the Python program faster. Perl will be faster again here. Perl isa file/text cruncher. But--in my opinion--Python is a better programming language for more general purposes.

无论如何，它不会使 Python 程序更快。Perl 在这里会再次更快。Perl是一个文件/文本处理程序。但是——在我看来——Python 是一种更通用的更好的编程语言。

文本处理 - Python 与 Perl 性能

提问by ihightower

回答by jrd1

回答by Josh Wright

回答by Don O'Donnell

回答by ikegami

回答by pepr

相关推荐

最近更新

标签

文本处理 - Python 与 Perl 性能

提问by ihightower

回答by jrd1

回答by Josh Wright

回答by Don O'Donnell

回答by ikegami

回答by pepr

相关推荐

__eq__ 在 Python 中是如何处理的以及以什么顺序处理？

python使用另一个文件中的变量

Python 无法安装 PyPdf 2 模块

在 Python 中将具有不同类型的项目列表作为字符串加入

相关推荐

最近更新

标签

eq 在 Python 中是如何处理的以及以什么顺序处理？