为什么 numpy/pandas 解析长行的 csv 文件这么慢?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29678154/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:12:47  来源:igfitidea点击:

Why is numpy/pandas parsing of a csv file with long lines so slow?

pythonparsingcsvnumpypandas

提问by dshepherd

I'm trying to efficiently parse a csv file with around 20,000 entries per line (and a few thousand lines) to a numpy array (or list of arrays, or anything similar really). I found a number of other questions, along with thisblog post, which suggest that pandas's csv parser is extremely fast. However I've benchmarked pandas, numpy and some pure-python approaches and it appears that the trivial pure-python string splitting + list comprehension beats everything else by quite a large margin.

我正在尝试将每行大约 20,000 个条目(和几千行)的 csv 文件有效解析为一个 numpy 数组(或数组列表,或任何类似的东西)。我发现了许多其他问题,以及这篇博客文章,这些问题表明 Pandas 的 csv 解析器非常快。然而,我已经对 Pandas、numpy 和一些纯 python 方法进行了基准测试,看起来琐碎的纯 python 字符串拆分 + 列表理解以相当大的幅度击败其他一切。

  • What's going on here?

  • Are there any csv parsers that that would be more efficient?

  • If I change the format of the input data will it help?

  • 这里发生了什么?

  • 有没有更有效的 csv 解析器?

  • 如果我更改输入数据的格式会有帮助吗?

Here's the source code I'm benchmarking with (the sum()is just to make sure any lazy iterators are forced to evaluate everything):

这是我sum()用来进行基准测试的源代码(只是为了确保任何惰性迭代器都被迫评估所有内容):

#! /usr/bin/env python3

import sys

import time
import gc

import numpy as np
from pandas.io.parsers import read_csv
import csv

def python_iterator_csv():
    with open("../data/temp_fixed_l_no_initial", "r") as f:
        for line in f.readlines():
            all_data = line.strip().split(",")
            print(sum(float(x) for x in all_data))


def python_list_csv():
    with open("../data/temp_fixed_l_no_initial", "r") as f:
        for line in f.readlines():
            all_data = line.strip().split(",")
            print(sum([float(x) for x in all_data]))


def python_array_csv():
    with open("../data/temp_fixed_l_no_initial", "r") as f:
        for line in f.readlines():
            all_data = line.strip().split(",")
            print(sum(np.array([float(x) for x in all_data])))


def numpy_fromstring():
    with open("../data/temp_fixed_l_no_initial", "r") as f:
        for line in f.readlines():
            print(sum(np.fromstring(line, sep = ",")))


def numpy_csv():
    with open("../data/temp_fixed_l_no_initial", "r") as f:
        for row in np.loadtxt(f, delimiter = ",", dtype = np.float, ndmin = 2):
            print(sum(row))


def csv_loader(csvfile):
    return read_csv(csvfile,
                      header = None,
                      engine = "c",
                      na_filter = False,
                      quoting = csv.QUOTE_NONE,
                      index_col = False,
                      sep = ",")

def pandas_csv():
    with open("../data/temp_fixed_l_no_initial", "r") as f:
        for row in np.asarray(csv_loader(f).values, dtype = np.float64):
            print(sum(row))


def pandas_csv_2():
    with open("../data/temp_fixed_l_no_initial", "r") as f:
        print(csv_loader(f).sum(axis=1))


def simple_time(func, repeats = 3):
    gc.disable()

    for i in range(0, repeats):
        start = time.perf_counter()
        func()
        end = time.perf_counter()
        print(func, end - start, file = sys.stderr)
        gc.collect()

    gc.enable()
    return


if __name__ == "__main__":

    simple_time(python_iterator_csv)
    simple_time(python_list_csv)
    simple_time(python_array_csv)
    simple_time(numpy_csv)
    simple_time(pandas_csv)
    simple_time(numpy_fromstring)

    simple_time(pandas_csv_2)

The output (to stderr) is:

输出(到标准错误)是:

<function python_iterator_csv at 0x7f22302b1378> 19.754893831999652
<function python_iterator_csv at 0x7f22302b1378> 19.62786615600271
<function python_iterator_csv at 0x7f22302b1378> 19.66641107099713

<function python_list_csv at 0x7f22302b1ae8> 18.761991592000413
<function python_list_csv at 0x7f22302b1ae8> 18.722911622000538
<function python_list_csv at 0x7f22302b1ae8> 19.00348913199923

<function python_array_csv at 0x7f222baffa60> 41.8681991630001
<function python_array_csv at 0x7f222baffa60> 42.141840383999806
<function python_array_csv at 0x7f222baffa60> 41.86879085799956

<function numpy_csv at 0x7f222ba5cc80> 47.957625758001086
<function numpy_csv at 0x7f222ba5cc80> 47.245571732000826
<function numpy_csv at 0x7f222ba5cc80> 47.25457685799847

<function pandas_csv at 0x7f2228572620> 43.39656048499819
<function pandas_csv at 0x7f2228572620> 43.5016079220004
<function pandas_csv at 0x7f2228572620> 43.567352316000324

<function numpy_fromstring at 0x7f593ed3cc80> 32.490607361
<function numpy_fromstring at 0x7f593ed3cc80> 32.421125410997774
<function numpy_fromstring at 0x7f593ed3cc80> 32.37903898300283

<function pandas_csv_2 at 0x7f846d1aa730> 24.903284349999012
<function pandas_csv_2 at 0x7f846d1aa730> 25.498485038999206
<function pandas_csv_2 at 0x7f846d1aa730> 25.03262125800029

From the blog post linked above it seems that pandas can import a csv matrix of random doubles at a data rate of 145/1.279502= 113 MB/s. My file is 814 MB, so pandas is only manages ~19 MB/s for me!

从上面链接的博客文章看来,pandas 可以以145/1.279502= 113 MB/s的数据速率导入随机双精度的 csv 矩阵。我的文件是 814 MB,所以 Pandas 对我来说只能管理 ~19 MB/s!

edit: As pointed out by @ASGM, this wasn't really fair to pandas because it is not designed for rowise iteration. I've included the suggested improvement in the benchmark but it's still slower than pure python approaches. (Also: I've played around with profiling similar code, before simplifying it to this benchmark, and the parsing always dominated the time taken.)

编辑:正如@ASGM 所指出的,这对 Pandas 并不公平,因为它不是为 rowise 迭代而设计的。我已经在基准测试中包含了建议的改进,但它仍然比纯 python 方法慢。(另外:在将其简化为这个基准测试之前,我已经对类似的代码进行了分析,并且解析总是主导所花费的时间。)

edit2: Best of three times without the sum:

编辑2:最好的三个没有sum

python_list_csv    17.8
python_array_csv   23.0
numpy_csv          28.6
numpy_fromstring   13.3
pandas_csv_2       24.2

so without the summation numpy.fromstringbeats pure python by a small margin (I think fromstring is written in Cso this makes sense).

所以没有求和比numpy.fromstring纯 python 小一点(我认为 fromstring 是用C编写的,所以这是有道理的)。

edit3:

编辑3:

I've done some experimentation with the C/C++ float parsing code hereand it looks like I'm probably expecting too much from pandas/numpy. Most of the robust parsers listed there give times of 10+ seconds just to parse this number of floats. The only parser which resoundingly beats numpy.fromstringis boost's spirit::qiwhich is C++ and so not likely to make it into any python libraries.

我在这里对 C/C++ 浮点解析代码进行了一些实验,看起来我可能对 pandas/numpy 期望过高。那里列出的大多数强大的解析器都给出了 10+ 秒的时间来解析这个数量的浮点数。唯一明显击败的解析器numpy.fromstring是 boost 的spirit::qi,它是 C++,因此不太可能将其放入任何 python 库中。

[ More precise results: spirit::qi~ 3s, lexical_cast~ 7s, atofand strtod~ 10s, sscanf~ 18s, stringstreamand stringstream reusedare incredibly slow at 50s and 28s. ]

[更精确的结果:spirit::qi~3s、lexical_cast~7satofstrtod~10s、 sscanf~18s,stringstream并且stringstream reused在 50s 和 28s 时非常慢。]

回答by ali_m

Does your CSV file contain column headers? If not, then explicitly passing header=Noneto pandas.read_csvcan give a slight performance improvement for the Python parsing engine (but not for the C engine):

您的 CSV 文件是否包含列标题?如果没有,那么显式传递header=Nonetopandas.read_csv可以为 Python 解析引擎(但不适用于 C 引擎)提供轻微的性能改进:

In [1]: np.savetxt('test.csv', np.random.randn(1000, 20000), delimiter=',')

In [2]: %timeit pd.read_csv('test.csv', delimiter=',', engine='python')
1 loops, best of 3: 9.19 s per loop

In [3]: %timeit pd.read_csv('test.csv', delimiter=',', engine='c')
1 loops, best of 3: 6.47 s per loop

In [4]: %timeit pd.read_csv('test.csv', delimiter=',', engine='python', header=None)
1 loops, best of 3: 6.26 s per loop

In [5]: %timeit pd.read_csv('test.csv', delimiter=',', engine='c', header=None)
1 loops, best of 3: 6.46 s per loop

Update

更新

If there are no missing or invalid values then you can do a little better by passing na_filter=False(only valid for the C engine):

如果没有丢失或无效的值,那么您可以通过传递做得更好na_filter=False(仅对 C 引擎有效):

In [6]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None)
1 loops, best of 3: 6.42 s per loop

In [7]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None, na_filter=False)
1 loops, best of 3: 4.72 s per loop

There may also be small gains to be had by specifying the dtypeexplicitly:

通过明确指定以下dtype内容,也可能会获得一些小收益:

In [8]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None, na_filter=False, dtype=np.float64)
1 loops, best of 3: 4.36 s per loop

Update 2

更新 2

Following up on @morningsun's comment, setting low_memory=Falsesqueezes out a bit more speed:

跟进@morningsun 的评论,设置low_memory=False挤出了更多的速度:

In [9]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None, na_filter=False, dtype=np.float64, low_memory=True)
1 loops, best of 3: 4.3 s per loop

In [10]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None, na_filter=False, dtype=np.float64, low_memory=False)
1 loops, best of 3: 3.27 s per loop

For what it's worth, these benchmarks were all done using the current dev version of pandas (0.16.0-19-g8d2818e).

就其价值而言,这些基准测试都是使用当前开发版本的 Pandas (0.16.0-19-g8d2818e) 完成的。

回答by ASGM

In the pure python case, you're iterating over the rows and printing as you go. In the pandas case, you're importing the whole thing into a DataFrame, and then iterating over the rows. But pandas' strength isn't in iterating over the rows - it's in operations that take place over the whole DataFrame. Compare the speed of:

在纯 python 的情况下,您正在迭代行并随时打印。在 Pandas 的情况下,您将整个内容导入到 DataFrame 中,然后遍历行。但是pandas 的优势不在于迭代行——而是在整个DataFrame 上进行的操作。比较速度:

def pandas_csv():
    with open("../data/temp_fixed_l_no_initial", "r") as f:
        print csv_loader(f).sum(axis=1)

This is still somewhat slower than the pure python approach, which you're welcome to use if this is the extent of your use case. But as @ali_m's comment points out, if you want to do more than print the sum of the rows, or if you want to transform the data in any way, you will probably find pandas or numpy to be more efficient both in processing time and programming time.

这仍然比纯 python 方法慢一些,如果这是您的用例范围,欢迎您使用。但正如@ali_m 的评论指出的那样,如果您想做的不仅仅是打印行的总和,或者如果您想以任何方式转换数据,您可能会发现 pandas 或 numpy 在处理时间和编程时间。

回答by hpaulj

The array_csvand numpy_csvtimes are quite similar. If you look at the loadtxtcode you'll see that the actions are quite similar. With array_csvyou construct an array for each line and use it, while numpy_csvcollects the parsed (and converted) lines into one list, which is converted to an array at the end.

array_csvnumpy_csv时间是非常相似的。如果您查看loadtxt代码,您会发现操作非常相似。与array_csv您构建用于每一行的阵列,并使用它,而numpy_csv收集所解析的(和转换后的)线成一个列表中,将其转化为在端部阵列。

loadtxtfor each row does:

loadtxt对于每一行来说:

        vals = split_line(line)
        ...
        # Convert each value according to its column and store
        items = [conv(val) for (conv, val) in zip(converters, vals)]
        # Then pack it according to the dtype's nesting
        items = pack_items(items, packing)
        X.append(items)

with a final

最终

X = np.array(X, dtype)

That [conv(val) for ...]line is just a generalization of your [float(val) for val in ...].

[conv(val) for ...]行只是您的[float(val) for val in ...].

If a plain list does the job, don't convert it to an array. That just adds unnecessary overhead.

如果普通列表可以完成这项工作,请不要将其转换为数组。这只会增加不必要的开销。

Functions like loadtxtare most valuable when the csvcolumns contain a mix of data types. They streamline the work of creating structured arrays from that data. For pure numeric data such as yours they don't add much.

loadtxtcsv列包含混合数据类型时,like 函数最有价值。它们简化了从该数据创建结构化数组的工作。对于像您这样的纯数字数据,它们不会增加太多。

I can't speak for pandas, except that it has yet another layer on top of numpy, and does a lot of its own hardcoding.

我不能说pandas,除了它在 之上还有另一层numpy,并且做了很多自己的硬编码。

回答by Eyad

If you are to give pandas the dtypesas dictionary (pd.read_csv(...,dtype={'x':np.float})it will make things much faster, as pandas tries to check the data type for every column.

如果你给pandas dtypesas 字典,(pd.read_csv(...,dtype={'x':np.float})它会让事情变得更快,因为pandas 会尝试检查每一列的数据类型。