pandas 熊猫 - 非常非常慢

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24653897/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:14:28  来源:igfitidea点击:

pandas - extremely extremely slow

pythonperformancepandas

提问by coffeequant

I am trying to do a df.apply on date objects but it's too too slow!!

我正在尝试对日期对象执行 df.apply 但它太慢了!!

My prun output gives....

我的修剪输出给出....

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1999   14.563    0.007   14.563    0.007 {pandas.tslib.array_to_timedelta64}
 13998    0.103    0.000   15.221    0.001 series.py:126(__init__)
  9999    0.093    0.000    0.093    0.000 {method 'reduce' of 'numpy.ufunc' objects}
272012    0.093    0.000    0.125    0.000 {isinstance}
  5997    0.089    0.000    0.196    0.000 common.py:199(_isnull_ndarraylike)

So basically it's 14 seconds for a 2000 length array. My actual array size is > 100,000 which translates to a run time of > 15 minutes or maybe more.

因此,对于 2000 长度的数组,基本上是 14 秒。我的实际数组大小 > 100,000,这意味着运行时间 > 15 分钟或更长时间。

It's stupid of pandas to call this function "pandas.tslib.array_to_timedelta64" which is the bottleneck? I really don't understand why this function call is necessary??? Both the operators in subtraction are of same data types. I explicity converted them beforehand using pd.to_datetime() method. And no this conversion time is not included in this calculation.

将这个函数称为“pandas.tslib.array_to_timedelta64”是Pandas的愚蠢行为,这是瓶颈?我真的不明白为什么这个函数调用是必要的???减法中的两个运算符都是相同的数据类型。我事先使用 pd.to_datetime() 方法明确转换了它们。并且没有这个转换时间不包括在这个计算中。

So in all you can understand my frustration at this pathetic code!!!

所以总的来说,你可以理解我对这个可悲代码的沮丧!!!

actual code looks like this

实际代码看起来像这样

 df  = pd.DataFrame(bet_endtimes)

def testing():
    close_indices = df.apply(lambda x: np.argmin(np.abs(currentdata['date'] - x[0])),axis=1)
    print close_indices

 %prun testing()

回答by Jeff

I'd recommend consulting the documentation: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-deltasIts also very helpfulto include sample data so I don't have to guess what you are doing.

我建议查阅文档:http: //pandas.pydata.org/pandas-docs/stable/timeseries.html#time-deltas包含示例数据 也非常有帮助,所以我不必猜测你是什么正在做。

Using apply is alwaysthe last operation to try. Vectorized methods are much faster.

使用 apply总是最后一次尝试的操作。矢量化方法要快得多。

In [55]: pd.set_option('max_rows',10)

In [56]: df = DataFrame(dict(A = pd.date_range('20130101',periods=100000, freq='s')))

In [57]: df
Out[57]: 
                        A
0     2013-01-01 00:00:00
1     2013-01-01 00:00:01
2     2013-01-01 00:00:02
3     2013-01-01 00:00:03
4     2013-01-01 00:00:04
...                   ...
99995 2013-01-02 03:46:35
99996 2013-01-02 03:46:36
99997 2013-01-02 03:46:37
99998 2013-01-02 03:46:38
99999 2013-01-02 03:46:39

[100000 rows x 1 columns]

In [58]:  (df['A']-df.loc[10,'A']).abs()
Out[58]: 
0   00:00:10
1   00:00:09
2   00:00:08
...
99997   1 days, 03:46:27
99998   1 days, 03:46:28
99999   1 days, 03:46:29
Name: A, Length: 100000, dtype: timedelta64[ns]

In [59]: %timeit  (df['A']-df.loc[10,'A']).abs()
1000 loops, best of 3: 1.47 ms per loop

When you contribute to pandas, you can name methods.

当您为 Pandas 做出贡献时,您可以命名方法。

It's stupid of pandas to call this function "pandas.tslib.array_to_timedelta64" which is the bottleneck? time is not included in this calculation.

将这个函数称为“pandas.tslib.array_to_timedelta64”是Pandas的愚蠢行为,这是瓶颈?时间不包括在这个计算中。