pandas df.loc[z,x]=y 如何提高速度?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37757844/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas df.loc[z,x]=y how to improve speed?
提问by AltSheets
I have identified one pandas command
我已经确定了一个 pandas 命令
timeseries.loc[z, x] = y
to be responsible for most of the time spent in an iteration. And now I am looking for better approaches to accelerate it. The loop covers not even 50k elements (and production goal is ~250k or more), but already needs a sad 20 seconds.
负责在迭代中花费的大部分时间。现在我正在寻找更好的方法来加速它。该循环甚至不包括 50k 个元素(生产目标是 ~250k 或更多),但已经需要 20 秒。
Here is my code (ignore the top half, it is just the timing helper)
这是我的代码(忽略上半部分,它只是计时助手)
def populateTimeseriesTable(df, observable, timeseries):
"""
Go through all rows of df and
put the observable into the timeseries
at correct row (symbol), column (tsMean).
"""
print "len(df.index)=", len(df.index) # show number of rows
global bf, t
bf = time.time() # set 'before' to now
t = dict([(i,0) for i in range(5)]) # fill category timing with zeros
def T(i):
"""
timing helper: Add passed time to category 'i'. Then set 'before' to now.
"""
global bf, t
t[i] = t[i] + (time.time()-bf)
bf = time.time()
for i in df.index: # this is the slow loop
bf = time.time()
sym = df["symbol"][i]
T(0)
tsMean = df["tsMean"][i]
T(1)
tsMean = tsFormatter(tsMean)
T(2)
o = df[observable][i]
T(3)
timeseries.loc[sym, tsMean] = o
T(4)
from pprint import pprint
print "times needed (total = %.1f seconds) for each command:" % sum(t.values())
pprint (t)
return timeseries
With (not important, not slow)
与(不重要,不慢)
def tsFormatter(ts):
"as human readable string, only up to whole seconds"
return time.strftime("%Y-%m-%d %H:%M:%S", time.gmtime(ts))
. .
. .
--> The to-be-optimized code is in the for-loop.
-->待优化代码在for循环中。
(T, and t are just helper function & dict, for the timing.)
(T 和 t 只是辅助函数和 dict,用于计时。)
I have timed every step. The vast majority of time:
我已经为每一步计时。绝大多数时间:
len(df.index)= 47160
times needed (total = 20.2 seconds) for each command:
{0: 1.102,
1: 0.741,
2: 0.243,
3: 0.792,
4: 17.371}
is spent in the last step
在最后一步花费
timeseries.loc[sym, tsMean] = o
I have already downloaded and install pypy- but sadly, that doesn't support pandas yet.
我已经下载并安装了pypy- 但遗憾的是,它还不支持Pandas。
Any ideas how to speed up populating a 2D array?
任何想法如何加快填充二维数组?
Thanks!
谢谢!
Edit: Sorry, hadn't mentioned - 'timeseries' is a dataframe too:
编辑:对不起,没有提到 - 'timeseries' 也是一个数据框:
timeseries = pd.DataFrame({"name": titles}, index=index)
回答by MaxU
UPDATE:starting from Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.
更新:从 Pandas 0.20.1 开始,.ix 索引器被弃用,支持更严格的 .iloc 和 .loc 索引器。
=====================================================================
================================================== ====================
@jezrael has provided an interesting comparison and i decided to repeat it using more indexing methods and against 10M rows DF (actually the size doesn't matter in this particular case):
@jezrael 提供了一个有趣的比较,我决定使用更多索引方法和 10M 行 DF 重复它(实际上在这种特殊情况下大小无关紧要):
setup:
设置:
In [15]: df = pd.DataFrame(np.random.rand(10**7, 5), columns=list('abcde'))
In [16]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 5 columns):
a float64
b float64
c float64
d float64
e float64
dtypes: float64(5)
memory usage: 381.5 MB
In [17]: df.shape
Out[17]: (10000000, 5)
Timing:
定时:
In [37]: %timeit df.loc[random.randint(0, 10**7), 'b']
1000 loops, best of 3: 502 μs per loop
In [38]: %timeit df.iloc[random.randint(0, 10**7), 1]
1000 loops, best of 3: 394 μs per loop
In [39]: %timeit df.at[random.randint(0, 10**7), 'b']
10000 loops, best of 3: 66.8 μs per loop
In [41]: %timeit df.iat[random.randint(0, 10**7), 1]
10000 loops, best of 3: 32.9 μs per loop
In [42]: %timeit df.ix[random.randint(0, 10**7), 'b']
10000 loops, best of 3: 64.8 μs per loop
In [43]: %timeit df.ix[random.randint(0, 10**7), 1]
1000 loops, best of 3: 503 μs per loop
Results as a bar plot:
结果为条形图:
Timing data as DF:
作为 DF 的计时数据:
In [88]: r
Out[88]:
method timing
0 loc 502.0
1 iloc 394.0
2 at 66.8
3 iat 32.9
4 ix_label 64.8
5 ix_integer 503.0
In [89]: r.to_dict()
Out[89]:
{'method': {0: 'loc',
1: 'iloc',
2: 'at',
3: 'iat',
4: 'ix_label',
5: 'ix_integer'},
'timing': {0: 502.0,
1: 394.0,
2: 66.799999999999997,
3: 32.899999999999999,
4: 64.799999999999997,
5: 503.0}}
Plotting
绘图
ax = sns.barplot(data=r, x='method', y='timing')
ax.tick_params(labelsize=16)
[ax.annotate(str(round(p.get_height(),2)), (p.get_x() + 0.2, p.get_height() + 5)) for p in ax.patches]
ax.set_xlabel('indexing method', size=20)
ax.set_ylabel('timing (microseconds)', size=20)
回答by jezrael
I always think at
is the fastest, but not. ix
is faster:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
print (df.at[2, 'B'])
6
print (df.ix[2, 'B'])
6
print (df.loc[2, 'B'])
6
In [77]: %timeit df.at[2, 'B']
10000 loops, best of 3: 44.6 μs per loop
In [78]: %timeit df.ix[2, 'B']
10000 loops, best of 3: 40.7 μs per loop
In [79]: %timeit df.loc[2, 'B']
1000 loops, best of 3: 681 μs per loop
EDIT:
编辑:
I try MaxU
df
and differences are caused random.randint
function:
我尝试和差异引起的功能:MaxU
df
random.randint
df = pd.DataFrame(np.random.rand(10**7, 5), columns=list('ABCDE'))
In [4]: %timeit (df.ix[2, 'B'])
The slowest run took 25.80 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 20.7 μs per loop
In [5]: %timeit (df.ix[random.randint(0, 10**7), 'B'])
The slowest run took 9.42 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 28 μs per loop
回答by Afshin Amiri
if you are adding rows inside a loop consider thses performance issues; for around first 1000 to 2000 records "my_df.loc" performance is better and gradually it is become slower by increasing the number of records in loop.
如果您在循环中添加行,请考虑这些性能问题;对于大约前 1000 到 2000 条记录,“my_df.loc”性能更好,并且通过增加循环中的记录数逐渐变慢。
If you plan to do thins inside a big loop(say 10M? records or so) you are better to use a mixture of "iloc" and "append"; fill a temp datframe with iloc untill the size gets around 1000, then append it to the original dataframe, and empy the temp dataframe. this would boost your performance around 10 times
如果你打算在一个大循环(比如 10M?记录左右)中做细化,你最好混合使用“iloc”和“append”;用 iloc 填充临时数据帧,直到大小达到 1000 左右,然后将其附加到原始数据帧,并清空临时数据帧。这将使您的表现提高约 10 倍