pandas df.loc[z,x]=y 如何提高速度？

Question

提问by AltSheets

I have identified one pandas command

我已经确定了一个 pandas 命令

timeseries.loc[z, x] = y

to be responsible for most of the time spent in an iteration. And now I am looking for better approaches to accelerate it. The loop covers not even 50k elements (and production goal is ~250k or more), but already needs a sad 20 seconds.

负责在迭代中花费的大部分时间。现在我正在寻找更好的方法来加速它。该循环甚至不包括 50k 个元素（生产目标是 ~250k 或更多），但已经需要 20 秒。

Here is my code (ignore the top half, it is just the timing helper)

这是我的代码（忽略上半部分，它只是计时助手）

def populateTimeseriesTable(df, observable, timeseries):
    """
    Go through all rows of df and 
    put the observable into the timeseries 
    at correct row (symbol), column (tsMean).
    """

    print "len(df.index)=", len(df.index)  # show number of rows

    global bf, t
    bf = time.time()                       # set 'before' to now
    t = dict([(i,0) for i in range(5)])    # fill category timing with zeros

    def T(i):
        """
        timing helper: Add passed time to category 'i'. Then set 'before' to now.
        """
        global bf, t 
        t[i] = t[i] + (time.time()-bf)
        bf = time.time()        

    for i in df.index:             # this is the slow loop
        bf = time.time()

        sym = df["symbol"][i]
        T(0)

        tsMean = df["tsMean"][i]
        T(1)

        tsMean = tsFormatter(tsMean)
        T(2)

        o = df[observable][i]
        T(3)

        timeseries.loc[sym, tsMean] = o
        T(4)

    from pprint import pprint
    print "times needed (total = %.1f seconds) for each command:" % sum(t.values())
    pprint (t)

    return timeseries

With (not important, not slow)

与（不重要，不慢）

def tsFormatter(ts):
    "as human readable string, only up to whole seconds"
    return time.strftime("%Y-%m-%d %H:%M:%S", time.gmtime(ts))

. .

--> The to-be-optimized code is in the for-loop.

-->待优化代码在for循环中。

(T, and t are just helper function & dict, for the timing.)

（T 和 t 只是辅助函数和 dict，用于计时。）

I have timed every step. The vast majority of time:

我已经为每一步计时。绝大多数时间：

len(df.index)= 47160
times needed (total = 20.2 seconds) for each command:
{0: 1.102,
 1: 0.741,
 2: 0.243,
 3: 0.792,
 4: 17.371}

is spent in the last step

在最后一步花费

timeseries.loc[sym, tsMean] = o

I have already downloaded and install pypy- but sadly, that doesn't support pandas yet.

我已经下载并安装了pypy- 但遗憾的是，它还不支持Pandas。

Any ideas how to speed up populating a 2D array?

任何想法如何加快填充二维数组？

Thanks!

谢谢！

Edit: Sorry, hadn't mentioned - 'timeseries' is a dataframe too:

编辑：对不起，没有提到 - 'timeseries' 也是一个数据框：

timeseries = pd.DataFrame({"name": titles}, index=index)

Answer 1

回答by MaxU

UPDATE:starting from Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.

更新：从 Pandas 0.20.1 开始，.ix 索引器被弃用，支持更严格的 .iloc 和 .loc 索引器。

=====================================================================

================================================== ====================

@jezrael has provided an interesting comparison and i decided to repeat it using more indexing methods and against 10M rows DF (actually the size doesn't matter in this particular case):

@jezrael 提供了一个有趣的比较，我决定使用更多索引方法和 10M 行 DF 重复它（实际上在这种特殊情况下大小无关紧要）：

setup:

设置：

In [15]: df = pd.DataFrame(np.random.rand(10**7, 5), columns=list('abcde'))

In [16]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 5 columns):
a    float64
b    float64
c    float64
d    float64
e    float64
dtypes: float64(5)
memory usage: 381.5 MB

In [17]: df.shape
Out[17]: (10000000, 5)

Timing:

定时：

In [37]: %timeit df.loc[random.randint(0, 10**7), 'b']
1000 loops, best of 3: 502 μs per loop

In [38]: %timeit df.iloc[random.randint(0, 10**7), 1]
1000 loops, best of 3: 394 μs per loop

In [39]: %timeit df.at[random.randint(0, 10**7), 'b']
10000 loops, best of 3: 66.8 μs per loop

In [41]: %timeit df.iat[random.randint(0, 10**7), 1]
10000 loops, best of 3: 32.9 μs per loop

In [42]: %timeit df.ix[random.randint(0, 10**7), 'b']
10000 loops, best of 3: 64.8 μs per loop

In [43]: %timeit df.ix[random.randint(0, 10**7), 1]
1000 loops, best of 3: 503 μs per loop

Results as a bar plot:

结果为条形图：

Timing data as DF:

作为 DF 的计时数据：

In [88]: r
Out[88]:
       method  timing
0         loc   502.0
1        iloc   394.0
2          at    66.8
3         iat    32.9
4    ix_label    64.8
5  ix_integer   503.0

In [89]: r.to_dict()
Out[89]:
{'method': {0: 'loc',
  1: 'iloc',
  2: 'at',
  3: 'iat',
  4: 'ix_label',
  5: 'ix_integer'},
 'timing': {0: 502.0,
  1: 394.0,
  2: 66.799999999999997,
  3: 32.899999999999999,
  4: 64.799999999999997,
  5: 503.0}}

Plotting

绘图

ax = sns.barplot(data=r, x='method', y='timing')
ax.tick_params(labelsize=16)
[ax.annotate(str(round(p.get_height(),2)), (p.get_x() + 0.2, p.get_height() + 5)) for p in ax.patches]
ax.set_xlabel('indexing method', size=20)
ax.set_ylabel('timing (microseconds)', size=20)

Answer 2

回答by jezrael

I always think atis the fastest, but not. ixis faster:

我一直认为at是最快的，但不是。ix是比较快的：

import pandas as pd

df = pd.DataFrame({'A':[1,2,3],
                   'B':[4,5,6],
                   'C':[7,8,9],
                   'D':[1,3,5],
                   'E':[5,3,6],
                   'F':[7,4,3]})

print (df)
   A  B  C  D  E  F
0  1  4  7  1  5  7
1  2  5  8  3  3  4
2  3  6  9  5  6  3

print (df.at[2, 'B'])
6
print (df.ix[2, 'B'])
6
print (df.loc[2, 'B'])
6

In [77]: %timeit df.at[2, 'B']
10000 loops, best of 3: 44.6 μs per loop

In [78]: %timeit df.ix[2, 'B']
10000 loops, best of 3: 40.7 μs per loop

In [79]: %timeit df.loc[2, 'B']
1000 loops, best of 3: 681 μs per loop

EDIT:

编辑：

I try MaxUdfand differences are caused random.randintfunction:

我尝试和差异引起的功能：MaxUdfrandom.randint

df = pd.DataFrame(np.random.rand(10**7, 5), columns=list('ABCDE'))


In [4]: %timeit (df.ix[2, 'B'])
The slowest run took 25.80 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 20.7 μs per loop

In [5]: %timeit (df.ix[random.randint(0, 10**7), 'B'])
The slowest run took 9.42 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 28 μs per loop

Answer 3

回答by Afshin Amiri

if you are adding rows inside a loop consider thses performance issues; for around first 1000 to 2000 records "my_df.loc" performance is better and gradually it is become slower by increasing the number of records in loop.

如果您在循环中添加行，请考虑这些性能问题；对于大约前 1000 到 2000 条记录，“my_df.loc”性能更好，并且通过增加循环中的记录数逐渐变慢。

If you plan to do thins inside a big loop(say 10M? records or so) you are better to use a mixture of "iloc" and "append"; fill a temp datframe with iloc untill the size gets around 1000, then append it to the original dataframe, and empy the temp dataframe. this would boost your performance around 10 times

如果你打算在一个大循环（比如 10M？记录左右）中做细化，你最好混合使用“iloc”和“append”；用 iloc 填充临时数据帧，直到大小达到 1000 左右，然后将其附加到原始数据帧，并清空临时数据帧。这将使您的表现提高约 10 倍

pandas df.loc[z,x]=y 如何提高速度？

提问by AltSheets

回答by MaxU

回答by jezrael

回答by Afshin Amiri

相关推荐

最近更新

标签

pandas df.loc[z,x]=y 如何提高速度？

提问by AltSheets

回答by MaxU

回答by jezrael

回答by Afshin Amiri

相关推荐

如何从 Pandas 的 OLS 摘要中提取特定值？

Pandas read_csv dtype 指定除一列之外的所有列

使用索引值访问 Pandas Data Frame 行

无法执行 Python Pandas set_value

相关推荐

最近更新

标签