快速将多列添加到 Pandas 数据框

Question

提问by Ben Kuhn

I'm writing some performance-sensitive code in which I have to add a large number of columns to a Pandas dataframe quickly.

我正在编写一些对性能敏感的代码，我必须在其中快速向 Pandas 数据帧添加大量列。

I've managed to get a 3x improvement over naively repeating df[foo] = barby constructing a second dataframe from a dict and concatenating them:

df[foo] = bar通过从 dict 构造第二个数据帧并将它们连接起来，我设法比天真重复提高了 3 倍：

def mkdf1(n):
    df = pd.DataFrame(index=range(10,20), columns=list('qwertyuiop'))
    for i in xrange(n):
        df['col%d' % i] = range(i, 10+i)
    return df

def mkdf2(n):
    df = pd.DataFrame(index=range(10,20), columns=list('qwertyuiop'))
    newcols = {}
    for i in xrange(n):
        newcols['col%d' % i] = range(i, 10+i)
    return pd.concat([df, pd.DataFrame(newcols, index=df.index)], axis=1)

The timings show substantial improvement:

时间显示有显着改善：

%timeit -r 1 mkdf1(100)
100 loops, best of 1: 16.6 ms per loop

%timeit -r 1 mkdf2(100)
100 loops, best of 1: 5.5 ms per loop

Are there any other optimizations I can make here?

我可以在这里进行任何其他优化吗？

EDIT: Also, the concatcall is taking much longer in my real code than my toy example; in particular the get_resultfunction takes a lot longer despite the production df having fewer rows and I can't figure out why. Any advice on how to speed this up would be appreciated.

编辑：此外，concat在我的实际代码中，调用时间比我的玩具示例要长得多；特别是get_result尽管生产 df 的行较少，但该函数需要更长的时间，我不知道为什么。任何有关如何加快速度的建议将不胜感激。

Answer 1

采纳答案by JohnE

I'm a little confused at exactly what your dataframe should look like, but it's easy to speed this up a lot with a general technique. Basically for pandas/numpy speed you want to avoid forand any concat/merge/join/append, if possible.

我对您的数据帧应该是什么样子感到有些困惑，但是使用通用技术可以很容易地加快速度。基本上大Pandas/ numpy的速度要避免for任何concat/merge/join/append，如果可能的话。

Your best bet here is most likely to use numpyto create an array that will be the input to a dataframe and then name the columns however you like. Both of these operations should be trivial as far as computation time.

您最好的选择是最有可能用于numpy创建一个数组，该数组将作为数据帧的输入，然后根据您的喜好命名列。就计算时间而言，这两个操作都应该是微不足道的。

Here's the numpy part, it looks like you already know how to construct column names.

这是 numpy 部分，看起来您已经知道如何构造列名。

%timeit pd.DataFrame(  np.ones([10,100]).cumsum(axis=0) 
                     + np.ones([10,100]).cumsum(axis=1) )
10000 loops, best of 3: 158 μs per loop

I think you are trying to make something like this? (If not, just check out numpy if you aren't familiar with it, it has all sorts of array operations that should make it very easy to do whatever you are trying to do here).

我认为您正在尝试制作这样的东西？（如果没有，如果您不熟悉它，请查看 numpy，它具有各种数组操作，可以很容易地完成您在此处尝试执行的任何操作）。

In [63]: df.ix[:5,:10]
Out[63]: 
   0   1   2   3   4   5   6   7   8   9   10
0   2   3   4   5   6   7   8   9  10  11  12
1   3   4   5   6   7   8   9  10  11  12  13
2   4   5   6   7   8   9  10  11  12  13  14
3   5   6   7   8   9  10  11  12  13  14  15
4   6   7   8   9  10  11  12  13  14  15  16
5   7   8   9  10  11  12  13  14  15  16  17

快速将多列添加到 Pandas 数据框

提问by Ben Kuhn

采纳答案by JohnE

相关推荐

最近更新

标签

快速将多列添加到 Pandas 数据框

提问by Ben Kuhn

采纳答案by JohnE

相关推荐

pandas 在 Matplotlib 中旋转现有轴标签

pandas 计算一个 DataFrame 的所有列与另一个 DataFrame 的所有列之间的相关性？

pandas 如何摆脱熊猫中的多维索引

pandas 如何根据列值对熊猫数据框进行切片？

相关推荐

最近更新

标签